# Welcome to COVID-19 Information Retrieval Engine 
                                                                                
### Project is part of Text analysis and Retrieval, FER 

## 1.) Preprocessing of the dataset 

#### Note: This part can be replaced by different method of parsing, just be sure to finish with a same data structure as we have

In [2]:
#lets import our own Parser class that will parse dataset based on the source.
from Parser import *

parser = Parser([Dataset.ALL])
parser.parse(indexByFile = False)
papers = parser.data_dicts

### Now we are going to combine all papers in one dictonary... We do this because we do want to test on all papers, but just by tweaking this little bit, you can determine which dataset you want.

In [5]:
all_papers = {}

current_id = 0
for dataset in papers: #<--- change this in order to change which dataset you want in dictonary
    for paper_id in papers[dataset]:
        all_papers[current_id] = papers[dataset][paper_id]
        current_id += 1

# 1.b) Importing queries

In [6]:
from TaskQuery import *

#We are importing all the queries here.
queries = TaskQuery.questions()

## IMPORTANT: Here enter the Query for which you want the end result!

In [131]:
QUERY = "What is COVID-19 transmission?"
print(QUERY)

What is COVID-19 transmission?


# 2.) BioNER filtering

### Here we are going to filter the dataset. We are going to take a query and extract keywords from it. Then, we will extract keywords from each paper and compare query keywords from paper keywords. Those papers that do not match with query keywords will be removed.

In [67]:
import scispacy
import spacy

#download this https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz and put
#in the data folder
bioNER = spacy.load("./data/en_core_sci_sm")
covid_keywords = ["covid19","sars-cov-2","covid-19","coronavirus","sars"]

In [132]:
#Getting keywords for Query
query_NER = bioNER(QUERY)
print(query_NER.ents)

(COVID-19, transmission)


In [133]:
#now we are going through all the documents and parsing them to get the important BIO terms like - coronavirus etc.
#then we are going to count the number of times some word from query has appered in doc, and divide that counter with 
#number of words in document. That will give us a probability score that we can use.
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

filtered_results = {}
alpha = 1 #parameter by which we are determining how much are we valuing COVID19 terms in body
beta = 1.8 #parametar by which we are determining how much are we valuing search terms in title
gamma = 1.2 #paramtar by which we are determining how much are we valuing search terms in abstract

NER_scores = []
if(len(set(query_NER.ents)) != 0):
    for i in range(len(all_papers)):
        paper = all_papers[i]
        document = bioNER(paper.body)
        title = bioNER(paper.title)
        abstract = bioNER(paper.abstract)
        
        query_counter = 0
        for query in set(query_NER.ents):
            counter = 0
            for keyword in set(document.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    counter += 1
                if(keyword.text.lower() in covid_keywords):
                    counter *= alpha
            counter = counter/len(set(document.ents))
            query_counter += counter
            
        
        title_counter = 0
        for query in set(query_NER.ents):
            for keyword in set(title.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    title_counter += 1 
        title_counter *= beta
        
        abstract_counter = 0
        for query in set(query_NER.ents):
            counter = 0
            for keyword in set(abstract.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    counter += 1
                if(keyword.text.lower() in covid_keywords):
                    counter *= gamma
            if(len(set(abstract.ents)) != 0):
                counter = counter/len(set(abstract.ents))
            else:
                counter = 0
            abstract_counter += counter
        
        
        query_counter = (query_counter+title_counter+abstract_counter)/len(set(query.ents))
        NER_scores.append((query_counter,i))
    NER_scores.sort(reverse = True)
    print(NER_scores)

[(3.918074515648286, 614), (3.8473820033955857, 610), (3.826650177683014, 408), (3.767235294117647, 210), (3.75435, 811), (3.7397382977595663, 977), (3.707265401531695, 833), (3.695212622601279, 672), (3.6939060618225614, 908), (3.693832391713748, 286), (3.6741580830532365, 139), (3.668987878787879, 298), (3.6573833902161548, 205), (3.654193279809694, 267), (3.652742946708464, 998), (3.6521125659566156, 743), (3.650686378035903, 857), (3.6495023572551077, 486), (3.649150286576169, 723), (3.6490153671030168, 35), (3.6416666666666666, 423), (3.640449438202247, 571), (3.637600716204118, 71), (3.635320088300221, 84), (3.6325581395348836, 458), (3.625595360506725, 20), (3.6174418604651164, 827), (2.1862574992764388, 918), (2.159602574019048, 913), (2.0985546428571427, 81), (2.0804468584853364, 464), (2.0589904973780393, 986), (2.057241379310345, 942), (2.054179815700807, 27), (2.019765058484489, 733), (2.01917316775483, 125), (2.018846431146359, 666), (2.0116666666666667, 978), (2.010660353

In [134]:
#Now we have filtered and removed all the papers that have score zero.
for score in NER_scores:
    if(score[0] != 0.0):
        filtered_results[all_papers[score[1]]] = score

# 2.) Word2vec

In [12]:
#User manual
#----------------------------------
#Install --> pip3 install gensim (apart from gensim, you will need numpy)
#Download word2vec file -->  https://code.google.com/archive/p/word2vec/
import gensim.models.keyedvectors as word2vec
import numpy as np

unable to import 'smart_open.gcs', disabling that module


In [13]:
#Here we initialize word2vec with already pretrained vectors
word2vec = word2vec.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [135]:
import nltk
query = nltk.word_tokenize(QUERY)
print(query)

['What', 'is', 'COVID-19', 'transmission', '?']


In [136]:
keywords_that_are_not_in_word2vec = ["covid19","sars-cov-2","covid-19"]

query_vector = np.zeros(300)
for word in query:
    if(word in word2vec.vocab):
        query_vector += word2vec[word]
    if(word.lower() in keywords_that_are_not_in_word2vec):
        query_vector += word2vec["coronavirus"]

In [137]:
from numpy.linalg import norm

W2V_scores = []
for paper in filtered_results:
    paper_vector = np.zeros(300)
    title_vector = np.zeros(300)
    tokens = bioNER(paper.whole_text)
    title_tokens = bioNER(paper.title)
    
    for token in set(tokens.ents):
        if(token.text in word2vec.vocab):
            if(token.text in query):
                paper_vector += word2vec[token.text]*alpha
            else:
                paper_vector += word2vec[token.text]
        
    for token in set(title_tokens.ents):
        if(token.text in word2vec.vocab):
            if(token.text in query):
                title_vector += word2vec[token.text]*beta
            else:
                title_vector += word2vec[token.text]
        
    cos_sim_paper = np.inner(query_vector,paper_vector)/(norm(query_vector)*norm(paper_vector))
    cos_sim_title = np.inner(query_vector,title_vector)/(norm(query_vector)*norm(title_vector))
    
    cos_sim = (cos_sim_paper+cos_sim_title*beta)/2
    
    score = filtered_results[paper]
    w2v_score = (score[0]*cos_sim,score[1])
    
    filtered_results[paper] = w2v_score
    W2V_scores.append(w2v_score)
W2V_scores.sort(reverse = True)
print(W2V_scores)



[(3.48760715556577, 908), (3.0812467439777067, 210), (2.9839042741655284, 486), (2.9023581246244206, 84), (2.901514149548767, 811), (2.8716259986757198, 610), (2.8487708340214803, 408), (2.8263836575573915, 571), (2.692341964036869, 423), (2.664758259582012, 139), (2.655419237651965, 298), (2.6464454644567077, 20), (2.6029455501229664, 977), (2.32466607640678, 723), (2.2115029498354577, 71), (2.1998547448922556, 205), (2.0145069853843967, 998), (1.9923143231447125, 614), (1.9757196348499422, 286), (1.9408351847189003, 833), (1.8382440913679141, 743), (1.7166430128155044, 857), (1.6678665577091005, 973), (1.6585422646174355, 267), (1.6418784514107054, 827), (1.6396312737024403, 464), (1.5885082491171465, 693), (1.5704681603107642, 851), (1.5376047253344018, 501), (1.4159684225907865, 458), (1.3766180741420078, 672), (1.3537328155032449, 978), (1.2432471253991966, 677), (1.2374638719492335, 777), (1.234990356197077, 1048), (1.188860915768959, 397), (1.1855173402506494, 986), (1.160055778

In [141]:
print(all_papers[84])

[1mTitle[0m
Identification of a super-spreading chain of transmission associated with COVID-19

[1mAbstract[0m
341 words; main text: 2964 words

[1mBody[0m
The outbreak of coronavirus disease 2019 (COVID- 19) , which was caused by the novel coronavirus SARS-CoV-2, has posed tremendous challenges to the international communities [1] [2] [3] [4] [5] [6] [7] [8] .As of March 18, 2020 , there were 81,116 confirmed cases and 3231 deaths in China 9 ; Globally, 160 counties and territories have reported cases of COVID-19, including 191,127 confirmed cases and 7807 deaths 9 . In response to "the alarming levels of spread and severity", the World Health Organization (WHO) has characterized COVID-19 as a pandemic 10 . This is the first pandemic caused by a coronavirus. Coronaviruses (CoV) are a large family of RNA viruses that cause a variety of mild and severe diseases in humans and animals 11 . Prior to the COVID-19, there were two severe outbreaks of human coronavirus diseases in the pa

# 3.) Doc2vec 

In [82]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [142]:
tagged_data = []
for paper in filtered_results:
    score = filtered_results[paper]
    
    tokens = [token.text for token in bioNER(paper.title.lower()).ents]
    doc = TaggedDocument(words = tokens, tags = [str(score[1])])
    tagged_data.append(doc)

#### This is the training part. Here we are making our own word embeddings. That means we are basically going to make our own word2vec. In other words, for every word from our dataset our model will make a vector in 20 dimensional space. Furthermore, every vectors will be similar if the words they are representing are similar. E.g. vectors for word coronavirus and covid19 will be similar.

In [143]:
max_epochs = 1000
vec_size = 20 #word2vec has 300, but I left 100 here
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    model.train(tagged_data,total_examples=model.corpus_count,epochs=model.iter)
    print('iteration {0}'.format(epoch))
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
print("Done.")

  


iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

iteration 594
iteration 595
iteration 596
iteration 597
iteration 598
iteration 599
iteration 600
iteration 601
iteration 602
iteration 603
iteration 604
iteration 605
iteration 606
iteration 607
iteration 608
iteration 609
iteration 610
iteration 611
iteration 612
iteration 613
iteration 614
iteration 615
iteration 616
iteration 617
iteration 618
iteration 619
iteration 620
iteration 621
iteration 622
iteration 623
iteration 624
iteration 625
iteration 626
iteration 627
iteration 628
iteration 629
iteration 630
iteration 631
iteration 632
iteration 633
iteration 634
iteration 635
iteration 636
iteration 637
iteration 638
iteration 639
iteration 640
iteration 641
iteration 642
iteration 643
iteration 644
iteration 645
iteration 646
iteration 647
iteration 648
iteration 649
iteration 650
iteration 651
iteration 652
iteration 653
iteration 654
iteration 655
iteration 656
iteration 657
iteration 658
iteration 659
iteration 660
iteration 661
iteration 662
iteration 663
iteration 664
iterat

#### Here we are testing our word embeddings with some query. Our query will be "Coronavirus transmission" and we are hoping to  find all the documents that are talking about coronavirus transmission. 

In [144]:
test_data = word_tokenize(QUERY.lower()) #change this query to test different things 
v1 = model.infer_vector(test_data)
print(QUERY)

What is COVID-19 transmission?


In [145]:
similar_doc = model.docvecs.most_similar([v1],topn = len(tagged_data))
print(similar_doc)

[('189', 0.9195913076400757), ('1005', 0.9167531728744507), ('516', 0.9139246344566345), ('678', 0.9133954048156738), ('993', 0.9085265398025513), ('796', 0.9062878489494324), ('32', 0.8976616859436035), ('948', 0.8938473463058472), ('356', 0.8834942579269409), ('479', 0.8787457942962646), ('419', 0.8763800859451294), ('857', 0.8521550893783569), ('1048', 0.8455232381820679), ('955', 0.8376749157905579), ('664', 0.8245499730110168), ('576', 0.8239209651947021), ('381', 0.8203771114349365), ('768', 0.8132331967353821), ('76', 0.8021485805511475), ('963', 0.7985355257987976), ('261', 0.7968178987503052), ('529', 0.7923281192779541), ('424', 0.7906389236450195), ('687', 0.7891881465911865), ('305', 0.7886194586753845), ('392', 0.7864183187484741), ('977', 0.7863318920135498), ('1014', 0.7814918756484985), ('183', 0.7799147963523865), ('82', 0.7745561003684998), ('137', 0.7743425965309143), ('114', 0.7733232975006104), ('715', 0.7617603540420532), ('410', 0.7606906294822693), ('340', 0.751

In [146]:
results = {}
for doc in similar_doc:
    results[int(doc[0])] = doc[1]

d2v_results = []
for paper in filtered_results:
    score = filtered_results[paper]
    result = results[score[1]]
    freshed_score = (result*score[0],score[1])
    
    filtered_results[paper] = freshed_score
    d2v_results.append(freshed_score)
d2v_results.sort(reverse = True)
print(d2v_results)

[(2.2296324122455493, 908), (2.1203594089006548, 486), (2.0467790992364425, 977), (1.9995471134494691, 84), (1.6751908234638129, 993), (1.462846080016528, 857), (1.4616660298611586, 610), (1.1958763496948435, 20), (1.1802229329505745, 998), (1.1748085369398404, 298), (1.1687575377521757, 205), (1.0805184801856051, 614), (1.044213045095378, 1048), (1.0122324454354394, 948), (1.0075859522974455, 419), (0.9812888136421475, 978), (0.9320717391762217, 827), (0.9011342125513276, 356), (0.8804591910898337, 114), (0.872089775220494, 571), (0.8402797526332731, 332), (0.8315235535020249, 479), (0.8215144239327381, 672), (0.8187258277578253, 71), (0.799612569756476, 967), (0.7792684195465898, 123), (0.7485773511593097, 457), (0.7454436720639751, 47), (0.7291735312356018, 695), (0.7221559572975501, 677), (0.7136596408604605, 70), (0.7018063259166974, 381), (0.6993401801174808, 682), (0.6865365720891318, 583), (0.6842390258568636, 236), (0.6814291737028637, 750), (0.6801950648538536, 790), (0.67816

In [151]:
print(all_papers[20])

[1mTitle[0m
Stochastic discrete epidemic modeling of COVID-19 transmission in the Province of Shaanxi incorporating public health intervention and case importation

[1mAbstract[0m
Before the lock-down of Wuhan/Hubei/China, on January 23 rd 2020, a large number of individuals infected by COVID-19 moved from the epicenter Wuhan and the Hubei province due to the Spring Festival, resulting in an epidemic in the other provinces including the Shaanxi province. The epidemic scale in Shaanxi was comparatively small and with half of cases being imported from the epicenter. Based on the complete epidemic data including the symptom onset time and transmission chains, we calculate the control reproduction number (1.48-1.69) in Xi'an. We could also compute the time transition, for each imported or local case, from the latent, to infected, to hospitalized compartment, as well as the effective reproduction number. This calculation enables us to revise our early deterministic transmission model to

# 4.) bm25? 

In [None]:
from gensim.corpora import Dictionary