# Welcome to COVID-19 Information Retrieval Engine 
                                                                                
### Project is part of Text analysis and Retrieval, FER 

## 1.) Preprocessing of the dataset 

#### Note: This part can be replaced by different method of parsing, just be sure to finish with a same data structure as we have

In [1]:
#lets import our own Parser class that will parse dataset based on the source.
from Parser import *

parser = Parser([Dataset.TEST])
parser.parse(indexByFile = False)
papers = parser.data_dicts

### Now we are going to combine all papers in one dictonary... We do this because we do want to test on all papers, but just by tweaking this little bit, you can determine which dataset you want.

In [3]:
all_papers = {}

current_id = 0
for dataset in papers: #<--- change this in order to change which dataset you want in dictonary
    for paper_id in papers[dataset]:
        all_papers[current_id] = papers[dataset][paper_id]
        current_id += 1

# 1.b) Importing queries

In [4]:
from TaskQuery import *

#We are importing all the queries here.
queries = TaskQuery.questions()
print(queries)

['What is known about Covid-19 transmission?', 'What is known about Covid-19 incubation?', 'What is known about Covid-19 environmental stability?']


## IMPORTANT: Here enter the Query for which you want the end result!

In [5]:
QUERY = queries[1]
print(QUERY)

What is known about Covid-19 incubation?


# 2.) BioNER filtering

### Here we are going to filter the dataset. We are going to take a query and extract keywords from it. Then, we will extract keywords from each paper and compare query keywords from paper keywords. Those papers that do not match with query keywords will be removed.

In [6]:
import scispacy
import spacy

#download this https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz and put
#in the data folder
bioNER = spacy.load("./data/en_core_sci_sm")
covid_keywords = ["covid19","sars-cov-2","covid-19","coronavirus","sars"]

In [7]:
#Getting keywords for Query
query_NER = bioNER(QUERY)
print(query_NER.ents)

(Covid-19, incubation)


In [8]:
#now we are going through all the documents and parsing them to get the important BIO terms like - coronavirus etc.
#then we are going to count the number of times some word from query has appered in doc, and divide that counter with 
#number of words in document. That will give us a probability score that we can use.
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

filtered_results = {}
alpha = 1 #parameter by which we are determining how much are we valuing COVID19 terms in body
beta = 1.8 #parametar by which we are determining how much are we valuing search terms in title
gamma = 1.2 #paramtar by which we are determining how much are we valuing search terms in abstract

NER_scores = []
if(len(set(query_NER.ents)) != 0):
    for i in range(len(all_papers)):
        paper = all_papers[i]
        document = bioNER(paper.body)
        title = bioNER(paper.title)
        abstract = bioNER(paper.abstract)
        
        query_counter = 0
        for query in set(query_NER.ents):
            counter = 0
            for keyword in set(document.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    counter += 1
                if(keyword.text.lower() in covid_keywords):
                    counter *= alpha
            counter = counter/len(set(document.ents))
            query_counter += counter
            
        
        title_counter = 0
        for query in set(query_NER.ents):
            for keyword in set(title.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    title_counter += 1 
        title_counter *= beta
        
        abstract_counter = 0
        for query in set(query_NER.ents):
            counter = 0
            for keyword in set(abstract.ents):
                if(lemmatizer.lemmatize(keyword.text.lower()) == lemmatizer.lemmatize(query.text.lower())):
                    counter += 1
                if(keyword.text.lower() in covid_keywords):
                    counter *= gamma
            if(len(set(abstract.ents)) != 0):
                counter = counter/len(set(abstract.ents))
            else:
                counter = 0
            abstract_counter += counter
        
        
        query_counter = (query_counter+title_counter+abstract_counter)/len(set(query.ents))
        NER_scores.append((query_counter,i))
    NER_scores.sort(reverse = True)
    print(NER_scores)

[(3.6521125659566156, 70), (1.9290890882359208, 14), (1.926678550207962, 44), (1.8618975741239894, 50), (1.8057273768613975, 47), (0.12185451231755348, 25), (0.07731321784617717, 72), (0.04284795472086442, 85), (0.03071658615136876, 29), (0.030021406953569058, 20), (0.024999999999999998, 67), (0.0091991341991342, 57), (0.007692307692307693, 9), (0.00646551724137931, 1), (0.0046816479400749065, 84), (0.004048582995951417, 6), (0.0034965034965034965, 2), (0.003048780487804878, 56), (0.002621231979030144, 94), (0.002250562640660165, 11), (0.0020304568527918783, 39), (0.00130718954248366, 65), (0.0012804097311139564, 22), (0.0010672358591248667, 86), (0.0010416666666666667, 60), (0.0009900990099009901, 90), (0.0009082652134423251, 51), (0.0008312551953449709, 40), (0.0007867820613690008, 32), (0.0007407407407407407, 53), (0.0006646726487205051, 4), (0.0006596306068601583, 18), (0.0005494505494505495, 68), (0.0005420054200542005, 58), (0.00043243243243243243, 59), (0.0002924831822170225, 37

In [9]:
#Now we have filtered and removed all the papers that have score zero.
for score in NER_scores:
    if(score[0] != 0.0):
        filtered_results[all_papers[score[1]]] = score

# 2.) Word2vec

In [10]:
#User manual
#----------------------------------
#Install --> pip3 install gensim (apart from gensim, you will need numpy)
#Download word2vec file -->  https://code.google.com/archive/p/word2vec/
import gensim.models.keyedvectors as word2vec
import numpy as np

unable to import 'smart_open.gcs', disabling that module


In [11]:
#Here we initialize word2vec with already pretrained vectors
word2vec = word2vec.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [12]:
import nltk
query = nltk.word_tokenize(QUERY)
print(query)

['What', 'is', 'known', 'about', 'Covid-19', 'incubation', '?']


In [13]:
keywords_that_are_not_in_word2vec = ["covid19","sars-cov-2","covid-19"]

query_vector = np.zeros(300)
for word in query:
    if(word in word2vec.vocab):
        query_vector += word2vec[word]
    if(word.lower() in keywords_that_are_not_in_word2vec):
        query_vector += word2vec["coronavirus"]

In [14]:
from numpy.linalg import norm

W2V_scores = []
for paper in filtered_results:
    paper_vector = np.zeros(300)
    title_vector = np.zeros(300)
    tokens = bioNER(paper.whole_text)
    title_tokens = bioNER(paper.title)
    
    for token in set(tokens.ents):
        if(token.text in word2vec.vocab):
            if(token.text in query):
                paper_vector += word2vec[token.text]*alpha
            else:
                paper_vector += word2vec[token.text]
        
    for token in set(title_tokens.ents):
        if(token.text in word2vec.vocab):
            if(token.text in query):
                title_vector += word2vec[token.text]*beta
            else:
                title_vector += word2vec[token.text]
        
    cos_sim_paper = np.inner(query_vector,paper_vector)/(norm(query_vector)*norm(paper_vector))
    cos_sim_title = np.inner(query_vector,title_vector)/(norm(query_vector)*norm(title_vector))
    
    cos_sim = (cos_sim_paper+cos_sim_title*beta)/2
    
    score = filtered_results[paper]
    w2v_score = (score[0]*cos_sim,score[1])
    
    filtered_results[paper] = w2v_score
    W2V_scores.append(w2v_score)
W2V_scores.sort(reverse = True)
print(W2V_scores)



[(1.7745787633280168, 70), (1.1651454139411335, 44), (0.9572473053210789, 14), (0.8919523639696857, 50), (0.777466692794606, 47), (0.09882556503606972, 25), (0.01955334546586607, 72), (0.018348460813301815, 85), (0.012520233950979263, 29), (0.010303128458724269, 20), (0.004333155666635088, 57), (0.004299592925933213, 9), (0.003738136730696603, 67), (0.0032602240579680205, 1), (0.0021496746601150727, 84), (0.0018462848262183202, 2), (0.001759438590582564, 6), (nan, 94), (0.0014688405951929077, 11), (0.0012089852186225373, 56), (0.0009545882658005207, 39), (0.0007264972096762861, 22), (0.0007009531331236247, 65), (0.000624956868539344, 60), (0.0005656798916535547, 86), (0.0005506408658853582, 90), (0.0004825135162031871, 40), (0.00048139173758721566, 51), (nan, 53), (0.00037243974984641637, 18), (0.0003665547910325716, 32), (0.0003558975120049752, 4), (0.00034362642169754183, 58), (0.00021625522895253865, 68), (0.00019026997850283298, 59), (0.00015771814518768338, 37)]


In [37]:
print(all_papers[17])

[1mTitle[0m
Estimating the Relative Probability of Direct Transmission between Infectious Disease Patients

[1mAbstract[0m
Estimating infectious disease parameters such as the serial interval (time between symptom onset in primary and secondary cases) and reproductive number (average number of secondary cases produced by a primary case) are important to understand infectious disease dynamics. Many estimation methods require linking cases by direct transmission, a difficult task for most diseases.Using a subset of cases with detailed genetic or contact investigation data to develop a training set of probable transmission events, we build a model to estimate the relative transmission probability for all case-pairs from demographic, spatial and clinical data. Our method is based on naive Bayes, a machine learning classification algorithm which uses the observed frequencies in the training dataset to estimate the probability that a pair is linked given a set of covariates.In simulation

# 3.) Doc2vec 

In [15]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [16]:
tagged_data = []
for paper in filtered_results:
    score = filtered_results[paper]
    
    tokens = [token.text for token in bioNER(paper.title.lower()).ents]
    doc = TaggedDocument(words = tokens, tags = [str(score[1])])
    tagged_data.append(doc)

#### This is the training part. Here we are making our own word embeddings. That means we are basically going to make our own word2vec. In other words, for every word from our dataset our model will make a vector in 20 dimensional space. Furthermore, every vectors will be similar if the words they are representing are similar. E.g. vectors for word coronavirus and covid19 will be similar.

In [17]:
max_epochs = 1000
vec_size = 20 #word2vec has 300, but I left 100 here
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    model.train(tagged_data,total_examples=model.corpus_count,epochs=model.iter)
    print('iteration {0}'.format(epoch))
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
print("Done.")

  


iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

iteration 607
iteration 608
iteration 609
iteration 610
iteration 611
iteration 612
iteration 613
iteration 614
iteration 615
iteration 616
iteration 617
iteration 618
iteration 619
iteration 620
iteration 621
iteration 622
iteration 623
iteration 624
iteration 625
iteration 626
iteration 627
iteration 628
iteration 629
iteration 630
iteration 631
iteration 632
iteration 633
iteration 634
iteration 635
iteration 636
iteration 637
iteration 638
iteration 639
iteration 640
iteration 641
iteration 642
iteration 643
iteration 644
iteration 645
iteration 646
iteration 647
iteration 648
iteration 649
iteration 650
iteration 651
iteration 652
iteration 653
iteration 654
iteration 655
iteration 656
iteration 657
iteration 658
iteration 659
iteration 660
iteration 661
iteration 662
iteration 663
iteration 664
iteration 665
iteration 666
iteration 667
iteration 668
iteration 669
iteration 670
iteration 671
iteration 672
iteration 673
iteration 674
iteration 675
iteration 676
iteration 677
iterat

#### Here we are testing our word embeddings with some query. Our query will be "Coronavirus transmission" and we are hoping to  find all the documents that are talking about coronavirus transmission. 

In [18]:
test_data = word_tokenize(QUERY.lower()) #change this query to test different things 
v1 = model.infer_vector(test_data)
print(QUERY)

What is known about Covid-19 incubation?


In [19]:
similar_doc = model.docvecs.most_similar([v1],topn = len(tagged_data))
print(similar_doc)

[('53', 0.9585644006729126), ('2', 0.9557319283485413), ('57', 0.9515500068664551), ('44', 0.9492290019989014), ('47', 0.9472017288208008), ('56', 0.9471619725227356), ('11', 0.9470564723014832), ('58', 0.9351019859313965), ('4', 0.9318826794624329), ('84', 0.9258043766021729), ('90', 0.9254163503646851), ('86', 0.9197592735290527), ('25', 0.9177792072296143), ('39', 0.9117348194122314), ('20', 0.911331057548523), ('29', 0.8868950605392456), ('59', 0.8069602847099304), ('68', 0.7970524430274963), ('67', 0.76641446352005), ('1', 0.6952894926071167), ('94', 0.694424569606781), ('65', 0.692328929901123), ('6', 0.6846479177474976), ('70', 0.6808638572692871), ('50', 0.6744297742843628), ('72', 0.6282376050949097), ('40', 0.6197509765625), ('9', 0.6145830750465393), ('51', 0.6060972213745117), ('85', 0.5939964056015015), ('18', 0.5676460266113281), ('60', 0.5560325980186462), ('32', 0.5553334951400757), ('14', 0.5230740904808044), ('37', 0.5217598676681519), ('22', 0.5160170197486877)]


In [20]:
results = {}
for doc in similar_doc:
    results[int(doc[0])] = doc[1]

d2v_results = []
for paper in filtered_results:
    score = filtered_results[paper]
    result = results[score[1]]
    freshed_score = (result*score[0],score[1])
    
    filtered_results[paper] = freshed_score
    d2v_results.append(freshed_score)
d2v_results.sort(reverse = True)
print(d2v_results)

[(1.2082465418276749, 70), (1.105989818458939, 44), (0.7364177955156412, 47), (0.6015592315044789, 50), (0.5007112635960242, 14), (0.09070004873282275, 25), (0.012284146927069112, 72), (0.01110413364791927, 29), (0.010898919771421281, 85), (0.00938956095434747, 20), (0.004123214304340037, 57), (0.002864962057021431, 67), (0.0026424570418683816, 9), (0.0022667995310501, 1), (0.0019901782086053227, 84), (0.0017645533572422866, 2), (0.0012045959674469443, 6), (nan, 94), (0.001391074992456606, 11), (0.001145104824421353, 56), (0.0008703313601326729, 39), (0.0005202893261972667, 86), (0.0005095720604692783, 90), (0.00048529013256631853, 65), (0.00037488492499289464, 22), (0.00034749639126352904, 60), (nan, 53), (0.00033165472710120966, 4), (0.0003213257493478709, 58), (0.00029903822287153086, 40), (0.0002917701945442595, 51), (0.00021141394415243525, 18), (0.00020356015326445805, 32), (0.0001723667585540915, 68), (0.00015354031602439843, 59), (8.229099856199205e-05, 37)]


In [41]:
print(all_papers[20])

[1mTitle[0m
Title: First 12 patients with coronavirus disease

[1mAbstract[0m
Introduction: More than 93,000 cases of coronavirus disease have been reported worldwide. We describe the epidemiology, clinical course, and virologic characteristics of the first 12 U.S. patients with COVID-19.We collected demographic, exposure, and clinical information from 12 patients confirmed by CDC during January 20-February 5, 2020 to have COVID-19. Respiratory, stool, serum, and urine specimens were submitted for SARS-CoV-2 rRT-PCR testing, virus culture, and whole genome sequencing.Results: Among the 12 patients, median age was 53 years (range: 21-68); 8 were male, 10 had traveled to China, and two were contacts of patients in this series. Commonly reported signs and symptoms at illness onset were fever (n=7) and cough (n=8). Seven patients were hospitalized with radiographic evidence of pneumonia and demonstrated clinical or laboratory signs of worsening during the second week of illness. Three 

# 4.) bm25? 

In [None]:
from gensim.corpora import Dictionary