## Parsing of the dataset CORD-19


### Importing the parser and parsing the datasets

In [1]:
from Parser import *

In [2]:
#Creating a Parser and specifying what kind of dataset we want to parse
parser = Parser([Dataset.BIORXIV])
parser.parse(indexByFile = False);

### Example of accesing the paper by index

In [44]:
#We can access the date by index or by file name, but we have to change in the parse function
#what kind of invoke we want
print(parser.data_dicts[Dataset.BIORXIV][3])

[1mTitle[0m
Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label

[1mAbstract[0m
Accurate and rapid diagnosis of COVID-19 suspected cases plays a crucial role in timely quarantine and medical treatment. Developing a deep learning-based model for automatic COVID-19 detection on chest CT is helpful to counter the outbreak of SARS-CoV-2. A weakly-supervised deep learning-based software system was developed using 3D CT volumes to detect COVID-19. For each patient, the lung region was segmented using a pre-trained UNet; then the segmented 3D lung region was fed into a 3D deep neural network to predict the probability of COVID-19 infectious. 499 CT volumes collected from Dec. 13, 2019, to Jan. 23, 2020, were used for training and 131 CT volumes collected from Jan 24, 2020, to Feb 6, 2020, were used for testing. The deep learning algorithm obtained 0.959 ROC AUC and 0.976 PR AUC. There was an operating point with 0.907 sensitivity and 0.911 specificity in the ROC curv

### Accesing certain elements of the paper

In [49]:
#By method titles(), abstracts() and bodies() you can access to certain elements of the paper
paper_abstracts = parser.titles()
for abstract in paper_abstracts[Dataset.BIORXIV].values():
    print(abstract)

[1mTitle[0m
p53 is not necessary for DUX4 pathology


[1mTitle[0m
Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases


[1mTitle[0m
Potentially highly potent drugs for 2019-nCoV


[1mTitle[0m
Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label


[1mTitle[0m
The Viral Protein Corona Directs Viral Pathogenesis and Amyloid Aggregation


[1mTitle[0m
Significance of hydrophobic and charged sequence similarities in sodium-bile acid cotransporter and vitamin D-binding protein macrophage activating factor


[1mTitle[0m
Dark Proteome of Newly Emerged SARS-CoV-2 in Comparison with Human and Bat Coronaviruses


[1mTitle[0m
Title: Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes Short title: Automated tool for phylogenetic and mutational analysis of coronaviruses genomes


[1mTitle[0m
Effect of SARS-CoV-2 infection upon male go

# Word2Vec demonstration

In [26]:
#User manual
#----------------------------------
#Install --> pip3 install gensim (apart from gensim, you will need numpy)
#Download word2vec file -->  https://code.google.com/archive/p/word2vec/
import gensim.models.keyedvectors as word2vec

In [27]:
#Here we initialize word2vec with already pretrained vectors
word2vec = word2vec.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [24]:
#As you can see, coronavirus is extremely similar with other virus terms 
word2vec.most_similar("coronavirus")

[('corona_virus', 0.7276226282119751),
 ('coronaviruses', 0.7216538190841675),
 ('paramyxovirus', 0.7113003730773926),
 ('SARS_coronavirus', 0.6601907014846802),
 ('arenavirus', 0.6494410037994385),
 ('influenza_virus', 0.6449826955795288),
 ('H#N#_subtype', 0.6360139846801758),
 ('H#N#_strain', 0.6324741840362549),
 ('H7_virus', 0.6261191964149475),
 ('flu_virus', 0.6249204874038696)]

In [25]:
#So word2vec is basically a dict, where for word it returns us a 300 dimensional vector. The more the words are similiar
#so are the vectors going to be similar (talking here about cosine similarity!).
word2vec["cure"]

array([-0.11914062,  0.0189209 , -0.02648926,  0.22460938, -0.17089844,
        0.4609375 ,  0.38867188, -0.19921875,  0.15429688, -0.00180054,
       -0.10693359, -0.26757812, -0.17089844,  0.17382812, -0.06982422,
        0.38085938,  0.10253906,  0.1171875 ,  0.14453125,  0.01409912,
        0.32226562,  0.45703125,  0.25976562,  0.06738281, -0.28515625,
        0.21289062, -0.20996094,  0.04418945, -0.14746094,  0.04296875,
       -0.22167969, -0.24707031, -0.24121094, -0.13574219, -0.15234375,
        0.02502441, -0.08203125, -0.328125  ,  0.44921875, -0.12988281,
        0.24414062,  0.01489258, -0.33203125, -0.14453125, -0.24023438,
       -0.11035156, -0.0300293 ,  0.06152344, -0.15917969, -0.12890625,
        0.02832031,  0.40039062, -0.046875  , -0.3203125 ,  0.09765625,
       -0.0859375 , -0.1171875 , -0.32421875, -0.0390625 , -0.09814453,
        0.41210938,  0.09765625,  0.19042969,  0.0859375 , -0.03710938,
        0.05688477,  0.05883789,  0.06640625,  0.0703125 ,  0.20

# Doc2Vec demonstration

### So, now we are heading into the big guns! Doc2Vec is basically word2vec, but only for words that are appearing in our dataset. Meaning that words like Coronavirus, Covid19, Wuhan and other important phrases will be recognized here by our model. In contrast, word2vec couldn't recognize covid19, because that's new term for this disease.

In [36]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

#### Here we are basically making our own dataset. We are taking our own papers ( parser.toList() will return all papers in dataset) and tagging them.

In [37]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(parser.toList())]

#### This is the training part. Here we are making our own word embeddings. That means we are basically going to make our own word2vec. In other words, for every word from our dataset our model will make a vector in 20 dimensional space. Furthermore, every vectors will be similar if the words they are representing are similar. E.g. vectors for word coronavirus and covid19 will be similar.

In [45]:
max_epochs = 120
vec_size = 20 #word2vec has 300, but I left 100 here
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    #print('iteration {0}'.format(epoch))
    model.train(tagged_data,total_examples=model.corpus_count,epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
print("Done.")

  from ipykernel import kernelapp as app


Done.


#### Here we are testing our word embeddings with some query. Our query will be "Coronavirus transmission" and we are hoping to  find all the documents that are talking about coronavirus transmission. 

In [46]:
test_data = word_tokenize("Coronavirus transmission".lower()) #change this query to test different things 
v1 = model.infer_vector(test_data)

#### Here we are finding the embeddings that will correnspond with our query. Function most_similar() will return us id and percentage of similarity with corrensponding query. E.g. ('43', 0.834 ) means that document with id 43 is 83% similar with query. ( although this isn't really percentage, this is similarity, but thats the gist :) )

In [47]:
#finding the most similar doc
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)

[('57', 0.5976769924163818), ('82', 0.5365582704544067), ('13', 0.5226129293441772), ('44', 0.476043164730072), ('60', 0.46594515442848206), ('58', 0.4617353081703186), ('46', 0.44316622614860535), ('76', 0.41599059104919434), ('37', 0.4150567650794983), ('80', 0.4132328927516937)]


### This is the most similar document with our query within our dataset of 100 papers.

In [49]:
#print(tagged_data[33])
print(parser.toList()[57])

Population movement, city closure and spatial transmission of the 2019-nCoV 1 infection in China 2The outbreak of pneumonia caused by a novel coronavirus (2019-nCoV) in Wuhan 17City of China obtained global concern, the population outflow from Wuhan has 18 contributed to spatial expansion in other parts of China. We examined the effects of 19 population outflow from Wuhan on the 2019-nCoV transmission in other provinces 20 and cities of China, as well as the impacts of the city closure in Wuhan. We observed 21 a significantly positive association between population movement and the number of 22 cases. Further analysis revealed that if the city closure policy was implemented two 23 days earlier, 1420 (95% CI: 1059, 1833) cases could be prevented, and if two days 24 later, 1462 (95% CI: 1090, 1886) more cases would be possible. Our findings suggest 25 that population movement might be one important trigger of the 2019-nCoV infection 26 transmission in China, and the policy of city closur