## Parsing of the dataset CORD-19


### Importing the parser and parsing the datasets

In [1]:
from Parser import *

In [2]:
#Creating a Parser and specifying what kind of dataset we want to parse
parser = Parser([Dataset.BIORXIV])
parser.parse(indexByFile = False);

### Example of accesing the paper by index

In [44]:
#We can access the date by index or by file name, but we have to change in the parse function
#what kind of invoke we want
print(parser.data_dicts[Dataset.BIORXIV][3])

[1mTitle[0m
Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label

[1mAbstract[0m
Accurate and rapid diagnosis of COVID-19 suspected cases plays a crucial role in timely quarantine and medical treatment. Developing a deep learning-based model for automatic COVID-19 detection on chest CT is helpful to counter the outbreak of SARS-CoV-2. A weakly-supervised deep learning-based software system was developed using 3D CT volumes to detect COVID-19. For each patient, the lung region was segmented using a pre-trained UNet; then the segmented 3D lung region was fed into a 3D deep neural network to predict the probability of COVID-19 infectious. 499 CT volumes collected from Dec. 13, 2019, to Jan. 23, 2020, were used for training and 131 CT volumes collected from Jan 24, 2020, to Feb 6, 2020, were used for testing. The deep learning algorithm obtained 0.959 ROC AUC and 0.976 PR AUC. There was an operating point with 0.907 sensitivity and 0.911 specificity in the ROC curv

### Accesing certain elements of the paper

In [49]:
#By method titles(), abstracts() and bodies() you can access to certain elements of the paper
paper_abstracts = parser.titles()
for abstract in paper_abstracts[Dataset.BIORXIV].values():
    print(abstract)

[1mTitle[0m
p53 is not necessary for DUX4 pathology


[1mTitle[0m
Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases


[1mTitle[0m
Potentially highly potent drugs for 2019-nCoV


[1mTitle[0m
Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label


[1mTitle[0m
The Viral Protein Corona Directs Viral Pathogenesis and Amyloid Aggregation


[1mTitle[0m
Significance of hydrophobic and charged sequence similarities in sodium-bile acid cotransporter and vitamin D-binding protein macrophage activating factor


[1mTitle[0m
Dark Proteome of Newly Emerged SARS-CoV-2 in Comparison with Human and Bat Coronaviruses


[1mTitle[0m
Title: Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes Short title: Automated tool for phylogenetic and mutational analysis of coronaviruses genomes


[1mTitle[0m
Effect of SARS-CoV-2 infection upon male go

# Word2Vec demonstration

In [26]:
#User manual
#----------------------------------
#Install --> pip3 install gensim (apart from gensim, you will need numpy)
#Download word2vec file -->  https://code.google.com/archive/p/word2vec/
import gensim.models.keyedvectors as word2vec

In [27]:
#Here we initialize word2vec with already pretrained vectors
word2vec = word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [24]:
#As you can see, coronavirus is extremely similar with other virus terms 
word2vec.most_similar("coronavirus")

[('corona_virus', 0.7276226282119751),
 ('coronaviruses', 0.7216538190841675),
 ('paramyxovirus', 0.7113003730773926),
 ('SARS_coronavirus', 0.6601907014846802),
 ('arenavirus', 0.6494410037994385),
 ('influenza_virus', 0.6449826955795288),
 ('H#N#_subtype', 0.6360139846801758),
 ('H#N#_strain', 0.6324741840362549),
 ('H7_virus', 0.6261191964149475),
 ('flu_virus', 0.6249204874038696)]

In [25]:
#So word2vec is basically a dict, where for word it returns us a 300 dimensional vector. The more the words are similiar
#so are the vectors going to be similar (talking here about cosine similarity!).
word2vec["cure"]

array([-0.11914062,  0.0189209 , -0.02648926,  0.22460938, -0.17089844,
        0.4609375 ,  0.38867188, -0.19921875,  0.15429688, -0.00180054,
       -0.10693359, -0.26757812, -0.17089844,  0.17382812, -0.06982422,
        0.38085938,  0.10253906,  0.1171875 ,  0.14453125,  0.01409912,
        0.32226562,  0.45703125,  0.25976562,  0.06738281, -0.28515625,
        0.21289062, -0.20996094,  0.04418945, -0.14746094,  0.04296875,
       -0.22167969, -0.24707031, -0.24121094, -0.13574219, -0.15234375,
        0.02502441, -0.08203125, -0.328125  ,  0.44921875, -0.12988281,
        0.24414062,  0.01489258, -0.33203125, -0.14453125, -0.24023438,
       -0.11035156, -0.0300293 ,  0.06152344, -0.15917969, -0.12890625,
        0.02832031,  0.40039062, -0.046875  , -0.3203125 ,  0.09765625,
       -0.0859375 , -0.1171875 , -0.32421875, -0.0390625 , -0.09814453,
        0.41210938,  0.09765625,  0.19042969,  0.0859375 , -0.03710938,
        0.05688477,  0.05883789,  0.06640625,  0.0703125 ,  0.20

# Doc2Vec demonstration

### So, now we are heading into the big guns! Doc2Vec is basically word2vec, but only for words that are appearing in our dataset. Meaning that words like Coronavirus, Covid19, Wuhan and other important phrases will be recognized here by our model. In contrast, word2vec couldn't recognize covid19, because that's new term for this disease.

In [36]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

#### Here we are basically making our own dataset. We are taking our own papers ( parser.toList() will return all papers in dataset) and tagging them.

In [37]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(parser.toList())]

#### This is the training part. Here we are making our own word embeddings. That means we are basically going to make our own word2vec. In other words, for every word from our dataset our model will make a vector in 20 dimensional space. Furthermore, every vectors will be similar if the words they are representing are similar. E.g. vectors for word coronavirus and covid19 will be similar.

In [45]:
max_epochs = 120
vec_size = 20 #word2vec has 300, but I left 100 here
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    #print('iteration {0}'.format(epoch))
    model.train(tagged_data,total_examples=model.corpus_count,epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
print("Done.")

  from ipykernel import kernelapp as app


Done.


#### Here we are testing our word embeddings with some query. Our query will be "Coronavirus transmission" and we are hoping to  find all the documents that are talking about coronavirus transmission. 

In [46]:
test_data = word_tokenize("Coronavirus transmission".lower()) #change this query to test different things 
v1 = model.infer_vector(test_data)

#### Here we are finding the embeddings that will correnspond with our query. Function most_similar() will return us id and percentage of similarity with corrensponding query. E.g. ('43', 0.834 ) means that document with id 43 is 83% similar with query. ( although this isn't really percentage, this is similarity, but thats the gist :) )

In [47]:
#finding the most similar doc
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)

[('57', 0.5976769924163818), ('82', 0.5365582704544067), ('13', 0.5226129293441772), ('44', 0.476043164730072), ('60', 0.46594515442848206), ('58', 0.4617353081703186), ('46', 0.44316622614860535), ('76', 0.41599059104919434), ('37', 0.4150567650794983), ('80', 0.4132328927516937)]


### This is the most similar document with our query within our dataset of 100 papers.

In [49]:
#print(tagged_data[33])
print(parser.toList()[57])

Population movement, city closure and spatial transmission of the 2019-nCoV 1 infection in China 2The outbreak of pneumonia caused by a novel coronavirus (2019-nCoV) in Wuhan 17City of China obtained global concern, the population outflow from Wuhan has 18 contributed to spatial expansion in other parts of China. We examined the effects of 19 population outflow from Wuhan on the 2019-nCoV transmission in other provinces 20 and cities of China, as well as the impacts of the city closure in Wuhan. We observed 21 a significantly positive association between population movement and the number of 22 cases. Further analysis revealed that if the city closure policy was implemented two 23 days earlier, 1420 (95% CI: 1059, 1833) cases could be prevented, and if two days 24 later, 1462 (95% CI: 1090, 1886) more cases would be possible. Our findings suggest 25 that population movement might be one important trigger of the 2019-nCoV infection 26 transmission in China, and the policy of city closur

## Topic modelling - Demonstration

### First we need to preprocess the data - Lemmatization and Tokenization

In [40]:
from nltk.stem import WordNetLemmatizer
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

In [119]:
def lemmatize(word):
    return WordNetLemmatizer().lemmatize(word,pos = "v")

#part_of_paper could be set to "title","abstract","body" or "whole".
#depends on which part of text do you want to tokenize
def preprocess(paper,part_of_paper = "whole"):
    text = ""
    if(part_of_paper == "title"):
        text = paper.title
    elif(part_of_paper == "abstract"):
        text = paper.abstract
    elif(part_of_paper == "body"):
        text = paper.body;
    else:
        text = paper.whole_text;
    
    tokens = []
    for token in simple_preprocess(text):
        if(token not in STOPWORDS and len(token) > 3):
            tokens.append(lemmatize(token))
    
    
    return tokens
    

In [120]:
print("----------------------Before preprocessing----------------------")
print(tokenize(parser.data_dicts[Dataset.BIORXIV][3])[:10])
print("----------------------After preprocessing-----------------------")
print(preprocess(parser.data_dicts[Dataset.BIORXIV][3])[:10])

----------------------Before preprocessing----------------------
['deep', 'learning', 'based', 'detection', 'for', 'covid', 'from', 'chest', 'ct', 'using']
----------------------After preprocessing-----------------------
['deep', 'learn', 'base', 'detection', 'covid', 'chest', 'weak', 'labelaccurate', 'rapid', 'diagnosis']


## Now lets make a dictionary of all words that appear in our dataset

In [121]:
from gensim.corpora import Dictionary

In [122]:
documents = []

papers = parser.data_dicts[Dataset.BIORXIV] #all papers from BIORXIV dataset
for index in papers:
    paper = papers[index]
    documents.append(preprocess(paper))

dictionary = Dictionary(documents)

### First 10 words in our dictonary 

In [123]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 absence
1 acetylation
2 achieve
3 act
4 activate
5 activation
6 actively
7 add
8 adeno
9 adipogenic
10 affect


#### Filtriranje top n najčešćih tokena

In [124]:
dictionary.filter_extremes(keep_n=100000)

## It's time for bag-of-words. We will do bag-of-words approach to each of the documents.

In [125]:
bag_of_words_corpus = [dictionary.doc2bow(document) for document in documents]

### Example for one document

In [126]:
#Lets choose the random index of the document

bag_of_words_document_34 = bag_of_words_corpus[34]
print(bag_of_words_document_34)

#Lets explain the output
# (6,2) -> 6 is the id of the word in dictionary 
#       -> 2 is the number of times the word occurred in the document

[(7, 10), (8, 1), (15, 10), (30, 1), (33, 5), (37, 3), (54, 2), (57, 7), (77, 1), (78, 1), (83, 2), (91, 1), (97, 35), (105, 1), (119, 2), (120, 1), (125, 3), (144, 2), (166, 3), (169, 2), (176, 11), (185, 1), (189, 1), (193, 2), (195, 1), (234, 1), (237, 7), (247, 1), (249, 16), (254, 1), (255, 1), (264, 1), (266, 2), (271, 3), (273, 16), (274, 5), (275, 1), (278, 2), (280, 3), (281, 1), (285, 5), (289, 3), (290, 1), (291, 2), (300, 2), (303, 23), (306, 1), (307, 2), (309, 1), (314, 1), (318, 4), (319, 2), (322, 1), (325, 1), (326, 3), (327, 2), (329, 2), (331, 2), (337, 2), (338, 23), (341, 1), (344, 2), (346, 3), (349, 1), (350, 11), (351, 2), (352, 1), (355, 6), (357, 2), (359, 26), (360, 3), (367, 1), (369, 2), (370, 5), (371, 5), (374, 1), (376, 13), (379, 12), (380, 2), (383, 1), (384, 1), (396, 7), (400, 1), (403, 1), (411, 5), (414, 33), (415, 28), (416, 3), (417, 1), (418, 1), (423, 7), (424, 3), (425, 1), (429, 1), (440, 1), (445, 15), (447, 1), (456, 2), (458, 2), (459, 1),

### More visual example

In [127]:
for i in range(len(bag_of_words_document_34)):
    print("Word {} (\"{}\") appears {} time.".format(bag_of_words_document_34[i][0], 
                                               dictionary[bag_of_words_document_34[i][0]],bag_of_words_document_34[i][1]))

Word 7 ("affect") appears 10 time.
Word 8 ("animal") appears 1 time.
Word 15 ("basic") appears 10 time.
Word 30 ("clear") appears 1 time.
Word 33 ("combine") appears 5 time.
Word 37 ("construct") appears 3 time.
Word 54 ("difficult") appears 2 time.
Word 57 ("directly") appears 7 time.
Word 77 ("explain") appears 1 time.
Word 78 ("expose") appears 1 time.
Word 83 ("fact") appears 2 time.
Word 91 ("functional") appears 1 time.
Word 97 ("growth") appears 35 time.
Word 105 ("imply") appears 1 time.
Word 119 ("line") appears 2 time.
Word 120 ("link") appears 1 time.
Word 125 ("make") appears 3 time.
Word 144 ("normal") appears 2 time.
Word 166 ("predict") appears 3 time.
Word 169 ("primary") appears 2 time.
Word 176 ("propagate") appears 11 time.
Word 185 ("reason") appears 1 time.
Word 189 ("reduction") appears 1 time.
Word 193 ("relatively") appears 2 time.
Word 195 ("relevant") appears 1 time.
Word 234 ("treat") appears 1 time.
Word 237 ("underlie") appears 7 time.
Word 247 ("word") app

## Besides Bag-of-Words we can also use TF-IDF which is usually more accurate...

In [128]:
from gensim import corpora, models
from pprint import pprint

In [129]:
tfidf = models.TfidfModel(bag_of_words_corpus)
corpus_tfidf = tfidf[bag_of_words_corpus]

In [130]:
#Example output
# (0,0.32) -> 0 is the id in the dictonary
#          -> 0.32 is the frequency of term in dictionary
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.09769017238665445),
 (1, 0.01061565164204266),
 (2, 0.020660439663961923),
 (3, 0.020660439663961923),
 (4, 0.01997696044845842),
 (5, 0.02998037611460888),
 (6, 0.011518049469993385),
 (7, 0.015629019266530028),
 (8, 0.03628945893049195),
 (9, 0.03628945893049195),
 (10, 0.03377374873177601),
 (11, 0.014761364006274133),
 (12, 0.03550693994808871),
 (13, 0.16330256518721378),
 (14, 0.10132124619532804),
 (15, 0.015186846739026929),
 (16, 0.051918478197021986),
 (17, 0.01656905971474332),
 (18, 0.02998037611460888),
 (19, 0.02998037611460888),
 (20, 0.041320879327923846),
 (21, 0.041320879327923846),
 (22, 0.03377374873177601),
 (23, 0.028702713696157706),
 (24, 0.1400728013036529),
 (25, 0.13429285781575248),
 (26, 0.048267733700932965),
 (27, 0.018144729465245975),
 (28, 0.01357353436826171),
 (29, 0.03377374873177601),
 (30, 0.01061565164204266),
 (31, 0.030373693478053857),
 (32, 0.02998037611460888),
 (33, 0.008754550081478307),
 (34, 0.024884716487559244),
 (35, 0.00875455

# Finally! We are for our first topic modelling algorithm. 

### Here we will use LDA algorithm - or formally more know as Latent Dirichlet Allocation. You can picture it as clustering GMM algorithm for topic modelling. Basically we are running through the text and algorithm will pick some words and depending on how often they are occuring and more, it will give some certainty on whether is that word a topic for that document. 

In [131]:
from gensim.models import LdaModel

### So here we are actually training this algorithm. We are using our bag of words corpus and we selected 10 topics. Id2Word means that we will look into our dictionary what the words in our topic means ( remove it to see the difference)

In [136]:
lda = LdaModel(corpus_tfidf, num_topics=5,id2word = dictionary,passes=4)

#### Okay, this output needs some explanation. As you can see we have 5 topics. This means that these five topics (or more precise five distributions ) present our dataset. In other words, this means that this are 5 main topics in our dataset. To clarify, topic represents a distribution of words that represent that topic. For example, Topic 0 is 0,006 percent patients, 0,004 percent covid and so on...  Also, important to mention is that we can see that Topic 0 is vastly different from Topic 2.

In [137]:
for index in range(5):
    print('\033[1m'+"Topic "+'\033[0m'+str(index)+":")
    print(lda.print_topic(index))
    print("-------------------------------------------")
    

[1mTopic [0m0:
0.006*"patients" + 0.004*"covid" + 0.004*"medrxiv" + 0.003*"perpetuity" + 0.003*"grant" + 0.002*"wuhan" + 0.002*"cells" + 0.002*"read" + 0.002*"hospital" + 0.002*"january"
-------------------------------------------
[1mTopic [0m1:
0.002*"heat" + 0.002*"antibody" + 0.002*"transition" + 0.002*"interval" + 0.001*"reproductive" + 0.001*"medrxiv" + 0.001*"outflow" + 0.001*"january" + 0.001*"covid" + 0.001*"variants"
-------------------------------------------
[1mTopic [0m2:
0.003*"protein" + 0.002*"proteins" + 0.002*"bind" + 0.002*"permission" + 0.002*"epitopes" + 0.002*"reuse" + 0.002*"ncov" + 0.002*"codon" + 0.002*"reserve" + 0.002*"structure"
-------------------------------------------
[1mTopic [0m3:
0.002*"train" + 0.002*"filter" + 0.002*"learn" + 0.002*"deep" + 0.001*"host" + 0.001*"strength" + 0.001*"speed" + 0.001*"network" + 0.001*"recovery" + 0.001*"covid"
-------------------------------------------
[1mTopic [0m4:
0.002*"protein" + 0.001*"adaptation" + 0.00

### Now that we have main topics for this dataset, we want to see to which topic does individual papers belong. To do that, we are going to use this code.

In [139]:
#Lets test with this document with index 1! As we can see this paper cleary is talking about Covid and Coronavirus,
#so we except thta the topic should be about that :) 
print(documents[1])

['real', 'time', 'estimation', 'risk', 'death', 'novel', 'coronavirus', 'covid', 'infection', 'inference', 'export', 'casesthe', 'export', 'case', 'novel', 'coronavirus', 'infection', 'confirm', 'outside', 'china', 'provide', 'opportunity', 'estimate', 'cumulative', 'incidence', 'confirm', 'case', 'fatality', 'risk', 'ccfr', 'mainland', 'china', 'knowledge', 'ccfr', 'critical', 'characterize', 'severity', 'understand', 'pandemic', 'potential', 'covid', 'early', 'stage', 'epidemic', 'exponential', 'growth', 'rate', 'incidence', 'present', 'study', 'statistically', 'estimate', 'ccfr', 'basic', 'reproduction', 'number', 'average', 'number', 'secondary', 'case', 'generate', 'single', 'primary', 'case', 'naïve', 'population', 'model', 'epidemic', 'growth', 'single', 'index', 'case', 'illness', 'onset', 'december', 'scenario', 'growth', 'rate', 'fit', 'parameters', 'scenario', 'base', 'data', 'export', 'case', 'report', 'january', 'cumulative', 'incidence', 'china', 'january', 'estimate', 'c

### So here we test LDA with a document with index 1 ( to be more precise, with TF-IDF representation of document with index 1) we can se distribution between topics. This papers topic is 92 percent Topic 0, and all the rest are around 1.7%. Meaning that the main topic of this paper is something about patients, covid and wuhan. This will give us a clue, on which papers are actually talking about this pandemic.

In [140]:
lda[corpus_tfidf[1]]

[(0, 0.9284133),
 (1, 0.017921286),
 (2, 0.017866932),
 (3, 0.017814443),
 (4, 0.017984083)]