# Data Explainability and Visualization
In this notebook, we perform exploratory data analysis on the published literature on COVID-19 and try to answer some basic questions such as

1. What is known about transmission, incubation, and environmental stability?
2. What do we know about COVID-19 risk factors?
3. Best medical care?
4. How the COVID-19 spreads and evolves?
5. What are some of the most promising vaccines available (if at all)?

Answering these questions will help clear doubts and answer basic questions with ease.

In [147]:
# handling imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import tqdm, os, re, glob, json
from IPython.display import display
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
import random
from gensim import corpora
import pickle
import gensim
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
style.use('ggplot')
%matplotlib inline

[nltk_data] Downloading package wordnet to /Users/Janjua/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Janjua/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
basepath = "/Users/Janjua/Desktop/Projects/Octofying-COVID19-Literature/dataset"
datapath = "CORD-19-research-challenge"

In [9]:
papers = [x for x in glob.glob(os.path.join(basepath, datapath) + "/*/*/*.json")]
print('Total papers found: ', len(papers))
print(papers[:1])

Total papers found:  33375
['/Users/Janjua/Desktop/Projects/Octofying-COVID19-Literature/dataset/CORD-19-research-challenge/custom_license/custom_license/86a998617c077f4fe2ab26214995a3548fbc0fc5.json']


In [18]:
for paper in papers:
    read_paper = json.load(open(paper))
    title = read_paper['metadata']['title']
    try:
        abstract = read_paper['abstract'][0]['text']
    except:
        abstract = "No abstract found"
    paper_text = ""
    for text in read_paper['body_text']:
        paper_text += text['text'] + '\n\n'
    print("="*100)
    print('Title: ', title)
    print("="*100)
    print('Abstract: ', abstract)
    print("="*100)
    print('Paper Contents: ', paper_text)
    print("="*100)
    break

Title:  Middle East Respiratory Syndrome and Severe Acute Respiratory Syndrome
Abstract:  The recent emergence of the Middle East respiratory syndrome (MERS)-CoV, a close relative of the Severe Acute respiratory syndrome (SARS)-CoV, both of which caused a lethal respiratory infection in humans, reinforces the need for further understanding of coronavirus pathogenesis and the host immune response. These viruses have evolved diverse strategies to evade and block host immune responses, facilitating infection and transmission. Pathogenesis following infection with these viruses is characterized by a marked delay in the induction of Type I interferon (IFN I) and, subsequently, by a poor adaptive immune response. Therapies that expedite IFN I induction as well as interventions that antagonize immunoevasive virus proteins are thus promising candidates for immune modulation.
Paper Contents:  While most CoVs cause the common cold in humans, infection with two recently emerged CoVs, SARS-CoV and

In [22]:
def draw_horizontal_lines(times):
    print("="*times)
    
def read_papers():
    papers_contents = []
    for paper in tqdm.tqdm(papers):
        read_paper = json.load(open(paper))
        title = read_paper['metadata']['title']
        try:
            abstract = read_paper['abstract'][0]['text']
        except:
            abstract = "No abstract found"
        paper_text = ""
        for text in read_paper['body_text']:
            paper_text += text['text'] + '\n\n'
        papers_contents.append([title, abstract, paper_text])
    return papers_contents

In [23]:
print("Reading Papers!")
draw_horizontal_lines(100)
papers = read_papers()
print(papers[0])
draw_horizontal_lines(100)

  0%|          | 58/33375 [00:00<00:58, 571.19it/s]

Reading Papers!


100%|██████████| 33375/33375 [01:05<00:00, 508.82it/s]

['Middle East Respiratory Syndrome and Severe Acute Respiratory Syndrome', 'The recent emergence of the Middle East respiratory syndrome (MERS)-CoV, a close relative of the Severe Acute respiratory syndrome (SARS)-CoV, both of which caused a lethal respiratory infection in humans, reinforces the need for further understanding of coronavirus pathogenesis and the host immune response. These viruses have evolved diverse strategies to evade and block host immune responses, facilitating infection and transmission. Pathogenesis following infection with these viruses is characterized by a marked delay in the induction of Type I interferon (IFN I) and, subsequently, by a poor adaptive immune response. Therapies that expedite IFN I induction as well as interventions that antagonize immunoevasive virus proteins are thus promising candidates for immune modulation.', 'While most CoVs cause the common cold in humans, infection with two recently emerged CoVs, SARS-CoV and MERS-CoV, resulted in more 




In [88]:
draw_horizontal_lines(100)
print("Create a dataframe for processing!")
draw_horizontal_lines(100)
df_covid = pd.DataFrame(papers, columns=["title", "abstract", "text"])
display(df_covid.head())

Create a dataframe for processing!


Unnamed: 0,title,abstract,text
0,Middle East Respiratory Syndrome and Severe Ac...,The recent emergence of the Middle East respir...,While most CoVs cause the common cold in human...
1,"Integrated, Multi-cohort Analysis Identifies C...",Graphical Abstract Highlights d MVS is a commo...,Clinically relevant respiratory viral signatur...
2,Evolutionary Medicine IV. Evolution and Emerge...,No abstract found,The evolutionary history of humans is characte...
3,International aviation emissions to 2025: Can ...,"International aviation is growing rapidly, res...","Sixty years ago, civil aviation was an infant ..."
4,2 Mechanisms of diarrhoea,No abstract found,Acute infections of the gastrointestinal tract...


In [90]:
def keyword_based_search(keyword, df):
    result = df[df['text'].str.contains(keyword)]
    text_content = result.text.values
    title = result.title.values
    display(result.head())
    relevant_sentences = {'title': [], 'sents': []}
    for sent in tqdm.tqdm(range(len(text_content))):
        sentences = text_content[sent].split('.')
        relevant_sentences['title'].append(title[sent])
        relevant_sentences['sents'].append([s for s in sentences if keyword in s])
    return relevant_sentences

draw_horizontal_lines(100)
print("Getting the subset of DF containing the keyword!")
draw_horizontal_lines(100)
relevant_sentences = keyword_based_search("pregnant women", df_covid)
for i in range(5):
    draw_horizontal_lines(100)
    print("Title: ", relevant_sentences['title'][i])
    print("Sentences: ", relevant_sentences['sents'][i])
    print()

Getting the subset of DF containing the keyword!


Unnamed: 0,title,abstract,text
136,Medical issues associated with commercial fl i...,Almost 2 billion people travel aboard commerci...,Fitness for air travel is a growing issue beca...
157,Pandemic 2009 influenza A (H1N1) infection amo...,No abstract found,Hajj is the largest annual recurring religious...
201,-NC-ND license (http://creativecommons.org/lic...,Many industrialized countries have implemented...,Vaccination of older adults has been shown to ...
290,Human Metapneumovirus and Other Respiratory Vi...,No abstract found,Human metapneumovirus (HMPV) is a respiratory ...
351,Emerging infectious disease outbreaks: Old les...,No abstract found,The long and prominent role of infectious dise...


100%|██████████| 1126/1126 [00:00<00:00, 4386.68it/s]

Title:  Medical issues associated with commercial fl ights
Sentences:  [' 69 Recommendations need to be in place for pregnant women because the fetus is exposed to the same radiation dose as the mother']

Title:  Pandemic 2009 influenza A (H1N1) infection among 2009 Hajj Pilgrims from Southern Iran: a real-time RT-PCR-based study
Sentences:  [' This finding is understandable in view of the fact that there were no members of high-risk groups such as pregnant women or individuals with chronic health conditions among the pilgrims, and secondly, as previous reports indicate, A(H1N1)pdm09 infection has not been associated with high mortality rates and finally the instructions given to the pilgrims about contact and hand hygiene and respiratory etiquette']

Title:  -NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Sentences:  [' immunization visits at ages 6, 10 and 14 weeks, and 9 months) and pregnant women']

Title:  Human Metapneumovirus and Other Respiratory Viral Infect




## Topic Modelling - Latent Dirichlet allocation
In this section, we perform topic modelling on the corpus (research papers) to retrieve the papers with relevant information.

In [162]:
def get_tokens(text):
    return word_tokenize(text)

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

def prepare_text_for_lda(text):
    tokens = get_tokens(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

def generate_tokens_for_searched_sentences():
    text_data = []
    for ix in range(len(relevant_sentences['title'])):
        full_text = ''.join(x for x in relevant_sentences['sents'][ix])
        tokens = prepare_text_for_lda(full_text)
        if random.random() > 0.99:
            text_data.append(tokens)
    return text_data
        
def get_corpus(token_data):
    dictionary = corpora.Dictionary(token_data)
    corpus = [dictionary.doc2bow(text) for text in token_data]
    pickle.dump(corpus, open('corpus.pkl', 'wb'))
    dictionary.save('dictionary.gensim')
    print("Saved dictionary!")
    return corpus, dictionary
    
tokens_data = generate_tokens_for_searched_sentences()
corpus, dictionary = get_corpus(tokens_data)

Saved dictionary!


In [163]:
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.061*"pregnant" + 0.061*"woman" + 0.019*"death" + 0.019*"admission" + 0.019*"Khuroo" + 0.019*"associate" + 0.019*"infection" + 0.010*"virus" + 0.010*"population" + 0.010*"clinically"')
(1, '0.028*"disease" + 0.028*"pregnant" + 0.028*"woman" + 0.028*"intrapartum" + 0.028*"issuance" + 0.028*"guideline" + 0.028*"culture" + 0.028*"first" + 0.028*"recommend" + 0.028*"national"')
(2, '0.024*"woman" + 0.024*"disease" + 0.024*"young" + 0.024*"pregnant" + 0.024*"child" + 0.024*"priority" + 0.024*"designate" + 0.024*"provider" + 0.024*"complication" + 0.024*"person"')
(3, '0.045*"viral" + 0.024*"therapy" + 0.024*"testing" + 0.024*"product" + 0.024*"receive" + 0.024*"emerge" + 0.024*"highlight" + 0.024*"technology" + 0.024*"Irish" + 0.024*"vulnerability"')
(4, '0.028*"pregnant" + 0.028*"woman" + 0.028*"transmission" + 0.016*"universal" + 0.015*"healthy" + 0.015*"routinely" + 0.015*"handling" + 0.015*"otherwise" + 0.015*"increase" + 0.015*"baby"')


In [165]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
