# Lab 4 : Topic Modelling

In this lab, we will work with research papers published on different aspects of coronaviruses over the years. Our goal is to use topic modelling to know different areas each research paper talks about and answer some important questions regarding the viruses.

1. We will begin by first extracting full body text, abstract and title from each paper and cleaning them.
2. We will then use gensim library to create a LDA topic model on the extracted body texts.
3. We will then use topic modelling and try to find most relevant papers on aspects like vaccine and respiratory viruses.
4. Finally, we will look at coherence score as a measure of tuning the number of topics in LDA topic model

## Important Instructions - 

1. Please make changes only inside the graded function. Do not make changes anywhere else in the notebook.
2. Please read the description of every graded function very carefully. Description clearly states what is the expectation of each graded function. 
3. After almost every graded function, there is a cell which you can run and see if the expected output matches the output you are getting. 

## Grading Policy -
1. You will receive full credit if the code passes all test cases.
2. In case of error, partial credit will be awarded based on your code and no. of test cases passed .

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import json
import os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from  sklearn.cluster import AgglomerativeClustering,SpectralClustering,KMeans
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDA
import seaborn as sns
import scispacy
import spacy
from gensim.models.ldamodel import LdaModel,CoherenceModel
from gensim import corpora

In [2]:
#setting stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
customize_stop_words = set([
    'doi', 'preprint', 'copyright', 'org', 'https', 'et', 'al', 'author', 'figure', 'table',
    'rights', 'reserved', 'permission', 'use', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 'CZI',
    '-PRON-', 'usually','study','also'])
stop_words=set(list(customize_stop_words)+list(stop_words))

lemmatizer = WordNetLemmatizer()

In [3]:
#the purpose of clean_abstract function is to remove stopwords, punctuation, 
#special characters as well as extra spaces
def clean_abstract(abstract):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    abstract = abstract.lower()
    # Clean the text
    abstract = re.sub(r"<br />", " ", abstract)
    abstract = re.sub(r"[^a-z]", " ", abstract)
    abstract = re.sub(r"   ", " ", abstract) # Remove any extra spaces
    abstract = re.sub(r"  ", " ", abstract)
    #remove stopwords
    stops = set(stopwords.words("english"))
    tokenized = word_tokenize(abstract)
    abstract = [lemmatizer.lemmatize(w) for w in tokenized if not w in stop_words and len(w) > 3]
    #abstract = " ".join(abstract)


    
    # Return a list of words
    return abstract

## The below code cell prepares the following important objects for analysis :

1. cleaned_text - list of lists where each sublist is cleaned full text of a research paper. 

2. text - list of lists where each sublist is full text of a research paper.

3. cleaned_titles - list of lists where each sublist is cleaned title of a research paper. 

4. titles - list of lists where each sublist is title of a research paper. 

5. abstracts - list of lists where each sublist is abstract of a research paper.

In [4]:
# extracting full text, abstracts and titles and corresponding paper ids from json data.
# we will clean the full text and titles.
cleaned_text=[]
cleaned_titles=[]
paper_ids=[]
text=[]
abstracts=[]
titles=[]
count=0
for file in os.listdir("pdf_json") :
    with open('pdf_json/' + file) as json_data:
        data=json.load(json_data)
        l=data['body_text']
        l1=data['abstract']
        if len(l1)==0 or len(l)==0:
            continue
        count+=1
        abstract=""
        paper_ids.append(data['paper_id'])
        for d in l :
            abstract+=d["text"]+" "
        if 'coronavirus' in abstract :
            text.append(abstract)
            abstract=clean_abstract(abstract)
            cleaned_text.append(abstract)
            abstract=""
            for d in l1 :
                abstract+=d["text"]+" "
            abstracts.append(abstract)
            titles.append(data['metadata']['title'])
            cleaned_titles.append(clean_abstract(data['metadata']['title']))
        

## Graded Function 1 (10 marks) :

Purpose - To create dictionary and corpus objects which will be used for creating gensim topic model.

You should use the corpora package of the gensim library.

The input to the function is the cleaned_text list which we have created above.

You should return both the dictionary and corpus.

For more information on how to create dictionary and corpus, refer the documentation - 

https://radimrehurek.com/gensim/models/ldamodel.html

In [5]:
from gensim.corpora.dictionary import Dictionary
def create_corpus(text) :
    # start code here
    dictio = Dictionary(text)
    corpus = [dictio.doc2bow(texts) for texts in text]
    # end code here
    return dictio, corpus

In [6]:
dictionary,corpus=create_corpus(cleaned_text)
for i in range(20) :
    print(dictionary[i])

abcam
ability
able
absence
abundant
accelerated
accomplished
according
accordingly
accumulate
accumulation
acetate
achieved
acid
acknowledged
acquire
across
actin
acting
active


## Expected Output -

ability

able

absence

absorbance

accca

according

acid

act

activated

activity

acute

added

addition

additional

additionally

adenoviral

adjuvant

administered

adsorption

affect

## Graded Function 2 : (5 marks)

Purpose - To create lda topic model using gensim.

Inputs will be dictionary and corpus object created above and the no. of top important words from each topic we want to extract.

While creating the model, you can keep no. of topics to 8 and random_state=25. You can change these parameters if you want for answering questions but I recevied good results using these parameters. 

You should return the created lda model and the important words for each topic. 

There is a method of lda model object which you can use to get top words of each topic. 

In [7]:
def create_lda_model(dictionary,corpus,n_words) :
    # start code here
    lda = LdaModel(corpus, num_topics=8,random_state=25,id2word=dictionary)
    # end code here
    return lda,lda.show_topics(num_topics=8, num_words=n_words, formatted=True)

In [8]:
lda_model,topics=create_lda_model(dictionary,corpus,40)
print(len(topics))
print(topics[0])

8
(0, '0.007*"virus" + 0.007*"infection" + 0.006*"cell" + 0.005*"cat" + 0.005*"group" + 0.005*"disease" + 0.005*"sample" + 0.004*"gene" + 0.004*"animal" + 0.003*"assay" + 0.003*"control" + 0.003*"however" + 0.003*"case" + 0.003*"health" + 0.003*"number" + 0.003*"study" + 0.003*"viral" + 0.003*"data" + 0.003*"time" + 0.003*"positive" + 0.003*"sequence" + 0.002*"analysis" + 0.002*"clinical" + 0.002*"risk" + 0.002*"high" + 0.002*"infected" + 0.002*"specie" + 0.002*"country" + 0.002*"type" + 0.002*"model" + 0.002*"level" + 0.002*"different" + 0.002*"three" + 0.002*"reported" + 0.002*"result" + 0.002*"pig" + 0.002*"detection" + 0.002*"shown" + 0.002*"treatment" + 0.002*"human"')


In [9]:
#Printing the list of topics to see which one has the highest proportion of certains words
for x in range(1,8):
    print(topics[x])

(1, '0.018*"cell" + 0.015*"protein" + 0.010*"virus" + 0.006*"infection" + 0.006*"viral" + 0.005*"human" + 0.004*"gene" + 0.004*"sequence" + 0.004*"binding" + 0.004*"host" + 0.004*"sars" + 0.003*"interaction" + 0.003*"activity" + 0.003*"shown" + 0.003*"structure" + 0.003*"site" + 0.003*"different" + 0.003*"data" + 0.003*"genome" + 0.003*"mouse" + 0.003*"analysis" + 0.002*"response" + 0.002*"type" + 0.002*"factor" + 0.002*"result" + 0.002*"infected" + 0.002*"however" + 0.002*"could" + 0.002*"replication" + 0.002*"target" + 0.002*"found" + 0.002*"receptor" + 0.002*"acid" + 0.002*"fusion" + 0.002*"sample" + 0.002*"domain" + 0.002*"high" + 0.002*"animal" + 0.002*"effect" + 0.002*"function"')
(2, '0.014*"cell" + 0.011*"protein" + 0.007*"viral" + 0.006*"infection" + 0.006*"virus" + 0.006*"sequence" + 0.005*"mers" + 0.005*"human" + 0.005*"antibody" + 0.004*"sample" + 0.004*"response" + 0.004*"anti" + 0.003*"gene" + 0.003*"patient" + 0.003*"infected" + 0.003*"sars" + 0.003*"result" + 0.003*"ana

## Expected Output -

len(topics) = 8

topics[0] =

(0,
 '0.014*"virus" + 0.007*"sample" + 0.007*"sequence" + 0.006*"cell" + 0.006*"sars" + 0.006*"viral" + 0.006*"infection" + 0.005*"human" + 0.004*"disease" + 0.004*"time" + 0.004*"result" + 0.003*"patient" + 0.003*"data" + 0.003*"case" + 0.003*"genome" + 0.003*"protein" + 0.003*"positive" + 0.003*"clinical" + 0.003*"primer" + 0.003*"study"')
 
#### Note - This output will change if no. of topics and random_state are different.

## Graded Function 3 : (15 marks)

Purpose - To get the top 20 papers for a given topic number.  A given paper belongs to the topic no. whose proportion is the highest among all topics in the paper. To get the composition of all topics in a paper, use the get_document_topics() function of the lda model object. 

Input is the topic no.(0 based indexing)

After getting all the papers belonging to the topic no. k(input), sort them based on the proportion of topic k they have in descending order and return 20 papers with highest amount of topic k in them.

Output will be a list of tuples with paper no. as 1st value and proportion of topic k as 2nd value

In [10]:
def get_top_articles(k) :
    # start code here
    doc_topics=lda_model.get_document_topics(corpus)
    track_dict=[]
    for x,y in enumerate(doc_topics):
        for tup in y:
            if (tup[0]==k):
                track_dict.append((x,tup[1]))
    sort_flat=sorted(track_dict, key = lambda x: x[1],reverse=True)
    return sort_flat[:20]

In [11]:
top_20_0=get_top_articles(0)
top_20_0

[(2595, 0.99980026),
 (2409, 0.9997993),
 (2086, 0.9997805),
 (2785, 0.99975777),
 (1382, 0.99968994),
 (2331, 0.9996309),
 (3608, 0.9995981),
 (3659, 0.9995814),
 (1805, 0.99954945),
 (2431, 0.9995453),
 (271, 0.9995075),
 (1106, 0.9995037),
 (2699, 0.99944925),
 (2406, 0.9994471),
 (2468, 0.99942887),
 (1879, 0.99942493),
 (2117, 0.9994037),
 (2857, 0.99937415),
 (2271, 0.9993598),
 (212, 0.99933577)]

## Expected Value -

[(2215, 0.9996958),

 (1954, 0.9996701),
 
 (3684, 0.9996221),
 
 (1342, 0.99961907),
 
 (3524, 0.9995674),
 
 (3492, 0.999553),
 
 (2277, 0.99951327),
 
 (519, 0.9992938),
 
 (2817, 0.99925375),
 
 (2256, 0.99917614),
 
 (965, 0.9991082),
 
 (3732, 0.9990591),
 
 (1369, 0.9988257),
 
 (1225, 0.9977632),
 
 (3263, 0.99463576),
 
 (3532, 0.9943344),
 
 (716, 0.9935407),
 
 (2372, 0.99306047),
 
 (3267, 0.9916947),
 
 (897, 0.9912204)]
 
#### Note - This output will change if no. of topics and random_state are different in LDA model

## Question 1 :  What do we know about vaccine development efforts for viruses? (5 marks) 

You should look to get the most relevant 20 articles on the subject of vaccines for varius viruses.  

In [12]:
vaccine_articles=get_top_articles(4)

In [13]:
for i,x in vaccine_articles[:5] :
    print(titles[i])
    print("\n")

Recombinant Chimeric Transmissible Gastroenteritis Virus (TGEV)-Porcine Epidemic Diarrhea Virus (PEDV) Virus Provides Protection against Virulent PEDV


Trypsin-independent porcine epidemic diarrhea virus US strain with altered virus entry mechanism


The Program for New Century Excellent Talents in University of Ministry of Education of P.R. China (NCET-10-0144), Sponsored by Chang Jiang Scholar Candidates Programme for Provincial Universities in Heilongjiang


Contribution of porcine aminopeptidase N to porcine deltacoronavirus infection


Genetic manipulation of porcine deltacoronavirus reveals insights into NS6 and NS7 functions: a novel strategy for vaccine design




<b>Analyzing the output of function 2 we see that among the 8 topics listed the probability of word vaccine occurring in them for the top 40 words is low. However,   topic[4] has the word vaccine with a probablity =0.003*"vaccine". So we choose it to get the most relevant 20 articles on the subject of vaccines.

## Expected Output - 

REVIEW Intranasal and oral vaccination with protein-based antigens: advantages, challenges and formulation strategies


Peptide Vaccine: Progress and Challenges


Journal of Immune Based Therapies and Vaccines Prospects for control of emerging infectious diseases with plasmid DNA vaccines


Emergence of Pathogenic Coronaviruses in Cats by Homologous Recombination between Feline and Canine Coronaviruses


Effects of Adjuvants on the Immunogenicity and Efficacy of a Zika Virus Envelope Domain III Subunit Vaccine

#### Note - This output may vary based on your parameters of the LDA model.

## Question 2 : What do we know about other viruses causing respiratory problems in adults and children? (5 marks)

You should look to get the most relevant 20 articles on the subject of respiratory problems in adults and children.

In [14]:
resp_articles=get_top_articles(7)

In [15]:
for i,x in resp_articles[:5] :
    print(titles[i])
    print("\n")

Respiratory Virus Infections in Hematopoietic Cell Transplant Recipients


viruses Perspective Potential Maternal and Infant Outcomes from Coronavirus 2019-nCoV (SARS-CoV-2) Infecting Pregnant Women: Lessons from SARS, MERS, and Other Human Coronavirus Infections


Exposure Patterns Driving Ebola Transmission in West Africa: A Retrospective Observational Study International Ebola Response Team


Bacterial and viral pathogen spectra of acute respiratory infections in under-5 children in hospital settings in Dhaka city


Goal-Oriented Respiratory Management for Critically Ill Patients with Acute Respiratory Distress Syndrome




<b>Analyzing the output of function 2, we see that among the 8 topics listed the probability of word "resipiratory" and "child" occurring in them for the top 40 words is highest in topic[7]. So we choose topic[7] to get the most relevant 20 articles on the subject of respiratory problems.

## Expected Output - 

Comparing Human Metapneumovirus and Respiratory Syncytial Virus: Viral Co- Detections, Genotypes and Risk Factors for Severe Disease


Bocavirus Infection in Otherwise Healthy Children with Respiratory Disease


Surveillance and Genome Analysis of Human Bocavirus in Patients with Respiratory Infection in Guangzhou


Imported Case of Acute Respiratory Tract Infection Associated with a Member of Species Nelson Bay Orthoreovirus


Clinical Epidemiology of Bocavirus, Rhinovirus, Two Polyomaviruses and Four Coronaviruses in HIV-Infected and HIV-Uninfected South African Children

#### Note - This output may vary based on your parameters of the LDA model.

## Graded Function 4 : (10 marks)

Purpose - To calculate coherence score of LDA model for different no. of topics.

Input will be a list of no. of topics for which we want to calculate coherence score.

Output will be a list of tuples where the 1st element is no. of topics and 2nd element is coherence score.

Please refer the following documentation on coherence score -

https://radimrehurek.com/gensim/models/coherencemodel.html

Please use coherence='c_v' in CoherenceModel. Also, keep random_state=25 in lda model to get the same expected output.

In [16]:
from gensim.models.coherencemodel import CoherenceModel
def get_coherence_scores(n_topics) :
    # start code here
    cm_score=[]
    for x in n_topics:
        model2 = LdaModel(corpus, num_topics=x,random_state=25,id2word=dictionary)
        cm = CoherenceModel(model=model2,texts=cleaned_text, corpus=corpus,dictionary=dictionary, coherence='c_v')
        cm_score.append((x,cm.get_coherence()))
    # end code here
    return cm_score

In [17]:
n_topics=[3,6,8,10]
coherence_scores=get_coherence_scores(n_topics)
coherence_scores

[(3, 0.3237877027385651),
 (6, 0.34323254567915923),
 (8, 0.3383538853673067),
 (10, 0.3341392450816718)]

## Expected Output -

[(3, 0.3271153212140945),

 (6, 0.3278043307090321),
 
 (8, 0.3396271207143783),
 
 (10, 0.33361763232155234)]