# Topic Modelling

In this project, we will work with research papers published on different aspects of coronaviruses over the years. Our goal is to use topic modelling to know different areas each research paper talks about and answer some important questions regarding the viruses.

1. We will begin by first extracting full body text, abstract and title from each paper and cleaning them.
2. We will then use gensim library to create a LDA topic model on the extracted body texts.
3. We will then use topic modelling and try to find most relevant papers on aspects like vaccine and respiratory viruses.
4. Finally, we will look at coherence score as a measure of tuning the number of topics in LDA topic model

In [10]:
#importing libraries
import pandas as pd
import numpy as np
import json
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from  sklearn.cluster import AgglomerativeClustering,SpectralClustering,KMeans
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDA
import seaborn as sns
import scispacy
import spacy
from gensim.models.ldamodel import LdaModel,CoherenceModel
from gensim import corpora
# nltk.download('wordnet')

In [5]:
#setting stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
customize_stop_words = set([
    'doi', 'preprint', 'copyright', 'org', 'https', 'et', 'al', 'author', 'figure', 'table',
    'rights', 'reserved', 'permission', 'use', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 'CZI',
    '-PRON-', 'usually','study','also'])
stop_words=set(list(customize_stop_words)+list(stop_words))

lemmatizer = WordNetLemmatizer()

In [6]:
def clean_abstract(abstract):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    abstract = abstract.lower()
    # Clean the text
    abstract = re.sub(r"<br />", " ", abstract)
    abstract = re.sub(r"[^a-z]", " ", abstract)
    abstract = re.sub(r"   ", " ", abstract) # Remove any extra spaces
    abstract = re.sub(r"  ", " ", abstract)
    #remove stopwords
    stops = set(stopwords.words("english"))
    tokenized = word_tokenize(abstract)
    abstract = [lemmatizer.lemmatize(w) for w in tokenized if not w in stop_words and len(w) > 3]
    #abstract = " ".join(abstract)


    
    # Return a list of words
    return abstract

The below code cell prepares the following important objects for analysis :

1. cleaned_text - list of lists where each sublist is cleaned full text of a research paper. 

2. text - list of lists where each sublist is full text of a research paper.

3. cleaned_titles - list of lists where each sublist is cleaned title of a research paper. 

4. titles - list of lists where each sublist is title of a research paper. 

5. abstracts - list of lists where each sublist is abstract of a research paper.

In [11]:
# extracting full text, abstracts and titles and corresponding paper ids from json data.
# we will clean the full text and titles.
cleaned_text=[]
cleaned_titles=[]
paper_ids=[]
text=[]
abstracts=[]
titles=[]
count=0
for file in os.listdir("pdf_json") :
    with open('pdf_json/' + file) as json_data:
        data=json.load(json_data)
        l=data['body_text']
        l1=data['abstract']
        if len(l1)==0 or len(l)==0:
            continue
        count+=1
        abstract=""
        paper_ids.append(data['paper_id'])
        for d in l :
            abstract+=d["text"]+" "
        if 'coronavirus' in abstract :
            text.append(abstract)
            abstract=clean_abstract(abstract)
            cleaned_text.append(abstract)
            abstract=""
            for d in l1 :
                abstract+=d["text"]+" "
            abstracts.append(abstract)
            titles.append(data['metadata']['title'])
            cleaned_titles.append(clean_abstract(data['metadata']['title']))
        

Creating dictionary and corpus objects which will be used for creating gensim topic model. We use the corpora package of the gensim library. The input to the function is the cleaned_text list which we have created above.

In [12]:
from gensim.corpora.dictionary import Dictionary
def create_corpus(text) :
    
    dictio = Dictionary(text)
    corpus = [dictio.doc2bow(texts) for texts in text]
    
    return dictio, corpus

In [13]:
dictionary,corpus=create_corpus(cleaned_text)
for i in range(20) :
    print(dictionary[i])

aberrant
able
absent
abundant
accepted
according
acid
acknowledgment
acquire
acquired
act
activate
activated
activates
activating
activation
actively
activity
adaptive
adaptor


Creating lda topic model using gensim. Inputs will be dictionary and corpus object created above and the no. of top important words from each topic we want to extract.

While creating the model, we  keep no. of topics to 8 and random_state=25.

In [14]:
def create_lda_model(dictionary,corpus,n_words) :
    
    lda = LdaModel(corpus, num_topics=8,random_state=25,id2word=dictionary)
    
    return lda,lda.show_topics(num_topics=8, num_words=n_words, formatted=True)

In [15]:
lda_model,topics=create_lda_model(dictionary,corpus,40)
print(len(topics))
print(topics[0])

8
(0, '0.015*"virus" + 0.008*"infection" + 0.008*"sample" + 0.006*"case" + 0.006*"disease" + 0.006*"respiratory" + 0.006*"viral" + 0.005*"patient" + 0.005*"cell" + 0.004*"time" + 0.004*"positive" + 0.004*"child" + 0.004*"human" + 0.004*"result" + 0.004*"clinical" + 0.004*"group" + 0.004*"pathogen" + 0.003*"detection" + 0.003*"data" + 0.003*"influenza" + 0.003*"study" + 0.003*"detected" + 0.003*"reported" + 0.003*"analysis" + 0.003*"animal" + 0.003*"strain" + 0.003*"however" + 0.003*"control" + 0.003*"year" + 0.003*"assay" + 0.003*"different" + 0.003*"high" + 0.003*"number" + 0.003*"outbreak" + 0.002*"found" + 0.002*"infected" + 0.002*"rate" + 0.002*"well" + 0.002*"protein" + 0.002*"associated"')


In [16]:
#Printing the list of topics to see which one has the highest proportion of certains words
for x in range(1,8):
    print(topics[x])

(1, '0.016*"patient" + 0.009*"infection" + 0.006*"respiratory" + 0.006*"case" + 0.006*"virus" + 0.005*"data" + 0.004*"group" + 0.004*"influenza" + 0.004*"disease" + 0.004*"viral" + 0.004*"level" + 0.003*"analysis" + 0.003*"result" + 0.003*"cell" + 0.003*"clinical" + 0.003*"hospital" + 0.003*"year" + 0.003*"sample" + 0.003*"study" + 0.003*"number" + 0.003*"sars" + 0.003*"risk" + 0.003*"treatment" + 0.003*"control" + 0.003*"pneumonia" + 0.003*"child" + 0.003*"however" + 0.003*"among" + 0.003*"model" + 0.003*"associated" + 0.003*"rate" + 0.003*"health" + 0.003*"time" + 0.003*"day" + 0.003*"significant" + 0.002*"population" + 0.002*"test" + 0.002*"effect" + 0.002*"protein" + 0.002*"symptom"')
(2, '0.024*"cell" + 0.014*"infection" + 0.012*"virus" + 0.009*"protein" + 0.009*"mouse" + 0.008*"viral" + 0.005*"expression" + 0.005*"infected" + 0.005*"response" + 0.004*"gene" + 0.004*"replication" + 0.004*"type" + 0.003*"host" + 0.003*"level" + 0.003*"result" + 0.003*"human" + 0.003*"effect" + 0.00

Getting the top 20 papers for a given topic number.  A given paper belongs to the topic no. whose proportion is the highest among all topics in the paper. Input is the topic no.(0 based indexing). After getting all the papers belonging to the topic no. k(input), we sort them based on the proportion of topic k they have in descending order and return 20 papers with highest amount of topic k in them.


In [10]:
def get_top_articles(k) :
    
    doc_topics=lda_model.get_document_topics(corpus)
    track_dict=[]
    for x,y in enumerate(doc_topics):
        for tup in y:
            if (tup[0]==k):
                track_dict.append((x,tup[1]))
    sort_flat=sorted(track_dict, key = lambda x: x[1],reverse=True)
    return sort_flat[:20]

In [11]:
top_20_0=get_top_articles(0)
top_20_0

[(2595, 0.99980026),
 (2409, 0.9997993),
 (2086, 0.9997805),
 (2785, 0.99975777),
 (1382, 0.99968994),
 (2331, 0.9996309),
 (3608, 0.9995981),
 (3659, 0.9995814),
 (1805, 0.99954945),
 (2431, 0.9995453),
 (271, 0.9995075),
 (1106, 0.9995037),
 (2699, 0.99944925),
 (2406, 0.9994471),
 (2468, 0.99942887),
 (1879, 0.99942493),
 (2117, 0.9994037),
 (2857, 0.99937415),
 (2271, 0.9993598),
 (212, 0.99933577)]

What do we know about vaccine development efforts for viruses?
To answer we get the most relevant 20 articles on the subject of vaccines for various viruses.  

In [12]:
vaccine_articles=get_top_articles(4)

In [13]:
for i,x in vaccine_articles[:5] :
    print(titles[i])
    print("\n")

Recombinant Chimeric Transmissible Gastroenteritis Virus (TGEV)-Porcine Epidemic Diarrhea Virus (PEDV) Virus Provides Protection against Virulent PEDV


Trypsin-independent porcine epidemic diarrhea virus US strain with altered virus entry mechanism


The Program for New Century Excellent Talents in University of Ministry of Education of P.R. China (NCET-10-0144), Sponsored by Chang Jiang Scholar Candidates Programme for Provincial Universities in Heilongjiang


Contribution of porcine aminopeptidase N to porcine deltacoronavirus infection


Genetic manipulation of porcine deltacoronavirus reveals insights into NS6 and NS7 functions: a novel strategy for vaccine design




Analyzing the output of function 2 we see that among the 8 topics listed the probability of word vaccine occurring in them for the top 40 words is low. However,   topic[4] has the word vaccine with a probablity =0.003*"vaccine". So we choose it to get the most relevant 20 articles on the subject of vaccines.

What do we know about other viruses causing respiratory problems in adults and children?

We get the most relevant 20 articles on the subject of respiratory problems in adults and children.

In [14]:
resp_articles=get_top_articles(7)

In [15]:
for i,x in resp_articles[:5] :
    print(titles[i])
    print("\n")

Respiratory Virus Infections in Hematopoietic Cell Transplant Recipients


viruses Perspective Potential Maternal and Infant Outcomes from Coronavirus 2019-nCoV (SARS-CoV-2) Infecting Pregnant Women: Lessons from SARS, MERS, and Other Human Coronavirus Infections


Exposure Patterns Driving Ebola Transmission in West Africa: A Retrospective Observational Study International Ebola Response Team


Bacterial and viral pathogen spectra of acute respiratory infections in under-5 children in hospital settings in Dhaka city


Goal-Oriented Respiratory Management for Critically Ill Patients with Acute Respiratory Distress Syndrome




Analyzing the output of function 2, we see that among the 8 topics listed the probability of word "resipiratory" and "child" occurring in them for the top 40 words is highest in topic[7]. So we choose topic[7] to get the most relevant 20 articles on the subject of respiratory problems.

We now calculate coherence score of LDA model for different no. of topics. Input will be a list of no. of topics for which we want to calculate coherence score. Output will be a list of tuples where the 1st element is no. of topics and 2nd element is coherence score.

In [16]:
from gensim.models.coherencemodel import CoherenceModel
def get_coherence_scores(n_topics) :
    
    cm_score=[]
    for x in n_topics:
        model2 = LdaModel(corpus, num_topics=x,random_state=25,id2word=dictionary)
        cm = CoherenceModel(model=model2,texts=cleaned_text, corpus=corpus,dictionary=dictionary, coherence='c_v')
        cm_score.append((x,cm.get_coherence()))
    
    return cm_score

In [17]:
n_topics=[3,6,8,10]
coherence_scores=get_coherence_scores(n_topics)
coherence_scores

[(3, 0.3237877027385651),
 (6, 0.34323254567915923),
 (8, 0.3383538853673067),
 (10, 0.3341392450816718)]