# Topic Modeling on Research Paper Abstracts
Iqra Munawar (M.S. Analytics, NCSU 2020) & Sameen Salam (M.S. Analytics, NCSU 2020)

There is a massive amount of research related to COVID-19. In this notebook, we implement a couple of different methods to cluster abstracts according to topical similarity. By doing this, we hope to make a large corpus of research like this easily parseable to individual labs and accessible to the general public. In this notebook, we use the K-means approach (based on TF-IDF) and the LDA topic modeling approach. 

The original data source used in this analysis comes from the CORD-19 Research Data Challenge hosted on Kaggle. We created our own text pre-processing pipeline (paper_abstract_cleaner.ipynb) and fed the original Kaggle dataset into it to get an output with all original rows and columns plus the extra "abstract2" column. This additional column contains the cleaned and modelable abstracts, and is what is primarily used in this analysis notebook. Unfortunately, as of this version, we have not figured out a way to use the Kaggle API for a direct pull or GitHub for large file storage to handle this dataset, so running this code locally from the repo contents would be impossible. If you have a Kaggle account, you can download it locally off of the Kaggle commit output here: https://www.kaggle.com/sameensalam/paper-abstract-cleaner/output. If you do not have a Kaggle account, you will not be able to get the cleaned abstracts, and running the paper_abstract_cleaner notebook in this repo on the original Kaggle dataset would take several hours and brick up your local machine. We will figure out the solution to this problem, hopefully in the next rollout. 

## Pending Items
* See if we can make the Mallet LDA perform any better than a 0.56 coherence score.   
* Look into summarizing key points made frequently in each topic.   
* Make a density plot of documents colored by topic cluster to see the distribution (TSNE or UMAP).   
* Dataset storage within the repo via LFS or linkup to Kaggle API for direct access and reproducibility.   
* Look into how small abstracts (single short sentence in length) affect models and potential removal strategies.  

## Setup
Import the necessary libraries and read in the data.

In [3]:
#Load in the necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
import random
import matplotlib.pyplot as plt
import string
import re
import pickle
import gensim
import pyLDAvis.gensim
import time
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans 
from nltk import FreqDist
from gensim.models import LdaModel
from gensim import corpora
from kaggle.api.kaggle_api_extended import KaggleApi
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from gensim.models import Phrases
from gensim.models import CoherenceModel

In [138]:
#Code for Kaggle API workaround (not solved)
#api = KaggleApi()
#api.authenticate()

In [143]:
#Update path. As of this version, this notebook cannot be run locally due to a lack of access to the data. 
metadata = pd.read_csv(r'C:\Users\USER\Documents\Misc\covid19_research\abstract_cleaned.csv')

## K-Means Approach
Here we use K-means clustering on the TF-IDF matrix of the entire 75,000 abstract corpus. We chose the TF-IDF matrix as the input for our clustering algorithm because it accounts for abstracts of different lengths (unlike bag of words). We would like to note that the true_k value of 5 clusters corresponds to the optimal number of topics found in our LDA model, which we will cover later on in the notebook.

In [144]:
metadata.shape

(75000, 20)

In [145]:
#Changing each entry in abstract2 column to a list 
cleaned_abstracts = metadata['abstract2']
cleaned_abstracts = cleaned_abstracts.apply(eval)

#Creating new cleaned_abstracts_tf object since these next lines of code only applies to K-means
cleaned_abstracts_tf = cleaned_abstracts.apply(lambda x: ' '.join(x))
cleaned_abstracts_tf = list(cleaned_abstracts_tf)

In [147]:
#Direct way to get TF-IDF model without having to go through an initial Bag o Words Model
tv = TfidfVectorizer(min_df= 0.00833, max_df = 0.5, norm = 'l2', use_idf = True, smooth_idf= True, lowercase=False, analyzer="word",
                     token_pattern=r"(?u)\S\S+")

#Transforming the cleaned_abstracts_tf object
tv_matrix = tv.fit_transform(cleaned_abstracts_tf)
tv_matrix = tv_matrix.toarray()

#Creating and showing a dataframe to help show how the transformation happened. Each row is a document, each feature is a token 
tv_dataframe = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
tv_dataframe   

Unnamed: 0,2019-ncov,ACE2,ARDS,CD4,CD8,CI,COVID-19,CT,CoV,ELISA,...,world,worldwide,would,wound,wuhan,year,yet,yield,young,zoonotic
0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
1,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
2,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21
3,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
4,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74995,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
74996,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
74997,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
74998,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,...,0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00


In [183]:
#Defining a kmeans model object with 5 clusters
true_k = 5
kmeans_model = KMeans(n_clusters= true_k, random_state= 0)

In [184]:
#Fitting the kmeans model
kmeans_model.fit(tv_matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [185]:
#Checking to see if the model properly labeled a few of the documents
kmeans_model.labels_

array([4, 4, 4, ..., 2, 4, 2])

In [186]:
#Getting the contribution of each word in descending order per centroid
order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
order_centroids

array([[1096,  676, 1501, ..., 1643,  131,  540],
       [   6, 1096,  232, ...,  542, 1511,  218],
       [ 237, 1216, 1645, ..., 1065,  672,   18],
       [1645, 1327,  766, ..., 1459,  745,  521],
       [ 686, 1613,  454, ..., 1022, 1077,  939]], dtype=int64)

In [187]:
#Getting all of our unique tokens let in during the TF-IDF step
terms = tv.get_feature_names()

Below, you will see the top words that define each cluster. These topics are what we believe each one of those clusters represents: 
* **Cluster 0:** Medical procedures and treatments for COVID-19 and its various complications  
* **Cluster 1:** Public health measures and epidemiology of the novel coronavirus 
* **Cluster 2:** Cellular and genetic mechanisms of the coronavirus  
* **Cluster 3:** Testing and symptomatic presentation of COVID-19
* **Cluster 4:** Meta-analyses of research/literature review on anything that might have to do with COVID-19

In [188]:
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind]),
    print("-----------------------------")

Cluster 0:
 patient
 group
 surgery
 laparoscopic
 complication
 outcome
 treatment
 study
 use
 postoperative
 day
 mean
 rate
 procedure
 perform
-----------------------------
Cluster 1:
 COVID-19
 patient
 case
 SARS-cov-2
 pandemic
 coronavirus
 disease
 health
 infection
 severe
 care
 china
 respiratory
 report
 spread
-----------------------------
Cluster 2:
 cell
 protein
 virus
 viral
 expression
 infection
 gene
 RNA
 replication
 bind
 response
 activity
 host
 human
 mice
-----------------------------
Cluster 3:
 virus
 respiratory
 infection
 influenza
 viral
 sample
 strain
 human
 detect
 child
 assay
 detection
 sequence
 disease
 test
-----------------------------
Cluster 4:
 health
 use
 disease
 model
 study
 datum
 system
 risk
 review
 infection
 control
 public
 case
 include
 increase
-----------------------------


We clearly have some really interesting preliminary results here, but we wanted to see if we could improve upon them further with the more robust LDA model. 

## LDA Model Approach

Here we use the Latent Direcelht Allocation (LDA) model to model the topics in this corpus of 75,000 abstracts. We included bigrams, or pairs of tokens that occur together in more than 20 documents in this case, and eliminated words that occurred in less than 625 times. Using the resulting token ids for the entire corpus and the resulting individual bag of words models for each abstract, we conducted a grid search for the optimal values for the following parameters:  
* **Number of topics**- number of topics the model going to assume while looking at each document.  
* **Alpha**- The per-document topic distribution. A higher value of alpha tells the model that each document is made up of more topics. 
* **Beta/Eta**- The per-topic word distribution. A higher value of beta/eta tells the model that each topic includes more words. 
* **Learning decay**- The rate at which the model "forgets" the old weights as it passes through the corpus.  

We tried these parameters on both the regular Gensim LDA algorithm and the Mallet implementation via Gensim. The ideal model was selected based on coherence score, which measures how similar the high scoring words are in each topic. Using the final candidate model, we plotted the topic distribution using the pyLDAvis library, a staple for this particular algorithm. 

In [162]:
# Compute bigrams from the cleaned_abstracts object defined earlier
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(cleaned_abstracts, min_count=20)
for idx in range(len(cleaned_abstracts)):
    for token in bigram[cleaned_abstracts[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            cleaned_abstracts[idx].append(token)

In [163]:
#Create token ids for entire corpus
dictionary = corpora.Dictionary(corrected_subset)

# Filter out words that occur less than 625 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below= 625, no_above=0.5)

#Creating individual bag of words for each abstract mapped according to the dictionary
corpus = [dictionary.doc2bow(text) for text in corrected_subset]

#Saving the dictionary and corpus items for later use
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [190]:
#Gridsearch parameters for LDA model. This code takes about 6hrs to run, so it's commented out for convenience here. 

#num_topics = [4,5,6,7]#5 topics was the best value
#test_alphas = [0.05,0.17,0.5]-----default (1/num_topics 1/6~0.17) was best
#test_betas = [0.05,0.17,0.5]----0.5 was best, but only very very slightly
#test_decay = [0.5,0.6,0.7]-----0.5 test decay was the best value
#coherence_vals = []
#counter = 1

#for topic_val in num_topics:
    
#    for alpha_val in test_alphas:
        
#        for beta_val in test_betas:
            
#            for decay_val in test_decay:
                
#                start_time = time.time()

#                ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = topic_val, id2word=dictionary, passes=7, workers=3, random_state=0, alpha=alpha_val, eta=beta_val, decay=decay_val)

#                end_time = time.time()

                #print("Perplexity Score:", ldamodel_gensim.log_perplexity(corpus))

#                coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= corrected_subset, dictionary=dictionary, coherence='c_v')
#                coherence_vals.append(coherence_model_lda.get_coherence())
                
                
#               print("Finished model:",counter, "of 72")
#                counter+=1

#coherence_vals

In [189]:
#Regular Gensim LDA model with gridsearch ideal parameters
start_time = time.time()

ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = 5, id2word=dictionary, passes=7, workers=3, random_state=0, alpha=0.17, eta=0.5, decay=0.5)

end_time = time.time()

coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_lda.get_coherence()

0.5333850126916893

To run this section of the code, you need to have installed the mallet software package. This link will take you to where you can download it: http://mallet.cs.umass.edu/download.php. This link will take you to a tutorial that helps with proper usage of Mallet: https://www.tutorialspoint.com/gensim/gensim_creating_lda_mallet_model.htm.  

In [179]:
#Mallet implementation of gensim LDA with ideal number of topics
os.environ.update({'MALLET_HOME':r'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/'})

mallet_path = 'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/bin/mallet'

start_time = time.time()

ldamodel_gensim_mallet = gensim.models.wrappers.LdaMallet(mallet_path= mallet_path, corpus= corpus, num_topics= 5, id2word=dictionary, workers=3,random_seed=1)

end_time = time.time()

#print("Getting coherence now!")

coherence_model_ldamallet = CoherenceModel(model=ldamodel_gensim_mallet,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_ldamallet.get_coherence()

#print("Run time:",end_time-start_time)


0.5550664345002915

After finding that the Mallet LDA model (0.555) performed better than the regular Gensim implementation (0.533), we decided to move forward with the Mallet model for visualization.

In [180]:
#This snippet is necessary to convert the Mallet model object into something that the pyLDAvis library can use
#Function to bypass gensim.models.wrappers.ldamallet.malletmodel2ldamodel, which has known bugs that reduce model performance
#Credit to Stackoverflow user: norpa
def ldaMalletConvertToldaGen(mallet_model):
    model_gensim = LdaModel(id2word=mallet_model.id2word, num_topics=mallet_model.num_topics, alpha=mallet_model.alpha, eta=0, iterations=1000, gamma_threshold=0.001, dtype=np.float32)
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

converted_model = ldaMalletConvertToldaGen(ldamodel_gensim_mallet)

In [181]:
#Saving the converted model for use in PyLDAvis
converted_model.save('model.gensim')

#Printing the top 10 words in each of the 5 topics
topics = converted_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.051*"virus" + 0.031*"cell" + 0.023*"protein" + 0.018*"human" + 0.017*"viral" + 0.012*"response" + 0.011*"gene" + 0.011*"show" + 0.010*"vaccine" + 0.010*"RNA"')
(1, '0.037*"infection" + 0.029*"respiratory" + 0.021*"patient" + 0.021*"clinical" + 0.017*"disease" + 0.017*"acute" + 0.016*"severe" + 0.015*"test" + 0.013*"influenza" + 0.013*"high"')
(2, '0.034*"COVID-19" + 0.027*"disease" + 0.025*"health" + 0.016*"case" + 0.014*"care" + 0.014*"pandemic" + 0.012*"risk" + 0.011*"control" + 0.011*"number" + 0.011*"report"')
(3, '0.026*"study" + 0.020*"model" + 0.015*"review" + 0.014*"datum" + 0.013*"system" + 0.013*"method" + 0.012*"result" + 0.012*"provide" + 0.011*"include" + 0.010*"analysis"')
(4, '0.058*"patient" + 0.025*"group" + 0.016*"treatment" + 0.016*"study" + 0.014*"compare" + 0.013*"outcome" + 0.013*"rate" + 0.013*"time" + 0.011*"perform" + 0.009*"significantly"')


In [182]:
#Loading the necessary objects for pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model.gensim')

#Plotting the LDA model results
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


Based on the graphic above, you can see that the 5 different topics in our model are spread apart with minimal overlapping, a good sign that our LDA model performed well. These are our interpretations of the 5 topics in the visual:  
* **Topic 1**: Cellular and genetic mechanisms of the coronavirus  
* **Topic 2**: Testing and symptomatic presentation of COVID-19  
* **Topic 3**: Public health measures and epidemiology of the novel coronavirus  
* **Topic 4**: Meta-analyses of research/literature review on anything that might have to do with COVID-19  
* **Topic 5**: Medical procedures and treatments for COVID-19 and its various complications  

As you can see, they map out very similarly to our K-means based clustering. Despite the differences in the algorithms, (for example, LDA allows for a document to carry multiple topical identities, whereas K-means follows a hard one document to one topic approach) there seems to be a strong degree of similarity in their results. Below, you will see individual topic coherences and topic-defining words.  

In [192]:
top_topics = converted_model.top_topics(corpus)
avg_topic_coherence = sum([t[1] for t in top_topics]) / true_k
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.7548.
[([(0.057637893, 'patient'),
   (0.02463671, 'group'),
   (0.015843473, 'treatment'),
   (0.01575325, 'study'),
   (0.0144310305, 'compare'),
   (0.012819775, 'outcome'),
   (0.012600204, 'rate'),
   (0.012529143, 'time'),
   (0.011136661, 'perform'),
   (0.009381687, 'significantly'),
   (0.009208425, 'day'),
   (0.008997637, 'surgery'),
   (0.008667881, 'significant'),
   (0.007949283, 'complication'),
   (0.00789978, 'difference'),
   (0.007771231, 'high'),
   (0.0076985722, 'year'),
   (0.007697774, 'CI'),
   (0.0075628376, 'increase'),
   (0.0074470635, 'low')],
  -1.6134076469292093),
 ([(0.037298456, 'infection'),
   (0.029244183, 'respiratory'),
   (0.020698465, 'patient'),
   (0.020525016, 'clinical'),
   (0.017482165, 'disease'),
   (0.016621439, 'acute'),
   (0.016335152, 'severe'),
   (0.01494568, 'test'),
   (0.013403273, 'influenza'),
   (0.013358512, 'high'),
   (0.012544412, 'sample'),
   (0.012443698, 'case'),
   (0.011379681, 'detect'

## Conclusions
Based on our results so far, we have been able to boil all research presented in the CORD-19 Research Data Challenge to five main topics:
* **Topic 1**: Cellular and genetic mechanisms of the coronavirus  
* **Topic 2**: Testing and symptomatic presentation of COVID-19  
* **Topic 3**: Public health measures and epidemiology of the novel coronavirus  
* **Topic 4**: Meta-analyses of research/literature review on anything that might have to do with COVID-19  
* **Topic 5**: Medical procedures and treatments for COVID-19 and its various complications  

From here, we will assign these topic labels back to the individual abstracts. We will create topical density plots to show the abstract distribution and how "soft" the boundaries are between topics in a more graunular way. We will also implement some type of summarizing algorithm that takes in all abstracts attributed to a specific topic label and breaks it down into the most important sentences and phrases. 

### Sources
We found the following articles helpful for code and/or concepts:  
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/  
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0  
https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd  
https://radimrehurek.com/gensim/models/ldamulticore.html  
https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html  
https://towardsdatascience.com/using-mallet-lda-to-learn-why-players-hate-pok%C3%A9mon-sword-shield-23b12e4fc395  
https://pythonprogramminglanguage.com/kmeans-text-clustering/  
https://www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ  