## Clustering asthma-related papers in CORD-19 dataset

### Introduction
The goal of this project is to explore research topics in asthma and coronviruses. What are the most popular topics the research community is focused on, before and after the COVID-19 outbreak? Are the areas of interest around asthma and coronaviruses the same before and after the appearance of SARS-CoV-2? 

In this project, I use Natural Language Processing (NLP) techniques in Python, to explore topics of research between asthma and coronaviruses before the identification of SARS-CoV-2, but also after the outbreak of the pandemic. The analysis is based on clustering scientific publications, in order to create groups of papers with similar topics. Two groups of clusters are created, one for papers published before and one for papers published after the COVID-19 outbreak. For the two periods of times, clustering aims at identifying popular research topics and finding potential gaps in research between asthma and the new coronavirus.

More details about the motivation and the scientific background of this data analysis can be found in my Medium article:


### Data
In response to the COVID-19 pandemic a large database, the COVID-19 Open Research Dataset (CORD-19), was created and has been made publicly available. CORD-19 is a resource of hundreds of thousands scholarly articles, about COVID-19, SARS-CoV-2, and related coronaviruses: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

### Collecting and Preprocessing Data

From this large database, I kept only those papers where the word "asthma" appears at least once in the abstract.

I went through the following text preprocessing steps, using NLTK and SpaCy:
- Removal of stop words
- Removal of non-English publications

Using their publication date, I divided the papers into those published before the outbreak of the pandemic (December 2019) and those published after. For the two groups of papers I applied:
- Tokenization
- Stemming
- Use of the Scikit-learn's Tfidf Vectorizer to transform tokens into a matrix of TF-IDF features
- Application of the KMeans algorithm for Clustering
- Application of the PCA algorithm for dimensionality reduction and clusters' visualization

*Note: This data analysis was performed in February 2021 and doesn't take into account potential databases updates.*


### Loading libraries

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from PIL import Image

In [2]:
pd.options.display.max_colwidth = 200

In [3]:
import nltk
import string
from nltk.stem import PorterStemmer
from sklearn.cluster import KMeans
from langdetect import detect

In [4]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from yellowbrick.cluster import KElbowVisualizer

In [5]:
import pickle

In [6]:
import plotly.express as px
import plotly.io as pio
pio.renderers
import plotly.graph_objects as go

import seaborn as sns
from wordcloud import WordCloud, ImageColorGenerator


In [7]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")


In [8]:
spacy_stop_words = ['whence', 'here', 'show', 'were', 'why', 'n’t', 'the', 'whereupon', 'not', 'more', 'how', 'eight', 'indeed', 'i', 'only', 'via', 'nine', 're', 'themselves', 'almost', 'to', 'already', 'front', 'least', 'becomes', 'thereby', 'doing', 'her', 'together', 'be', 'often', 'then', 'quite', 'less', 'many', 'they', 'ourselves', 'take', 'its', 'yours', 'each', 'would', 'may', 'namely', 'do', 'whose', 'whether', 'side', 'both', 'what', 'between', 'toward', 'our', 'whereby', "'m", 'formerly', 'myself', 'had', 'really', 'call', 'keep', "'re", 'hereupon', 'can', 'their', 'eleven', '’m', 'even', 'around', 'twenty', 'mostly', 'did', 'at', 'an', 'seems', 'serious', 'against', "n't", 'except', 'has', 'five', 'he', 'last', '‘ve', 'because', 'we', 'himself', 'yet', 'something', 'somehow', '‘m', 'towards', 'his', 'six', 'anywhere', 'us', '‘d', 'thru', 'thus', 'which', 'everything', 'become', 'herein', 'one', 'in', 'although', 'sometime', 'give', 'cannot', 'besides', 'across', 'noone', 'ever', 'that', 'over', 'among', 'during', 'however', 'when', 'sometimes', 'still', 'seemed', 'get', "'ve", 'him', 'with', 'part', 'beyond', 'everyone', 'same', 'this', 'latterly', 'no', 'regarding', 'elsewhere', 'others', 'moreover', 'else', 'back', 'alone', 'somewhere', 'are', 'will', 'beforehand', 'ten', 'very', 'most', 'three', 'former', '’re', 'otherwise', 'several', 'also', 'whatever', 'am', 'becoming', 'beside', '’s', 'nothing', 'some', 'since', 'thence', 'anyway', 'out', 'up', 'well', 'it', 'various', 'four', 'top', '‘s', 'than', 'under', 'might', 'could', 'by', 'too', 'and', 'whom', '‘ll', 'say', 'therefore', "'s", 'other', 'throughout', 'became', 'your', 'put', 'per', "'ll", 'fifteen', 'must', 'before', 'whenever', 'anyone', 'without', 'does', 'was', 'where', 'thereafter', "'d", 'another', 'yourselves', 'n‘t', 'see', 'go', 'wherever', 'just', 'seeming', 'hence', 'full', 'whereafter', 'bottom', 'whole', 'own', 'empty', 'due', 'behind', 'while', 'onto', 'wherein', 'off', 'again', 'a', 'two', 'above', 'therein', 'sixty', 'those', 'whereas', 'using', 'latter', 'used', 'my', 'herself', 'hers', 'or', 'neither', 'forty', 'thereupon', 'now', 'after', 'yourself', 'whither', 'rather', 'once', 'from', 'until', 'anything', 'few', 'into', 'such', 'being', 'make', 'mine', 'please', 'along', 'hundred', 'should', 'below', 'third', 'unless', 'upon', 'perhaps', 'ours', 'but', 'never', 'whoever', 'fifty', 'any', 'all', 'nobody', 'there', 'have', 'anyhow', 'of', 'seem', 'down', 'is', 'every', '’ll', 'much', 'none', 'further', 'me', 'who', 'nevertheless', 'about', 'everywhere', 'name', 'enough', '’d', 'next', 'meanwhile', 'though', 'through', 'on', 'first', 'been', 'hereby', 'if', 'move', 'so', 'either', 'amongst', 'for', 'twelve', 'nor', 'she', 'always', 'these', 'as', '’ve', 'amount', '‘re', 'someone', 'afterwards', 'you', 'nowhere', 'itself', 'done', 'hereafter', 'within', 'made', 'ca', 'them']

In [9]:
# extending the list of stopwords taken into account
stop_words.extend(spacy_stop_words)

### Data overview

In [10]:
# Loading the csv file to a pandas dataframe, to have a look at the papers' metadata:
asthma_df = pd.read_csv("asthma_data.csv")
asthma_df.head(n=3)

Unnamed: 0,gitcord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,...,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,abstract_lower,title_lower
0,qva0jt86,4ba79e54ecf81b30b56461a6aec2094eaf7b7f06,PMC,Relevance of human metapneumovirus in exacerbations of COPD,10.1186/1465-9921-6-150,PMC1334186,16371156.0,cc-by,"BACKGROUND AND METHODS: Human metapneumovirus (hMPV) is a recently discovered respiratory virus associated with bronchiolitis, pneumonia, croup and exacerbations of asthma. Since respiratory virus...",2005-12-21,...,Respir Res,,,,document_parses/pdf_json/4ba79e54ecf81b30b56461a6aec2094eaf7b7f06.json,document_parses/pmc_json/PMC1334186.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1334186/,,"background and methods: human metapneumovirus (hmpv) is a recently discovered respiratory virus associated with bronchiolitis, pneumonia, croup and exacerbations of asthma. since respiratory virus...",relevance of human metapneumovirus in exacerbations of copd
1,chz8luni,d68d71553d3a31381c0c3851351f912a9a7be1c9,PMC,Surfactant therapy for acute respiratory failure in children: a systematic review and meta-analysis,10.1186/cc5944,PMC2206432,17573963.0,cc-by,"INTRODUCTION: Exogenous surfactant is used to treat acute respiratory failure in children, although the benefits and harms in this setting are not clear. The objective of the present systematic re...",2007-06-15,...,Crit Care,,,,document_parses/pdf_json/d68d71553d3a31381c0c3851351f912a9a7be1c9.json,document_parses/pmc_json/PMC2206432.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2206432/,,"introduction: exogenous surfactant is used to treat acute respiratory failure in children, although the benefits and harms in this setting are not clear. the objective of the present systematic re...",surfactant therapy for acute respiratory failure in children: a systematic review and meta-analysis
2,3zh8jmc2,fe2000f280297c40bc53ce95d703a9ca6aac19fd,PMC,Differential Regulation of Type I Interferon and Epidermal Growth Factor Pathways by a Human Respirovirus Virulence Factor,10.1371/journal.ppat.1000587,PMC2736567,19806178.0,cc-by,"A number of paramyxoviruses are responsible for acute respiratory infections in children, elderly and immuno-compromised individuals, resulting in airway inflammation and exacerbation of chronic d...",2009-09-18,...,PLoS Pathog,,,,document_parses/pdf_json/fe2000f280297c40bc53ce95d703a9ca6aac19fd.json,document_parses/pmc_json/PMC2736567.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2736567/,,"a number of paramyxoviruses are responsible for acute respiratory infections in children, elderly and immuno-compromised individuals, resulting in airway inflammation and exacerbation of chronic d...",differential regulation of type i interferon and epidermal growth factor pathways by a human respirovirus virulence factor


There are 2567 papers containing the word "asthma", among the coronavirus-related publications 

In [11]:
asthma_df.shape

(2567, 21)

In [None]:
# Use of langdetect google library to dect the language of the "abstract" column in our dataframe

asthma_df["lang_detect"] = asthma_df["abstract_lower"].apply(detect)

In [None]:
# As non-english papers occupy only a small percent of our total papers, they are excluded from the analysis

asthma_df["lang_detect"].value_counts()

There are 2528 papers in total, written in english.

In [None]:
asthma_df = asthma_df.loc[asthma_df['lang_detect'] == "en"]
asthma_df.shape

I'd like to quickly vizualize the text of these papers, that's why I create a wordcloud where the most frequent words are represented by a bigger font! I'd like to give a shape to my image and use it as a cover for my article! What would it be?

In [None]:
virus_mask = np.array(Image.open("coronavirus_canvas.png"))

In [None]:
text = " ".join(ab for ab in asthma_df["abstract_lower"].tolist())

In [None]:
wc = WordCloud(background_color="white", max_words=100, mask=virus_mask,
               stopwords=stop_words, contour_width=3, contour_color='gray')

wc.generate(text)

wc.to_file("coronavirus_new.png")

plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Let's go back to data processing!


Papers are divided between those published before the identification of the new coronavirus, SARS-CoV-2, and those published after. I pick December 2019 as the cut-off date. 

The dataset contains 1023 papers published before December 2019 and 1507 papers published on December 2019 and later.

In [None]:
asthma_before_covid = asthma_df.loc[asthma_df['publish_time']<"2019-12-01"].reset_index(drop=True)
asthma_before_covid.shape

In [None]:
asthma_after_covid = asthma_df.loc[asthma_df['publish_time']>="2019-12-01"].reset_index(drop=True)
asthma_after_covid.shape

Let's have a look at the number of the papers published, per month, since the covid outbreak. 

In [None]:
asthma_after_covid['publish_time_new'] =  pd.to_datetime(asthma_after_covid['publish_time'])

In [None]:
asthma_after_covid['publish_month_year'] = pd.to_datetime(asthma_after_covid['publish_time']).dt.to_period('M')
asthma_after_covid.head()

In [None]:
asthma_after_covid = asthma_after_covid.sort_values('publish_month_year')

In [None]:
dates = asthma_after_covid["publish_month_year"].value_counts()
dates_df = dates.to_frame().reset_index()

In [None]:
dates_df = dates_df.sort_values("index")

In [None]:
dates_df.rename(columns={"index": "date_published", "publish_month_year":"number of papers"})

The graph below illustrates the number of papers published through the last 12-14 months. However, we recognize that the two picks noticed in January 2020 and January 2021 are not completely accurate. As a certain number of papers had only the year (yyyy) mentioned as publication date, January 1st of that year (01/01/yyyy) is taken as their complete date.

As a result, we cannot draw a very accurate example of the distribution of publications through the months.

In [None]:
dates_df.plot(x ='index', y = 'publish_month_year')

## Clustering

### Before COVID-19

In [None]:
texts_before = asthma_before_covid["abstract_lower"].tolist()

In [None]:
def custom_tokenizer(str_input):
    
    stemmer = PorterStemmer()
    words = nltk.word_tokenize(str_input)
    words = [word for word in words if word.lower() not in stop_words]
    
    words = [word.replace('â¡', '') for word in words]
    words = [word.replace('â¢', '') for word in words]
    words = [word.replace('â£', '') for word in words]
       
    words = [''.join(c for c in word if c not in string.punctuation+'©±×≤≥●＜--“”→„') for word in words]
    words = [word for word in words if word not in ['‘', '’', '„']]
        
    words = [word for word in words if word]
    words = [word for word in words if not any(char.isdigit() for char in word)]
    
    words = [stemmer.stem(word) for word in words]
    words = [word for word in words if len(word)> 1]
    words = [word for word in words if "asthma" not in word]
    
        
    return words

In [None]:
vec_before = TfidfVectorizer(tokenizer=custom_tokenizer,
                             max_features=2000,
                      stop_words='english')

matrix_before = vec_before.fit_transform(texts_before)
df_before = pd.DataFrame(matrix_before.toarray(), columns=vec_before.get_feature_names())
df_before.head()

I use the "elbow" method for an estimation of the optimal number of clusters for the group of papers:
https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

However, in the context of this analysis the results are not reproducible and different number of clusters is suggested by the model. For this reason, the method is not taken completely into account. 
The final number of clusters for each group of papers is chosen after running the code for different numbers of clusters and reviewing the clusters' content (papers' topics) every time.

For more information about the use of Kmeans in clustering: 
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

In [None]:
def vizualize_elbow(data, min_cluster=4, max_cluster=20):
    
    model = KMeans(random_state=2)
    visualizer = KElbowVisualizer(model, k=(min_cluster, max_cluster))

    visualizer.fit(data)
       
    return visualizer  

In [None]:
viz_before = vizualize_elbow(matrix_before)
viz_before.show()

In [None]:
number_of_clusters=11
km_before = KMeans(n_clusters=number_of_clusters, random_state=1)
model_before = km_before.fit(matrix_before)

In [None]:
#pickle.dump(model_before, open("model_before.pkl", "wb"))
#km_before = pickle.load(open("model_before.pkl", "rb"))

Let's have an overview of our clusters' centers (centroids) and labels. Then we get the top 20 terms for every cluster. In other words, we see which are the most frequently mentioned words per cluster. Note that since we have applied Stemming, we only have the "root" of the words now.

In [None]:
centroids_before, labels_before = model_before.cluster_centers_, model_before.labels_
print(labels_before)

In [None]:
print("Top terms per cluster:")
order_centroids_before = centroids_before.argsort()[:, ::-1]
terms = vec_before.get_feature_names()
for i in range(number_of_clusters):
    top_words = [terms[ind] for ind in order_centroids_before[i, :20]]
    print("Cluster {}: {}".format(i, ' '.join(top_words)))

PCA and its implementation in Python:
https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

In [None]:
T_before = preprocessing.Normalizer().fit_transform(df_before)

# Fit and transform the TFidf values to PCA
pca_model = PCA(n_components=2, random_state = 2)
pca_model.fit(T_before) 
T_before = pca_model.transform(T_before)

#Transform the centroids
centroids_before_pca = pca_model.transform(centroids_before)

In [None]:
type(centroids_before_pca)

In [None]:
centroids_before_df = pd.DataFrame(centroids_before_pca, columns = ['dimension 1','dimension 2'])
centroids_before_df

In [None]:
centroids_before_df["pca_1"] = centroids_before_pca[:,0]
centroids_before_df["pca_2"] = centroids_before_pca[:,1]
centroids_before_df

In [None]:
asthma_before_covid['Labels'] = km_before.labels_
asthma_before_covid['pca_1'] = T_before[:, 0]
asthma_before_covid['pca_2'] = T_before[:, 1]

In [None]:
asthma_before_covid = asthma_before_covid.sort_values(by = "Labels", ascending = True)

In [None]:
asthma_before_covid['Labels'] = asthma_before_covid['Labels'].astype(str)

In [None]:
asthma_before_covid['Labels'].value_counts().sort_values(ascending=True)

Finally, let's plot the clusters!

In [None]:
fig_clusters_before = px.scatter(asthma_before_covid, 
                 x="pca_1", 
                 y="pca_2", 
                 color="Labels",
                 hover_data=['title'])

fig_clusters_before.show()

Having a look at the clusters, we assing labels to them so that we can easily get a sense of each cluster's topic!

In [None]:
centroids_before_titles = ["Immune response", 
                           "Molecular links",
                           "Asthma and COPD exacerbations",
                           "title4",
                           "title5",                           
                           "title6",
                           "title7",                           
                           "title8",
                           "title9",
                           "title10",
                           "title11"]

In [None]:
centroids_before_df["centroids_labels"] = centroids_before_titles
centroids_before_df

In [None]:
fig_clusters_before.add_scatter(y=centroids_before_df["pca_2"].tolist(),
                         x=centroids_before_df["pca_1"].tolist(),
                         mode="markers+text",
                         text=centroids_before_df["centroids_labels"],
                         marker=dict(size=10, color="white"),
                         name="Centroids")

In [None]:
fig_clusters_before.update_layout(
    title_text='Asthma and various coronaviruses',
    legend=dict(
        font=dict(
            size=15)))

Below, we also get the number of features, in our case the number of papers that each cluster contains.

In [None]:
asthma_before_covid['Labels'].value_counts().sort_values(ascending=False)

### After covid-19

The exact same process is followed for the groups of papers published after the SARS-CoV-2 outbreak.

In [None]:
texts_after = asthma_after_covid["abstract_lower"].tolist()

In [None]:
vec_after = TfidfVectorizer(tokenizer=custom_tokenizer,
                      stop_words='english', 
                           max_features=1000)

matrix_after = vec_after.fit_transform(texts_after)
df_after = pd.DataFrame(matrix_after.toarray(), columns=vec_after.get_feature_names())

In [None]:
df_after.head()

In [None]:
viz_after = vizualize_elbow(matrix_after)
viz_after.show()

In [None]:
number_of_clusters=12
km_after = KMeans(n_clusters=number_of_clusters, random_state=1)
model_after = km_after.fit(matrix_after)

In [None]:
#pickle.dump(model_after, open("model_after.pkl", "wb"))
#km_after = pickle.load(open("model_after.pkl", "rb"))

In [None]:
centroids_after, labels_after = model_after.cluster_centers_, model_after.labels_
print(centroids_after)

In [None]:
print("Top terms per cluster:")
order_centroids_after = km_after.cluster_centers_.argsort()[:, ::-1]
terms = vec_after.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids_after[i, :20]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

In [None]:
order_centroids_after

In [None]:
T_after = preprocessing.Normalizer().fit_transform(df_after)

# Dimesionality reduction to 2
pca_model = PCA(n_components=2, random_state=2)
pca_model.fit(T_after) 
T_after = pca_model.transform(T_after) 

#Transform the cluster's centroids
centroids_after_pca = pca_model.transform(centroids_after)

In [None]:
asthma_after_covid['Labels'] = km_after.labels_
asthma_after_covid['pca_1'] = T_after[:, 0]
asthma_after_covid['pca_2'] = T_after[:, 1]

In [None]:
centroids_after_df = pd.DataFrame(centroids_after_pca, columns = ['dimension 1','dimension 2'])
centroids_after_df

In [None]:
centroids_after_df["pca_1"] = centroids_after_pca[:,0]
centroids_after_df["pca_2"] = centroids_after_pca[:,1]
centroids_after_df

In [None]:
asthma_after_covid = asthma_after_covid.sort_values(by = "Labels", ascending = True)

In [None]:
asthma_after_covid['Labels'] = asthma_after_covid['Labels'].astype(str)

In [None]:
fig_clusters_after = px.scatter(asthma_after_covid, 
                 x="pca_1", 
                 y="pca_2", 
                 color="Labels",
                 hover_data=['title'])

fig_clusters_after.show()

Labels are assigned to thig group of clusters as well

In [None]:
centroids_after_titles = ["title 1", "title2", "title3", "title4","title5", "title6", "title7","title8","title9","title10", "title 12", "title13"]


In [None]:
centroids_after_df["centroids_labels"] = centroids_after_titles
centroids_after_df

In [None]:
fig_clusters_after.add_scatter(y=centroids_after_df["pca_2"].tolist(),
                         x=centroids_after_df["pca_1"].tolist(),
                         mode="markers+text",
                         text=centroids_after_df["centroids_labels"],
                         marker=dict(size=10, color="white"),
                         name="Centroids")

In [None]:
fig_clusters_after.update_layout(
    title_text='Asthma and SARS-CoV-2',
    legend=dict(
        font=dict(
            size=15)))

In [None]:
asthma_after_covid['Labels'].value_counts().sort_values(ascending=False)