<b>

<p>
<center>
<font size="5">

</font>
</center>
</p>

<p>
<center>
<font size="5">
COVID-19 Open Research Dataset Challenge (CORD-19)
</font>
</center>
</p>

<p>
<center>
<font size="4">
Authors: Chao Zhou, Ruijin Jia, Matteo Bucalossi
</font>
</center>
</p>

<p>
<center>
<font size="3">
Machine Learning I (DATS 6202), Spring 2020
</font>
</center>
</p>

<p>
<center>
<font size="2">
[GitHub Repository](https://github.com/matteobucalossi50/CORD-19-Challenge)
</font>
</center>
</p>

</b>

### Introduction

Last month, in response to the global COVID-19 pandemic, the White House and other leading research institutions, including the Allen Institute for AI, have prepared the COVID-19 Open Research Dataset (CORD-19): this dataset includes over 57,000 scholarly articles, most of them with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. As literature published keeps increasing in scale, the medical community would need AI tools to quickly gain insights and directions to fight this disease in a timely manner.  
Kaggle has issued a [competition](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) to develop such tools to help the medical community for this high priority scientific challenge. The public dataset represents a highly readable and clean collection of materials, and it is beeing updated constantly as more publications are released and added to the corpus.  

Thanks to the minimal work on pre-processing required by this dataset, we decided to apply a Transformer model to extract sentence embeddings for machine learning task-specific "heads". Given that the pre-trained transformer by the UKPLab uses BERT, we decided to train and fine-tune the same transformer on SciBERT by Allen AI to obtain more relevant embeddings for a scientific and medical corpus.  
We then used said embeddings to train different clustering algorithms for unsupervised learning: we used UAMP for dimensionality reduction, and HDBSCAN for density-based clustering while LDA for probability-based clustering. Also, we integrated a semantic search algorithm based on cosine similarity that can take user questions (we recommend to take ideas from [Tasks](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks)) and provides the top 5 relevant articles from the corpus.

# Data Preprocessing

Luckily the dataset had already been cleaned for the most part, and we only had to create an hashable corpus of text from json files. Once we had the body text and the abstact for each article, and its relevant metadata, in a dataframe, we cleaned the text columns of odd characters using regex.  Here we call the [preprocessing](https://github.com/matteobucalossi50/CORD-19-Challenge/blob/master/scripts/preprocessing.py) script to obtain a clean dataframe to use for our machine learning tasks.

In [0]:
### load preprocessing.py
### print dataframe.head or something
import numpy as np
import pandas as pd
import os
import glob
import json
import scispacy
import spacy
import tqdm
import matplotlib.pyplot as plt

import preprocessing

In [0]:
# directories and paths
root_path = '/Users/Matteo/Desktop/ML1/project/data/'
metadata_path = f'{root_path}/all_sources_metadata_2020-03-13.csv'
metadata = pd.read_csv(metadata_path)
metadata.head()
metadata.info()

all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)   # here imo it's where shit happens
print(len(all_json))

In [0]:
# read json files
first_row = preprocessing.FileReader(all_json[0])
print(first_row)

In [0]:
# build dataframe
df_covid = preprocessing.read_directory_files(all_json)
df_covid.head()

In [0]:
# clean abstract and body_text
cleaned_abstract = []
for item in df_covid['abstract']:
    item = preprocessing.clean_text(item)
    cleaned_abstract.append(item)
df_covid['abstract'] = cleaned_abstract

#clean body_text
cleaned_body = []
for item in df_covid['body_text']:
    item = preprocessing.clean_text(item)
    cleaned_body.append(item)
df_covid['body_text'] = cleaned_body

In [0]:
pd.DataFrame([[df_covid.shape[0], df_covid.shape[1]]], columns=['# rows', '# columns'])

In [0]:
# save dataframe
df_covid.to_pickle('/Users/Matteo/Desktop/ML1/project/data/preprocessed_dataframe.pkl')

# Train Sentence Transformer on sciBERT

### Transformers

When it comes to natural language processing, the most recent developments have seen attention mechanisms prevailing versus more traditiona RNN models. These models use an architecture called Transformer, much faster and easier to parallelize than other networks. This is where these models revolutionize the field: an attention mechanism looks at an input sequence and decides at each step which other components of the sequence are important<sup>[1](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04)</sup> - meaning it can replicate the way we actually process text, i.e. not only focusing on single words but also considering what's around it to make sense of the language.   

Transformers are architectures for transforming a sequence into another by using Encoder and Decoder; yet, they only imply attention mechanisms without any Recurrent Networks (previously the go-to models for many NLP tasks). Here's an image<sup>[2](https://arxiv.org/abs/1706.03762)</sup>  to illustrate such Transformer architecture, with the Encoder part on the left and the Decoder on the right.
![](images/Transformers_scheme.png)
We won't bore you with the details of the model and its mathematical aspects, but we can point out two main characteristics of Transformers:
 - the Multi-Head Attention layers treat each word's relationship with every other word in the same sentence, basically paying attention to more words than just one when processing sequences. These layers will apply to every (input/target, depending on encoder/decoder) sentence the following equation: ![](images/scaled dot-prod attention.png)
 - the positional encodings of words are dense vectors (some extra word embeddings) representing the position of a word within the sentencem and are added to each word's embeddings.  



### BERT & SciBERT

A 2018 paper<sup>[3](https://arxiv.org/abs/1810.04805)</sup> published by various Google researchers brought to life a nowadays state-of-the-art application of a Transformer-based architectures for self-supervised pretraining on large corpus, BERT (*Bidirectional Encoder Representations from Transformers*).  BERT is a method of pre-training language representations, so that we can train a general-purpose model on an immense corpus and then use said model for downstream NLP tasks - its bidirectional characteristic allows to represent each word within its context (i.e. other words in the sentence, both on the left and right of represented word).

If the original BERT trained a large (12-layer to 24-layer) Transformer on a large corpuse of Wikipedia and BookCorpus, more recent BERT-alike models have trained the same architecture on specific corpuses for domain-relevant tasks. For instance, in 2019 Allen AI released SciBERT,<sup>[4](https://arxiv.org/abs/1903.10676)</sup> a BERT model trained on huge corpus of scientific papers (82% of which from biomedical domain)  from [semanticscholar.org](https://www.semanticscholar.org/), which significantly improves BERT performance on downstream NLP tasks specific to scientific problems. We believe that using SciBERT for this project will yield much better results than the generic original BERT model given the specificity of our data.

### Sentence-BERT

BERT-alike models described above have set a new bar for sentence-pair regressions tasks, but unfortunately they need both sentences to be fed in the Transformer, causing such a computational overhead that makes unsupervised tasks virtually impossible.  
Thus, in 2019 researchers at UKPLab released SBERT,<sup>[5](https://arxiv.org/abs/1908.10084)</sup> a modification of pre-trained BERT to derive semantic sentence embeddings easily comparable for similarity and clustering. SBERT fine-tunes BERT-alike models with a siamese or triplet network structure to obtain such embeddings. Such a revolutionary paper proposed a model that still maintains BERT-level accuracy, while scaling down time complexity from 65 hours to 5 seconds for specific unsupervised NLP tasks, including similarity comparison, semantic search and clustering. (ah! exactly what we are trying to do here!)  
Here's an illustration of the SBERT architecture as described in the paper (the structure would not change if the objective function was different, as it may be for different tasks): ![](images/SBERT architecture.png) 

# Sentence embeddings for analysis

Now, we intend to perform some clustering as well as semantic search on the CORD-19 dataset. To do so, we need SBERT sentence embeddings so that we can accomplish these tasks in reasonable time. We decided that pre-trained embeddings on BERT would have not provide the SOTA outcome we were looking for, thus we fine-tuned our sentence embedding method on SciBERT to get science-specific sentence embeddings.  
The SBERT library provides the code to tune any BERT-like model (in our case, SciBERT) on the Natural Language Inference (NLI) data, published by Allen AI. The following [script](https://github.com/matteobucalossi50/CORD-19-Challenge/blob/master/scripts/training_scibert.py) trained SciBERT on the NLI dataset using a Softmax Classifier as training loss and the STS (Semantic Text Similarity) dataset as benchmark for evaluation, and provided us a file-tuned sentence embedder to derive embeddings for our clustering and search tasks on the CORD-19 dataset.

```python
# select one Transformer
model_name = 'allenai/scibert_scivocab_uncased'

# Use sciBERT model for mapping tokens to embeddings
word_embedding_model = models.BERT(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Convert the dataset to a DataLoader ready for training
train_data = SentencesDataset(nli_reader.get_examples('train.gz'), model=model)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)

dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model)
dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=batch_size)
evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

num_epochs = 2
warmup_steps = math.ceil(len(train_dataloader) * num_epochs / batch_size * 0.1) #10% of train data for warm-up
# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path
          )
```

In [0]:
The training of the sentence transformer took eventually around 5 hours when performed on Kaggle notebook with GPU.

At this point, we can simply use our fine-tuned model (perfect for our biomedical dataset) to encode our corpus. This script will provide the embeddings we need, both for the abstract and separately for the full text of each article. We will then access these embeddings in the dataframe for downstream tasks below.

In [0]:
## import embeddings.py
import pandas as pd
from sentence_transformers import SentenceTransformer
import embeddings

# download pre-trained model
model = SentenceTransformer('training_nli_allenai-scibert_scivocab_uncased-2020-04-26_13-22-06') #this or the model we trained and saved

# import dataframe
df_covid = pd.read_pickle('/Users/Matteo/Desktop/ML1/project/data/preprocessed_dataframe.pkl')  # hopefully this works


In [0]:
# add abstract embeddings to dataframe
df_covid['abs_embeddings'] = embeddings.sent_embeddings(df_covid['abstract'], model)

# save dataframe
df_covid.to_pickle('./data/preprocessed_dataframe.pkl')

df_covid.head()

In [0]:
# add full text embeddings to dataframe
df_covid['body_embeddings'] = embeddings.sent_embeddings(df_covid['body_text'], model)

# save dataframe
df_covid.to_pickle('./data/preprocessed_dataframe.pkl')

df_covid.head()

# Clustering 

Now we are going to reduce the dimensions and do clustering. UMAP is applied to reduce dimension keeping 95% variance, which will retain meaningful information and removes noisies to make clustering easier. Then HDBSCAN and K-Means are applied to cluster and labelled group. Finally we are going to modeling topic in each clustering group by LDA. 

## UMAP

Uniform Manifold Approximation and Projection (UMAP)<sup>[6](https://arxiv.org/abs/1802.03426)</sup> is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. It provides a very general framework for approaching manifold learning and dimension reduction, but can also provide specific concrete realizations and can preserve more of the global structure with superior run time performance.

In [0]:
!pip install umap

In [0]:
pip install matplotlib

In [0]:
import umap.umap_ as umap
import matplotlib.pyplot as plt

In [0]:
#open pickle file to extract vextorization
df=open('/content/drive/My Drive/Colab Notebooks/final/preprocessed_dataframe_withabs.pkl','rb')

In [0]:
import pickle
import pandas as pd
import numpy as np

In [0]:
data3=pickle.load(df)

In [0]:
reducer = umap.UMAP(n_neighbors = 5)

In [0]:
#for abstract
numpy_array=data3['abs_embeddings'][0]
for i in range(1,len(data3['abs_embeddings'])):
  numpy_array=np.row_stack((numpy_array,data3['abs_embeddings'][i]))

In [0]:
#for body text
numpy_array2=data3['text_embeddings'][0]
for i in range(1,len(data3['text_embeddings'])):
  numpy_array2=np.row_stack((numpy_array2,data3['text_embeddings'][i]))

###Visulization

#####Abstract

In [0]:
# reduce to two dimentions and plot result
clusterable_embedding = reducer.fit_transform(np.asmatrix(numpy_array))
plt.figure(figsize=(12,8))
plt.scatter(clusterable_embedding[:,0],clusterable_embedding[:,1])
clusterable_embedding.shape
print(clusterable_embedding)

#####Body Text

In [0]:
clusterable_embedding2 = reducer.fit_transform(np.asmatrix(numpy_array2))
plt.figure(figsize=(12,8))
plt.scatter(clusterable_embedding2[:,0],clusterable_embedding2[:,1])
clusterable_embedding2.shape
print(clusterable_embedding2)

## HDBSCAN

We are trying to run vectorization and separate the literature by two ways, the first one is HDBSCAN. HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander.<sup>[7](https://link.springer.com/chapter/10.1007%2F978-3-642-37456-2_14)</sup> It extends DBSCAN (a classic density-based spatial algorithm) by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. 

In [0]:
!pip install hdbscan

In [0]:
import hdbscan
import numpy as np
import seaborn as sns
import pandas as pd

In [0]:
#Abstract
clusterer = hdbscan.HDBSCAN(min_cluster_size=2, gen_min_span_tree=True)
clusterer=clusterer.fit(clusterable_embedding)

In [0]:
#Body Text
clusterer2=clusterer.fit(clusterable_embedding2)

### visualization

#### Abstract

In [0]:
#Build the minimum spanning tree
clusterer.minimum_spanning_tree_.plot(edge_cmap='viridis',
                                      edge_alpha=0.6,
                                      node_size=80,
                                      edge_linewidth=2)

In [0]:
#Build the cluster hierarchy
clusterer.single_linkage_tree_.plot(cmap='viridis', colorbar=True)

In [0]:
#Condense the cluster tree
clusterer.condensed_tree_.plot()

In [0]:
#Extract the clusters
clusterer.condensed_tree_.plot(select_clusters=True, selection_palette=sns.color_palette())

In [0]:
#clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(clusterable_embedding)
color_palette = sns.color_palette('Paired',max(clusterer.labels_))
cluster_colors = [color_palette[x] if x >= 0 and x<max(clusterer.labels_)
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                        zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*clusterable_embedding.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

we got more than 1000 clustering groups totally, so HDBSCAN may not be the best way to cluster, and then we will try K-Means

####Body Text

In [0]:
#Build the minimum spanning tree
clusterer2.minimum_spanning_tree_.plot(edge_cmap='viridis',
                                      edge_alpha=0.6,
                                      node_size=80,
                                      edge_linewidth=2)

In [0]:
#Build the cluster hierarchy
clusterer2.single_linkage_tree_.plot(cmap='viridis', colorbar=True)

In [0]:
#Condense the cluster tree
clusterer2.condensed_tree_.plot()

In [0]:
#Extract the clusters
clusterer2.condensed_tree_.plot(select_clusters=True, selection_palette=sns.color_palette())

In [0]:
#clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(clusterable_embedding)
color_palette = sns.color_palette('Paired',max(clusterer2.labels_))
cluster_colors = [color_palette[x] if x >= 0 and x<max(clusterer2.labels_)
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors2 = [sns.desaturate(x, p) for x, p in
                        zip(cluster_colors, clusterer2.probabilities_)]
plt.scatter(*clusterable_embedding2.T, s=50, linewidth=0, c=cluster_member_colors2, alpha=0.25)

## K-means

Then we will see what k-means clustering to be like and what makes it different from HDBSCAN. First step is finding best k-value. Distortion computes the sum of squared distances from each point to its assigned center. When distortion is plotted against k there will be a k value after which decreases in distortion are minimal. This is the desired number of clusters.

In [0]:
from sklearn.cluster import KMeans

####Abstract

In [0]:
from scipy.spatial.distance import cdist
distortions = []
K = range(2, 40)
for k in K:
    k_means = KMeans(n_clusters=k, random_state=42).fit(clusterable_embedding)
    k_means.fit(clusterable_embedding)
    distortions.append(sum(np.min(cdist(clusterable_embedding, k_means.cluster_centers_, 'euclidean'), axis=1)) / pd.DataFrame(data3).shape[0])

In [0]:
X_line = [K[0], K[-1]]
Y_line = [distortions[0], distortions[-1]]

# Plot the elbow
plt.plot(K, distortions, 'b-')
plt.plot(X_line, Y_line, 'r')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

The best k-value is about 9 to 12, so we determined the best one is 10

In [0]:
kmeans = KMeans(n_clusters=10, random_state=0).fit_predict(clusterable_embedding)
kmeans

In [0]:
kmeans_clustering_model=KMeans(n_clusters=10, random_state=0)
kmeans_clustering_model.fit(clusterable_embedding)
kmeans_label=kmeans_clustering_model.labels_

In [0]:
data3['cluster_']=kmeans_label

Now we can compare to ways of clustering. Usually K-Means works well for “round” or spherical, and when most dense in the center of the sphere
not contaminated by noise/outliers. Our dataset does not centered with arbitrary shapes and too many noises, therefore K-Means works well here.

### visualization

In [0]:
plt.scatter(clusterable_embedding[:,0],clusterable_embedding[:,1], c=kmeans, cmap='rainbow')

####Body Text

In [0]:
from scipy.spatial.distance import cdist
distortions2 = []
K2 = range(2, 40)
for k in K:
    k_means2 = KMeans(n_clusters=k, random_state=42).fit(clusterable_embedding2)
    k_means2.fit(clusterable_embedding2)
    distortions2.append(sum(np.min(cdist(clusterable_embedding2, k_means2.cluster_centers_, 'euclidean'), axis=1)) / pd.DataFrame(data3).shape[0])

In [0]:
X_line2 = [K2[0], K2[-1]]
Y_line2 = [distortions2[0], distortions2[-1]]

# Plot the elbow
plt.plot(K2, distortions2, 'b-')
plt.plot(X_line2, Y_line2, 'r')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

In [0]:
kmeans2 = KMeans(n_clusters=10, random_state=0).fit_predict(clusterable_embedding2)
kmeans2

In [0]:
kmeans_clustering_model2=KMeans(n_clusters=10, random_state=0)
kmeans_clustering_model2.fit(clusterable_embedding2)
kmeans_label2=kmeans_clustering_model2.labels_

In [0]:
data3['textcluster_']=kmeans_label2

## LDA

This part we are going to divided clustering groups by topic labelled by latent Dirichlet Allocation.<sup>[9](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)</sup> LDA is a generative statistical model that uses unobserved groups to explain why some parts of a corpus are similar. The model posits that each document is a mixture of a small number of topics, where each topic is described by a distribution of words and each word's presence in the document can be attributed to one of these topics.   
Let's see now what topics our model identifies:

In [0]:
## call lda

### visualization

# Semantic search

### Cosine similarity
An amazing use of the embeddings we obtained with our fine-tuned Sentence Transformer model is to query the corpus and find the most similar embeddings to the query's embeddings. This can be simply done by calculating cosine similarity among embeddings, and then select the least-distant ones as the most relevant to the query.  
Here we propose a semantic search system where the user can input a question about COVID-19 and they will get the top 5 most relevant articles from the corpus. We suggest to interrogate our system with questions taken from the Kaggle's suggested [tasks](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks).

In [0]:
## call searches and get table out
import searches

# import model
# embedder = SentenceTransformer('bert-large-nli-mean-tokens') #this or the model we trained and saved

# load corpus
df_covid = pd.read_pickle('./data/preprocessed_dataframe.pkl')


In [0]:
# asking the user
query = input('What would you like to know from CORD-19? ')
print('\nUse abstracts:')
searches.sem_search(query, embedder, df_covid, df_covid['abs_embeddings'])


In [0]:
# asking the user (slower)
print('\nUse full text:')
searches.sem_search(query, embedder, df_covid, df_covid['body_embeddings'])


# Conclusion

Great AI stuff I guess!