### Intro to CORD-19 dataset - NLP and Unsupervised Learning

In [None]:
import os
import json
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

# NLP Text processing libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Tokenizers
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Unsupervised learning
from sklearn.decomposition import NMF, PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
# note - the first time you run this notebook, you need to install data from some of the NLP packages
# nltk.download()

There are a huge number of NLP packages that you can use - here we will just use NLTK, as it is realtively simple to understand and use, but some other common packages (which you should totally check out) include:
- Spacy
- Gensim
- SparkNLP

#### Config parameters

In [None]:
datadir = "data/CORD-19-research-challenge/"

### Load the metadata - find a suitable subset of files

This is going to be vital for loading appropriate data, and understanding what we have

In [None]:
# load the metadata csv
metadata = pd.read_csv(os.path.join(datadir, "metadata.csv"))
metadata.head()

In [None]:
# Let's see all of the possible sources of the papers
metadata["source_x"].unique()

In [None]:
# Let's get a list of all the papers that are in the BioRxiv
bioarxiv_locations = metadata["source_x"] == "BioRxiv"
bioarxiv_df = metadata.loc[bioarxiv_locations]
bioarxiv_papers = bioarxiv_df["pdf_json_files"].dropna().tolist()

In [None]:
# How many papers do we have?
print(f"There are {len(bioarxiv_papers)} papers in our sample")

#### Load the papers into a dataframe so we can get started

Each of the papers exists as a seperate json file, so we now want to load those into memory. Note, many of these will be very large - the whole dataset is larger than pandas can efficiently take in a single data frame

In [None]:
# Extract the papers into a list of JSON objects, and then load into a dataframe
json_extracts = [json.load(open(os.path.join(datadir, paper), "rb")) for paper in bioarxiv_papers]
df = pd.DataFrame.from_records(json_extracts)

In [None]:
# Let's take a loook at one of the abstracts
df.iloc[0]["abstract"][0]

In [None]:
# Note that some of the abstracts are empty - let's get rid of those
df = df[df["abstract"].map(lambda x: len(x)) > 0]
df.reset_index(inplace=True)
print(f"New number of papers = {df.shape[0]}")

In [None]:
# Now let's extract the abstracts
df["abstract"] = df["abstract"].apply(lambda x: x[0]["text"])

In [None]:
# Extract only the abstracts to move forward with
df_nlp = df[["paper_id", "abstract"]]

In [None]:
df_nlp

### Preprocessing part 1: Peform some simple text filtering

We want to remove elements of the text that are perhaps useful for idomatic language and human understanding, but are not necessarily needed in order to extract useful information from the documents

This includes things like:
- Punctuation
- Capitalization
- stop words

In English, stop words include common linking words and articles such as:
- a
- an
- the
- but

Most NLP packages contain a stopwords list, here we will use the one from the Naturual Language Tool Kit (NLTK)

Note also - you can add to or create your own custom list of stopwords - for example you may wish to exclude the word "covid" from some analyses

In [None]:
# Let's remove the punctuation from the corpus
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, "")
    return text

df_nlp["abstract"] = df_nlp["abstract"].apply(remove_punctuation)

In [None]:
# Remove all capitalizations in our corpus
def make_lower_case(text):
    lower_words_list = [word.lower() for word in text.split(" ")] # note - returns a list of words
    return " ".join(lower_words_list)
    
df_nlp["abstract"] = df_nlp["abstract"].apply(make_lower_case)

In [None]:
# Let's have a look at the NLTK stopwords
print(stopwords.words('english'))

In [None]:
# Let's remove any of these stopwords from our corpus
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    filtered_text = [word for word in text.split(" ") if word not in stop_words] # note - returns a list of words
    return " ".join(filtered_text)
    
df_nlp["abstract"] = df_nlp["abstract"].apply(remove_stopwords)

In [None]:
df_nlp

### Preprocessing part 2 - stemming and lemmatizing

We often want to reduce the space of all possible words further by combining semantically similar information together. These could include things like:
- Combining singular and plural words together
- Combining synonyms together

#### Stemming

Stemming is the idea of removing certain common prefixes of words - for example removing the s at the end of words, with the hope of making them the same. Note - this can be too aggressive in some contexts

Various stemming schemes exist, with varying rules and levels of aggression about what parts of words they will remove. Some examples within NLTK are
- Snowball stemmer
- Porter stemmer

To give an example the sentence <strong>Programers program with programing languages</strong> becomes <strong>program program with program languag</strong> when passed through a porter stemmer

In [None]:
def stemmer(text):
    """
    Perform stemming on a single element of the corpus
    """
    porter = PorterStemmer()  # Define the PorterStemmer
    filtered_text = [porter.stem(word) for word in text.split(" ")] # note - returns a list of words
    return " ".join(filtered_text)

df_nlp["abstract"] = df_nlp["abstract"].apply(stemmer)

In [None]:
df_nlp

#### Lemmatizing

The idea of lemmatizing is to group together words that are variants of the same word, such as synonyms. It can be used with, or instead or, stemming.

While fill morphological analysis could make something more human readable, it doesn't guarantee any practical benefit over stemming for information retrieval purposes.

There exist a number of models for performing lemmatization, one of the most famous being WordNet, which we will use here

In [None]:
def lemmatizer(text):
    """
    Perform lemmatizing on a single element of the corpus
    """
    wordnet = WordNetLemmatizer()  # Define the PorterStemmer
    filtered_text = [wordnet.lemmatize(word) for word in text.split(" ")] # note - returns a list of words
    return " ".join(filtered_text)

df_nlp["abstract"] = df_nlp["abstract"].apply(lemmatizer)

In [None]:
df_nlp

### Preprocessing part 3: Tokenizing

So now out data is reasonably constructed, we want to begin to prepare the data for input to a machine learning model

Recall - ML models are mathematical objects, so we need to create a mathematical model of our text data. There are many possible schemes, often dependent upon the technology that you intend to use.

For this example, we will show a few possible schemes - the bag of words representation and the TF-IDF representation

#### Bag of words / Count Vectorizer

Here we convert every document to a matrix where each column is a word, each row is a document, and the value in each column is how many times the word appears in the document.

For example, the phrase "the cat sat on the mat" becomes:

| the | cat | sat | on | mat | dog | ran |
|-----|-----|-----|----|-----|-----|-----|
|  2  |  1  |  1  |  1 |  1  |  0  |  0  |

Every distinct word that appears in the training set will have a unique column. This is called a <strong>vocabulary</strong>

Note - only a small subset of words are likely to appear in a single document - most elements in a single row are likely to be zero. Thus, this matrix is <strong>sparse</strong> 

In [None]:
count_vect = CountVectorizer()
count_matrix = count_vect.fit_transform(df_nlp["abstract"])
print(count_matrix.shape)

In [None]:
# This is a sparse matrix, so cannot trivially be printed out
print(count_matrix[0,1:2000])

In [None]:
# But we can print out the words if we want to:
word_list = count_vect.get_feature_names()
test_word_counts = count_matrix[0, 1:2000]
for n in count_matrix[0, 1:2000].indices:
    print(f"the word '{word_list[n]}' appears {test_word_counts[0, n]} times")

#### Term-Frequency Inverse-Document-Frequency (TFIDF) Vectorizer

This technique is super useful for information retrieval and search, and is regularly used as a preprocessing step across a range of NLP modelling approaches.

The idea behind it is that certain words are more important in telling documents apart than others. For example, the word "football" commonly occurs in sports articles, but will be rare in other kinds of article. TFIDF tries to capitalize on this by creating weights that are high when a word appears a lot, but only in a few documents, and low when a word is infrequent or appears in a lot of documents

The score for each word is:

<strong>number of times word appears in the document</strong> / <strong>number of documents the word appears in at least once</strong>

So if the word "cat" appears twice in a document, and is present at least once in 10 documents, the TFIDF value would be 0.2

In [None]:
tfidf_vect = TfidfVectorizer()
tfidf_matrix = tfidf_vect.fit_transform(df_nlp["abstract"])
print(tfidf_matrix.shape)

In [None]:
# This is also a sparse matrix, so cannot trivially be printed out
# Bear in mind - as the number of documents increases, the TFIDF score will tend to get smaller
print(tfidf_matrix[0,1:2000])

In [None]:
word_list = count_vect.get_feature_names()
test_word_counts = count_matrix[0, 1:2000]
for n in count_matrix[0, 1:2000].indices:
    print(f"the word '{word_list[n]}' has a TFIDF score of {test_word_counts[0, n]}")

#### More tokenizing choices

You aren't limited to just unigrams (single words). You could also do bigrams (each token is two consecutive words), trigrams etc

You can also do combinations

And there are other options like skipgrams, which is used by the well known word2vec embedding model

In [None]:
tfidf_vect_bg = TfidfVectorizer(ngram_range = (1,2))
tfidf_matrix_bg = tfidf_vect_bg.fit_transform(df_nlp["abstract"])
print(tfidf_matrix_bg.shape)

In [None]:
word_list = tfidf_vect_bg.get_feature_names()
test_word_counts = tfidf_matrix_bg[0, 1:8000]
for n in test_word_counts.indices:
    print(f"the word '{word_list[n]}' has a TFIDF score of {test_word_counts[0, n]}")

For the remainder of this session, we will be working with the TFIDF vectorizer, with the unigram tokenizing scheme

### Starting with some unsupervised learning

This dataset does not come with any labels. Now, we could go though and manually create labels, or we can try and gain some insight into the data with some unsupervised learning

Unsupervised learning is a technique wherby we, the human, pass in some data to an algorithm with allow it to "learn" something about the data on its own

#### Can we extract something about the topics that are contained in our corpus?

Topic modelling can be done using a range of techniques. The idea here is to try and find "topics" that might exist in our corpus - ie clusters of "similar" papers, which we might interpret as different topics.

Caveat - this is much easier to do on a corpus that has a broader range of different topics! 

There are a range of techniques that we can use to perform topic modelling, including:
- Non-negative matrix factorization (NMF)
- Latent Dirichlet Allocatoin (LDA)

For the purposes of this exercise, we will look at the NMF algorithm, and see how it can be leveraged. Note though, a lot of the kaggle submissions are using LDA - so definitely have a go with that technique

This is an example of a dimensionality reduction technique - we are trying to transform our data in such a way that valuable information is surfaced in early dimensions, enabling us to remove later dimensions. In the case below, we will attempt to transform the data such that it can be optimally reduced to 10 dimensions.

Due to the way this algorithm works, the 10 dimensions can often be interpreted as topics that appear in the paper

In [None]:
# Define an NMF object (you're getting the pattern by now!)
# Here, we will try and see if we can detect 10 different topics
model = NMF(n_components = 10)
features = model.fit_transform(tfidf_matrix)

In [None]:
# Extract the list of words that could possibly exist
words = np.array(tfidf_vect.get_feature_names())

# Extract the h-matrix from the NMF algorithm
h_sk = model.components_

# Extract the top 18words from each of the observed 10 topics
topics = np.flip(h_sk.argsort(axis = 1), axis=1)[:, :8]
for n, topic in enumerate(topics):
    print(f"topic {n+1}: {words[topic]}")

Looking at this from a human angle, we can start to see possible topics emerging from this. Some examples:

- Topics 1 and 8 both mention the ACE2 enzyme alongside protein binding verbiage
- Topic 5 includes a lot of words about experimental methods 
- Topic 10 looks like it includes epidemiological information about infection
- Topic 9 is probably junk

This, or similar, information could be used, for example, to provide a provisional labelling scheme for your papers, or for information extraction - eg finding all the papers that mention bats and infections

#### Visualizing clusters - t-distributed Stochastic Neighbor Embedding (t-SNE)

I wanted to show you this to demonstrate that not everything works, and that is OK.

In [None]:
tsne_model = TSNE(perplexity = 50, n_iter=1000000, learning_rate=10)
features = tsne_model.fit_transform(tfidf_matrix)

In [None]:
plt.scatter(features[:,0], features[:,1], s=5)

This method can often be really helpful in visualizing extremely high dimensional data, but it is not particularly help in this scenario - it needs a lot of tuning of the adjustable hyperparameters.

###  A simple machine learning model: k-means clustering 

This is one of the simplest unsupervised machine learning algorithms, but illustrates nicely how we can apply such techniques to learn something, and then to predict on something unknown

An important note - k-means performs exceedingly badly on high dimensional data due to the <strong>curse of dimensionality</strong>. Thus, we need to reduce the number of dimensions (a lot) before applying this technique. There are many ways of doing this for NLP task, including using deep neural networks, but here we will apply a simpler technique

#### Apply Principal Component Analysis (PCA) to reduce the dimensionality to help k-means perform better


In [None]:
pca = PCA(n_components=2)
reduced_matrix = pca.fit_transform(tfidf_matrix.todense())

In [None]:
print(f"Size of reduced matrix = {reduced_matrix.shape}")

#### Apply the k-means algorithm

In [None]:
kmeans = KMeans(n_clusters=6)
kmeans.fit(reduced_matrix)

In [None]:
kmeans

#### Let us see what we're actually getting out of this algorithm

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_matrix[:, 0].min() - 0.2, reduced_matrix[:, 0].max() + 0.2
y_min, y_max = reduced_matrix[:, 1].min() - 0.2, reduced_matrix[:, 1].max() + 0.2
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_matrix[:, 0], reduced_matrix[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

#### How do we choose a "best" K?

In [None]:
# We will use something called the elbow method
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(reduced_matrix)
    kmeanModel.fit(reduced_matrix)
    distortions.append(sum(np.min(cdist(reduced_matrix, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / reduced_matrix.shape[0])

In [None]:
# create new plot and data
plt.plot()
colors = ['b', 'g', 'r']
markers = ['o', 'v', 's']

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Note - the 3 or 6-cluster models may be optimal - but be warned about spurious results!

Let's look at the PCA <strong>variance explained</strong>

In [None]:
pca = PCA(n_components=500)
test_matrix = pca.fit_transform(tfidf_matrix.todense())
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

So we can see that 2 components acutally capture relatively little of the total variance explained, we need several hundred components to effectively do this. This doesn't necessarily invalidate our results, but it does suggest that we should apply some caution

#### Let's see which papers fall into which cluster

In [None]:
cluster_assignment = kmeans.predict(reduced_matrix)

In [None]:
# Let's see which papers ended up in cluster 1
papers_in_cluster1 = cluster_assignment == 1

In [None]:
paper_df = df[["paper_id", "abstract"]]
paper_df.loc[papers_in_cluster1]

#### When all is said and done - the human has to be in the loop to decide whether the clustering and information retrieval we have done makes sense