# Topic Modelling COVID-19 Dataset

**Table of Contents**

<a href='#section1'> 1. Introduction</a>

<a href='#section2'> 2. Data Loading and Exploration</a>

<a href='#section3'> 3. Summary Statistics of Corpus</a>

<a href='#section4'> 4. Preprocessing: Tokenization, Lemmatization and Stopwords</a>

<a href='#section5'> 5. Topic Model: Fitting Latent Dirichlet Allocation Model</a>

<a href='#section6'> 6. Inspection of the topics</a>

><a href='#section6.1'>6.1. Visualization of topics</a>

><a href='#section6.2'>6.2 Word Similarity</a>

><a href='#section6.3'>6.3 Plotting words in 2D</a>

<a href='#section7'> 7. Document Similarity</a>

><a href='#section7.1'> 7.1. Ploting Documents in 2D</a>

<a href='#section8'> 8. Model Performance</a>

<a href='#section9'> 9. Document Clustering</a>

----

**Exercise list**

- [Ex1: Important information](#ex1)

- <a href='#ex2'>Ex2: Text vectorization</a>

- <a href='#ex3'>Ex3: Topic inspection</a>

- <a href='#ex4'>Ex4: Document similarity</a>

- <a href='#ex5'>Ex5: Model performance</a>

- <a href='#ex6'>Ex6: Document clustering</a>

-----

<a id='section1'></a>
## 1. Introduction
Topic models allow to discover (induce) the topics in a given set of documents. It can be understood as a probabilistic clustering method that groups words belonging to the same topic, associated with a probability. We assume that documents exhibit multiple topics (a multinomial distribution over topics), and each topic is a distribution over words (defined by a multinomial distribution over words). 

In this assignment we will estimate emergins topics of a corpus of biomedicine domain. We will be using a part of the dataset of [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). The CORD-19 dataset contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses. In this notebook we will be modeling topics for paper published between 2019 and 2020. 


>>![](https://drive.google.com/uc?id=1fUETDUs3vMLvbe9xWq1Wr-fuimjpf51_)


**Set up**

Please, run the following cells to mount your Drive to your Colaboratory session, and enable plot.ly charts as well.

In [None]:
!pip install pyldavis

In [None]:
# Mount Drive files
from google.colab import drive
drive.mount('/content/drive')

In [None]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

<a id='section2'></a>
## 2. Load data and Exploration
We first will load the corpus with Pandas as a DataFrame object. Data-frame object has many columns, but text is stored in `abstract`.

__Note__: Make sure your path in "data_path" is correct. Otherwise you won't be able to run the notebook.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats


# Plotly imports
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from matplotlib import pyplot as plt
%matplotlib inline


def load_metadata(path_to_metada, below_year=2020, above_year=1950):
    # Select interesting fields from metadata file
    fields = ['cord_uid', 'title', 'authors', 'publish_time', 'abstract', 'journal', 'url']
    # Extract selected fields from metadata file into dataframe
    df_mdata = pd.read_csv(path_to_metada, skipinitialspace=True, index_col='cord_uid', usecols=fields)

    # WARNING: cord_uid is described as unique, but c4u0gxp5 is repeated. So I remove one of this
    df_mdata = df_mdata.loc[~df_mdata.index.duplicated(keep='first')]
    df_mdata['publish_time'] = pd.to_datetime(df_mdata['publish_time'], errors="coerce")
    df_mdata['publish_year'] = df_mdata['publish_time'].dt.year
    df_mdata = df_mdata[df_mdata['abstract'].notna()]
    df_mdata = df_mdata[df_mdata['authors'].notna()]
    # df_mdata = df_mdata[df_mdata['sha'].notna()]
    df_mdata['authors'] = df_mdata['authors'].apply(lambda row: str(row).split('; '))

    relevant_time = df_mdata.publish_year.between(above_year, below_year)
    df_mdata = df_mdata[relevant_time]

    return df_mdata[['title', 'authors', 'publish_time', 'abstract', 'journal', 'url']]


data_path='drive/MyDrive/Colab Notebooks/nlp-app-II/data/'
df = load_metadata(data_path+'metadata.csv', below_year=2020, above_year=2019)

#df['doc_name'] = df['doc_name'].astype(str)
#df['text'] = df['text'].astype(str)

df.head()

<a id='section3'></a>
## Summary statistics of the COVID-19 corpus

Here we can visualize some basic statistics in the data, like the distribution of the document-length of the article that comprised the corpus. For this purpose, we will use Plot.ly visualisation library and plot some simple bar plots.

Following code approximately counts the length of the documents in words. Note that we are spliting the words by whitespace, and this is not the most accurate way of tokenization.

In [None]:
enable_plotly_in_cell()

lengths = np.asarray(df['abstract'].apply(lambda sent: len(sent.split(' '))))

print('Mean length: {}'.format(lengths.mean()))
print('Most repeated length: {}'.format(stats.mode(lengths)))

data = [go.Histogram(x=lengths)]

layout = go.Layout(
    title='Document length frequencies',
    xaxis=dict(
        title='Abstract length [words]'
    ),
    yaxis=dict(
        title='Count'
    ),
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

**Number of documents**

Following we will calculate the size of the dataset in terms of number of abstract. Note that when loading the information in the metadata file we only take into account the paper published between 2019 and 2020. 

In [None]:
print("Number of abstracts loaded: {} ".format(df.shape[0]))

**Word frequencies**


Word frequencies can tell a lot about the corpus we are working with, but it is not completely straightforward. 


In [None]:
enable_plotly_in_cell()

all_words = df['abstract'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (Uncleaned) Word frequencies in the corpus'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

Do you see anything odd about the words that appear in this word frequency plot? Do these words actually tell us much about the themes and concepts that should contain a corpus of biomedicine? 

These words are all so commonly occuring words which you could find just anywhere else. Therefore we must find some way to preprocess our dataset first to strip out all these commonly occurring words which do not bring much to the table.

#### Exercise 1 <a class="anchor" id="ex1"></a>

- How would you surface/extract the most important features for modeling the topics?

<a id='section4'></a>

## 4. Preprocessing: Tokenization, lemmatization, removing semantically empty stuff

In almost all Natural Language Processing tasks that you will come across, one will generally always have to undergo these few pre-processing steps to convert the input raw text into a form that is readable by your model and the machine. Text pre-processing can be boiled down to these few simple steps:

1. **Tokenization** - Segmentation of the text into its individual constitutent words. 
2. **Stopwords** - Throw away any words that occur too frequently as its frequency of occurrence will not be useful in helping detecting relevant texts. (as an aside also consider throwing away words that occur very infrequently).
3. **Vectorization** - Converting text into vector format. One of the simplest is the famous bag-of-words approach, where you create a matrix (for each document or text in the corpus). In the simplest form, this matrix stores word frequencies (word counts) and is oft referred to as vectorization of the raw text. 

There are many toolkits in Python that help preprocessing input text. Two well-known packages are: 

- [**Natural Language Toolkit (NLTK)**](http://www.nltk.org/) 

- [**SpaCy**](https://spacy.io/)

We will use SpaCy toolkit in this notebook, but the use of NLTK should be similar.

In [None]:
import spacy
from spacy.lang.en.examples import sentences 
import string

def preprocess(data, remove_stopwords=True, remove_func_words=True):    
    nlp = spacy.load('en', disable=['parser', 'ner'])

    if remove_func_words:
        open_class_words = set(['NOUN', 'ADV', 'VERB','ADJ'])
        data['preproc'] = data['abstract'].apply(lambda row: [tok.lemma_.lower() for tok in nlp(row) if tok.pos_ in open_class_words])
    else:
        data['preproc'] = data['abstrac'].apply(lambda row: [tok.lemma_.lower() for tok in nlp(row)])
    
    if remove_stopwords:
        stop = nlp.Defaults.stop_words
        data['preproc'] = data['preproc'].apply(lambda row: [word for word in row if word not in stop])
        data['preproc'] = data['preproc'].apply(lambda row: [word for word in row if not all([c in string.punctuation for c in word])])
        extended_puntcation = '…–—«»'
        data['preproc'] = data['preproc'].apply(lambda row: [word for word in row if not all([c in extended_puntcation for c in word])])
        
    data['preproc'] = data['preproc'].apply(lambda row: ' '.join(row))
    return data

df = preprocess(df)

'preprocess' function adds new columns in the dataframe: 'preproc'. Preproc columns contains tokenize and cleaned documents.

- Spacy tokenization
- Spacy part-of-speech tagging to get only content words.
- Remove stopwords.
- Remove punctuation marks.

In [None]:
df.head()

**Revisiting term frequencies**

Having implemented our lemmatized count vectorizer, let us revist the plots for the term frquencies of the top 50 words (by frequency). As you can see from the plot, all our prior preprocessing efforts have not gone to waste. With the removal of stopwords, the remaining words seem much more meaningful where you can see that all the stopwords in the earlier term frequency plot 

In [None]:
enable_plotly_in_cell()

# word frequencies
all_words = df['preproc'].str.split(expand=True).unstack().value_counts()

data = [go.Bar(
            x = all_words.index.values[0:50],
            y = all_words.values[0:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (cleaned) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

Plot clearly shows which is the domain of the corpus. Now as the most representative we have words such as 'cell', 'gene', or 'treatment'. But still we have some general words like 'use', 'find' and 'common', among others.  

<a id='section5'></a>

## 5. Topic Model: Fitting Latent Dirichlet Allocation Model

Now we are ready to stimate the topics from text using the [Latent Dirichlet Allocation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) algorithm . I will be using Sklearn's implementation. Another very well-known LDA implementation is Radim Rehurek's [gensim](https://radimrehurek.com/gensim/), so check it out as well.

**Corpus - Document - Word : Topic Generation**

In LDA, the modelling process revolves around three things: the text corpus, its collection of documents, D and the words W in the documents. Therefore the algorithm attempts to uncover K topics from this corpus via the following way (illustrated by the diagram)

![Three_Level Bayesian Model](http://scikit-learn.org/stable/_images/lda_model_graph.png)

Model each topic, $\kappa$ via a Dirichlet prior distribution given by $\beta_{k}$:

![](http://scikit-learn.org/stable/_images/math/2c1ff5b3d6f342d7dad0395210c8a13947de451c.png)

Model each document d by another Dirichlet distribution parameterized by $\alpha$:

![](http://scikit-learn.org/stable/_images/math/530c80986933767c9d182af83075c13d72cdef97.png)

Subsequently for document d, we generate a topic via a multinomial distribution which we then backtrack and use to generate the correspondings words related to that topic via another multinomial distribution:

![](http://scikit-learn.org/stable/_images/math/0bb078c0fe621366a147231b5c8240efacdb895b.png)
![](http://scikit-learn.org/stable/_images/math/f7960e06327c64a1e9da6769e68aa4342b7de73d.png)


*(Image source: http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation)*

The LDA algorithm first models documents via a mixture model of topics. From these topics, words are then assigned weights based on the probability distribution of these topics. It is this probabilistic assignment over words that allow a user of LDA to say how likely a particular word falls into a topic. Subsequently from the collection of words assigned to a particular topic, are we thus able to gain an insight as to what that topic may actually represent from a lexical point of view.

From a standard LDA model, there are really a few key parameters that we have to keep in mind and consider programmatically tuning before we invoke the model:
1. n_components: The number of topics that you specify to the model
2. $\alpha$ parameter: This is the dirichlet parameter that can be linked to the document topic prior 
3. $\beta$ parameter: This is the dirichlet parameter linked to the topic word prior

To invoke the  algorithm, we simply create an LDA instance through the Sklearn's *LatentDirichletAllocation* function. The various parameters would ideally have been obtained through some sort of validation scheme. In this instance, the optimal value of n_components (or topic number) was found by conducting a KMeans + Latent Semantic Analysis Scheme (as shown in this paper here) whereby the number of Kmeans clusters and number of LSA dimensions were iterated through and the best silhouette mean score.

<a id='section4.1'></a>
### 4.1 Vectorizing Text
Text vectorization is basic tool that converts text into numbers that machines understand. There different ways to convert text into vector. Scikit-learn offers various functions to do this (https://scikit-learn.org/stable/modules/feature_extraction.html). We will explore few of them:

- __Bag-of-Words__: [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

- __TF-IDF__: [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)


**Example of count vectorizer**

There are multiple ways to vectorize tokens. The example below shows that bag-of-words approach, in which each word in the vocabulary is indexed and counted how many time occurs. This representation is very useful in models like Topic Models, where the frequency of the words is very important to know.

Below, we show the vectorization of twe sequences of words. Note that in the first sentence 'eat' is repeated three times. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# Defining our sentence
sentence = ["I love to eat eat and eat Burgers", 
            "I love to eat Fries"]
vectorizer = CountVectorizer(min_df=0)
sentence_transform = vectorizer.fit_transform(sentence)

We initialize and create a simple term frequency object via the CountVectorizer function simply called "vectorizer". The parameters that we have provided explicitly are the "min_df" in the parameter refers to the minimum document frequency (the rest are left as default). The vectorizer will simply drop all words that occur less than that value set.  For a detailed read up on this method as well as the rest of the parameters that one could use, I refer you to the [Sklearn website](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Finally we apply the fit_transform method is actually comprised of two steps. The first step is the fit method where the vectorizer is mapped to the the dataset that you provide. Once this is done,  the actual vectorizing operation is performed via the transform method where the raw text is turned into its vector form as shown below


In [None]:
print("The features are:\n {}".format(vectorizer.get_feature_names_out()))
print("\nThe vectorized array looks like:\n {}".format(sentence_transform.toarray()))

**Vectorizing the whole dataset**

Combining Pandas and Scikit-learn makes very easy this part of the precossesing. First, we create the vectorizer object of sklearn, and then we can convert to vectors our tokenized and cleaned corpus directly with fit_transform method.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(strip_accents = 'unicode',
                             stop_words = 'english',
                             lowercase = True,
                             token_pattern = r'\b[a-zA-Z]{3,}\b',
                             max_df = 0.5, 
                             min_df = 10)

# apply transformation
tf = vectorizer.fit_transform(df['preproc'])

# tf_feature_names is the word index of the vectorizer
tf_feature_names = vectorizer.get_feature_names_out()
print(tf_feature_names[0:5])

<a id='ex3'></a>
#### Exercise 2
You can apply different options when vectorizating the input. For example, you can try using  different thresholds (min_df, max_df) and see what is the difference.

### 5.1 Fitting topics

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
number_of_topics = 100
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)

In [None]:
model.fit(tf)

<a id='section6'></a>

## 6. Inspection of the topics

In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

In [None]:
no_top_words= 10
display_topics(model, tf_feature_names, no_top_words)

<a id='section 6.1'></a>
### 6.1. Visualization of topics

In [None]:
# pip install pyldavis
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.sklearn.prepare(model, tf, vectorizer)

#### Exercise 3 <a class="anchor" id="ex3"></a>
- Does estimated topics make sense? Could you name some of them?

<a id='section6.2'></a>
### 6.2 Word similarity

In [None]:
from sklearn.metrics.pairwise import cosine_distances

def most_probable_words(topic_word_vector, word_topic_table, vocab, top_n=10):
  sim = topic_word_vector.dot(word_topic_table)
  knn = np.argsort(-sim)
  similar_words = [vocab[i] for i in knn[0:top_n]]
  return similar_words

def most_similar_words(word_vector, word_table, vocab, top_n=10):
  dists = cosine_distances(word_vector.reshape(1, -1), word_table)
  pairs = enumerate(dists[0])
  most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
  similar_words = [vocab[i[0]] for i in most_similar]
  return similar_words

# P(t | w): topic distr given word w
topic_prob = np.transpose(model.components_) # P(t | w)
row_sums = np.sum(topic_prob, axis=1)
topic_prob = topic_prob / row_sums[:, np.newaxis]

# P(w | t): word probability given topic t
word_prob = model.components_ # P(w | t)
row_sums = np.sum(word_prob, axis=1)
word_prob = word_prob / row_sums[:, np.newaxis]

# word indices
word2idx = {w : i for i, w in enumerate(tf_feature_names)}

Change the variable `word` to obtain more similar words

In [None]:
word = 'virus'

# P(w2 | w1)
topic_word_vector = topic_prob[word2idx[word]]  # p(t | w)
similar_words = most_probable_words(topic_word_vector, word_prob, tf_feature_names)
print('P(w2 | w1)\nMost probable words to "{}":'.format(word))
print(similar_words)

print("")

# cos(p(t|w1), p(t|w2))
similar_words = most_similar_words(topic_word_vector, topic_prob, tf_feature_names)
print('cos(p(t|w1), p(t|w2))\nMost similar words to "{}":'.format(word))
print(similar_words)

<a id='section6.3'></a>
### 6.3 Ploting words in 2D

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
words_2d = tsne.fit_transform(topic_prob)

In [None]:
enable_plotly_in_cell()

df_2d = pd.DataFrame(columns=['x', 'y', 'word'])
df_2d['x'], df_2d['y'], df_2d['word'] = words_2d[:,0], words_2d[:,1], np.asarray(tf_feature_names)

df_2d = df_2d.sample(n=200)

data = [go.Scatter(
            x = df_2d['x'],
            y = df_2d['y'],
            mode = 'markers',
            marker = dict(
                size = 10,
                color = 'rgba(255, 182, 193, .9)',
                line = dict(
                    width = 2,
                )
            ),
            text=df_2d['word']
    )]

layout = go.Layout(
    title='2D visualization of words'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-scatter')

<a id='section7'></a>
## 7. Document Similarity

Note that we can represent the documents according to their topic proportions. This latent representation can be seen as generalization over occurring words, in which documents with different words and similar topics can be considered as similar ones. 

In [None]:
doc_name = ["unseen document"]
text = str(df['abstract'].iloc[20])
title = str(df['title'].iloc[20])
print('Title: {}'.format(title))
print(text)
new = pd.DataFrame({'doc_name':doc_name, 'abstract':text})
new = preprocess(new)
x_new = model.transform(vectorizer.transform(new['abstract']))

In [None]:
from sklearn.metrics.pairwise import euclidean_distances, cosine_distances
 
def most_similar(x, Z, top_n=5):
    dists = cosine_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
  
# get latent representaiton of documents 
tf_Z = model.transform(tf)

# get most similar documents
similarities = most_similar(x_new, tf_Z)
document_id, similarity = similarities[0]

similar_ids = [sim[0] for sim in similarities]

# print most similar document
df.iloc[similar_ids]


#### Exercise 4 <a class="anchor" id="ex4"></a>

- Select random article from the dataset and predict the topics of the unseen document. 
- Do not forget to preprocess the new document!

<a id='section7.1'></a>
### 7.1 Ploting documents in 2D

In [None]:
from sklearn.manifold import TSNE

tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(tf_Z)

In [None]:
df_2d = pd.DataFrame(columns=['x', 'y', 'document'])
df_2d['x'], df_2d['y'], df_2d['document'] = tsne_lda[:,0], tsne_lda[:,1], np.asarray(df['title'])


In [None]:
enable_plotly_in_cell()

data = [go.Scatter(
            x = df_2d['x'],
            y = df_2d['y'],
            mode = 'markers',
            marker = dict(
                size = 10,
                color = 'rgba(255, 182, 193, .9)',
                line = dict(
                    width = 2,
                )
            ),
            text=df_2d['document']
    )]

layout = go.Layout(
    title='2D visualization of documents'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-scatter')

<a id='section8'></a>
## 8. Diagnose Model Performance
Measure the performance of topic models is challenging, as there is no easy way to measure how coherent are the estimated topics. One correct way to do it is to evaluated extrinsically, in which we apply the model in a final task like information retrieval and similar. This way to evaluate is usually quite expensive. Therefore, the most common way to evaluate such models is by measuring the log-likehood or perplexity of the model for a given unseen set of documents. 

Two measure are mathematically related, but whereas we find higher log-likehood values, we want a lower perplexity value. Intuitively, the metrics measure how well the model is fitted to the data (or how likely is to generate the given data). 

In [None]:
from pprint import pprint

# Log Likelyhood: Higher the better
print("Log Likelihood: ", model.score(tf))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", model.perplexity(tf))

# See model parameters
pprint(model.get_params())

#### Exercise 5 <a class="anchor" id="ex5"></a>

- Try to find the best number of topics that get low perplexity. This can take a good amount of time, so do not go for a very exhaustive search.

<a id='section9'></a>
## 9. Clustering documents
Topic models can be seen as latent representation of documents, in which we make visible the main topic/themes of the document. As we seen above, the representation can useful to find similar documents (and words). Having a way to represent documents and measure their similarity we can easily set up a clustering algorithm that group documents by their topics. 

In [None]:
# Construct the k-means clusters
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=15, random_state=100).fit_predict(tf_Z)

clusters.shape

# Build the Singular Value Decomposition(SVD) model
tsne_model = TSNE(n_components=2, verbose=0, random_state=0, angle=.99, init='pca')  # 2 components 
lda_output_tsne = tsne_model.fit_transform(tf_Z)

# X and Y axes of the plot using TSNE decomposition
x = lda_output_tsne[:, 0]
y = lda_output_tsne[:, 1]

In [None]:
enable_plotly_in_cell()

data = [go.Scatter(
            x = x,
            y = y,
            mode = 'markers',
            marker = dict(
                size = 10,
                color = clusters,
                line = dict(
                    width = 2,
                )
            ),
            text=np.asarray(df['title'])
    )]

layout = go.Layout(
    title='Segregation of Topic Clusters'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-scatter')

#### Exercise 6 <a class="anchor" id="ex6"></a>
- Inspect some clusters and indentify examples that make sense.