<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W03-document-vectorization-and-clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project02%20-%20Text_Clustering.ipynb) included in the [dipanjanS
/nlp_workshop_odsc19](https://github.com/dipanjanS/nlp_workshop_odsc19) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [41]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2022-04-28


# Machine Learning Overview

- Machine learning workflow
![](https://miro.medium.com/max/1200/1*XgcF3ayEH2Q8JEbZx8D09Q.png)

- Types of machine learning
![](https://www.researchgate.net/publication/354960266/figure/fig1/AS:1075175843983363@1633353305883/The-main-types-of-machine-learning-Main-approaches-include-classification-and.png)

# Understanding Data Vectorization


Data vectorization is a process of representing data points, be it sounds, images, or texts, by a sequence of numbers (i.e. vectors) in a meaningful way. It's an important step in a machine learning workflow. In this tutorial, we'll focus on algorithms for document vectorization. 

## Count vectorizer

![](https://raw.githubusercontent.com/cassieview/intro-nlp-wine-reviews/master/imgs/vectorchart.PNG)

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
        ]
count_vectorizer = CountVectorizer()
count_vectors = count_vectorizer.fit_transform(corpus)
count_vectors.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [43]:
count_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

## TF-IDF vectorizer

![](https://miro.medium.com/max/1200/1*V9ac4hLVyms79jl65Ym_Bw.jpeg)

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(corpus)
tfidf_vectors.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [45]:
tfidf_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

## Word embeddings

Famous architectures for word embeddings include:

- [Word2vec](https://en.wikipedia.org/wiki/Word2vec) by Google
- [fastText](https://en.wikipedia.org/wiki/FastText) by Facebook
- [GloVe](https://en.wikipedia.org/wiki/GloVe) (Global Vectors) by the Stanford NLP team

- The intuition: Distributional semantics

![](https://slideplayer.com/slide/12147948/71/images/10/John+Rupert+Firth+You+shall+know+a+word+by+the+company+it+keeps.jpg)

- The training logic
![](https://jaxenter.com/wp-content/uploads/2018/08/image-2.png)

- Impact of word embeddings

![](https://miro.medium.com/max/1400/1*sAJdxEsDjsPMioHyzlN3_A.png)

You can play with Word2vec embeddings for 10K words in English [here](https://projector.tensorflow.org/).

# Understanding Document Clustering


Clustering is one of the most important Unsupervised Machine Learning Techniques. These algorithms come in handy, especially in situations where labeled data is a luxury. Clustering techniques help us understand the underlying patterns in data (more so around them being similar) along with the ability to bootstrap certain supervised learning approaches as well.

Clustering techniques have been studied in depth over the years and there are some very powerful clustering algorithms available. For this tutorial, we will be working with a movie dataset containing movie plot, cast, genres and related other information. We will be working with __K-Means__, one of the most famous algorithms for clustering. 

![](https://bookdown.org/tpinto_home/Unsupervised-learning/kmeans.png)

# Install Dependencies

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install -U gensim
!python -m spacy download en_core_web_sm

# Import Dependencies

In [47]:
# standard Python
from collections import Counter
# NLP
import spacy
# tabularization of data
import pandas as pd
# traditional machine learning
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# fast operations on numbers
import numpy as np
# word embedding
from gensim.models import FastText
# interactive plotting
import plotly.express as px

In [48]:
pd.options.plotting.backend = "plotly"

# Clustering with TF-IDF

## Cluster similar movies

Here you will learn how to cluster text documents (in this case movies). We will use the following pipeline:
- Text pre-processing
- Feature Engineering
- Clustering Using K-Means
- Prepare Movie Clusters
- Visualize Clusters

Clustering is an unsupervised approach to find groups of similar items in any given dataset. There are different clustering algorithms and __K-Means__ is a pretty simple yet affect one.

In [49]:
raw_df = pd.read_csv('https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2010%20-%20Project%208%20-%20Movie%20Recommendations%20with%20Document%20Similarity/tmdb_5000_movies.csv.gz', compression='gzip')
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [50]:
cols = ['title', 'tagline', 'overview', 'popularity']
df = raw_df[cols].copy()
df.tagline.fillna('', inplace=True)
df['text'] = df['tagline'] + ' ' + df['overview']
df.dropna(inplace=True)
df.sort_values(by='popularity', inplace=True, ascending=False)
df.set_index('title', inplace=True)
df

Unnamed: 0_level_0,tagline,overview,popularity,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Minions,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by...",875.581305,"Before Gru, they had a history of bad bosses M..."
Interstellar,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...,724.247784,Mankind was born on Earth. It was never meant ...
Deadpool,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...,514.569956,Witness the beginning of a happy ending Deadpo...
Guardians of the Galaxy,All heroes start somewhere.,"Light years from Earth, 26 years after being a...",481.098624,All heroes start somewhere. Light years from E...
Mad Max: Fury Road,What a Lovely Day.,An apocalyptic story set in the furthest reach...,434.278564,What a Lovely Day. An apocalyptic story set in...
...,...,...,...,...
Midnight Cabaret,The hot spot where Satan's waitin'.,A Broadway producer puts on a play with a Devi...,0.001389,The hot spot where Satan's waitin'. A Broadway...
Hum To Mohabbat Karega,,"Raju, a waiter, is in love with the famous TV ...",0.001186,"Raju, a waiter, is in love with the famous TV..."
Penitentiary,"There's only one way out, and 100 fools stand ...",A hitchhiker named Martel Gordone gets in a fi...,0.001117,"There's only one way out, and 100 fools stand ..."
Alien Zone,Don't you dare go in there!,A man who is having an affair with a married w...,0.000372,Don't you dare go in there! A man who is havin...


## Text pre-processing

We will do some basic text pre-processing on our movie descriptions before we build our features.

In [51]:
nlp = spacy.load("en_core_web_sm")

In [52]:
def preprocess_texts(texts):
    # nlp.pipe() is more efficient than nlp()
    clean_texts = []
    # NOTE: The lemmatizer requires the `tagger` component, so don't disable it!
    for doc in nlp.pipe(texts, disable=["ner", "parser"]):
        tokens = [tok.lemma_.lower() for tok in doc if (
                    not tok.is_stop 
                    and not tok.is_punct
                    and not tok.is_currency
                    and not tok.is_space
                    and not tok.like_num
                    and not tok.like_url
                    and not tok.like_email
                )
        ]
        clean_text = " ".join(tokens)
        clean_texts.append(clean_text)
    return clean_texts

In [53]:
interstellar = df.at['Interstellar', 'text']
interstellar

'Mankind was born on Earth. It was never meant to die here. Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.'

In [54]:
preprocess_texts([interstellar])

['mankind bear earth mean die interstellar chronicle adventure group explorer use newly discover wormhole surpass limitation human space travel conquer vast distance involve interstellar voyage']

In [55]:
%time clean_texts = preprocess_texts(df['text'])

CPU times: user 33 s, sys: 437 ms, total: 33.5 s
Wall time: 41.6 s


In [56]:
df['clean_text'] = clean_texts
df[['text', 'clean_text']]

Unnamed: 0_level_0,text,clean_text
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Minions,"Before Gru, they had a history of bad bosses M...",gru history bad boss minions stuart kevin bob ...
Interstellar,Mankind was born on Earth. It was never meant ...,mankind bear earth mean die interstellar chron...
Deadpool,Witness the beginning of a happy ending Deadpo...,witness beginning happy end deadpool tell orig...
Guardians of the Galaxy,All heroes start somewhere. Light years from E...,hero start light year earth year abduct peter ...
Mad Max: Fury Road,What a Lovely Day. An apocalyptic story set in...,lovely day apocalyptic story set furth reach p...
...,...,...
Midnight Cabaret,The hot spot where Satan's waitin'. A Broadway...,hot spot satan waitin broadway producer put pl...
Hum To Mohabbat Karega,"Raju, a waiter, is in love with the famous TV...",raju waiter love famous tv reporter greeta kap...
Penitentiary,"There's only one way out, and 100 fools stand ...",way fool stand way hitchhiker name martel gord...
Alien Zone,Don't you dare go in there! A man who is havin...,dare man have affair married woman drop wrong ...


## Extract TF-IDF features

`TfidfVectorizer` (in the `sklearn` library) has many parameters. Some important ones are listed below. The explanations are taken straight from [the official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), the best place to learn more about it.

- `strip_accents`:{‘ascii’, ‘unicode’}, default=None
Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.

- `ngram_range`: tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

- `max_df`: float or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

- `min_df`: float or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

- `max_features`: int, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [57]:
tfidf_vec = TfidfVectorizer(
    strip_accents='ascii',
    max_df=0.95,
    min_df=0.01, 
    )

In [58]:
corpus = df['clean_text']
tfidf_matrix = tfidf_vec.fit_transform(corpus)
tfidf_matrix.shape

(4800, 491)

In [59]:
type(tfidf_matrix)

scipy.sparse.csr.csr_matrix

## Cluster documents using K-Means

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between 

1. assigning data points to clusters based on the current centroids 
2. chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

till convergence..

![](https://i.imgur.com/42n9uvR.png)

Features we are using here are BOW-based TF-IDF scores.

`KMeans` (in the `sklearn` library) has many parameters. Some important ones are listed below. The explanations are taken straight from [the official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), the best place to learn more about it.

- `n_clusters`: int, default=8
The number of clusters to form as well as the number of centroids to generate.
- `max_iter`: int, default=300
Maximum number of iterations of the k-means algorithm for a single run.
- `random_state`: int, RandomState instance or None, default=None
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. 

In [60]:
NUM_CLUSTERS = 6
km = KMeans(
    n_clusters=NUM_CLUSTERS, 
    max_iter=10000, 
    random_state=123,
    ).fit(tfidf_matrix)

In [61]:
km

KMeans(max_iter=10000, n_clusters=6, random_state=123)

In [62]:
Counter(km.labels_)

Counter({0: 2261, 1: 421, 2: 400, 3: 617, 4: 455, 5: 646})

In [63]:
df['kmeans_cluster'] = km.labels_
cols_to_show = ['kmeans_cluster', 'popularity']
movie_clusters = df[cols_to_show] \
                .sort_values(
                    by=['kmeans_cluster', 'popularity'], 
                    ascending=False
                    ) \
                .groupby('kmeans_cluster').head(20)
movie_clusters

Unnamed: 0_level_0,kmeans_cluster,popularity
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Inception,5,167.583710
The Shawshank Redemption,5,136.747729
Harry Potter and the Chamber of Secrets,5,132.397737
Inside Out,5,128.655964
Twilight,5,127.084938
...,...,...
The Hunger Games: Mockingjay - Part 2,0,127.284427
Star Wars,0,126.393695
Brave,0,125.114374
The Lord of the Rings: The Return of the King,0,123.630332


In [64]:
km.cluster_centers_

array([[0.0042941 , 0.00442863, 0.0039245 , ..., 0.01940531, 0.01142658,
        0.02455702],
       [0.00175059, 0.0011706 , 0.0028379 , ..., 0.02826776, 0.008413  ,
        0.02823072],
       [0.00489074, 0.00405404, 0.00356831, ..., 0.03193439, 0.01135203,
        0.03026509],
       [0.00142843, 0.00264851, 0.00180987, ..., 0.02765681, 0.00785481,
        0.01965747],
       [0.00294351, 0.00387855, 0.00647657, ..., 0.0326279 , 0.01226612,
        0.02601481],
       [0.00971298, 0.00154296, 0.00242636, ..., 0.02738348, 0.01131609,
        0.02980695]])

In [65]:
feature_names = tfidf_vec.get_feature_names_out()
topn_features = 30
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [66]:
# get key features for each cluster
# get movies belonging to each cluster
for cluster_num in range(NUM_CLUSTERS):
    key_features = (
        [
         feature_names[index] for index 
         in ordered_centroids[cluster_num, :topn_features]
        ]
    )
    cluster_filter = (movie_clusters['kmeans_cluster'] == cluster_num)
    movies = movie_clusters[cluster_filter].index.tolist()
    print('CLUSTER #' + str(cluster_num+1))
    print('Key Features:', key_features)
    print('Popular Movies:', movies)
    print('-'*80)

CLUSTER #1
Key Features: ['man', 'find', 'new', 'young', 'woman', 'time', 'love', 'year', 'film', 'come', 'team', 'go', 'try', 'murder', 'way', 'take', 'town', 'work', 'force', 'city', 'group', 'discover', 'kill', 'help', 'get', 'turn', 'want', 'day', 'father', 'know']
Popular Movies: ['Pirates of the Caribbean: The Curse of the Black Pearl', 'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Whiplash', 'The Dark Knight', 'Fight Club', "Pirates of the Caribbean: Dead Man's Chest", 'Teenage Mutant Ninja Turtles', 'Gone Girl', 'Pixels', 'Rise of the Planet of the Apes', 'The Lord of the Rings: The Fellowship of the Ring', 'Despicable Me 2', 'Pirates of the Caribbean: On Stranger Tides', "One Flew Over the Cuckoo's Nest", 'The Hunger Games: Mockingjay - Part 2', 'Star Wars', 'Brave', 'The Lord of the Rings: The Return of the King', 'Iron Man']
--------------------------------------------------------------------------------
CLUSTER #2
Key Features: ['story', 'true'

## Visualize clusters

- **Principal Component Analysis** (PCA) is a common technique for reducing dimensions.

![](https://miro.medium.com/max/1400/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg)

In [67]:
# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=123)
# pass tfidf_matrix to the pca and store the reduced vectors as pca_vecs
pca_vecs = pca.fit_transform(tfidf_matrix.toarray())
# save our two dimensions into pca_0 and pca_1 in df
df['pca_0'] = pca_vecs[:, 0]
df['pca_1'] = pca_vecs[:, 1]

In [68]:
fig = px.scatter(df, x="pca_0", y="pca_1", 
                 color="kmeans_cluster",
                 size='popularity', 
                 hover_name=df.index,)
fig.show()

# Clustering with Embeddings

Here we use FastText embeddings as features and use K-means for clustering.

## Train an embedding model

In [69]:
!pip list | grep gensim

gensim                        4.1.2


`FastText` (in the `gensim` library) has many parameters. Some important ones are listed below. The explanations are taken straight from [the official documentation](https://radimrehurek.com/gensim/models/fasttext.html), the best place to learn more about it. 

- `sentences` (iterable of list of str, optional) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, LineSentence in word2vec module for such examples. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
- `min_count` (int, optional) – The model ignores all words with total frequency lower than this.
- `vector_size` (int, optional) – Dimensionality of the word vectors.
- `window` (int, optional) – The maximum distance between the current and predicted word within a sentence.
- `workers` (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines)
- `sg` ({1, 0}, optional) – Training algorithm: skip-gram if sg=1, otherwise CBOW.
- `seed` (int, optional) – Seed for the random number generator. 

In [70]:
tokenized_docs = [doc.split() for doc in corpus]
ft_model = FastText(
    tokenized_docs, 
    vector_size=300, 
    min_count=2, 
    window=10, 
    workers=4, 
    sg=0, # 1 for skip-gram and 0 for CBOW 
    seed=123,
    )

![](https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation.png)

## Vectorize a document with an embedding model

In [71]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index_to_key)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [72]:
doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs, ft_model, 300)
doc_vecs_ft.shape

(4800, 300)

In [73]:
NUM_CLUSTERS = 6
km_fasttext = KMeans(
    n_clusters=NUM_CLUSTERS, 
    max_iter=10000, 
    n_init=100, 
    random_state=123,
    ).fit(doc_vecs_ft)

In [74]:
Counter(km_fasttext.labels_)

Counter({0: 423, 1: 1508, 2: 1304, 3: 449, 4: 1115, 5: 1})

In [75]:
df['kmeans_cluster_ft'] = km_fasttext.labels_
cols_to_show = ['kmeans_cluster_ft', 'popularity']
movie_clusters_ft = df[cols_to_show] \
                    .sort_values(
                    by=['kmeans_cluster_ft', 'popularity'], 
                    ascending=False
                    ) \
                    .groupby('kmeans_cluster_ft').head(20)
movie_clusters_ft

Unnamed: 0_level_0,kmeans_cluster_ft,popularity
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Helix... Loaded,5,0.020600
Mad Max: Fury Road,4,434.278564
Captain America: Civil War,4,198.372395
Frozen,4,165.125366
Batman v Superman: Dawn of Justice,4,155.790452
...,...,...
Seventh Son,0,63.628459
Princess Mononoke,0,60.732738
In Time,0,60.231382
Dallas Buyers Club,0,59.454473


In [76]:
ordered_centroids = km_fasttext.cluster_centers_.argsort()[:, ::-1]

# get key features for each cluster
# get movies belonging to each cluster
for cluster_num in range(NUM_CLUSTERS):
    cluster_filter = (movie_clusters_ft['kmeans_cluster_ft'] == cluster_num)
    movies = movie_clusters_ft[cluster_filter].index.tolist()
    print('CLUSTER #' + str(cluster_num+1))
    print('Popular Movies:', movies)
    print('-'*80)

CLUSTER #1
Popular Movies: ["Pirates of the Caribbean: At World's End", 'Pirates of the Caribbean: On Stranger Tides', 'Chappie', 'Night at the Museum: Secret of the Tomb', 'Man of Steel', 'Prisoners', 'The Good, the Bad and the Ugly', 'Harry Potter and the Prisoner of Azkaban', 'About Time', 'The Sixth Sense', 'Ted', 'Ted 2', 'Dumb and Dumber To', 'Room', 'Self/less', 'Seventh Son', 'Princess Mononoke', 'In Time', 'Dallas Buyers Club', 'The Godfather: Part III']
--------------------------------------------------------------------------------
CLUSTER #2
Popular Movies: ['Interstellar', 'Guardians of the Galaxy', 'Pirates of the Caribbean: The Curse of the Black Pearl', 'Terminator Genisys', 'The Martian', 'Avatar', 'Teenage Mutant Ninja Turtles', 'Gone Girl', 'X-Men: Apocalypse', 'The Shawshank Redemption', 'The Maze Runner', 'Tomorrowland', 'Inside Out', "One Flew Over the Cuckoo's Nest", 'Pulp Fiction', 'Iron Man', 'Ant-Man', 'Lucy', 'Batman Begins', 'The Dark Knight Rises']
--------

## Visualize clusters

In [81]:
# initialize PCA with 3 components
pca = PCA(n_components=3, random_state=123)
# pass fasttext vectors to the pca and store the reduced vectors as pca_vecs
pca_vecs = pca.fit_transform(doc_vecs_ft)
# save our three dimensions into pca_0_ft, pca_1_ft, and pca_2_ft in df
df['pca_0_ft'] = pca_vecs[:, 0]
df['pca_1_ft'] = pca_vecs[:, 1]
df['pca_2_ft'] = pca_vecs[:, 2]

In [82]:
fig = px.scatter_3d(df, x="pca_0_ft", y="pca_1_ft", z="pca_2_ft", 
                 color="kmeans_cluster_ft",
                 size='popularity', 
                 hover_name=df.index,)
fig.show()