<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/11d-lda-imdb.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>


# 11d -- Latent Dirichlet Allocation (LDA) of the IMDb dataset

Unsupervised LDA for topic modeling of the IMDb dataset

Reference: Raschka's [ch08.ipynb](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch08/ch08.ipynb) -- github


* Latent Dirichlet Allocation (LDA) is a unsupervised generative statistical modeling technique
* With LDA, documents are organized into groups
  * Each document is a mixture of topics, and each word's presence is attributable to one of the document's topics
  * Topics are defined by groups of words that appear together frequently across documents
  * Given a bag-of-words as input, LDA decomposes the data into two matrices:
    * document-to-topic matrix
    * word-to-topic matrix
  * LDA is a generalization of probablistic Latent Semantic Analysis (pLSA)
    * LDA can be used to create synthetic documents
    * the synthetic documents share the same statistical properties of the training data
    * that's what is meant by generative
  * pLSA builds upon LSA by using a probabilistic multinomial distrubition to model document-word co-ocurrence
    * pLSA is not generative in the same sense as LDA
  * LSA is closely related to PCA
    * LDA is related to PCA, in that the matrix decomposition of the data optimizes a measure of error in the data representation.
* The number of topics is a hyperparameter of the analysis, akin to the dimensionality reduction that we've seen PCA.

In [None]:
# Read movie reviews from CSV in Raschka's github repo
# This cell replaces cells 2, 3 & 4
import os
import sys
import time
import pandas as pd
import urllib.request

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()

target = "movie_data.csv.gz"
source = "https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/" + target
if not os.path.isfile(target):
    urllib.request.urlretrieve(source, target, reporthook)

df = pd.read_csv(target, compression='gzip')

assert df.shape == (50000, 2)
df.head(3)

In [None]:
# Cell 46
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

### Expectation-maximization algorithm

* Goal: Estimate parameters of a statistical model so you can use it to make predictions.
    * With a generative model, you can estimate the probability of the data given a set of parameters.
* Expectation -- make predictions (e.g., classify data) based on statistical model (and its presumed parameters)
* Maximization -- update the unknown parameters by optimizing some "fitness" function (e.g., prediction errors)
* VanderPlas demonstrates E-M with K-means [05.11-k-means](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)
    * Initialize: randomly choose K cluster centers
         * i.e., initialize parameters of statistical model
    * Step 1: Expectation -- make predictions based on statistical model
        * i.e., classify data by assigning samples to clusters based on distance from the K centers
    * Step 2: Maximization -- update the parameters based on the data and some "fitness" criterion
        * i.e., recompute the cluster centers from the data and the predicted labels from Step 1
    * Repeat steps 1 & 2 until done
        * i.e., go back to Step 1 and repeat until parameters stop changing
* Visualization using Old Faithful geyser data -- wait-time (delay) between eruptions vs eruption duration
    * Ref: [Expectation-maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) -- wikipedia

<img src="https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif" width="400"/>



In [None]:
# Cell 47 (takes ~7 minutes in Colab)
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [None]:
# Cell 48
lda.components_.shape

In [None]:
# Cell 49
n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Based on these top-ranked words for each topic, you may guess that the LDA identified the following topics:
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedy movies reviews
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

In [None]:
# Cell 50
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')