# Topic Modeling using Latent Dirichlet Allocation (Clustering)

# Introduction
This notebook is using [Stanford IMDb Review dataset](http://ai.stanford.edu/~amaas/data/sentiment "Stanford IMDb Large Movie Review Dataset").
One must download it, install it locally and set up the variable 'base_path' below to point to the FS path of the dataset.

This notebook is about topic modeling using a technique called Latent Dirichlet Allocation (LDA).

# Data set Loading

In [2]:
# Set the base of the data path where folders test/neg, train/pos, etc, live.
base_path = "../../data/aclImdb" # Change this here to the right path.

# The folders where to look for the reviews.
data_sets = ['test', 'train']
sa_dir_names = ['neg', 'pos']

# List the content of the data path for the sake of checking the data set folders.
files = !ls {base_path}
print(files)

['README', 'aclImdb_100000.csv', 'aclImdb_100000_raw.parquet', 'aclImdb_10000_raw.parquet', 'aclImdb_1000_raw.parquet', 'aclImdb_100_raw.parquet', 'aclImdb_20000_raw.parquet', 'aclImdb_2000_raw.parquet', 'aclImdb_200_raw.parquet', 'aclImdb_210_raw.parquet', 'aclImdb_211_raw.parquet', 'aclImdb_250.csv', 'aclImdb_250_raw.parquet', 'aclImdb_251_raw.parquet', 'aclImdb_252_raw.parquet', 'aclImdb_300_raw.parquet', 'aclImdb_301_raw.parquet', 'aclImdb_50000_raw.parquet', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']


# Data Prep

LDA works on numbers and not on text. The data has to be converted into a feature vector representation for LDA to be able to compute metrics. The metrics will then serve to define clusters and group observations together.

In [3]:
# Set up Python system path to find our modules.
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules.
import file_loader as fl

# Add the file to SparkContext for the executor to find it.
sc.addPyFile('../src/file_loader.py')

In [4]:
# Number of observations.
obs_nb = 1000

# Load the data in a parquet file.
file_parquet, _ = fl.load_data(base_path, obs_nb, spark)

In [5]:
!ls -d {base_path}/*.parquet

[34m../../data/aclImdb/aclImdb_100000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_10000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_1000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_100_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_20000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_2000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_200_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_210_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_211_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_250_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_251_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_252_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_300_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_301_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_50000_raw.parquet[m[m


In [6]:
file_parquet

'../../data/aclImdb/aclImdb_1000_raw.parquet'

In [7]:
# Read the parquet file into a data frame.
df_pqt = spark.read.parquet(file_parquet)

# Showing some observations (entries).
df_pqt.persist()
df_pqt.show()

+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|datasettype|   filename| datetimecreated|reviewid|reviewpolarity|reviewrating|                text|
+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|       test| 3515_8.txt|20181026T102204Z|    3515|             1|           8|I didn't have ver...|
|       test|2823_10.txt|20181026T102204Z|    2823|             1|          10|This movie makes ...|
|       test| 4278_9.txt|20181026T102204Z|    4278|             1|           9|I have to admit I...|
|       test|5651_10.txt|20181026T102204Z|    5651|             1|          10|This film is a kn...|
|       test|4366_10.txt|20181026T102204Z|    4366|             1|          10|Yes, this movie w...|
|       test|5100_10.txt|20181026T102204Z|    5100|             1|          10|I first saw this ...|
|       test|12123_7.txt|20181026T102204Z|   12123|             1|           7|I don't know

# Text Cleansing

In [28]:
import nltk
from nltk.corpus import stopwords

# Remove the stop words
nltk.download('stopwords')
stopwords_set = list(set(stopwords.words('english')))

stopwords_set[:10]
# stopwords_bc = spark.sparkContext.broadcast(set(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /Users/hujol/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


list

# Enrich the Stop Words List

In [72]:
stopwords_set += ['-', '&', 'i\'m', '2', 'one', 'two', '.', 'can\'t', 'i\'ve']

In [73]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, StopWordsRemover

# Remove all HTML tags.
html_tags_remover = fl.HTMLTagsRemover(inputCol='text', outputCol='textclean')

# Tokenize and remove stop words.
tokenizer = Tokenizer(inputCol=html_tags_remover.getOutputCol(), outputCol="words_tknz")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words", 
                           stopWords=stopwords_set)

# Create the pipeline.
pipeline_cleaner = Pipeline(stages=[html_tags_remover, tokenizer, remover])

# Fit the pipeline.
model_p = pipeline_cleaner.fit(df_pqt)

# Tranform the data frame.
df_cleaned = model_p.transform(df_pqt)

In [74]:
# Check the resulting transformation.
len(df_cleaned.head().words)
a_sample = df_cleaned.take(5)[4]
print(len(a_sample['text']), a_sample['text'][250:600])
print(len(a_sample['textclean']), a_sample['textclean'][250:600])
print(len(a_sample['words']), a_sample['words'][:10])

1647 rs of this film are Laurence Harvey and Julie Harris. Now before this film, I'd only see Miss Harris in East of Eden with James Dean and I own an audio tape of The Glass Menagerie that she did on stage with Monty Clift and Jessica Tandy, so I wasn't sure how she'd be in this role and BOY, did she impress me. How hammy was she? I love ham! ;-) Mr. H
1630 rs of this film are Laurence Harvey and Julie Harris. Now before this film, I'd only see Miss Harris in East of Eden with James Dean and I own an audio tape of The Glass Menagerie that she did on stage with Monty Clift and Jessica Tandy, so I wasn't sure how she'd be in this role and BOY, did she impress me. How hammy was she? I love ham! ;- Mr. Ha
152 ['yes,', 'movie', 'hilarious', 'acting', 'top', 'notch', 'whole', 'cast.', 'except', 'shelley']


In [75]:
# Split the df into train and test
df_p_training, df_p_test = df_cleaned.randomSplit([0.9, 0.1], seed=12345)

df_p_training.count(), df_p_test.count()

(905, 95)

# Create a features vector

In [76]:
from pyspark.ml.feature import CountVectorizer, IDF

df_p_training = df_p_training.drop('featurestf')

# Define the count vector so the IDF can compute the features vector.
cv = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="featurestf", vocabSize=30000, minDF=1.0)
idf = IDF(inputCol=cv.getOutputCol(), outputCol="features")

# Create the pipeline.
pipeline = Pipeline(stages=[cv, idf])

# Fit the pipeline.
model_idf = pipeline.fit(df_p_training)

# Transform the data frame that was cleaning by the pipeline_cleaner.
df_idf = model_idf.transform(df_p_training)

In [77]:
# Check the result.
a_sample = df_idf.take(1)[0]
a_sample['features']

SparseVector(28011, {6: 1.1959, 14: 1.4667, 32: 1.8462, 34: 1.8051, 39: 1.7852, 51: 1.9038, 77: 2.3431, 90: 2.3317, 118: 2.4396, 146: 5.3963, 155: 5.3963, 166: 2.5895, 170: 2.6502, 171: 2.9589, 191: 2.8017, 218: 2.9378, 233: 2.9589, 246: 2.9172, 264: 3.0714, 279: 6.8157, 288: 3.0478, 568: 3.631, 573: 3.5132, 588: 3.5132, 610: 3.5902, 635: 3.631, 654: 3.631, 687: 3.7645, 769: 3.8133, 810: 4.101, 893: 4.0365, 929: 3.9187, 1112: 4.101, 1235: 4.4111, 1260: 8.4882, 1433: 4.4111, 1507: 4.5065, 1612: 4.5065, 1759: 4.6118, 1839: 4.5065, 1845: 4.7296, 2663: 4.8631, 2937: 4.8631, 3003: 5.0173, 3074: 5.0173, 3295: 5.0173, 3357: 5.0173, 3402: 5.0173, 3411: 5.0173, 3594: 5.0173, 3676: 5.4227, 4104: 5.1996, 4130: 5.1996, 4445: 5.1996, 5091: 5.4227, 5236: 18.3477, 5424: 5.4227, 5502: 5.4227, 6566: 5.7104, 6949: 12.2318, 8737: 5.7104, 9072: 5.7104, 9318: 5.7104, 9622: 5.7104, 9958: 5.7104, 10086: 6.1159, 10808: 6.1159, 11210: 6.1159, 14057: 6.1159, 15814: 6.1159, 16998: 6.1159, 20354: 6.1159, 20572: 6

# Latent Dirichlet Allocation Applied

In [78]:
from pyspark.ml.clustering import LDA

lda = LDA(k=5, seed=1, optimizer="em")
model_lda = lda.fit(df_idf)

In [79]:

# Check the result.
model_lda.vocabSize()

28011

## Topics Found

In [80]:
topics_info = model_lda.describeTopics().toPandas()
topics_info

Unnamed: 0,topic,termIndices,termWeights
0,0,"[1, 2, 4, 207, 0, 6, 13, 10, 11, 12]","[0.002003425392168439, 0.0014707285271285076, ..."
1,1,"[1, 0, 3, 5, 12, 82, 18, 10, 8, 6]","[0.0018693115981087355, 0.001570741148537871, ..."
2,2,"[1, 0, 2, 4, 52, 17, 18, 9, 23, 33]","[0.0017917824315858437, 0.001779742523937128, ..."
3,3,"[1, 0, 44, 2, 13, 7, 14, 25, 31, 49]","[0.0016817130608839928, 0.001667629490558416, ..."
4,4,"[0, 1, 2, 3, 14, 6, 21, 20, 8, 777]","[0.0019845099274118905, 0.00181662420598082, 0..."


In [81]:
topics_matrix = model_lda.topicsMatrix()

In [82]:
# Check the topic matrix.
topics_matrix.toArray()[:10,]

array([[109.64332199, 132.70762858, 163.25540536, 140.43941247,
        176.14908165],
       [177.26205424, 157.93303021, 164.35982355, 141.6254603 ,
        161.24720827],
       [130.12930801,  99.80417035, 129.80960645, 123.08570722,
        133.51152058],
       [ 82.85808233, 116.79360372, 100.21672657,  64.47604241,
        127.11991003],
       [124.48142814,  75.54921523, 114.48371413,  78.10292185,
        105.14220462],
       [ 83.84125523, 113.27301487,  89.83652336,  87.33993819,
        100.30977706],
       [103.57355513,  99.96773992,  87.58418959,  94.33253502,
        114.43286178],
       [ 70.37627816,  88.5766793 ,  95.90784817,  99.74589052,
        102.39038325],
       [ 90.40969962, 101.15412813,  78.50702074,  80.17414352,
        105.59311164],
       [ 79.42631087,  97.32157873, 109.04295968,  86.51586956,
         81.41165087]])

In [83]:
# Get the CountVectorizer model to fit the cleaned data frame and get the vocabulary.
model_cv = cv.fit(df_p_training)

In [84]:
print(model_cv.vocabulary[:200])
terms_in_stop_words_list = [a_term for a_term in model_cv.vocabulary if a_term in stopwords_set]
len(terms_in_stop_words_list)

['movie', 'film', 'like', 'good', 'would', 'even', 'really', 'see', 'get', 'great', 'story', 'time', 'much', 'first', 'think', 'make', 'also', 'people', 'could', 'never', 'many', 'watch', 'it.', 'bad', 'way', 'made', 'know', 'movies', 'well', 'characters', 'movie.', 'ever', 'character', 'still', 'little', 'say', 'best', 'plot', 'love', 'film.', 'seen', 'films', 'go', 'something', 'show', 'acting', 'real', 'going', 'better', 'watching', 'look', 'nothing', 'old', 'film,', 'movie,', 'every', 'back', 'man', 'makes', 'actually', 'scene', 'quite', 'actors', 'want', 'lot', 'find', 'saw', 'thing', 'scenes', 'without', 'may', 'part', 'life', 'end', 'give', 'years', 'take', 'seems', 'around', 'another', 'pretty', 'things', 'funny', 'always', 'come', 'thought', 'music', 'bit', 'us', 'played', 'gets', 'kind', "he's", 'probably', 'new', 'feel', 'director', 'whole', 'almost', 'big', 'work', 'young', 'it,', "that's", 'must', 'long', '"the', 'anything', 'might', 'since', 'main', 'enough', "there's", '

0

## Display the Words per Topic

In [85]:
import numpy as np

for index, term_indices, term_weights in np.array(topics_info):
    terms = [model_cv.vocabulary[i] for i in term_indices]
    word_to_weight = list(zip(terms, term_weights))
    
    print("Topic %i:" % index, ", ".join(terms), end='\n')
    
    term_stop_words = [a_term for a_term in terms if a_term in stopwords_set]
    if term_stop_words:
        print("Terms in stop words list: %s" % ", ".join(term_stop_words))

Topic 0: film, like, would, black, movie, really, first, story, time, much
Topic 1: film, movie, good, even, much, funny, could, story, get, really
Topic 2: film, movie, like, would, old, people, could, great, bad, still
Topic 3: film, movie, show, like, first, see, think, made, ever, watching
Topic 4: movie, film, like, good, think, really, watch, many, get, cheesy


## Display Movies per Topic

In [92]:
# Apply the model to the data frame.
df_review_topic = model_lda.transform(df_idf)

## Show Topic Distribution

In [97]:
distributions = df_review_topic.select('topicDistribution').take(1)[0]['topicDistribution'].toArray()
[i ==  max(distributions) for i in distributions].index(True)

0

In [None]:
from pyspark.sql import Row

aaa = df_review_topic.rdd.map(lambda row: Row(row, str([i ==  max(row['topicDistribution'].toArray()) for i in row['topicDistribution'].toArray()].index(True))))
aaa.toDF().toPandas()