# Topic Modeling using Latent Dirichlet Allocation (Clustering)

# Introduction
This notebook is using [Stanford IMDb Review dataset](http://ai.stanford.edu/~amaas/data/sentiment "Stanford IMDb Large Movie Review Dataset").
One must download it, install it locally and set up the variable 'base_path' below to the FS path of the dataset.

This notebook is about topic modeling using a technique called Latent Dirichlet Allocation (LDA).

# Data set Loading

In [15]:
# Set the base of the data path where folders test/neg, train/pos, etc, live.
base_path = "../../data/aclImdb"

# The folders where to look for the reviews.
data_sets = ['test', 'train']
sa_dir_names = ['neg', 'pos']

# List the content of the data path for the sake of checking the data set folders.
files = !ls {base_path}
print(files)

['README', 'aclImdb_100000.csv', 'aclImdb_100000_raw.parquet', 'aclImdb_10000_raw.parquet', 'aclImdb_1000_raw.parquet', 'aclImdb_100_raw.parquet', 'aclImdb_20000_raw.parquet', 'aclImdb_2000_raw.parquet', 'aclImdb_200_raw.parquet', 'aclImdb_210_raw.parquet', 'aclImdb_211_raw.parquet', 'aclImdb_250.csv', 'aclImdb_250_raw.parquet', 'aclImdb_251_raw.parquet', 'aclImdb_252_raw.parquet', 'aclImdb_300_raw.parquet', 'aclImdb_301_raw.parquet', 'aclImdb_50000_raw.parquet', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']


# Data Prep

LDA works on numbers and not on text. The data has to be converted into a feature vector representation for LDA to be able to compute metrics. The metrics will then serve to define clusters and group observations together.

In [16]:
# Set up Python system path to find our modules.
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules.
import file_loader as fl

# Add the file to SparkContext for the executor to find it.
sc.addPyFile('../src/file_loader.py')

In [17]:
# Load the data in a parquet file.
file_parquet, _ = fl.load_data(base_path, 301, spark)

In [18]:
!ls -d {base_path}/*.parquet

[34m../../data/aclImdb/aclImdb_100000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_10000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_1000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_100_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_20000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_2000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_200_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_210_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_211_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_250_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_251_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_252_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_300_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_301_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_50000_raw.parquet[m[m


In [19]:
file_parquet

'../../data/aclImdb/aclImdb_301_raw.parquet'

In [None]:
# Read the parquet file into a data frame.
df_pqt = spark.read.parquet(file_parquet)

# As needed.
# df_pqt = df_pqt.drop('words')

# Showing some observations (entries).
df_pqt.persist()
df_pqt.show()

# Text Cleansing

In [95]:
import nltk
from nltk.corpus import stopwords

# Remove the stop words
nltk.download('stopwords')
stopwords_set = list(set(stopwords.words('english')))

stopwords_set[:10]
# stopwords_bc = spark.sparkContext.broadcast(set(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /Users/hujol/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['both',
 'was',
 'been',
 'ourselves',
 'doing',
 'because',
 'am',
 'as',
 'and',
 'me']

In [96]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, StopWordsRemover

# Remove all HTML tags.
df_pqt = fl.transform_html_clean(df_pqt, 'textclean')

# Tokenize and remove stop words.
tokenizer = Tokenizer(inputCol="textclean", outputCol="words_tknz")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words_test", 
                           stopWords=stopwords_set)
pipeline = Pipeline(stages=[tokenizer, remover])

# Fit the pipeline.
model_p = pipeline.fit(df0)

# Tranform the data frame.
df_p = model_p.transform(df0)

In [97]:
# Check the resulting transformation.
len(df_p.head().words_test)
a_sample = df_p.take(5)[4]
print(len(a_sample['text']), a_sample['text'][250:600])
print(len(a_sample['textclean']), a_sample['textclean'][250:600])
print(len(a_sample['words_test']), a_sample['words_test'][:10])

3471 ll back on the tried and true toilet humor of a teen sex comedy [i.e. "American Pie"], or warm the audience with the sentimentality of a romantic comedy [i.e. Julia Roberts' entire career]. It can only maintain a push to the end, and hope that the audience can appreciate the almost required irony of it's resolution.<br /><br />Written by husband/wi
3377 ll back on the tried and true toilet humor of a teen sex comedy i.e. "American Pie", or warm the audience with the sentimentality of a romantic comedy i.e. Julia Roberts' entire career. It can only maintain a push to the end, and hope that the audience can appreciate the almost required irony of it's resolution.Written by husband/wife team Wally Wo
327 ['sophisticated', 'sex', 'comedies', 'always', 'difficult', 'pull', 'off.', 'look', 'films', 'blake']


In [98]:
# Split the df into train and test
df_p_training, df_p_test = df_p.randomSplit([0.9, 0.1], seed=12345)

df_p_training.count(), df_p_test.count()

(230, 22)

# Create a features vector

In [102]:
df_p_training = df_p_training.drop('featurestf')

# Define the count vector so the IDF can compute the features vector.
cv = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="featurestf", vocabSize=30000, minDF=1.0)
idf = IDF(inputCol=cv.getOutputCol(), outputCol="features")

# Create the pipeline.
pipeline = Pipeline(stages=[cv, idf])

# Fit the pipeline.
model_idf = pipeline.fit(df_p_training)

# Transform the data frame.
df_idf = model_idf.transform(df_p_training)

In [106]:
# Check the result.
a_sample = df_idf.take(1)[0]
a_sample['features']

SparseVector(10578, {0: 4.8218, 1: 2.9677, 2: 5.9311, 3: 3.4441, 4: 1.073, 5: 3.3351, 7: 4.0943, 8: 2.1714, 9: 3.2571, 10: 2.7639, 12: 2.5055, 15: 1.3649, 17: 1.5712, 19: 1.4534, 20: 1.4534, 22: 1.6138, 23: 1.4534, 25: 1.6582, 29: 1.6358, 32: 3.5577, 36: 1.6812, 37: 1.8048, 39: 5.2606, 40: 3.3165, 43: 2.1466, 44: 3.7741, 45: 1.8048, 46: 3.6097, 47: 1.7288, 48: 5.4945, 51: 1.8048, 57: 1.9767, 58: 3.8321, 59: 4.1502, 61: 1.9767, 62: 6.3306, 66: 2.1102, 67: 2.3514, 68: 2.0084, 70: 6.2254, 72: 6.4397, 74: 2.3979, 78: 2.1843, 85: 2.0751, 87: 6.6706, 88: 9.5916, 94: 2.2644, 95: 2.1843, 100: 4.6138, 102: 2.3979, 104: 2.2235, 106: 2.2644, 107: 2.3514, 109: 4.7958, 116: 4.6138, 117: 4.7958, 118: 4.7028, 121: 2.2644, 132: 2.4467, 139: 2.552, 141: 2.498, 144: 7.6561, 150: 2.6092, 159: 2.6092, 164: 2.6698, 171: 5.6067, 179: 2.552, 181: 2.552, 184: 5.4687, 192: 2.6698, 202: 5.6067, 203: 2.7344, 206: 5.3397, 209: 22.9979, 219: 2.8034, 247: 2.8775, 250: 2.7344, 256: 2.8034, 258: 5.7549, 271: 2.8034, 

# Latent Dirichlet Allocation Applied

In [120]:
from pyspark.ml.clustering import LDA

lda = LDA(k=5, seed=1, optimizer="em")
model_lda = lda.fit(df_idf)

In [121]:
# Check the result.
model_lda.vocabSize()

10578

In [154]:
model_lda.describeTopics().toPandas()

Unnamed: 0,topic,termIndices,termWeights
0,0,"[0, 1, 9, 74, 624, 4, 52, 281, 6, 19]","[0.0020609759941865036, 0.002036337672082326, ..."
1,1,"[60, 55, 1, 474, 307, 0, 14, 2, 88, 41]","[0.003516612765568113, 0.002261082291291362, 0..."
2,2,"[6, 0, 453, 1, 2, 81, 3, 47, 209, 221]","[0.003052733224759423, 0.0026427669067573475, ..."
3,3,"[443, 14, 2, 25, 1, 15, 33, 21, 10, 0]","[0.00213380505133751, 0.002106536718676652, 0...."
4,4,"[153, 6, 410, 5, 465, 1, 156, 2, 119, 0]","[0.0038359394566674686, 0.0024447223590275805,..."


In [132]:
topics_matrix = model_lda.topicsMatrix()

In [152]:
topics_matrix.toArray()[:10,]

array([[36.82449917, 37.14671488, 50.37797794, 29.70896658, 34.68058176],
       [36.38427382, 42.43309555, 36.06745967, 31.73441111, 38.86509604],
       [24.40049702, 35.97351675, 35.84170214, 37.58443714, 35.65941904],
       [22.11280188, 31.05716164, 31.38364112, 21.02460429, 30.1208124 ],
       [33.12244131, 18.21111216, 24.59345038, 19.48178838, 29.05571131],
       [19.89510702, 25.04741253, 24.73496457, 15.64804686, 43.62985596],
       [30.68420526, 30.93673338, 58.19299715, 28.40444837, 50.59886487],
       [26.12092055, 23.08888505, 21.64141218, 15.86207514, 27.92734258],
       [26.70890323, 24.35316999, 20.86747695, 20.84783924, 25.56487893],
       [33.69699164, 15.63287622, 13.26981784, 20.41148247, 25.55972021]])

In [159]:
model_cv = cv.fit(df_p_training)

In [163]:
print(model_cv.vocabulary[:20])
'it' in stopwords_set

['movie', 'film', 'like', 'one', 'good', 'really', '-', 'would', 'even', 'see', 'get', 'time', 'think', 'great', 'people', 'story', 'much', 'first', 'it.', 'also']


True