# Topic Modeling using Latent Dirichlet Allocation (Clustering)

# Introduction
This notebook is using [Stanford IMDb Review dataset](http://ai.stanford.edu/~amaas/data/sentiment "Stanford IMDb Large Movie Review Dataset").
One must download it, install it locally and set up the variable 'base_path' below to point to the FS path of the dataset.

This notebook is about topic modeling using a technique called Latent Dirichlet Allocation (LDA).

# Data set Loading

In [2]:
# Set the base of the data path where folders test/neg, train/pos, etc, live.
base_path = "../../data/aclImdb" # Change this here to the right path.

# The folders where to look for the reviews.
data_sets = ['test', 'train']
sa_dir_names = ['neg', 'pos']

# List the content of the data path for the sake of checking the data set folders.
files = !ls {base_path}
print(files)

['README', 'aclImdb_100000.csv', 'aclImdb_100000_raw.parquet', 'aclImdb_10000_raw.parquet', 'aclImdb_1000_raw.parquet', 'aclImdb_100_raw.parquet', 'aclImdb_20000_raw.parquet', 'aclImdb_2000_raw.parquet', 'aclImdb_200_raw.parquet', 'aclImdb_210_raw.parquet', 'aclImdb_211_raw.parquet', 'aclImdb_250.csv', 'aclImdb_250_raw.parquet', 'aclImdb_251_raw.parquet', 'aclImdb_252_raw.parquet', 'aclImdb_300_raw.parquet', 'aclImdb_301_raw.parquet', 'aclImdb_50000_raw.parquet', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']


# Data Prep

LDA works on numbers and not on text. The data has to be converted into a feature vector representation for LDA to be able to compute metrics. The metrics will then serve to define clusters and group observations together.

In [3]:
# Set up Python system path to find our modules.
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules.
import file_loader as fl

# Add the file to SparkContext for the executor to find it.
sc.addPyFile('../src/file_loader.py')

In [4]:
# Number of observations.
obs_nb = 1000

# Load the data in a parquet file.
file_parquet, _ = fl.load_data(base_path, obs_nb, spark)

In [5]:
!ls -d {base_path}/*.parquet

[34m../../data/aclImdb/aclImdb_100000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_10000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_1000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_100_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_20000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_2000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_200_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_210_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_211_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_250_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_251_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_252_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_300_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_301_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_50000_raw.parquet[m[m


In [6]:
file_parquet

'../../data/aclImdb/aclImdb_1000_raw.parquet'

In [7]:
# Read the parquet file into a data frame.
df_pqt = spark.read.parquet(file_parquet)

# Showing some observations (entries).
df_pqt.persist()
df_pqt.show()

+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|datasettype|   filename| datetimecreated|reviewid|reviewpolarity|reviewrating|                text|
+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|       test| 3515_8.txt|20181026T091736Z|    3515|             1|           8|I didn't have ver...|
|       test|2823_10.txt|20181026T091736Z|    2823|             1|          10|This movie makes ...|
|       test| 4278_9.txt|20181026T091736Z|    4278|             1|           9|I have to admit I...|
|       test|5651_10.txt|20181026T091736Z|    5651|             1|          10|This film is a kn...|
|       test|4366_10.txt|20181026T091736Z|    4366|             1|          10|Yes, this movie w...|
|       test|5100_10.txt|20181026T091736Z|    5100|             1|          10|I first saw this ...|
|       test|12123_7.txt|20181026T091736Z|   12123|             1|           7|I don't know

# Text Cleansing

In [9]:
import nltk
from nltk.corpus import stopwords

# Remove the stop words
nltk.download('stopwords')
stopwords_set = list(set(stopwords.words('english')))

stopwords_set[:10]
# stopwords_bc = spark.sparkContext.broadcast(set(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /Users/hujol/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


["you'd",
 'down',
 'needn',
 "mustn't",
 'hers',
 'against',
 'your',
 'through',
 'further',
 'once']

In [19]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, StopWordsRemover

# Remove all HTML tags.
html_tags_remover = fl.HTMLTagsRemover(inputCol='text', outputCol='textclean')

# Tokenize and remove stop words.
tokenizer = Tokenizer(inputCol=html_tags_remover.getOutputCol(), outputCol="words_tknz")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words", 
                           stopWords=stopwords_set)

# Create the pipeline.
pipeline_cleaner = Pipeline(stages=[html_tags_remover, tokenizer, remover])

# Fit the pipeline.
model_p = pipeline_cleaner.fit(df_pqt)

# Tranform the data frame.
df_cleaned = model_p.transform(df_pqt)

In [20]:
# Check the resulting transformation.
len(df_cleaned.head().words)
a_sample = df_cleaned.take(5)[4]
print(len(a_sample['text']), a_sample['text'][250:600])
print(len(a_sample['textclean']), a_sample['textclean'][250:600])
print(len(a_sample['words']), a_sample['words'][:10])

1647 rs of this film are Laurence Harvey and Julie Harris. Now before this film, I'd only see Miss Harris in East of Eden with James Dean and I own an audio tape of The Glass Menagerie that she did on stage with Monty Clift and Jessica Tandy, so I wasn't sure how she'd be in this role and BOY, did she impress me. How hammy was she? I love ham! ;-) Mr. H
1630 rs of this film are Laurence Harvey and Julie Harris. Now before this film, I'd only see Miss Harris in East of Eden with James Dean and I own an audio tape of The Glass Menagerie that she did on stage with Monty Clift and Jessica Tandy, so I wasn't sure how she'd be in this role and BOY, did she impress me. How hammy was she? I love ham! ;- Mr. Ha
156 ['yes,', 'movie', 'hilarious', 'acting', 'top', 'notch', 'whole', 'cast.', 'except', 'shelley']


In [21]:
# Split the df into train and test
df_p_training, df_p_test = df_cleaned.randomSplit([0.9, 0.1], seed=12345)

df_p_training.count(), df_p_test.count()

(905, 95)

# Create a features vector

In [22]:
from pyspark.ml.feature import CountVectorizer, IDF

df_p_training = df_p_training.drop('featurestf')

# Define the count vector so the IDF can compute the features vector.
cv = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="featurestf", vocabSize=30000, minDF=1.0)
idf = IDF(inputCol=cv.getOutputCol(), outputCol="features")

# Create the pipeline.
pipeline = Pipeline(stages=[cv, idf])

# Fit the pipeline.
model_idf = pipeline.fit(df_p_training)

# Transform the data frame.
df_idf = model_idf.transform(df_p_training)

In [23]:
# Check the result.
a_sample = df_idf.take(1)[0]
a_sample['features']

SparseVector(28020, {2: 0.6844, 7: 1.1959, 9: 1.8891, 16: 1.4667, 35: 1.8462, 37: 1.8051, 42: 1.7852, 55: 1.9038, 82: 2.3431, 96: 2.3317, 126: 2.4396, 152: 5.3963, 159: 5.3963, 173: 2.5895, 178: 2.6502, 181: 2.9589, 199: 2.8017, 227: 2.9378, 242: 2.9589, 255: 2.9172, 272: 3.0714, 290: 6.8157, 293: 3.0478, 586: 3.5132, 588: 3.631, 602: 3.5132, 615: 3.5902, 645: 3.631, 664: 3.631, 688: 3.7645, 782: 3.8133, 800: 4.101, 905: 4.0365, 940: 3.9187, 1117: 4.101, 1274: 4.4111, 1285: 8.4882, 1416: 4.4111, 1585: 4.5065, 1633: 4.5065, 1789: 4.5065, 1796: 4.7296, 1822: 4.6118, 2845: 4.8631, 2925: 4.8631, 3198: 5.0173, 3336: 5.0173, 3439: 5.0173, 3450: 5.0173, 3492: 5.0173, 3619: 5.0173, 3642: 5.0173, 3750: 5.1996, 3782: 5.1996, 3797: 5.4227, 4461: 5.1996, 4685: 5.4227, 4825: 18.3477, 5401: 5.4227, 5461: 5.4227, 6641: 12.2318, 6723: 5.7104, 6926: 5.7104, 7450: 5.7104, 8715: 5.7104, 9393: 5.7104, 9552: 5.7104, 10211: 6.1159, 12322: 6.1159, 13475: 6.1159, 17190: 6.1159, 17227: 6.1159, 21756: 6.1159, 2

# Latent Dirichlet Allocation Applied

In [24]:
from pyspark.ml.clustering import LDA

lda = LDA(k=5, seed=1, optimizer="em")
model_lda = lda.fit(df_idf)

In [25]:
# Check the result.
model_lda.vocabSize()

28020

In [26]:
model_lda.describeTopics().toPandas()

Unnamed: 0,topic,termIndices,termWeights
0,0,"[1, 3, 0, 2, 9, 787, 27, 12, 23, 7]","[0.0022201148561984247, 0.0014447574195974062,..."
1,1,"[1, 0, 3, 15, 21, 344, 48, 14, 2, 4]","[0.0017325556461041337, 0.0014006685968434127,..."
2,2,"[9, 1, 0, 4, 2, 5, 37, 18, 95, 11]","[0.0021506657295716843, 0.0016757227399249817,..."
3,3,"[1, 0, 9, 145, 7, 6, 3, 2, 23, 10]","[0.0018550225680454926, 0.0018344894985548024,..."
4,4,"[0, 9, 3, 1, 8, 7, 5, 13, 87, 2]","[0.0020106356340426285, 0.0019638129275122364,..."


In [27]:
topics_matrix = model_lda.topicsMatrix()

In [28]:
topics_matrix.toArray()[:10,]

array([[121.78837608, 123.37933398, 130.1902185 , 162.13468156,
        184.70223993],
       [190.6509602 , 152.61394607, 144.86314676, 163.94942222,
        150.35010131],
       [110.38035744, 102.92382594, 112.34209082, 105.37327057,
        109.6216282 ],
       [124.06763035, 116.58349022,  95.88234195, 106.20637297,
        173.60047713],
       [ 75.35353221, 102.25215346, 123.31206116,  81.31305096,
        109.23356727],
       [ 92.27207712,  89.30318714, 108.73439725,  79.46784329,
        127.98197917],
       [ 97.75653344,  87.62623873,  73.11879014, 106.80084342,
        109.29810297],
       [101.69432469,  71.3374006 ,  90.28007237, 108.40931056,
        128.16977323],
       [ 71.35872522,  97.7858671 ,  73.36306366,  83.20950862,
        131.27991479],
       [105.24280858,  67.51132737, 185.9210941 , 114.53798567,
        180.40098383]])

In [29]:
model_cv = cv.fit(df_p_training)

In [30]:
print(model_cv.vocabulary[:20])
'it' in stopwords_set

['movie', 'film', 'one', 'like', 'good', 'would', 'even', 'really', 'see', '-', 'get', 'great', 'story', 'time', 'much', 'first', 'think', 'make', 'also', 'people']


True