# Topic Modeling using Latent Dirichlet Allocation (Clustering)

# Introduction
This notebook is using [Stanford IMDb Review dataset](http://ai.stanford.edu/~amaas/data/sentiment "Stanford IMDb Large Movie Review Dataset").
One must download it, install it locally and set up the variable 'base_path' below to the FS path of the dataset.

This notebook is about topic modeling using a technique called Latent Dirichlet Allocation (LDA).

# Data set Loading

In [15]:
# Set the base of the data path where folders test/neg, train/pos, etc, live.
base_path = "../../data/aclImdb"

# The folders where to look for the reviews.
data_sets = ['test', 'train']
sa_dir_names = ['neg', 'pos']

# List the content of the data path for the sake of checking the data set folders.
files = !ls {base_path}
print(files)

['README', 'aclImdb_100000.csv', 'aclImdb_100000_raw.parquet', 'aclImdb_10000_raw.parquet', 'aclImdb_1000_raw.parquet', 'aclImdb_100_raw.parquet', 'aclImdb_20000_raw.parquet', 'aclImdb_2000_raw.parquet', 'aclImdb_200_raw.parquet', 'aclImdb_210_raw.parquet', 'aclImdb_211_raw.parquet', 'aclImdb_250.csv', 'aclImdb_250_raw.parquet', 'aclImdb_251_raw.parquet', 'aclImdb_252_raw.parquet', 'aclImdb_300_raw.parquet', 'aclImdb_301_raw.parquet', 'aclImdb_50000_raw.parquet', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']


# Data Prep

LDA works on numbers and not on text. The data has to be converted into a feature vector representation for LDA to be able to compute metrics. The metrics will then serve to define clusters and group observations together.

In [16]:
# Set up Python system path to find our modules.
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules.
import file_loader as fl

# Add the file to SparkContext for the executor to find it.
sc.addPyFile('../src/file_loader.py')

In [17]:
# Load the data in a parquet file.
file_parquet, _ = fl.load_data(base_path, 301, spark)

In [18]:
!ls -d {base_path}/*.parquet

[34m../../data/aclImdb/aclImdb_100000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_10000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_1000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_100_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_20000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_2000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_200_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_210_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_211_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_250_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_251_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_252_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_300_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_301_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_50000_raw.parquet[m[m


In [19]:
file_parquet

'../../data/aclImdb/aclImdb_301_raw.parquet'

In [None]:
# Read the parquet file into a data frame.
df_pqt = spark.read.parquet(file_parquet)

# As needed.
# df_pqt = df_pqt.drop('words')

# Showing some observations (entries).
df_pqt.persist()
df_pqt.show()

# Text Cleansing

In [10]:
df_pqt = fl.transform_html_clean(df_pqt, 'textclean')
tt = df_pqt.select('text', 'textclean').take(5)
i = 4
len(tt[i]['text']), len(tt[i]['textclean']), tt[i]['text'], tt[i]['textclean']

(3471,
 3377,
 'Sophisticated sex comedies are always difficult to pull off. Look at the films of Blake Edwards, who is arguably the master of the genre, and you will find just as many misses as hits. For, if a film of this nature ever fails to work, it can never fall back on the tried and true toilet humor of a teen sex comedy [i.e. "American Pie"], or warm the audience with the sentimentality of a romantic comedy [i.e. Julia Roberts\' entire career]. It can only maintain a push to the end, and hope that the audience can appreciate the almost required irony of it\'s resolution.<br /><br />Written by husband/wife team Wally Wolodarsky and Maya Forbes, "Seeing Other People" opens with engaged couple Ed & Alice [Jay Mohr & Julianne Nicholson] only seconds away from rear-ending the car in front of them. As the frame freezes, we unexpectedly hear the thoughts and fears of both characters. From here on out, we welcome that the story about to unfold will enjoy a point of view from both sexes