# Topic Modeling using Latent Dirichlet Allocation (Clustering)

# Introduction
This notebook is using [Stanford IMDb Review dataset](http://ai.stanford.edu/~amaas/data/sentiment "Stanford IMDb Large Movie Review Dataset").
One must download it, install it locally and set up the variable 'base_path' below to the FS path of the dataset.

This notebook is about topic modeling using a technique called Latent Dirichlet Allocation (LDA).

# Data set Loading

In [1]:
# Set the base of the data path where folders test/neg, train/pos, etc, live.
base_path = "../../data/aclImdb"

# The folders where to look for the reviews.
data_sets = ['test', 'train']
sa_dir_names = ['neg', 'pos']

# List the content of the data path for the sake of checking the data set folders.
files = !ls {base_path}
print(files)

['README', 'aclImdb_100000.csv', 'aclImdb_100000_raw.parquet', 'aclImdb_10000_raw.parquet', 'aclImdb_1000_raw.parquet', 'aclImdb_100_raw.parquet', 'aclImdb_20000_raw.parquet', 'aclImdb_2000_raw.parquet', 'aclImdb_200_raw.parquet', 'aclImdb_210_raw.parquet', 'aclImdb_211_raw.parquet', 'aclImdb_250_raw.parquet', 'aclImdb_251_raw.parquet', 'aclImdb_252_raw.parquet', 'aclImdb_300_raw.parquet', 'aclImdb_50000_raw.parquet', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']


# Data Prep

LDA works on numbers and not on text. The data has to be converted into a feature vector representation for LDA to be able to compute metrics. The metrics will then serve to define clusters and group observations together.

In [2]:
# Set up Python system path to find our modules.
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our modules.
import file_loader as fl0

In [3]:
# Load the data in a parquet file.
file_parquet = fl0.load_data(base_path, 301, spark)

In [4]:
!ls -d {base_path}/*.parquet

[34m../../data/aclImdb/aclImdb_100000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_10000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_1000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_100_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_20000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_2000_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_200_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_210_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_211_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_250_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_251_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_252_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_300_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_301_raw.parquet[m[m
[34m../../data/aclImdb/aclImdb_50000_raw.parquet[m[m


In [5]:
file_parquet

'../../data/aclImdb/aclImdb_301_raw.parquet'

In [6]:
# Read the parquet file into a data frame.
df_pqt = spark.read.parquet(file_parquet)

# As needed.
# df_pqt = df_pqt.drop('words')

# Showing some observations (entries).
df_pqt.persist()
df_pqt.show()

+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|datasettype|   filename| datetimecreated|reviewid|reviewpolarity|reviewrating|                text|
+-----------+-----------+----------------+--------+--------------+------------+--------------------+
|       test|10167_9.txt|20181024T155532Z|   10167|             1|           9|A film that tends...|
|       test| 903_10.txt|20181024T155532Z|     903|             1|          10|This is a well do...|
|       test| 1466_8.txt|20181024T155532Z|    1466|             1|           8|A strange relatio...|
|       test| 6176_8.txt|20181024T155532Z|    6176|             1|           8|This is a brillia...|
|       test|5124_10.txt|20181024T155532Z|    5124|             1|          10|i read the book b...|
|       test|2807_10.txt|20181024T155532Z|    2807|             1|          10|I played Sam (the...|
|       test| 4038_9.txt|20181024T155532Z|    4038|             1|           9|"Snow Queen"