# Univ. of Illinois Data Mining Project on Coursera
## Task 01 - Initial Topic Investigation
2018-09-16
loganjtravis@gmail.com (Logan Travis)

In [4]:
# Suppress warnings. Python 3.7 complains about a number of packages using a soon to be deprecated
# import syntax.
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Imports
import os, pickle, random
import gensim.models as models, gensim.matutils as matutils, gensim.corpora as corpora
import nltk
from scipy.sparse import load_npz, save_npz
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import pyLDAvis.gensim

In [6]:
# Set random seed for repeatability
random.seed(42)

### Summary

From course page [Week 1 > Task 1 Information > Task 1 Overview](https://www.coursera.org/learn/data-mining-project/supplement/z2jpZ/task-1-overview):

> The goal of this task is to explore the Yelp data set to get a sense about what the data look like and their characteristics. You can think about the goal as being to answer questions such as:
> 
> 1. What are the major topics in the reviews? Are they different in the positive and negative reviews? Are they different for different cuisines?
> 2. What does the distribution of the number of reviews over other variables (e.g., cuisine, location) look like?
> 3. What does the distribution of ratings look like?
>
> In general, you can address such questions by showing visualization of statistics computed based on the data set or topics extracted from review text.

### Grading Rubric

From course page [Week 1 > Task 1 Information > Task 1 Rubric](https://www.coursera.org/learn/data-mining-project/supplement/Xk8lq/task-1-rubric):

> You will evaluate your peers' submission for Task 1 using this rubric. While evaluating, consider the following questions:
> 
> * Application of a topic model: Was the description of the topic modeling procedure clear enough such that you can produce the same results?
> * Topic visualization: Does the topic visualization effectively display the data?
> * Data exploration: Was the description of the two sets of data they selected for comparison clear enough to follow?
> * Visualization comparison: Does the visualization component highlight the differences/similarities between the data?
> 
> Note that the examples listed in the "Excellent" column are not an exclusive list for each category. You may choose to award 6 points for any effort in your peers' submissions that goes beyond what is required.
> 
> | Criteria | Poor (1 point) | Fair (3 points) | Good (5 points) | Excellent (6 points) |
> | --- | --- | --- | --- | --- |
> | **Task 1.1: Application of a topic model** | A topic model was either not used or did not generate any topic. | A topic model was used, but the report fails to mention what model was used and/or how it is applied to the data set. | The report clearly explains what topic model was used and how it was applied to the data set. | For example, multiple topic models were used and the report analyzes the differences between them. |
> | **Task 1.1: Generated visualization** | The visualization is either absent or useless. | The visualization is present but does not help make clear what topics the people have talked about in the reviews. | The visualization clearly shows and distinguishes what topics people have talked about in the reviews. | For example, multiple visualizations were used and the report analyzes the comparative strengths of each.
> | **Task 1.2: Generated sets of topics** | The two subsets are not comparable. | The two subsets are comparable. A topic model was used on the two subsets, but the report fails to mention what model was used and/or how it was applied to the data set. | The two subsets are comparable. The report clearly explains what topic model was used and how it was applied to the two subsets. | For example, multiple interesting subsets were identified and assessed for their usefulness, or multiple topic models were applied to the two subsets with differences between them analyzed.
> | **Task 1.2: Visualization of comparison** | The two subsets are visualized in such a way that similarities and differences are not clear. | The two subsets are visualized in such a way to show the similarity of the two subsets, but no attempt was made to show the differences. | The two subsets are visualized in such a way that both similarities and differences are very apparent. | Extra transformation of the data was done to improve visualization, or multiple ways of visualizing the topics were used to provide a very comprehensive comparison.
> | **Visualizations: Appropriateness of choice** | The visualization methods are not suitable for the type of data. | The visualization methods are suitable for the type of data, but another way to visualize the data is clearly better. | The visualization methods used are quite suitable for the type of data and made relationships clear. | Furthermore, extra effort was made to make the visualizations beautifully designed and/or usefully interactive.

### Get Data Set

Note: I cleaned and saved a Pandas dataframe (as a GZIPped pickle) from the Yelp reviews dataset in a separate notebook "task00-yelp-reviews-to-pandas-dataframe.ipynb".

In [7]:
# Set paths to data source, work in process ("WIP"), and output
PATH_SOURCE = "source/"
PATH_WIP = "wip/"
PATH_OUTPUT = "output/"

# Set review file path
PATH_SOURCE_YELP_REVIEWS = PATH_SOURCE + "yelp_academic_dataset_review.pkl.gzip"
PATH_WIP_TOKENIZER = PATH_WIP + "task01_tokenizer.pkl"
PATH_WIP_TOKEN_MATRIX = PATH_WIP + "task01_token_matrix.npz"
PATH_WIP_LDA_MODEL = PATH_WIP + "taks01_lda_model"

In [8]:
# Read pickled dataframe
dfYelpReviews = pd.read_pickle(PATH_SOURCE_YELP_REVIEWS)

In [9]:
# Print dataframe shape and head
print(f"Shape: {dfYelpReviews.shape}")
dfYelpReviews.head()

Shape: (1125458, 9)


Unnamed: 0_level_0,business_id,date,stars,text,type,user_id,votes_cool,votes_funny,votes_useful
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15SdjuK7DmYqUAj6rjGowg,vcNAWiLM4dR7D2nwwJ7nCA,2007-05-17,5,dr. goldberg offers everything i look for in a...,review,Xqd0DzHaiyRqVH3WRG7hzg,1,0,2
RF6UnRTtG7tWMcrO2GEoAg,vcNAWiLM4dR7D2nwwJ7nCA,2010-03-22,2,"Unfortunately, the frustration of being Dr. Go...",review,H1kH6QZV7Le4zqTRNxoZow,0,0,2
-TsVN230RCkLYKBeLsuz7A,vcNAWiLM4dR7D2nwwJ7nCA,2012-02-14,4,Dr. Goldberg has been my doctor for years and ...,review,zvJCcrpm2yOZrxKffwGQLA,1,0,1
dNocEAyUucjT371NNND41Q,vcNAWiLM4dR7D2nwwJ7nCA,2012-03-02,4,Been going to Dr. Goldberg for over 10 years. ...,review,KBLW4wJA_fwoWmMhiHRVOA,0,0,0
ebcN2aqmNUuYNoyvQErgnA,vcNAWiLM4dR7D2nwwJ7nCA,2012-05-15,4,Got a letter in the mail last week that said D...,review,zvJCcrpm2yOZrxKffwGQLA,1,0,2


### TF-IDF

*

In [10]:
# Set flag to load token matrix from file if found; set this to False when changing
# other parameters
load_token_matrix_from_file = True
overwrite_saved_token_matrix = not load_token_matrix_from_file

# Set token limit
max_features = 10000

# Set document frequency ceiling; topic analysis will ignore words found in more documents
max_df = 0.5

# Set document frequency floor; topic analysis will ignore words found in fewer document
min_df = 0.001

#### Custom Tokenizer

The `TfidVectorizer` class has a default pre-processor and tokenizer. While the pre-processing steps meet my needs (i.e., puncuation removal and setting lower-case) the tokenizer does not lemmatize nor stem words. Those two additional steps should produce more stable topics. I therefore create my own tokenizer.

Note: I create `MyTokenizer` is a class to internalize instantiation of NLK's `WordNetLemmatizer`.

In [11]:
class MyTokenizer:
    def __init__(self):
        """String tokenizer utilizing lemmatizing and stemming."""
        self.wnl = nltk.stem.WordNetLemmatizer()
    
    def __call__(self, document):
        """Return tokens from a string."""
        return [self.wnl.lemmatize(token) for token in nltk.word_tokenize(document)]

#### Vectorized TF-IDF

*

In [12]:
# Create TF-IDF vectorizer limiting 
vectorizer = TfidfVectorizer(max_features=max_features, max_df=max_df, min_df=min_df, \
                            stop_words="english", use_idf=True, tokenizer=MyTokenizer())

In [13]:
# Set working dataframe to a 30% sample of the full data set; too large otherwise
df = dfYelpReviews.sample(frac=0.3)

In [14]:
%%time
# Load token matrix and vectorizer from file if found and flag set to permit;
# otherwise vectorize documents
tokenMatrix = None
if(load_token_matrix_from_file and \
   os.path.isfile(PATH_WIP_TOKEN_MATRIX) and 
   os.path.isfile(PATH_WIP_TOKENIZER)):
    print(f"Loading token matrix from file \"{PATH_WIP_TOKEN_MATRIX}\"...")
    tokenMatrix = load_npz(PATH_WIP_TOKEN_MATRIX)
    f = open(PATH_WIP_TOKENIZER, "rb")
    vectorizer = pickle.load(f)
    f.close()
else:
    print("Vectorizing documents to build token matrix...")
    tokenMatrix = vectorizer.fit_transform(df.text)
    overwrite_saved_token_matrix = True

Loading token matrix from file "wip/task01_token_matrix.npz"...
CPU times: user 1.17 s, sys: 83.4 ms, total: 1.26 s
Wall time: 1.25 s


In [15]:
# Print token matrix shape
print("Found {0[1]:,} tokens in {0[0]:,} documents".format(tokenMatrix.shape))

Found 4,653 tokens in 337,637 documents


In [16]:
# Save token matrix and vectorizer to file if changed
if(overwrite_saved_token_matrix):
    save_npz(PATH_WIP_TOKEN_MATRIX, tokenMatrix)
    f = open(PATH_WIP_TOKENIZER, "wb")
    pickle.dump(vectorizer, f)
    f.close()

### Find Topics Using LDA

*

In [47]:
# Set flag to load LDA model from file if found; set this to False when changing
# other parameters
load_lda_model_from_file = True
overwrite_saved_lda_model = not load_lda_model_from_file

# Set number of topics
num_topics = 50

# Set number of words to display for each topic
num_words = 10

In [48]:
# Convert GenSim corpus from token vectors
corpus = matutils.Sparse2Corpus(tokenMatrix, documents_columns=False)

In [49]:
# Create a GenSim dictionary for documents; Note: Passes the vectorizer tokens as
# a single "document".
dictionary = corpora.Dictionary([vectorizer.get_feature_names()])

In [50]:
%%time
# Load LDA model form file if found and flag set to permit; otherwise find topics
lda = None
if(load_lda_model_from_file and \
   os.path.isfile(PATH_WIP_LDA_MODEL)):
    print(f"Loading LDA model from \"{PATH_WIP_LDA_MODEL}\"...")
    lda = models.ldamulticore.LdaMulticore.load(PATH_WIP_LDA_MODEL)
else:
    print("Finding topics using LDA...")
    lda = models.ldamulticore.LdaMulticore(corpus, num_topics=num_topics, id2word=dict(dictionary.items()))
    overwrite_saved_lda_model = True

Finding topics using LDA...
CPU times: user 4min 42s, sys: 4.25 s, total: 4min 46s
Wall time: 4min 45s


In [54]:
# Print topics
lda.show_topics(num_topics=num_topics, num_words=num_words)

[(0,
  '0.028*"filipino" + 0.023*"candy" + 0.022*"bistro" + 0.022*"ravioli" + 0.019*"cheeseburger" + 0.019*"safeway" + 0.018*"di" + 0.018*"ramsay" + 0.017*"jose" + 0.016*"tony"'),
 (1,
  '0.028*"massage" + 0.015*"spa" + 0.013*"haircut" + 0.009*"great" + 0.009*"groupon" + 0.008*"facial" + 0.007*"barber" + 0.007*"workout" + 0.007*"hair" + 0.007*"relaxing"'),
 (2,
  '0.024*"dog" + 0.012*"dr." + 0.010*"doctor" + 0.009*"care" + 0.008*"pet" + 0.008*"office" + 0.008*"patient" + 0.007*"staff" + 0.007*"animal" + 0.006*"n\'t"'),
 (3,
  '0.019*"gelato" + 0.015*"omelette" + 0.014*"cookie" + 0.013*"cooky" + 0.012*"]" + 0.012*"cheesecake" + 0.011*"juice" + 0.010*"http" + 0.010*"croissant" + 0.010*"john"'),
 (4,
  '0.035*"irish" + 0.027*"print" + 0.026*"beatles" + 0.026*"bank" + 0.025*"refund" + 0.024*"jungle" + 0.023*"hut" + 0.018*"voucher" + 0.016*"robert" + 0.016*"pickup"'),
 (5,
  '0.042*"great" + 0.030*"food" + 0.022*"korean" + 0.021*"awesome" + 0.020*"service" + 0.018*"atmosphere" + 0.015*"hawa

In [55]:
# Save LDA model to file if changed
if(overwrite_saved_lda_model):
    lda.save(PATH_WIP_LDA_MODEL)

### Graphing Topics

*

In [57]:
%%time
# Prepare visualization
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)

CPU times: user 3min, sys: 1.41 s, total: 3min 1s
Wall time: 3min 4s


In [58]:
# Display visualization
pyLDAvis.display(vis)