# Learning from Big Data: Module 1 - Natural Language Processing

#### Session 3 - LDA and Word2Vec

# Introduction
#### This file illustrates `LDA` (Latent Dirichlet Allocation) and `Word2Vec`. 

# 1. Loading Packages

In [2]:
# Loading the required packages
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# 2. Loading the Reviews

Next, we load the review data. **Note** that we use the ISO-8859-1 encoding from the pd.readcsv() function - this helps reading the review text correctly for further processing (by correctly interpreting non-ASCII symbols).

In [3]:
# Loading the review data.
reviews_raw = pd.read_csv('../../data/reviews/reviews_tiny.csv', encoding='ISO-8859-1')
reviews_raw = reviews_raw[
    ['movie_name',
     'review_code',
     'reviewer',
     'review_date',
     'num_eval',
     'prob_sentiment',
     'words_in_lexicon_sentiment_and_review',
     'ratio_helpful',
     'raters',
     'prob_storyline',
     'prob_acting',
     'prob_sound_visual',
     'full_text',
     'processed_text',
     'release_date',
     'first_week_box_office',
     'MPAA',
     'studio',
     'num_theaters']
]

TOT_REVIEWS = len(reviews_raw)

### 2.1 Calculating the likelihoods with your own content likelihood file

In [4]:
# TODO: compute the content likelihoods for all the words in the training data...
likelihoods_content = pd.read_csv('../../data/lexicons/example_100_fake_likelihood_content.csv')

### 2.2 Inspecting the list of words to be passed for to LDA:

In [5]:
# Converting the first column to a list of strings
lexicon_content = likelihoods_content.iloc[:, 0].values.astype('U')
print(lexicon_content)

['story' 'hero' 'world' 'character' 'moral' 'audience' 'opponent' 'ofthe'
 'scene' 'one' 'not' 'characters' 'plot' 'will' 'can' 'also' 'man'
 'desire' 'stories' 'two' 'time' 'see' 'line' 'great' 'must' 'good' 'way'
 'revelation' 'ofa' 'first' 'need' 'make' 'michael' 'change' 'house'
 'heros' 'action' 'main' 'get' 'love' 'dialogue' 'selfrevelation' 'many'
 'just' 'technique' 'end' 'structure' 'steps' 'tells' 'life' 'argument'
 'symbol' 'key' 'george' 'wants' 'anatomy' 'only' 'theme' 'use' 'well'
 'even' 'single' 'place' 'principle' 'opposition' 'comes' 'look' 'values'
 'rick' 'storyteller' 'new' 'point' 'writers' 'big' 'web' 'within' 'says'
 'premise' 'scenes' 'people' 'conflict' 'human' 'weakness' 'back' 'take'
 'form' 'down' 'beginning' 'give' 'come' 'show' 'designing' 'doesnt'
 'makes' 'king' 'three' 'example' 'family' 'plan' 'know']


# 3. Unsupervised Learning: Latent Dirichlet Allocation (LDA)

In [6]:
# Creating a CountVectorizer to create the Document-Term matrix
vectorizer = CountVectorizer(analyzer='word',       
                             vocabulary={word: i for i, word in enumerate(lexicon_content)}, 
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}',  
                            )

# Applying the vectorizer
data_vectorized = vectorizer.fit_transform(reviews_raw['processed_text'])

**Next**, we will set the `LDA` parameters.
+ `k` is the number of topic we ask LDA to estimate. In supervised learning, we set it equal to 3. In this example, we arbitrarily set `k` equal to 10.
+ `SEED` is for replicability (i.e., obtain the same number every time the code is run).
+ `ITER` parameter is set for the maximum number of iterations for the Expectation-Maximization algorithm used by sklearn's LDA implementation
    + In the unlikely case you have a warning of "no convergence", you may increase `ITER` to 2000 or 4000.

In [7]:
# Setting the LDA parameters
SEED = 100
ITER = 1000
k = 10

#### Tip: choosing which `k` to use in LDA is a **model selection problem**.
Typically, the best approach is to compute a model for each level of `k`, save the model log-likelihood, and choosing the `k` that produced the highest log-likelihood.
+ The `LatentDirichletAllocation` object in sklearn has a method called `score` which returns the log-likelihood.
+ The `score` method can be used after the model has been fitted, as follows:
  + `loglikelihood_k = lda_model_k.score(data_vectorized)`

**Next**, we will run the LDA and save the model. The model produced by `LatentDirichletAllocation()` is an object of class LatentDirichletAllocation (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). This class includes the topics, the score (log-likelihood), and a lot more. To extract these elements, one should use the Methods listed under "Methods"  in the documentation.

In [8]:
# Fitting the LDA Model
lda_model = LatentDirichletAllocation(n_components=k,               
                                      max_iter=ITER,
                                      learning_method='online',
                                      random_state=SEED,          
                                      batch_size=128,            
                                      evaluate_every = -1,       
                                      n_jobs = -1,               
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

#### Printing the log-likelihood.

In [9]:
log_likelihood = lda_model.score(data_vectorized)
print(f"The log-likelihood for k = {k} is {log_likelihood:.3f}")

The log-likelihood for k = 10 is -56185.756


#### Inspecting the posteriors.

In [10]:
# The columns/topics names (formatting)
topic_names = ["Topic" + str(i + 1) for i in range(lda_model.n_components)]

# The rows/indices names (formatting)
doc_names = ["Review_" + str(i + 1) for i in range(data_vectorized.shape[0])]

# Posterior probabilities per document by topic
df_document_topic = pd.DataFrame(np.round(lda_output, 3), columns=topic_names, index=doc_names)
print(df_document_topic)

             Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_1      0.058   0.005   0.465   0.005   0.005   0.005   0.005   0.441   
Review_2      0.003   0.003   0.272   0.003   0.003   0.003   0.106   0.412   
Review_3      0.003   0.397   0.477   0.003   0.003   0.003   0.003   0.058   
Review_4      0.003   0.291   0.465   0.003   0.003   0.003   0.003   0.003   
Review_5      0.011   0.495   0.272   0.011   0.011   0.011   0.011   0.011   
...             ...     ...     ...     ...     ...     ...     ...     ...   
Review_996    0.011   0.011   0.194   0.011   0.011   0.011   0.011   0.338   
Review_997    0.008   0.008   0.008   0.008   0.008   0.008   0.290   0.287   
Review_998    0.006   0.435   0.006   0.006   0.006   0.006   0.355   0.168   
Review_999    0.009   0.694   0.009   0.009   0.009   0.009   0.009   0.009   
Review_1000   0.006   0.006   0.006   0.006   0.006   0.006   0.006   0.707   

             Topic9  Topic10  
Review_1      0.005 

In [11]:
# Printing the 999-th one
print(df_document_topic[998:999])

            Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_999   0.009   0.694   0.009   0.009   0.009   0.009   0.009   0.009   

            Topic9  Topic10  
Review_999   0.233    0.009  


#### Tip: for the data splits.
For the data splits, if you can, mind the time. It's best to train on a split that temporarily precedes the prediction split, but sometimes that is not viable. However, it is good to be aware.

# 4. Unsupervised Learning: Word Embeddings

Our word embeggind example has three steps.
+ First, run Word2Vec to train a model using the training data split.
+ Second, use the trained model to analyze the prediction data split.
+ Third, use the constructed variables to forecast the `box office`.

### Step 1: Training step

In [15]:
full_data = reviews_raw['full_text'].str.lower()

In [16]:
# TODO: use a split of the data here (say 70%) instead of the entire dataset
# train_data = ...

In [17]:
# Tokenizing each sentence into a list of words
full_data = [simple_preprocess(line, deacc=True) for line in full_data]

# Number of topics for Word2Vec
topics_word2vec = 10

# Training the Word2Vec model
model = Word2Vec(full_data, vector_size=topics_word2vec, sg=0, epochs=20)

# The embeddings in gensim's Word2Vec model can be accessed via the 'wv' attribute
embeddings = model.wv

### Step 2: Constructing variables from word embeddings

In [18]:
# TODO: use the other split of the data here (30%)
# test_data = ...

In [19]:
# Initializing the embeddings matrix
all_embeddings = np.zeros((TOT_REVIEWS, topics_word2vec))

# Looping through each review
for review in range(TOT_REVIEWS):
    
    # Tokenizing the review: identify the words, separately
    tokenized_review = simple_preprocess(reviews_raw['full_text'].iloc[review])

    # Getting the word vectors per review
    embedding_review = [] # Initializing an empty list to store the word vectors

    # Looping through each word in the tokenized review
    for word in tokenized_review:
    
        # Checking if the word exists in the Word2Vec model vocabulary
        if word in model.wv.key_to_index:
        
            # If it does, get its vector and add it to the list
            word_vector = model.wv[word]
            embedding_review.append(word_vector)

    # Here, we handle the case where none of the words in the review are in the Word2Vec vocabulary
    if not embedding_review:
        continue
    
    # Compute mean across all words in the review 
    all_embeddings[review, :] = np.mean(embedding_review, axis=0)

#### Inspecting the embeddings

In [21]:
# Word embeddings per document by topic (these are not probabilities!)
# The columns/topics names (formatting)
topic_names_w2v = ["Topic" + str(i + 1) for i in range(topics_word2vec)]

# The rows/indices names (formatting)
doc_names_w2v = ["Review_" + str(i + 1) for i in range(all_embeddings.shape[0])]

# Posterior probabilities per document by topic
df_document_w2v_topic = pd.DataFrame(np.round(all_embeddings, 3), columns=topic_names_w2v, index=doc_names_w2v)
print(df_document_w2v_topic)

             Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_1     -0.319  -0.674   0.766  -0.042   0.346   0.958   0.392   0.532   
Review_2     -0.190  -0.686   0.530  -0.007   0.142   0.851   0.492   0.490   
Review_3     -0.115  -0.913   0.390  -0.021   0.542   0.905   0.582   0.222   
Review_4     -0.623  -0.874   0.476  -0.287   0.394   1.418   0.740   0.289   
Review_5     -0.023  -0.788   0.304  -0.112   0.534   0.937   0.758   0.416   
...             ...     ...     ...     ...     ...     ...     ...     ...   
Review_996    0.131  -0.716   0.515   0.097  -0.034   0.822   0.221   0.474   
Review_997    0.088  -0.727   0.397   0.519   0.851   0.573   0.478   0.111   
Review_998    0.474  -1.082  -0.132   0.019   0.633   1.190   0.553  -0.278   
Review_999    0.508  -0.592   0.284   0.148   0.832   0.735   0.353  -0.102   
Review_1000   0.169  -0.998   0.225   0.085   0.135   0.631   0.329   0.378   

             Topic9  Topic10  
Review_1     -0.479 

In [22]:
# Printing the 999-th one
print(df_document_w2v_topic[998:999])

            Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_999   0.508  -0.592   0.284   0.148   0.832   0.735   0.353  -0.102   

            Topic9  Topic10  
Review_999  -0.954   -0.533  


### Step 3: Using the constructed variables to forecast the `box office`

In [23]:
# TODO: Implementation...