# Learning from Big Data: Module 1 - Final Assignment Template

#### Student Name: FIRSTNAME SECONDNAME (000000XX)

# Introduction

This file provides a template for **Assignment 1** from the Learning from Big Data course. This Jupyter Notebook file was prepared to save you time so that you can focus on the theory and technical parts of the methods seen in class. This was prepared with a specific application in mind: *movie reviews*. For the supervised learning tasks, we will focus on three topics: acting, storyline, and visual/sound effects.

You have by now received the dataset of reviews, the three dictionaries with the training set of words for each topic, a list of stopwords, and a validation dataset containing sentences classified by a panel of human judges. This Jupyter Notebook file has a lot of (Python) code written to handle things such as leading these data files and general settings of the environment we use to perform the analysis. The supervised learning code in this file was covered in **Session 02**.

This Jupyter Notebook file will load all the above-mentioned files and make them available for you to use them for solving the NLP problems listed here. The questions **you are to answer** are marked as "`QUESTION`". The parts **you are expected to code yourself** are marked as "`# ADD YOUR CODE HERE`". There, you are expected to write your own code based on the in-class discussions and the decisions you will make as you study the theory, material, and models.

#### This assignment has the following structure:
1. **General Guidelines**
2. **Research Question**
3. **Load the Packages**
4. **Load the Reviews**
5. **Data Aggregation and Formatting**
6. **Supervised Learning: The Naive Bayes Classifier (NBC)**
7. **Supervised Learning: Inspect the NBC Performance**
8. **Unsupervised Learning: Predicting the Box Office using LDA**
9. **Unsupervised Learning: Predicting the Box Office using Word2Vec**
10. **Analysis - Answering the Research Question**
11. **OPTIONAL - Run and interpret the VADER lexicon for sentiment**
12. **APPENDIX**

# 1. General Guidelines

**Page limit**. This template has 8 pages, and you are allowed to add 8 to 10 pages (not including the appendix). Even though there is a page limit, you have the possibility of using appendices, which do not have a limitation in the number of pages. Use your pages wisely. For example, having a table with 2 rows and 3 columns that uses 50% (or even 25%) of a page is not really wise.

# 2. Research Question

`QUESTION I:` Present here the main research question you are going to answer with your text analysis.
You are free to choose the problem and change it until the last minute before handing in your report. However, your question should not be so simple that it does not require text analysis. For example, if your question can be answered by reading two reviews, you do not need text analysis; all you need is 10 seconds to read two reviews. Your question should not be so difficult that you cannot answer in your report. Your question needs to be answered in these pages.


`SOLUTION I:`

# 3. Load the Packages

Before starting the problem set, make sure that you have all the required packages installed properly. Simply run the code cell below (Shift-Enter). **Note**: you are free to add other packages.

In [15]:
# Loading the required packages
import re
import string
import numpy as np
import pandas as pd
from collections import Counter, namedtuple
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score
from nltk.corpus import stopwords
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 4. Load the Reviews

We will explore the concepts from this problem set with a dataset of online movie reviews. The reviews were written from 2009 to 2013. The data was collected in 2014. Each observation in the data includes the textual review, a numerical rating from 1 to 10 (i.e., the number of stars), the movie title, the reviewer, and the date the review was written. The observation includes data from the moving being reviewed: the movie release data, the box office in the first week (as that is the strongest predictor of movie success), the studio that produced the movie, the number of theaters that the movie was released in, and the MPAA rating. The review also includes two pieces of information on the quality of the review itself: the number of readers who found the review useful, and the number of readers who rated the review as useful or not useful. There are reviews that no one rate as useful or not useful. The date in which a review was rated is not available.

### The data set contains the following 19 columns:
+ **movie_name**: title of the movie being reviewed.
+ **review_code**: a unique serial identifier for all the reviews in this dataset.
+ **reviewer**: the reviewer who wrote the review.
+ **num_eval**: the number of stars.
+ **review_date**: the date the review was written.
+ **prob_sentiment**: a placeholder variable to store the probability the review is positive. `TODO`: You need to compute this.
+ **words_in_lexicon_sentiment_and_review**: the number of words that are found both in the review and in the sentiment lexicon you will be using.
+ **ratio_helpful**: number of people that rated the review as useful divided by the total number of people that rated the review.
+ **raters**: number of people that rated the review as either useful or not useful.
+ **prob_storyline**: a placeholder variable to store the probability the review is about the movie storyline.
+ **prob_acting**: a placeholder variable to store the probability the review is about acting.
+ **prob_sound_visual**: a placeholder variable to store the probability that the review is about the movie special effects (sound or visual).
+ **full_text**: raw review text.
+ **processed_text**: the cleaned review text, free of punctuation marks.
+ **release_date**: the day the movie was released.
+ **first_week_box_office**: number of movie theaters tickets sold in the first week from movie release. Data from boxofficemojo.com
+ **MPAA**: MPAA rating of the movie (e.g., PG-rated).
+ **studio**: movie studio that produced the movie.
+ **num_theaters**: number of movie theaters that this movie was shown on the release date. Data from boxofficemojo.com

### Loading the reviews:

In [16]:
# Loading the data
reviews_raw = pd.read_csv('../data/reviews/reviews_tiny.csv', encoding='ISO-8859-1')
reviews_raw = reviews_raw[
    ['movie_name',
     'review_code',
     'reviewer',
     'review_date',
     'num_eval',
     'prob_sentiment',
     'words_in_lexicon_sentiment_and_review',
     'ratio_helpful',
     'raters',
     'prob_storyline',
     'prob_acting',
     'prob_sound_visual',
     'full_text',
     'processed_text',
     'release_date',
     'first_week_box_office',
     'MPAA',
     'studio',
     'num_theaters']
]

# 5. Data Aggregation and Formatting

`QUESTION II:` Decide on how to aggregate or structure your data. The data you received is at the review level (i.e., each row/observation is a review). However, the variablese in the data are very rich and allow you to use your creativity when designing your *research question*. For example, there are timestamps, which allow you to aggregate the data at the daily level (or even hourly level). There is information on reviewers, which allow you to inspect patterns of rating by reviewers. There is information on the studio's, and more. Please explicitly indicate how you structured your dataset, and what is your motivation to do so. Even if you are using the data at the review level, indicate how and why that is needed for your research question. 

`SOLUTION II:`

# 6. Supervised Learning: the Naive Bayes Classifier (NBC)

## 6.0 Load Support Functions and Global Parameters

The two functions, `compute_posterior_sentiment` and `compute_posterior_content`, are called once per review. These functions use the Bayes rule we saw in **Session 02** to compute the posterior probabilities that the review is about each topic (in the 2nd function) and the posterior probability that the sentiment in the review is positive and/or negative (in the 1st function). The functions are loading by executing the code cell below.

In [17]:
# Function for computing posterior sentiment
def compute_posterior_sentiment(prior, corpus_in, dict_words, p_w_given_c, TOT_DIMENSIONS):
    prior = np.array(prior)
    vec = CountVectorizer(vocabulary=dict_words, lowercase=True)
    word_matrix = vec.fit_transform([corpus_in]).toarray()

    # Check if there are any relevant words in the review, if there are, treat them. 
    # If not, use the prior.
    if word_matrix.sum() == 0:
        posterior = prior
        words_ = ['']
    else:
        # Positions in word matrix that have words from this review
        word_matrix_indices = np.where(word_matrix > 0)[1]

        # Initializing posterior vector
        posterior = np.zeros(TOT_DIMENSIONS)
        vec_likelihood = np.zeros(TOT_DIMENSIONS)

        # Loop around words found in review
        for word_matrix_index in word_matrix_indices:
            word = vec.get_feature_names_out()[word_matrix_index]

            # Check if the word exists in p_w_given_c.words
            p_w_given_c_indices = np.where(p_w_given_c.words == word)[0]
            if p_w_given_c_indices.size > 0:
                p_w_given_c_index = p_w_given_c_indices[0]
                vec_likelihood = np.array([p_w_given_c.pos_likelihood[p_w_given_c_index], 
                                           p_w_given_c.neg_likelihood[p_w_given_c_index]])

                # Looping around occurrences | word
                for i in range(word_matrix[0, word_matrix_index]):
                    numerat = prior * vec_likelihood
                    denomin = prior.dot(vec_likelihood)
                    posterior = numerat / denomin

                    if np.sum(posterior) > 1.01:
                        raise Exception('ERROR')

                    prior = np.array(posterior)

        words_ = vec.get_feature_names_out()[word_matrix_indices]

    return {'posterior_': posterior, 'words_': words_}


# Function for computing posterior content
def compute_posterior_content(prior, corpus_in, dict_words, p_w_given_c, BIGRAM, TOT_DIMENSIONS):
    vec = CountVectorizer(vocabulary=dict_words, lowercase=True, ngram_range=(1, BIGRAM))
    word_matrix = vec.fit_transform([corpus_in]).toarray()

    # Check if there are any relevant words in the review, if there are, treat them. 
    # If not, use the prior.
    if word_matrix.sum() == 0:
        posterior = prior
    else:
        # Positions in word matrix that have words from this review
        word_matrix_indices = np.where(word_matrix > 0)[1]
        posterior = np.zeros(TOT_DIMENSIONS)

        # Loop around words found in review
        for word_matrix_index in word_matrix_indices:
            word = vec.get_feature_names_out()[word_matrix_index]
            p_w_given_c_index = np.where(p_w_given_c.words == word)[0][0]
            vec_likelihood = np.array([p_w_given_c.storyline[p_w_given_c_index], 
                                       p_w_given_c.acting[p_w_given_c_index], 
                                       p_w_given_c.visual[p_w_given_c_index]])

             # Looping around occurrences | word
            for i in range(word_matrix[0, word_matrix_index]):
                numerat = prior * vec_likelihood
                denomin = prior.dot(vec_likelihood)
                posterior = numerat / denomin

                if np.sum(posterior) > 1.01:
                    raise Exception('ERROR')

                prior = posterior

    return {'posterior_': posterior}


# Setting Global Parameters
PRIOR_SENT = 1/2
PRIOR_CONTENT = 1/3
TOT_REVIEWS = len(reviews_raw)

## 6.1 Likelihoods

`QUESTION III:` Create the content likelihoods based on the 3 lists of words below. Be explicit on the decisions you took in the process, and why you made those decisions (e.g., which smoothing approach you used).

`SOLUTION III:`

### 6.1.1 Loading the Dictionaries (Training Data)

In [18]:
# Loading the storyline dictionary
dictionary_storyline = pd.read_csv('../data/lexicons/storyline_33k.txt')

# Loading the acting dictionary
dictionary_acting = pd.read_csv('../data/lexicons/acting_33k.txt')

# Loading the visual dictionary
dictionary_visual = pd.read_csv('../data/lexicons/visual_33k.txt')

### 6.1.2 Content/Topic Likelihoods

In [19]:
# ADD YOUR CODE HERE, replacing these fake likelihoods
likelihoods_content = pd.read_csv('../data/lexicons/example_100_fake_likelihood_content.csv')

# Converting the first column to a list of strings
lexicon_content = likelihoods_content.iloc[:, 0].astype(str).tolist()

### 6.1.3 Sentiment Likelihoods

`QUESTION IV:` Locate a list of sentiment words that fits your research question. For example, you may want to look just at positive and negative sentiment (hence two dimensions), or you may want to look at other sentiment dimensions, such as specific emotions (excitement, fear, etc.).
**TIP:** Google will go a long way for finding these, but do check if there is a paper you can cite that uses your list.

`SOLUTION IV:`

In [20]:
# ADD YOUR CODE HERE, load your sentiments words list
# dictionary_sentiment = pd.read_csv('')

# ADD YOUR CODE HERE, replacing these fake likelihoods
likelihoods_sentiment = pd.read_csv('../data/lexicons/example_100_fake_likelihood_sentiment.csv')

# Converting the first column to a list of strings
lexicon_sentiment = likelihoods_sentiment.iloc[:, 0].astype(str).tolist()

## 6.2 Run NBC for Sentiment

In [21]:
for review_index in range(TOT_REVIEWS):
    if (review_index % 100 == 0):
        print(f"Computing sentiment of review #{review_index}")
        
     
    # Reset the prior as each review is looked at separately
    prior_sent = [PRIOR_SENT, 1-PRIOR_SENT]

    text_review = str(reviews_raw['processed_text'].iloc[review_index])

    # Pre-process the review to remove punctuation marks and numbers
    # Note: we are not removing stopwords here (nor elsewhere - a point for improvement)
    text_review = text_review.translate(str.maketrans('', '', string.punctuation))
    text_review = ''.join([i for i in text_review if not i.isdigit()])

    # Computing posterior probability the review is positive
    TOT_DIMENSIONS = 2
    sent_results = compute_posterior_sentiment(prior=prior_sent,
                                               corpus_in=text_review,
                                               dict_words=lexicon_sentiment,
                                               p_w_given_c=likelihoods_sentiment,
                                               TOT_DIMENSIONS=TOT_DIMENSIONS)
    
    words_sent = sent_results['words_']
    posterior_sent = sent_results['posterior_']

    # Setting the posterior sentiment in the prob_sentiment column
    reviews_raw.loc[review_index, 'prob_sentiment'] = posterior_sent[0]
    reviews_raw.loc[review_index, 'words_in_lexicon_sentiment_and_review'] = ' '.join(words_sent)

Computing sentiment of review #0
Computing sentiment of review #100
Computing sentiment of review #200
Computing sentiment of review #300
Computing sentiment of review #400
Computing sentiment of review #500
Computing sentiment of review #600
Computing sentiment of review #700
Computing sentiment of review #800
Computing sentiment of review #900


## 6.3 Run NBC for Content

In [22]:
for review_index in range(TOT_REVIEWS):
    print(f'Computing content of review # {review_index}') if review_index%100 == 0 else None
    
    if reviews_raw['full_text'].iloc[review_index] != "":
        text_review = str(reviews_raw['processed_text'].iloc[review_index])

        # Pre-process the review to remove punctuation marks and numbers
        # Note: we are not removing stopwords here (nor elsewhere - a point for improvement)
        text_review = text_review.translate(str.maketrans('', '', string.punctuation))
        text_review = ''.join([i for i in text_review if not i.isdigit()])
        
        # Compute posterior probability the review is about each topic/content
        TOT_DIMENSIONS = 3
        prior_content = np.repeat(PRIOR_CONTENT, TOT_DIMENSIONS).reshape(-1, TOT_DIMENSIONS)
        posterior_content = compute_posterior_content(prior=prior_content, 
                                              corpus_in=text_review,
                                              dict_words=lexicon_content,
                                              p_w_given_c=likelihoods_content, 
                                              BIGRAM=2,
                                              TOT_DIMENSIONS=TOT_DIMENSIONS)
        
        reviews_raw.loc[review_index, 'prob_storyline'] = posterior_content['posterior_'][0][0]
        reviews_raw.loc[review_index, 'prob_acting'] = posterior_content['posterior_'][0][1]
        reviews_raw.loc[review_index, 'prob_sound_visual'] = posterior_content['posterior_'][0][2]

processed_reviews = reviews_raw

# Save the updated file, now including the sentiment and content/topic posteriors.
processed_reviews.to_csv('../output/test_processed_reviews.csv', index=False)

Computing content of review # 0
Computing content of review # 100
Computing content of review # 200
Computing content of review # 300
Computing content of review # 400
Computing content of review # 500
Computing content of review # 600
Computing content of review # 700
Computing content of review # 800
Computing content of review # 900


# 7. Supervised Learning: Inspect the NBC Performance

## 7.1 Load the Judge Scores

In [23]:
# Loading the judges scores
ground_truth_judges = pd.read_csv('../data/judges/judges.csv')

## 7.2 Compute Confusion Matrix, Precision, and Recall

`QUESTION V:` Compare the performance of your NBC implementation (for content) against the judges ground truth by building the confusion matrix and computing the precision and accuracy scores. **Do not forget to interpret your findings.**

In [24]:
# ADD YOUR CODE HERE.

`SOLUTION V:`

# 8. Unsupervised Learning: Predicting Box Office using LDA

`QUESTION VI:` Using Latent Dirichlet Allocation (LDA), predict the movie box office.

In [25]:
# ADD YOUR CODE HERE. Note: you are allowed to use the code from Session 03.

`SOLUTION VI:`

# 9. Unsupervised Learning: Predicting the Box Office using Word2Vec

`QUESTION VII:` Using Word2Vec, predict movie box office.
+ **Tip 1:** You can reduce the dimensionality of the output of Word2Vec with PDA/Factor Analysis. This will save you computing time.
+ **Tip 2:** Word2Vec wil give you word vectors. You can then compute the average of these word vectors for all words in a review. This will give you vector describing the content of a review, which you can use as your constructed variable(s).

In [26]:
# ADD YOUR CODE HERE. Note: you are allowed to use the code from Session 03.

`SOLUTION VII:`

# 10. Analysis: Use the constructed variables to answer your research question

`QUESTION VIII:` Now that you have constructed your NLP variables for sentiment and content using both supervised and unsupervised methods, use them to answer your original research question.

In [27]:
# ADD YOUR CODE HERE. The code for the analysis of your research question should be written here.

`SOLUTION VIII:`

# OPTIONAL: Run and interpret sentiment with the supervised learning VADER lexicon

`QUESTION IX (optional):` Using the VADER code you received in the lecture, compute the sentiment using the VADER package. Compare the performance of your NBC implementation (for sentiment) assuming that the VADER classification were the ground truth and then build the confusion matrix, compute the precision, and computre the recall. **Note** that we are now interested in understanding how much the two classifications differ and how, but we are not implying that VADER is error-free, far from it. We are interested in uncovering sources of systemic differences that can be attributed to the algorithms or lexicons. **Do interpret your findings**.

In [28]:
# ADD YOUR CODE HERE.

`SOLUTION IX (optional):`

# APPENDIX