# [95-865] Unstructured Data Analysis: Final Exam Q1

Name: Jose Alberto Rodriguez 
Andrew ID: Josealbr

# Q1:  Sentiment Analysis [Total: 25 Points]

Universal Brothers, a fictional movie production company approaches you to find out whether people like the movies the produce.
Download the dataset from https://www.dropbox.com/s/1ztjhjsznlhtv10/ExamData-1.zip?dl=0 <br>
Unzip into same folder as this notebook. Do not have the files inside any other inner folder.
Find attached an IMDB dataset for movie reviews. You will perform sentiment analysis on this dataset. Poorly rated movies are labeled 0 and highly rated movies are labeled 1. You are provided with training and test data.

### (a) Load [1 Point]
Load the train and test files into a dataframe. Name the columns to help you in the rest of the problem.

In [69]:
import pandas as pd
import numpy as np

test = pd.read_csv('test.csv', encoding = "ISO-8859-1", header=None)
train = pd.read_csv('train.csv', encoding = "ISO-8859-1", header=None)

train.columns = ['id', 'sentiment', 'review']
test.columns = ['id', 'sentiment', 'review']

train.head()

Unnamed: 0,id,sentiment,review
0,id,sentiment,review
1,5814_8,1,With all this stuff going down at the moment w...
2,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
3,7759_3,0,The film starts with a manager (Nicholas Bell)...
4,3630_4,0,It must be assumed that those who praised this...


### (b) Clean the Dataset [4 points]

- Tokenize the reviews using Spacy
- Convert tokens to lower case
- Only keep alphanumeric characters in the reviews
- Use the stopwords.txt file to remove stop words from the tokens list
- Remove punctuations from the reviews
- Perform the above on both the train and test review datasets
- List the tokens for the first 5 reviews

In [70]:
import spacy
import re
nlp = spacy.load('en')

with open("stopwords.txt") as f:
    stopwords = f.read()
    stopwords = stopwords.split('\n')

     
def is_alpha(token):
    if re.match('[a-zA-Z]+$', token):
        return True
    else:
        return False

def get_words(doc):
    for token in doc:
        if is_alpha(token.orth_):
            yield token    

        
def remove_step_words(doc):
    for token in doc:
        if token.lower_:
            pass
        
def process(df):
    column = []
    for r in df['review']:
        doc = nlp(r)
        s = ''
        for w in get_words(doc):
            if not w.lower_ in stopwords:
                s += w.lower_  + ' '
        column.append(s)
    return column

train['tokens'] = process(train)
test['tokens'] = process(test)
    

In [64]:
train.head(4)

Unnamed: 0,id,sentiment,review,tokens
0,id,sentiment,review,review
1,5814_8,1,With all this stuff going down at the moment w...,stuff going moment mj started listening music ...
2,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",classic war timothy hines entertaining film ob...
3,7759_3,0,The film starts with a manager (Nicholas Bell)...,film starts manager nicholas bell giving welco...


### (c) Turning movie reviews into vectors with the help of word embeddings [Loading word embeddings - 2 points, feature matrix - 8 points]

In the RNN demo, you saw how to load a pre-computed GloVe word embedding. Please repeat this to load in 50-dimensional GloVe word embedding vectors from `glove.6B.50d.txt`. 

A movie review is composed of words. Given a review, let's define a "review embedding" to be the average of the individual token (word) embeddings. Write a function to create a matrix whose rows are the review embeddings created as described in the previous sentence. If a word does not have a GloVe word embedding, ignore it. Each movie review now is an embedding vector which could be thought of as the feature vector for that movie review.

In [87]:
embeddings_index = {}

# We will use the 100-dimensional embedding vectors
with open("glove.6B.50d.txt", encoding='utf-8') as f:
    # Each row represents a word vector
    for line in f:
        values = line.split()
        # The first part is word
        word = values[0]
        # The rest are the embedding vector
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))


Found 400000 word vectors.


"\nword_index = {}\nindex = 0\nfor tokens in test['tokens']:\n    for word in tokens.split(' ')::\n        print(word)\n        if word not in word_index:\n            word_index[word] = index\n            index +=1\n\n\n\n# We first initialize the embedding matrix with zeros\nembedding_matrix = np.zeros((max_words, embedding_dim))\nfor word, i in word_index.items():\n    # We get the word embeddings for each word from GloVe\n    embedding_vector = embeddings_index.get(word)\n    # We only look at top 10000 words\n    if i< max_words:\n        # if the embedding vector for the word exists in GloVe, we use it as the corresponding row in the \n        # embedding matrix; otherwise we leave the row as all zeros\n        if embedding_vector is not None:\n            embedding_matrix[i] = embedding_vector\n            \n"

In [89]:
embedding_dim = 50

word_index = {}
index = 0
for tokens in test['tokens']:
    for word in tokens.split(' '):
        if word not in word_index:
            word_index[word] = index
            index +=1

for tokens in train['tokens']:
    for word in tokens.split(' '):
        if word not in word_index:
            word_index[word] = index
            index +=1


In [103]:
# We first initialize the embedding matrix with zeros
embedding_dim = 50
embedding_matrix = np.zeros((len(word_index.keys()), embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
         

In [105]:
# write your code here for computing a 2D numpy array of review embeddings
# (so the i-th movie review should having an embedding given by the i-th row of the 2D array)
print(len(embedding_matrix))

embedding_matrix[1:3]


19246


array([[ 0.37632999,  0.058652  ,  0.17005   ,  0.46862999,  0.95798999,
        -0.82027   , -0.83683002,  0.53315002, -0.22664   , -1.15050006,
         0.11026   ,  0.22662   , -0.80680001,  0.12202   ,  0.91294003,
         0.39002001, -0.0051597 ,  0.11369   ,  0.45456001, -0.11737   ,
        -0.074381  ,  1.50880003,  0.46654999,  0.04601   ,  0.68558002,
        -2.28719997, -0.081728  ,  0.55559999, -1.12220001, -0.042912  ,
         2.56509995, -0.12145   , -0.42656001, -0.11731   , -0.51801002,
        -0.51683003,  0.58125001, -0.20615999,  0.67071998,  0.82279998,
         0.21314   ,  1.36619997, -0.18691   , -0.78496999,  0.73258001,
        -0.51868999, -1.53690004,  0.84912997,  0.51594001,  0.87638998],
       [ 1.62450004, -0.71608001,  0.15488   ,  0.33987999, -0.54799998,
         0.94953001,  0.53549999,  0.29616001,  0.78119999,  0.57072997,
        -0.56431001, -0.98483998, -0.51095998,  0.40357   ,  0.39757001,
        -0.31668001, -0.19241001,  0.32126001, -1.

### (d) SVM [Cross Validation - 6 points, Correct Prediction - 4 Points]
We now train a polynomial kernel SVM using the review embeddings from the previous part as feature vectors. But first, we have to figure out SVM parameter $C$ and also which polynomial degree $d$ to use! We do a grid search over $C$ in the range `np.logspace(-4, 2, 3)`, and polynomial degree $d$ in the range `range(1,4)`. For each parameter choice $(C, d)$, compute the 5-fold cross validation prediction accuracy (this will be the cross validation score for $(C, d)$). Across all choices of $(C, d)$ we will use whichever one has the highest cross validation score -- this will correspond to the best $C$ and best $d$. Then train the polynomial kernel SVM using the best $C$ and best $d$ and report the accuracy on the actual test data.

Note that this problem involves writing grid search cross validation code (do NOT use scikit-learn or some external resource code that does this grid search for you; this problem is asking you to complete the grid search code below!).

In [3]:
from sklearn import svm
from sklearn.model_selection import KFold

num_folds = 5
k_fold = KFold(num_folds)
C_values = np.logspace(-4, 2, 3)
D_values = range(1, 4)

arg_max = None
max_cross_val_score = -np.inf
for C in C_values:
    for d in D_values:
        ################################################################################
        # Write your code only in this block -- do not change code outside of this block!
        # Hint: You will want to use `k_fold` (defined above) in conjunction with your
        # training movie review feature vectors (these are the review embeddings) as
        # well as training labels. The code here should *not* look at any part of the
        # test data!
        
        cross_val_score = 0  # this should not actually be set to 0! set it to the cross validation SVM accuracy using parameters C and d
        
        ################################################################################

        cross_val_score = np.mean(fold_scores)
        if cross_val_score > max_cross_val_score:
            max_cross_val_score = cross_val_score
            arg_max = (C, d)
            
best_C, best_d = arg_max
print(best_C, best_d)

In [4]:
# write your code here that trains a polynomial kernel SVM using the best C and best d
# (and the full training dataset), and then prints out the classifier accuracy on the
# test data