# HW 2: Data Preparation for Sentiment Classification

In this homework we will prepare the IMDB movie review sentiment dataset. We will prepare it to fit a model that will predict whether a new review has a positive or negative sentiment. 

**Start by downloading the IMDB_Dataset from the .csv file into a pandas DataFrame**

In [309]:
import pandas as pd
import numpy as np

#Download the dataset into a Pandas DataFrame and display the first 5 rows
## YOUR CODE HERE
imdb = pd.read_csv("IMDB_Dataset.csv")

In [310]:
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


We have to process the data first, so that we have fixed length sequences.

<b>We want to split the dataset into reviews and labels. </b>

In [311]:
imdb_x = imdb["review"]
imdb_y = imdb["sentiment"]

In [312]:
# Paste the toBinary function created in HW 1 from this hw set (week 2)
def toBinary(data, positive):
    return data.apply(lambda val: 1 if val == positive else 0)

**Use the toBinary method to tranform the sentiment column into binary, 1 for positive and 0 for negative.**

In [313]:
imdb_y = toBinary(imdb_y, "positive")

"Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors."
    https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

<b>Lemmatize the sentences using any library from the article. Make sure to filter out non-alphabetical characters. </b>

In [314]:
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

def Lemmatize(data):
    # Init the Wordnet Lemmatizer
    data = data.str.split(" |,|<br /><br />|\"|\'|!") # splits the data from sentences to words, feel free to change
    lemmatizer = WordNetLemmatizer()
    return data.apply(lambda words: list(map(lemmatizer.lemmatize, words)))

imdb_x = Lemmatize(imdb_x)

The data has to be put into integer form, so each integer represents a unique word, 0 represents a PAD character, 1 represents a START character and 2 represents a character that is unknown because it is not in the top `num_words`. 
Thus 3 represents the first real word. 

<i> Do not implement dictionary keys for PAD, START, and unknown characters, this will be done later. </i>

Also the words should be in decreasing order of frequency, so the word that 3 represents is the most common word in the dataset. 

Complete CreateDict which will take in a data column <br>
    <b>1) Create a Dict that maps {Word: Apperances in dataset} </b> <br>
    <b>2) Choose the top N most recurring words and give ascending indexes starting at 3</b> <br>

In [315]:
numWords = 1000

def CreateDict(data, topN):
    dictionary = {}
    
    for review in data:
        for word in review:
            if word in dictionary.keys():
                dictionary[word] += 1
            else:
                dictionary[word] = 1
                
    sorted_dict = sorted(dictionary.items(), key=lambda x: x[1], reverse = True)
    
    new_dict = {}
    i = 2 
    for word in sorted_dict:
        if i < numWords:
            new_dict[word[0]] = i
            i += 1 
        else:
            break
    return new_dict
    
wordCounter = CreateDict(imdb_x, numWords)

Complete  replaceByIndex which will replace known words with their index and unknown words with a 2.

In [316]:
def replaceByIndex(data, wordCounter):
    unknown = 2
    for i in range(len(data)):
        data[i] = [wordCounter[w] if w in wordCounter else unknown for w in data[i]]
    return data

imdb_x = replaceByIndex(imdb_x, wordCounter)

In [317]:
imdb_x.head()

0    [290, 6, 3, 86, 2, 42, 2, 12, 129, 171, 44, 70...
1    [120, 437, 128, 2, 2, 16, 2, 2, 8, 51, 2, 51, ...
2    [10, 195, 14, 15, 4, 437, 102, 7, 2, 58, 23, 4...
3    [2, 60, 13, 4, 259, 125, 4, 128, 368, 2, 98, 6...
4    [2, 2, 13, 2, 2, 9, 3, 2, 6, 2, 2, 8, 4, 2, 2,...
Name: review, dtype: object

#### We want to process the data into NumPy arrays of sequences that are all length 200. We will use these criteria: 
* We want to add a 1 at the beginning of every review to signal the beginning of the text.
* If a given sequence is shorter than 200 tokens we want to pad the beginning of the sequence out with zeros so that the sequence is 200 long. 
* Else if the sequence is longer than 200 (including the starting 1) we want to cut it down to length 200. 


In [318]:
def process_data(data):
    processed = []
    for review in data:
        review_arr = np.append(np.array([1]), np.array(review))
        if len(review_arr) > 200:
            review_arr = review_arr[:200]
        elif len(review_arr) < 200:
            difference = np.zeros((200 - len(review_arr)))
            review_arr = np.append(difference, review_arr)
        processed.append(review_arr)
    
    return np.array(processed)

imdb_x = process_data(imdb_x)

<b> Separate the dataset into train and test sets, test set should be 1/3 of the set.</b> <p>
This sklearn method will make your life much easier: 
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [319]:
from sklearn.model_selection import train_test_split

x_train_proc, x_test_proc, y_train, y_test = train_test_split(imdb_x, imdb_y, test_size=0.33)

At this point **your job is done!!!** Congratulations, if done correctly, the sentences are processed and ready to be used as features and labels to train a Recurrent Neural Network (LSTM). You will learn how to do this yourself in the next couple weeks. For now, you can just sit back and "follow along" as we build this model using Keras and then train it. 

The first thing we will do is initialize the model using Sequential.

In [320]:
import keras
from keras import Sequential

imdb_model = Sequential()

Now we want to add an embedding layer. The purpose of an embedding layer is to take a sequence of integers representing words in our case and turn each integer into a dense vector in some embedding space. (This is essentially the idea of Word2Vec https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). We want to create an embedding layer with vocab size equal to the max num words we allowed when we loaded the data (in this case 1000), and a fixed dense vector of size 32. Then we have to specify the max length of our sequences and we want to mask out zeros in our sequence since we used zero to pad.
Use the docs for embedding layer to fill out the missing entries: https://keras.io/layers/embeddings/

In [321]:
from keras.layers.embeddings import Embedding
imdb_model.add(Embedding(1000, 32, input_length=200, mask_zero=True))

#### **(a)** We add an LSTM layer with 32 outputs, then a Dense layer with 16 neurons, then a relu activation, then a dense layer with 1 neuron, then a sigmoid activation. Then we print out the model summary. The Keras documentation is here: https://keras.io/

In [322]:
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Activation
imdb_model.add(LSTM(32))

In [323]:
imdb_model.add(Dense(units=16, activation='relu'))
imdb_model.add(Dense(units=1, activation='sigmoid'))

In [324]:
imdb_model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 200, 32)           32000     
_________________________________________________________________
lstm_7 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_13 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 17        
Total params: 40,865
Trainable params: 40,865
Non-trainable params: 0
_________________________________________________________________


#### **(b)** Now we compile the model with binary cross entropy, and the adam optimizer. We include accuracy as a metric in the compile. Then train the model on the processed data.

In [325]:
imdb_model.compile(loss=keras.losses.binary_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['acc'])

In [326]:
imdb_model.fit(x_train_proc, y_train)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/1


<keras.callbacks.callbacks.History at 0x7f89bcb3ee10>

In [327]:
print("Accuracy: ", imdb_model.evaluate(x_test_proc, y_test)[1])

Accuracy:  0.7984848618507385


## If you did the data pre-processing correctly you should be getting around an 80% accuracy. congratulations, that is much better than random! 
<i>If you are getting a test accuracy that is significantly lower, you probably did something wrong, slack your NMEP team or go to office hours to get help sorting it out :) </i>

#### Now we can look at our predictions and the sentences they correspond to.

In [328]:
y_pred = imdb_model.predict(x_test_proc)

In [329]:
word_to_id = wordCounter
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items() if value < 2000}
def get_words(token_sequence):
    return ' '.join(id_to_word[token] for token in token_sequence)

def get_sentiment(y_pred, index):
    return 'Positive' if y_pred[index] else 'Negative'

In [330]:
y_test = [i for i in y_test]
y_pred = np.vectorize(lambda x: int(x >= 0.5))(y_pred)
correct = []
incorrect = []
for i, pred in enumerate(y_pred):
    if y_test[i] == pred:
        correct.append(i)
    else:
        incorrect.append(i)

#### Now we print out one of the sequences we got correct.

In [331]:
print(get_sentiment(y_pred, correct[10]))
print(get_words(x_test_proc[correct[10]]))

Positive
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> <UNK> <UNK> is a great romantic <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> is a great <UNK> I think she did a great <UNK> I hope that <UNK> that the Disney <UNK> could put this movie on <UNK> I think it s kind of cool and a little bit <UNK> It s kind of sad when <UNK> <UNK> <UNK> in a <UNK> <UNK> in the beginning at first it make you want to <UNK> or <UNK> But in the middle when no one <UNK> <UNK> <UNK> it get kind of funny and <UNK> There is a little bit of mystery in this movie but not <UNK> But still I would recommend this movie to the whole family if they enjoy comedy <UNK> mystery <UNK> or romance type of movies. It s Great <UNK> I think tha

#### And one we got wrong.

In [332]:
print(get_sentiment(y_pred, incorrect[10]))
print(get_words(x_test_proc[incorrect[10]]))

Positive
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> Even though the book wasn t <UNK> <UNK> to the real situation it <UNK> it still <UNK> a sense of <UNK> I find it hard to believe that anyone who wa involved in making this film had ever been to <UNK> a it didn t feel Japanese in the <UNK> <UNK> everything about it wa <UNK> I will admit the actor were <UNK> quite good but couldn t stand a chance of <UNK> it. <UNK> the film started I wa surprised that there were only ten people in the cinema on a <UNK> night <UNK> after the movie had <UNK> in <UNK> <UNK> minute in I wa <UNK> they <UNK> I <UNK> so I would have the right to <UNK> it. The whole movie wa <UNK> my <UNK> and <UNK> laugh of <UNK> from my Japanese <UNK> <UNK> I saw <UNK> out of that cinema had look of <UNK> and <UNK> on their <UNK> <UNK> To the <UNK> of this movie <UNK> you

#### As you can see the amount of UNKNOWN characters in the sequence cause by having only 1000 vocab words is hurting our performance. If you want, go back and increase the number of vocab words to 2000 and compare your accuracy.

## And that's it! Now you should feel like a data engineering/preprocessing expert :) 