# Practical 6 - Reviews Sentiment Classification 
In this notebook, we will attempt to classify, whether a review is good or bad, given the text of the review.  
This is the solution notebook for the [Kaggle Competition Here](https://www.kaggle.com/c/mixed-reviews-dataset/overview)

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import re
import random
import string
from datetime import datetime
from functools import partial

from keras import (models, optimizers, layers, callbacks, 
                   regularizers)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from gensim.models import KeyedVectors

import baseline_model

%matplotlib inline

Using TensorFlow backend.


## Sourcing the data
We will use pandas to source the data from the CSV:

In [2]:
reviews_df = pd.read_csv("reviews_train.csv")

Read the test data from the CSV:

In [3]:
test_df = pd.read_csv("reviews_test.csv")

## Exploring the data
Taking a look at the first five rows of the data:

In [4]:
reviews_df.head()

Unnamed: 0.1,Unnamed: 0,review,label
0,0,"102 Dalmatians (2000, Dir. Kevin Lima) <br /><...",0
1,1,By 1971 it was becoming more and more obvious ...,0
2,2,"This film, once sensational for its forward-th...",0
3,3,One has to be careful whom one tells about wat...,1
4,4,I read somewhere where this film was supposed ...,0


The `Unnamed: 0` column is confusingly named, upon closer inspection, we realise that it is actually the `id`.

As such we rename `Unnamed: 0` to `id`:

In [5]:
reviews_df = reviews_df.rename(columns={"Unnamed: 0": "id"})
test_df = test_df.rename(columns={"Unnamed: 0": "id"})
reviews_df.columns

Index(['id', 'review', 'label'], dtype='object')

Lets take a look at a random review to see how the dataset works.

In [6]:
rand_idx = random.randint(0, len(reviews_df))
# Get random review and label from reviews dataframe 
rand_review = reviews_df.loc[rand_idx, "review"]
rand_label = reviews_df.loc[rand_idx, "label"]

print("Review: \n", rand_review, "\n Label: ", rand_label)

Review: 
 I bought this movie last weekend at my local Movie Gallery. It was buy 2 get 2 free and I needed one more so I chose this one. Horrible mistake. The box reads like it would be a really good movie. Well, it starts out like it is going to be this great movie. For about 5 minutes, that is. The movie is about a young woman, Laila, who gets killed trying to save her beau, Jack, from a bull. Laila's dad, Cordobes, is a rancher that the townspeople are afraid of. He assumes that Jack killed Laila because she was supposedly afraid of this bull, and goes on this hunt to find him. That was the first 5 minutes that is good. What follows after that is only gonna get 100 times worse. Whoever wrote the script, in my opinion, had to of been on some kind acid trip or something because nothing else made any sense what so ever. Jack is on the run and finds this traveling radio DJ named Mary who gives him a ride. I think Mary is supposed to be a virgin Mary type character. You know, Jesus' moth

As we can observe from the reviews, we have to do the following steps before we can conduct ML:
- Split up the words in the review
- Remove the HTML tags (ie `<br/>`)
- Deal with the punctuation somehow>
- Dealing capital letters in the reviews
- Dealing with reviews with no words whatever

## Preparing the data
Removing the HTML tags from the data as they not relevant:

In [7]:
# make a copy so that we dont overwrite the original dataframe
pp_reviews_df = reviews_df.copy()
pp_test_df = test_df.copy()
# remove html tags
pp_reviews_df["review"] = reviews_df["review"].transform(
    (lambda review: re.sub("<.+>", "", review)))
pp_test_df["review"] = test_df["review"].transform(
    (lambda review: re.sub("<.+>", "", review)))

Removing punctuation in the reviews:

In [8]:
def remove_punctuation(input_str):
    for p in string.punctuation:
        input_str = input_str.replace(p, "")
    return input_str

pp_reviews_df["review"] = pp_reviews_df["review"].transform(
     remove_punctuation)
pp_test_df["review"] = pp_test_df["review"].transform(
     remove_punctuation)

Remove the captial letters by converting them to lower case:

In [9]:
pp_reviews_df["review"] = pp_reviews_df["review"].transform(
    (lambda review: review.lower()))
pp_test_df["review"] = pp_test_df["review"].transform(
    (lambda review: review.lower()))

We have finished with the data cleaning process.  
Lets take a look at the processed data:

In [10]:
pp_reviews_df.head()

Unnamed: 0,id,review,label
0,0,102 dalmatians 2000 dir kevin lima shes change...,0
1,1,by 1971 it was becoming more and more obvious ...,0
2,2,this film once sensational for its forwardthin...,0
3,3,one has to be careful whom one tells about wat...,1
4,4,i read somewhere where this film was supposed ...,0


### Converting the Words into Vectors
#### One hot encoding
We will convert the each review into a vector of counts of each word.

In [11]:
vectorizer = CountVectorizer()
reviews = pp_reviews_df["review"]
review_vectors = vectorizer.fit_transform(reviews)

In [12]:
review_vectors.toarray()

MemoryError: 

In [13]:
print("required memory:")
len(pp_reviews_df) * len(vectorizer.get_feature_names())

required memory:


6945550000

However, One Hot Encoding the data using `CountVectorizer` does not seem very feasible because it requires a load of RAM we do not have.

### Feature Extraction using Word Vectors
Here we apply a form of transfer learning: using other peoples models to do feature extraction. In this case we use Facebook AI research's Fasttext model to convert the words to vectors

First we need to download [Fasttext pretrained word vectors](https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip) and  unzip it in the notebook's folder

Then we load the word to vector encoder like so:
> note: its going to take some time to load the vectors

In [14]:
%%time
encoder = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec",
                                            limit=200000)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 1min 9s, sys: 334 ms, total: 1min 9s
Wall time: 1min 9s


Once we have loaded the vectors, we can convert the words in the 
reviews into word vectors.

Since vector arithmetic works on word vectors, we compute the mean of the word vectors for each review to get a vector representing the "meaning" of review:

In [15]:
# define a function to convert a review into a vector
def convert_review_to_vector(review):
    # split the review into words
    words = review.split()
    
    # convert to the words into vectors
    unknown_vector = np.zeros((300,))
    vectors = [ unknown_vector ] # handle no word reviews
    
    for word in words:
        if word in encoder.vocab:
            vector = encoder[word]
        else: # word is not in encoder vocabulary
            vector = unknown_vector
        vectors.append(vector) # convert to float32 to reduce ram use

    # mean up the word vectors to get the "meaning" of the review
    return np.mean(vectors, axis=0)

# convert the reviews in the dataframe itself
%time pp_reviews_df["review"] = pp_reviews_df["review"].transform(convert_review_to_vector)
%time pp_test_df["review"] = pp_test_df["review"].transform(convert_review_to_vector)

CPU times: user 16.9 s, sys: 35.8 ms, total: 16.9 s
Wall time: 16.9 s
CPU times: user 1.02 s, sys: 11 µs, total: 1.02 s
Wall time: 1.02 s


Extract out the inputs (review vectors) and the outputs (labels):

In [16]:
review_vectors = np.stack(pp_reviews_df["review"].values) #inputs
test_review_vectors = np.stack(pp_test_df["review"].values) #inputs
labels = np.stack(pp_reviews_df["label"].values) #outputs

Split the data into the training and validation sets:

In [17]:
train_vectors, valid_vectors, train_labels, valid_labels = (
    train_test_split(review_vectors, labels, 
                     test_size=10000,
                     shuffle=True))

### Build the model
Now we proceed to build the model:

In [18]:
# Build model given parameters
def build_model(input_shape, n_outputs, scale_width, scale_depth,
               activation, l2_lambda):
    model = models.Sequential()

    # Input layer
    model.add(layers.InputLayer(input_shape))
    
    # add hidden layers
    for i in range(scale_depth):
        model.add(layers.Dense(scale_width,
                               kernel_regularizer=
                               regularizers.l2(l2_lambda)))
        model.add(activation())
    
    # output layer
    model.add(layers.Dense(n_outputs, activation="sigmoid"))
    
    return model

Lets test our build function like so:

In [19]:
model = build_model(input_shape=(300,),
                   n_outputs=1,
                   scale_width=64,
                   scale_depth=3,
                   activation=layers.ReLU,
                   l2_lambda=0)

model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 64)                19264     
_________________________________________________________________
re_lu_1 (ReLU)               (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
re_lu_2 (ReLU)               (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                4160      
_________________________________________________________________
re_lu_3 (ReLU)               (None, 64)                0         
_________________________________________________________________
dens

## Training the Model
Since this is a binary classification problem ('good' or 'bad'),  
we should use the `binary_crossentropy` loss function.


In [20]:
# We compile the model as follows:
model.compile(
    optimizers.Adam(lr=1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy"])

# derive a unique name for our run based on the current time
run_name = "run_" + datetime.strftime(datetime.now(), "%Y_%m_%d__%H_%M_%S")
run_path = os.path.join("logs", run_name)
os.makedirs(run_path, exist_ok=True)

# Training the model:
model.fit(train_vectors, train_labels,
         validation_data=(valid_vectors, valid_labels),
         batch_size=64,
         epochs=4,
         callbacks=[callbacks.TensorBoard(log_dir=run_path)])

Instructions for updating:
Use tf.cast instead.
Train on 40000 samples, validate on 10000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fc7b3244dd8>

# Iterating the Model
We change different hyperparameters of the model to attempt to improve its performance (accuracy):
- learning rate
- no. of epochs trained
- no. of hidden layers
- no. of hidden units
- adding regularization

In [21]:
model = build_model(input_shape=(300,),
                   n_outputs=1,
                   scale_width=128,
                   scale_depth=2,
                   activation=(lambda:layers.LeakyReLU(0.3)),
                   l2_lambda=1e-4)

# We compile the model as follows:
model.compile(
    optimizers.Adam(lr=1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy"])

# derive a unique name for our run based on the current time
run_name = "run_" + datetime.strftime(datetime.now(), "%Y_%m_%d__%H_%M_%S")
run_path = os.path.join("logs", run_name)
os.makedirs(run_path, exist_ok=True)

# Training the model:
model.fit(train_vectors, train_labels,
         validation_data=(valid_vectors, valid_labels),
         batch_size=64,
         epochs=100,
         callbacks=[callbacks.TensorBoard(log_dir=run_path),
                    callbacks.ReduceLROnPlateau(monitor="val_loss",
                                               factor=0.5,
                                               patience=4,
                                               cooldown=4)])

Train on 40000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/1

<keras.callbacks.History at 0x7fc76056de80>

## Predicting using the Model
We predict using the model by feeding it test data:

In [22]:
valid_probs = model.predict(valid_vectors)
test_probs = model.predict(test_review_vectors)

We threshold the probabilties to get a set of predictions:

In [23]:
threshold = 0.5
valid_preds = np.squeeze(valid_probs >= threshold)
test_preds = np.squeeze(test_probs >= threshold)

We tune threshold by cross validating using accuracy:

In [24]:
print(f"accuracy: {accuracy_score(valid_labels, valid_preds)}")

accuracy: 0.8271


## Preparing your Predictions
Prepare test predictions in the format `id`, `labels`:

In [25]:
# Obtain the ids
ids = test_df["id"].values

# Create the submission dataframe
submit_df = pd.DataFrame(data={
    "id": ids,
    "label": test_preds.astype("int")
})

submit_df.head()

Unnamed: 0,id,label
0,0,1
1,1,1
2,2,0
3,3,0
4,4,1


Finally, we write out submission dataframe into a CSV file:

In [26]:
print("Predictions: ", len(submit_df))
submit_df.to_csv("submit.csv", index=False)

Predictions:  3000
