# Sentiment Analysis with Keras

The IMBD dataset holds 50,000 movie reviews. They are highly polarised and categorised into positive and negative reviews (25000 each). The task is binary sentiment classification. Given a review can a model predict whether it is being positive or negative about the movie?

Most of the skills used here come from these DataCamp courses: 
- Recurrent Neural Networks for Language Modeling in Python
- Introduction to Deep Learning with Keras

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from tensorflow import keras
from keras.layers import Dense,Dropout,Embedding,LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.layers import BatchNormalization

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence

import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.corpus import wordnet

from sklearn.model_selection import RandomizedSearchCV, KFold
from keras.wrappers.scikit_learn import KerasClassifier

Using TensorFlow backend.


## Introduction to the Data

In [3]:
df = pd.read_csv('IMDB Dataset.csv')
#df = df.sample(frac=0.1)

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
print (len(df[df['sentiment']=='negative']))
print (len(df[df['sentiment']=='positive']))

25000
25000


The dataset is balanced between negative and positive reviews. 

## Preparing the Text Data

Both Keras and NLTK can be used to get the text data ready to be fed into a neural network. 

In [145]:
reviews = df['review']

Keras has the `text_to_word_sequence` function that converts a text block into a list of words. It also filters out punctuation and converts text to lowercase. 

In [146]:
reviews = [text_to_word_sequence(phrase) for phrase in reviews]

In [None]:
print (reviews[1])

The next step is to make sure that only words are kept in the list: 

In [148]:
alphabetic_reviews = list()
for review in reviews: 
    alphabetic_review = [word for word in review if word.isalpha() and word != 'br']
    alphabetic_reviews.append(alphabetic_review)

Stopwords are commonly used words that don't add to the meaning of the phrase - so it won't help our model predict sentiment. NLTK has a list of stopwords which can be used to filter the reviews of this dataset. 

In [150]:
stop_words = set(stopwords.words('english'))
filtered_reviews = list()
for review in alphabetic_reviews: 
    filtered_review = [word for word in review if not word in stop_words]
    filtered_reviews.append(filtered_review)

The NLTK `WordNetLemmatizer` reduces inflected words to its root word. Examples of this include:
- 'Cats' becoming 'cat'
- Corpora' becoming 'corpus'

In [152]:
lemmatizer= WordNetLemmatizer()

For the lemmatizer to work, it needs to know what type of word it's processing. NLTK's `pos_tag` returns the POS (Part Of Speech) that word falls under, e.g. verb. This helps the lemmatizer know how to lemmatize the word. But the lemmatizer takes a different kind of POS tag so it has to be converted from an NLTK tag to a WordNet tag. This is done using a function (derived from https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258) :  

In [153]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

Then another function can be created to lemmatize each review. First the words of each review are tagged as an NLTK Part Of Speech. Then these tags are converted to WordNet tags. Once a dictionary of words and tags are created, the lemmatizer can run on each word. 

In [154]:
def lemmatize(review):
    #get nltk tag
    nltk_tag = nltk.pos_tag(review)  
    #create dict of words and associated wordnet tag 
    wordnet_tag = dict()
    for word,tag in nltk_tag: 
        wordnet_tag[word] = nltk_tag_to_wordnet_tag[word[1]]
        
    lemmatized_sentence = []
    for word, tag in wordnet_tag:
        if tag is None:
            #no tag means can't lemmatize
            lemmatized_sentence.append(word)
        else:        
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

In [155]:
#run function on reviews
lemmatized_reviews = list()
for review in filtered_reviews: 
    lemmatized_reviews.append(lemmatize_sentence(review))

In [None]:
print (lemmatized_reviews[1])

Now each review has been cleaned up. But it's still not ready to be used in a ML model. The next step is to represent the words as a collection of vectors. A Keras Tokenizer can be used to do this. First the reviews are converted to sequences. Then the vectors need to all be same length to be used in the model. 

In [157]:
#initialise tokenizer 
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lemmatized_reviews)
#convert words to numbers
features = tokenizer.texts_to_sequences(lemmatized_reviews)
#add zeros to arrays until they are all the same length 
features = sequence.pad_sequences(features)

Our features are now ready to be used in a model. Next we need a target. Luckily this is a much simpler task: 

In [158]:
target = pd.get_dummies(df['sentiment'],drop_first=True)

## Transfer Learning for Language Models

Transfer learning, as the name suggests, involves transferring information from an already developed model to a new one. It means rather than the model starting with randomly chosen weights, it is initialised with better weights. 

GloVE (Global Vectors for Word Representation) is an example of this for language models. It was developed at Stanford. 

Using functions from DataCamp (Recurrent Neural Networks for Language Modeling in Python) the GloVE pretrained vectors can be extracted and it can be set up to be used with this dataset. 

In [159]:
#vocabulary dict holds the list of words collected by the tokenizer and the number it assigned to it 
vocabulary_dict = tokenizer.word_index
vocabulary_size = len(vocabulary_dict)

In [160]:
#load glove file 
file = 'glove.6B.300d.txt'

In [161]:
#first get all the words and associated vectors from the glove file 
glove_vector_dict = dict()
with open(file) as f: 
    for line in f: 
        values = line.split()
        word = values[0]
        coefs = values[1:]
        glove_vector_dict[word] = np.asarray(coefs,dtype='float32')

In [162]:
#then create a matrix specific to this dataset by looking up words in the glove dictionary
#and adding their already established vector to an array 
embedding_matrix = np.zeros((vocabulary_size+1, 300))
for word, i in vocabulary_dict.items():
    embedding_vector = glove_vector_dict.get(word)
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

## Deep Learning Model

The text data has been cleaned and translated into numbers and initial weights have been established. Now it's time to actually create a model. 

In [178]:
np.random.seed(1)

In [179]:
#initialise sequential model
model = Sequential()

## Embedding Layer
An embedding layer creates vectors in such a way that:
- It uses less dimensions than one hot encoding (one hot encoding would create very sparse vectors when dealing with the many unique words of this task).
- It puts similar features closer together because they attempt to learn something about the data when they work. An embedding will, for instance, put positive words closer together.

In [180]:
input_length = features.shape[1]
output_dim = embedding_matrix.shape[1]

In [181]:
#add embedding layer using the embedding matrix from glove 
model.add(Embedding(input_dim=vocabulary_size + 1,
                     output_dim=output_dim, weights = [embedding_matrix], input_length=input_length)) 

## Batch Normalisation 
Normalising the neural network will help its performance. It will help the network train more quickly because it should converge more quickly. 

In [182]:
model.add(BatchNormalization())

## RNNs 
RNNs use the information that simple neural networks do but they also learn the context of their predictions. This is useful in sentiment analysis - the words that preceed or follow a given word can change its sentiment. 

In the case of sentiment analysis a 'many to one' RNN would be used. This takes a sequence of inputs and generates a single output. However, simple RNNs aren't used very often in practice. The main reason for this is the vanishing/exploding gradient problems. 

The vanishing gradient problem : simple RNNs don't perform well on long series of data. In this case the gradient will become vanishingly small and so the model won't be able to use it to improve its error score. 

Exploding gradient problem: over time large error gradients can be accumulated by an RNN. This results in very large updates to the weights and this makes the model perform poorly. 

## LSTMs 
Long Short Term Memory networks (LSTMs) are a type of RNN. They function better than a simple RNN for most sequence problems. They are more selective in what they remember in order to predict the future. 

They solve the vanishing and exploding gradient problems because they only remember important information. 

The difference between an these networks and a simple RNN are the operations in the LSTM cells. They contain mechanisms called gates that control the flow of information. These changes allow the network to decide whether to keep or forget information based on importance. The gates contain sigmoid activations. This keeps values between 0 and 1. Here 0 means values that should be forgotten and 1 means values that should be kept. 

In [183]:
model.add(LSTM(6, return_sequences = False))                    

## Dropout Layer
Dropout involves randomly choosing some neurons to ignore during training. This helps the model be less sensitive to the specific weights of neurons and should help the model not  overfit on training data 

In [184]:
model.add(Dropout(0.2))

In [185]:
#output layer
model.add(Dense(1,activation='sigmoid'))

In [186]:
model.summary()

Model: "sequential_108"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_108 (Embedding)    (None, 1401, 300)         26229300  
_________________________________________________________________
batch_normalization_9 (Batch (None, 1401, 300)         1200      
_________________________________________________________________
lstm_106 (LSTM)              (None, 6)                 7368      
_________________________________________________________________
dropout_1 (Dropout)          (None, 6)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 7         
Total params: 26,237,875
Trainable params: 26,237,275
Non-trainable params: 600
_________________________________________________________________


In [187]:
#early stopping helps prevent overfitting 
early_stopping_monitor = EarlyStopping(patience=2,monitor='val_loss')

In [188]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [189]:
model.fit(features, target, validation_split=0.3, epochs=3,
          verbose=True,callbacks = [early_stopping_monitor])

Train on 35000 samples, validate on 15000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Hyperparameter Tuning 

### Activation Functions 

Activation functions determine the output of a neural network through maths equations. An activation function is applied to the output of a neuron. If a model doesn't have an activation function it is essentially just a linear regression model. This prevents the model from learning complex patterns in the data. 

Three activation functions to consider: 
- Sigmoid: varies between 0 and 1 for all possible X input values 
- Tanh: varies between -1 and 1
- ReLU: varies between 0 and infinity 

### Batch Size 

A batch is a sample of data. Batch size is a hyperparameter that controls how many samples a model will work through in each epoch before the internal parameters are updated. As a rule of thumb the bigger a dataset, the bigger the batch size. 

### Learning Rate

Adam is the recommended optimizer for most deep learning problems. Its learning rate is a very important hyperparameter. It controls how much the model changes in response to estimated error each time the model weights update. If the learning rate is too small the training process may take too long whereas a too large value could lead to the model choosing poor weights. 

## Cross-Validation
I ran cross-validation on a smaller subset of the dataset and the best parameters that were found were: 

In [None]:
{'learning_rate': 0.01, 'batch_size': 256, 'activation': 'tanh'}

In [190]:
#set random seed for reproducibility 
np.random.seed(1)

In [191]:
# Creates a model given an activation and learning rate
def create_model(learning_rate, activation):
    opt = keras.optimizers.Adam(lr = learning_rate)
    model = Sequential()
    model.add(Embedding(input_dim=vocabulary_size + 1, output_dim=output_dim, weights = [embedding_matrix], input_length=input_length)) 
    model.add(Dense(6, activation = activation))
    model.add(LSTM(6, return_sequences = False))
    model.add(Dropout(0.2))
    model.add(Dense(1,activation='sigmoid', name = 'output'))
    model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

In [None]:
tuning_model = KerasClassifier(build_fn = create_model,verbose=0)

params = {'activation': ['sigmoid','relu', 'tanh'], 'batch_size': [32, 128, 256],
          'learning_rate': [0.1, 0.01, 0.001]}

#search for parameters that give lowest validation loss
random_search = RandomizedSearchCV(tuning_model, param_distributions = params, cv = KFold(3),estimator='val_loss')

random_search.fit(features,target)
print (random_search.best_params_)

## Conclusion

Using deep learning for sentiment classification is known to be effective. Here our model has learnt about movie reviews using these steps:
1. Translating the text of the reviews into numbers 
2. Using GLoVe embeddings to give the model a better than random starting point 
3. Building a deep learning model using LSTM, Dropout, BatchNormalization and Dense layers.
4. Evaluating this model using cross validation to find the best hyperparameters 