[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/Ex11-sentiment-analysis.ipynb) 


# ADAMS Tutorial #11: Sentiment Analysis

Sentiment analysis is one of the most popular applications for text classification. It is also interesting from a business perspective. For example, many companies have an interest in analyzing text data emerging in social media to understand how consumers value their products, service, brands, etc.

The goal of sentiment analysis is to model the polarity of a piece of text, whether it is rather positive or rather negative. We can frame that as a binary classification problem, with labels of one and zero indicating positive or negative sentiment, respectively. That is the approach we will take today. Other options exist and could involve modeling a three-level target (positive, neutral, negative) or a numeric target variable the values of which represent different strengths of polarity (e.g., between +5 and -5). Whenever approaching the sentiment analysis task by supervised learning, we depend on having some data with sentiment labels. That is often the real challenge in practice - where do the labels come from? - and explains why many labeled data sets re-occur in papers.

We will look at the preprocessed movie review data set, that we used in the last tutorial; originally gathered and studied by Maas et al. (https://www.aclweb.org/anthology/P11-1015). We will apply different modeling approach to predict review sentiment, from a simple dictionary-based approach over conventional supervised machine learning to several deep learning techniques. Here is the outline of the tutorial.
 
 1. Preliminaries
 2. Dictionary-based sentiment analysis
 3. Linear classifier with vectorized inputs
 4. Deep learning for text classification

## 1. Preliminaries
We reuse the data set from the [last tutorial](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb). To set up our environment we load the IMDB 50K review data set and also the list of cleaned reviews. We then add the cleaned reviews to the data set to have everything at the same place. It is a good idea to examine a few reviews and make sure that the original version and the cleaned version match. Ones this is confirmed, you can safely discard the raw review text to save some memory. Finally, we update the coding of our target variable and encode positive and negative reviews as one and zero, respectively.  

The notebook sets a new milestone in terms of demand for computational resources. We recommend running the notebook  in Colab or another cloud-based platform of your choice.  

In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib inline

# Load/save data from/to disk 
import pickle

# Assess sentiment classification models 
from sklearn.metrics import accuracy_score, confusion_matrix

# Working with pre-trained embeddings from Tutorial #10
from gensim.models import KeyedVectors
from gensim.models.keyedvectors import Word2VecKeyedVectors

In [None]:
# Create a global variable to idicate whether the notebook is run in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules


# Configure variables pointing to directories and stored files 
if IN_COLAB:
    # Mount Google-Drive
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_DIR = '/content/drive/My Drive/data/'  # adjust to Google drive folder with the data if applicable
else:
    DATA_DIR = '../../data/' # adjust to the directory where data is stored on your machine (if running the notebook locally)

IMDB_50K = 'IMDB-50K-Movie-Review.zip'  # CSV fil with the original IMDB 50K data set
CLEAN_REVIEW = 'imdb_clean_full.pkl'   # List with tokenized reviews after standard NLP preparation

### Data integration
We load the data set with the movie reviews and and their binary sentiment label. We then load the cleaned review data from the [tutorial #10](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb) and store it in the data frame. 

In [None]:
# Load the raw review data set
df = pd.read_csv(DATA_DIR + IMDB_50K, sep=",", encoding="ISO-8859-1")
df.info()

In [None]:
# Binary-encode the target variable
df['sentiment'] = df['sentiment'].map({'positive' : 1, 'negative': 0})
df['sentiment'].value_counts()

In [None]:
# Load the cleaned review text from tutorial #7
import pickle
with open(DATA_DIR + CLEAN_REVIEW,'rb') as path_name:
    reviews = pickle.load(path_name)
assert len(reviews)==50000

The cleaned reviews are stored as a list of lists. In order to store the cleaned reviews in one column of the data frame, we need to revert the tokenization and create one long string for every review. NLTK provides functions to undo tokenization, as we illustrate below. It is true that our approach is a little inefficient here, since we will need to tokenize the reviews later to put them into Keras; first detokenizing the tokenized data only to tokenize it again later... not very efficient. With all justified critic, following the notebook might be easiest if all the data is stored in a central point, that is our data frame. Hence, we prepare the data in this way and do not worry about efficiency.   

In [None]:
# Undo the tokenization and put the data into a new column in the data frame.
from nltk.tokenize.treebank import TreebankWordDetokenizer

df['review_clean'] = [TreebankWordDetokenizer().detokenize(review) for review in reviews]

In [None]:
df

All looks good, let's start with our first sentiment model. However, before moving on, it is probably a good idea to save our data frame to disk. After all, the detokenization took a little while and we don't want to have to do it again if something happens to our data frame. Since we used Pickle before, we stick to this library and simply pickle our data frame. To save disk space, we get rid of the original review text before saving.

In [None]:
df.drop('review', inplace=True, axis=1)

In [None]:
# Store data frame to disk
file_name = 'ex11_imdb50K_clean.pkl'
df.to_pickle(DATA_DIR + file_name)

In [None]:
# Load data frame from disk
file_name = 'ex11_imdb50K_clean.pkl'
df = pd.read_pickle(DATA_DIR + file_name)

In [None]:
df

#### Downsampling the data to increase speed
One more thing before moving on. You should decide whether you want to proceed with the full data frame (i.e., 50K reviews) or draw a random sample to decrease the runtime of the following examples. Using all the data is feasible on any descent computer but prepare for a bit of waiting when training our neural networks. Here is a little bit of code to reduce the amount of data. You can also go back to the previous code and *pickle* the sampled data frame to store it for further use.

In [None]:
# Draw a random sample of n reviews to increase the speed of the following steps
n = 5000
ix = np.random.randint(0, high=df.shape[0]-1, size=n)
df = df.loc[ix, :]
df

## 2. Dictionary-based sentiment analysis

A simple approach to rate the sentiment of a text is to literally model it as the sum of its parts through the sentiment of each word. AFINN is an English word listed developed by Finn Årup Nielsen. Words scores range from minus five (negative) to plus five (positive). The English language dictionary consists of 2,477 coded words. Note that you will need to install the library before be able to run the following code.

In [None]:
#    !pip install afinn

from afinn import Afinn
afinn = Afinn(language='en')

We look up the sentiment score for each word in turn and sum up the sentiment values over words. Here are a few examples. Quite easy, isn't it.

In [None]:
#* Some examples how to rate texts. Larger values indicate stronger positive feelings
print(afinn.score("What a marvelous evening, the weather is simply delightful. Wonderful!"))
print(afinn.score("I am devastated, the donuts are not what they used to be, what a horrendous taste"))
print(afinn.score("To be or not to be, that is the question.."))

Since we already have a data frame, why not adding the sentiment score of every review as a new column. This is a nice use case for the *.apply()* function that Pandas data frames support. We score the cleaned version of the review. If you fancy a little exercise, consider to also score the original review text and compare the differences between the two scores. You could then identify reviews where the sentiment scores differ substantially between the original and cleaned text. That might point to some issues in our data preparation, i.e., the cleaning of the review text in [Tutorial #10](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb) 

In [None]:
# Add the Afinn scores to our data frame 
# Caution: if you use the full data set of 50K reviews, the scoring will take a while.
start = time.time()
df['afinn_score'] = df['review_clean'].apply(afinn.score)
end = time.time()

print('Processed {} reviews in {:.0f} sec.'.format(df.shape[0], end-start))

In [None]:
df['afinn_score'].describe() # overall rather positive

We can treat the sentiment scores as class predictions. Applying a classification cut-off of zero, we posit that every review with a positive score is classified as positive, and negative otherwise. We can then examine the predictive accuracy of the dictionary-based classifier using standard performance measures from the field of binary classification. 

In [None]:
#* Calculate accuracy of the afinn classifer using a cut-off of zero
df['yhat_afinn'] = np.where(df['afinn_score']>0, 1, 0)

In [None]:
df.head()

In [None]:
score_dict=accuracy_score(df['sentiment'], df['yhat_afinn'])
print("Accuracy: {:.4f}".format(score_dict))
confusion_matrix(df['sentiment'], df['yhat_afinn'])

score_dict=accuracy_score(df['sentiment'], df['yhat_afinn'])
print("Accuracy: {:.4f}".format(score_dict))
confusion_matrix(df['sentiment'], df['yhat_afinn'])

Seems that our dictionary-based classifier is biased toward the positive reviews. Note that your result might differ depending on which data you are using (all reviews, random sample). 

## 3. Linear classifier with vectorized inputs
Before building complex deep-learning based sentiment classifiers, we can estimate a simple logit model and use it as a benchmark. Now that we start estimating models, we also need to partition our data to estimate model performance on a hold-out test set. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review_clean'], df['sentiment'], test_size=0.25, random_state=111)

We will use count vectorization for our baseline model: take the words of each sentence and create a vocabulary of all the unique words in the sentences. This vocabulary can then be used to create a feature vector of word counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#* Transform the review text using a CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train)

X_train_countvec = vectorizer.transform(X_train)  
X_test_countvec  = vectorizer.transform(X_test)  

In [None]:
X_train_countvec.shape

In [None]:
print(X_train_countvec[0])

In [None]:
# Verify the above data with the original review
case = X_train.iloc[0]  
case

In [None]:
# Look-up the index of the first word
print('Index of the first word is {}'.format(vectorizer.vocabulary_['dr']))  # enter first word here

Now that we know the index of the word in the vocabulary, we can check the 'feature value' in the training data. This will be some number, which is supposed to quantify how often the first word occurred in the review. Then, examining the review text, we should be able to verify correctness of the count. 

In [None]:
# Check the word count of that feature
X_train_countvec[0, 6932]

You can also select some index from the above print of `X_train_countvec` and query the corresponding word. That is just another way of checking that the feature value correctly captures how many times a word appeared in the review. 

In [None]:
vectorizer.get_feature_names()[148]  # You can take any feature index from the above print-out

Having convinced ourselves that the data is sound, we proceed by estimating a linear classifier. Given the high-dimensionality of the data set, which is characteristic for count-based word embeddings, we chose LASSO. We set the  argument solver of the linear model to `liblinear`, which is a highly efficient library for regularized linear models, which we can interface via scikit-learn.

In [None]:
# Estimate LASSO model
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver="liblinear", penalty='l1', )
classifier.fit(X_train_countvec, y_train)
score_lr = classifier.score(X_test_countvec, y_test)

# Calculate classification accuracy
print("Accuracy: {:.4f}".format(score_lr))
confusion_matrix(y_test, classifier.predict(X_test_countvec))  # Note that we do not tune the classification cut-off

The above model is not the best conceivable benchmark. While LASSO is suitable for high-dimensional (text) data, using count vectorization for sentiment analysis is just questionable. TFxIDF weights might be a little better but  suffers the same 'flaw' of considering all the words in the vocabulary as features and counting on the LASSO penalty to filter out the irrelevant words. To make a serious attempt to improve the model, we would need to go back to the data and do some more cleaning (i.e., removing rare and non-sentiment words).

## 4. Deep learning for text classification
If we can use LASSO, we can also use a neural networks for sentiment prediction. Previous tutorials have already introduced as to different types of models. Time to test them on our review data. However, prior to building models in Keras, we need to do a bit of housekeeping. Our data is not yet in the right form. Keras expects a sequence of integers, which represent (sparse) one-hot-encoded words. So, we have to build our vocabulary and need to make sure that we can move seamlessly from words to integers and vice-versa. Last time, we developed corresponding dictionaries from scratch; remember our *word2id* and *id2word* dictionaries from the [W2V tutorial](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb). To introduce yet more options, we will use Keras functionality today. 

In [None]:
#* Build vocabulary using Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

NUM_WORDS = 2500  # We could use all words as in the LASSO example but this would increase training times substantially

# Create tokenizer object and build vocab from the training set
tokenizer_obj = Tokenizer(NUM_WORDS, oov_token=1)  # We fit the tokenizer to the training set reviews. The test set might include
tokenizer_obj.fit_on_texts(X_train)  # words that are not part of the training data. The argument oov_token ensures that such new words are mapped to the specified index

In [None]:
# Convert training set reviews to sequences of integer values
X_tr_int = tokenizer_obj.texts_to_sequences(X_train)

In [None]:
X_tr_int[0][:10]

After *fitting* the tokenizer, we have access to its internal vocabulary, which was build-up as part of the fitting. For example, we can convert the integer encoded text back to words as follows:

In [None]:
demo = [tokenizer_obj.index_word[token] for token in X_tr_int[0][:10]]
demo

And again back to integers... 

In [None]:
[tokenizer_obj.word_index[token] for token in demo]

The Keras layers that we will use later expect the input data to have a fixed, pre-defined shape. For example, you might remember the previous LSTM examples in which we had to make sure that our inputs are of the form *samples / timesteps / features*. At present, our reviews differ substantially in length. So, the next task on our todo list is to padd the reviews and ensure a consistent sequence length.

In [None]:
#* Determine the maximum review length in the training set
max_review_length = max([len(review) for review in X_tr_int])
print('The longest review of the training set has {} words.'.format(max_review_length))

Standard practice in NLP is to embed words in a vector space. Considering an embedding dimension of, e.g., 100, each word in the input data (i.e., review) will be mapped to a 100 dim vector. Working with a large embedding dimension and long sequences will result in slow training. Since we care more about illustrating concepts than building the best possible sentiment classifier, we will set an upper bound on the text length and pad reviews accordingly. All reviews that are shorter than our upper bound will be padded with zeros. Longer reviews will be truncated. In practice, you would need to experiment carefully whether and how much truncating the data hurts performance. 

In [None]:
# Upper bound of the review length for padding
MAX_REVIEW_LENGTH = 400

X_tr_int_pad = pad_sequences(X_tr_int, MAX_REVIEW_LENGTH)

So far, we dealt only with the training data. So it is about time to also process the test data.

In [None]:
# Encode and pad the test data
X_ts_int = tokenizer_obj.texts_to_sequences(X_test)  # Due to oov_token argument, new words will be mapped to 1
X_ts_int_pad = pad_sequences(X_ts_int, MAX_REVIEW_LENGTH)

In [None]:
# Structure of the prepared training and test data
X_tr_int_pad.shape, y_train.shape, X_ts_int_pad.shape, y_test.shape

Time to build some neural networks using Keras.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding,GRU, Dropout
from keras.layers.embeddings import Embedding
from keras.initializers import Constant

In [None]:
# Some variables to centralize the configuration of deep learning models
NB_HIDDEN = 16
EPOCH = 5
BATCH_SIZE = 64 #128
EMBEDDING_DIM = 50
VAL_SPLIT = 0.25  # fraction of the training set used for validation

### Model 1: Basic GRU 
We begin with a basic GRU. We chose GRU over LSTM because training the former is faster. More importantly, the input to our GRU are the word embeddings, which we obtain from the Keras `Embedding layer`. In our first model, we initialize the embeddings randomly and train them together with the other network parameters (i.e., in the GRU layer). This architecture is a fairly basic approach toward text classification. We advance the model as we go along.  

In [None]:
# Embedding layer
embedding_layer=Embedding(input_dim=NUM_WORDS, 
                          output_dim=EMBEDDING_DIM, 
                          input_length=MAX_REVIEW_LENGTH
                         )
# GRU text classifier
model1=Sequential()                        
model1.add(embedding_layer)
model1.add(GRU(NB_HIDDEN))
model1.add(Dense(1, activation="sigmoid"))
model1.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model1.summary()

In [None]:
model1_story = model1.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

#### A little bit of infrastructure
The GRU was just the first model in a chain of models of increasing sophistication. Sounds promising doesn't it.
Since we are about to train more and more networks, we should develop a little bit of infrastructure to work with them. Specifically, for each network, we need to produce test set predictions. Also, we would like to examine the development of the loss during training; e.g., to judge whether increasing the number of epochs would make sense. Last, it would be useful to save trained models to disk. After all, we spent quite some time on training them to making a backup in case something goes wrong with out notebook makes a lot of sense. Let's develop some helper functions for these tasks.

##### Helpfer function for model evaluation

In [None]:
SCORE_BAG = {}  # Dictionary to store the results of different Keras models

In [None]:
def diag_model(model, story, x_ts, y_ts, cut_off=0.5):
    ''' 
        Diagnose fitted keras models by plotting results from the
        story (e.g., development of training loss) and calculating
        classification accuracy on the test set
    '''
    score = model.evaluate(x_ts, y_ts, verbose=0)
    # Confusion matrix
    cmat = confusion_matrix(y_ts, model1.predict(x_ts)>cut_off)
    print('Test loss:', score[0])
    print('Test accuracy: {:.4f}'.format(np.trace(cmat)/np.sum(cmat)))
    print('Confusion matrix:')
    print(cmat)
    
    plt.plot(story.history['accuracy'])
    plt.plot(story.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()
    return score
 

Let's demonstrate our helper function in action by inspecting our first Keras model.

In [None]:
SCORE_BAG.update({'M1' : diag_model(model1, model1_story, X_ts_int_pad, y_test)})

#### Save trained network to disk
Saving a model is easy enough and does not warrant a helper function. Using *Pickle*, we store models as follows:


In [None]:
to_disk = (model1, model1_story)
with open(DATA_DIR + 'model1.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

In [None]:
# Load from disk if needed 
with open(DATA_DIR + 'model1.pkl','rb') as file_name:
    model1, model1_story = pickle.load(file_name)

### Model 2 GRU with pre-trained IMDB embeddings

Model #1 used word embeddings but trained these as part of learning the classifier. You can imagine that corresponding embeddings are different from those resulting from a model that is specifically designed to learn embeddings such as Word-to-Vec. Weights in Model #1 including the weights in the embedding matrix were trained to predict review sentiment. Word-to-Vec, on the other hand, solves a different prediction task related to the co-occurrences of words in a pre-defined context window. We have trained corresponding weights in the [previous tutorial](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb). It is about time to put these embeddings into action. Our second model will be similar to the first one but use the pre-trained embeddings from [Tutorial 10](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb).

#### Setting up the environment
The following demo counts on you having available a stored version of 'good' embeddings from [Tutorial 10](https://github.com/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb). If needed, you can find such embeddings in our Moodle folder.

In [None]:
# Load pretrained W2V embeddings obtained from the IMDB review data set
imdb_index = KeyedVectors.load_word2vec_format(DATA_DIR + 'w2v_imdb_dim50_embeddings.model', binary=False)
print('Loaded pre-trained embeddings for {} words.'.format(len(imdb_index.vocab)))

Pre-trained embeddings are essentially just a bunch of numbers for individual words. Needless to say, the numbers carry meaning, capturing syntactic and semantic relationships between words, etc. However, what we just loaded is a dictionary-like data structure in which words serve as a key and the value is the pre-trained embedding of that word. Let's illustrate this using the word *movie* as an example.  

In [None]:
e = imdb_index['movie']
print(e.shape)
e[:10]

A few words on embeddings...

At present, we use embeddings that were obtained from the same corpus, namely the IMDB movie review data set, as the one we are working with right now. That is not common. Typically, the pre-training was done on some other - much larger - corpus. Remember that the very purpose of using pre-trained embeddings is that we hope the pre-trained embeddings to embody some information about word relationships that also prove valuable for our task. The larger the pre-trainind corpus the better. 

Working with two different corpora, that used for pre-training embeddings and that used in the target task, poses some challenges. First, the pre-trained embeddings will include word vectors for words that do not appear in our corpus. That is less of a problem. 

Second, and more importantly, our corpus will include some words for which we lack an embedding. Addressing this issue in a satisfactory manner is out of the scope of this tutorial. Pre-training an embedding for unknown words might be a way forward. We will apply a rough fix and map out-of-vocabulary words to an embedding vector of zeros. 

Third, **and most importantly**, our matrix of pre-trained embeddings will function like a lookup table. Remember that the Keras embedding layer will not compute a dot product between a one-hot encoded input word and the embedding matrix because this would be inefficient. Instead, Keras expects to find the embedding of a word with index i in the i'th row of the embedding matrix. Therefore, it is critical that the word indices must match between our pretrained embeddings and our model's embedding matrix.

Since we will work with different pre-trained embeddings in what follows, we implement a little helper function to create an embedding matrix for our corpus. We will then use that embedding matrix when creating our next Keras model.

In [None]:
def get_embedding_matrix(tokenizer, pretrain, vocab_size):
    '''
        Helper function to construct an embedding matrix for 
        the focal corpus based on some pre-trained embeddings.
    '''
    
    dim = 0
    if isinstance(pretrain, KeyedVectors) or isinstance(pretrain, Word2VecKeyedVectors):
        dim = pretrain.vector_size        
    elif isinstance(pretrain, dict):
        dim = next(iter(pretrain.values())).shape[0]  # get embedding of an arbitrary word
    else:
        raise Exception('{} is not supported'.format(type(pretrain)))
    
    
    # Initialize embedding matrix
    emb_mat = np.zeros((vocab_size, dim))

    # There will be some words in our corpus for which we lack a pre-trained embedding.
    # In this tutorial, we will simply use a vector of zeros for such words. We also keep
    # track of the words to do some debugging if needed
    oov_words = []
    # Below we use the tokenizer object that created our task vocabulary. This is crucial to ensure
    # that the position of a words in our embedding matrix corresponds to its index in our integer
    # encoded input data
    for word, i in tokenizer.word_index.items():  
        # try-catch together with a zero-initilaized embedding matrix achieves our rough fix for oov words
        try:
            emb_mat[i] = pretrain[word]
        except:
            oov_words.append(word)
    print('Created embedding matrix of shape {}'.format(emb_mat.shape))
    print('Encountered {} out-of-vocabulary words.'.format(len(oov_words)))
    return (emb_mat, oov_words)

In [None]:
# Create embedding weight matrix
imdb_weights, _ = get_embedding_matrix(tokenizer_obj, imdb_index, NUM_WORDS)

In [None]:
imdb_weights[tokenizer_obj.word_index['film'],:]

#### Training the network
The code to build our GRU is actually very similar to the previous one. The only difference lies in the embedding layer where we now use our pre-trained embeddings as weight. One other notable point concerns the argument `trainable`, which we set to false. With this setting, the weights in the embedding matrix will not change. This simplifies the training but might prohibit our network from unfolding its full potential. Guess what will be our next classifier ones we trained this one ;)

In [None]:
# Embedding layer
embedding_layer=Embedding(input_dim=NUM_WORDS, 
                          output_dim=EMBEDDING_DIM, 
                          input_length=MAX_REVIEW_LENGTH,
                          embeddings_initializer=Constant(imdb_weights), #weights to start with, and not touch during training
                          trainable=False  # do not update the weights of the embedding matrix
                         )
# GRU text classifier
model2=Sequential()                        
model2.add(embedding_layer)
model2.add(GRU(NB_HIDDEN))
model2.add(Dense(1, activation="sigmoid"))
model2.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model2.summary()

In [None]:
model2_story = model2.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

In [None]:
# Assess the model
SCORE_BAG.update( {'M2': diag_model(model2, model2_story, X_ts_int_pad, y_test) })

# And save it 
to_disk = (model2, model2_story)
with open(DATA_DIR + 'model2.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

### Model 3: GRU with pre-trained IMDB embeddings and fine-tuning

Let's try to improve the previous model. Using pre-trained embeddings, maybe even from some other big corpus, makes sense. At the same time, accuracy might raise if we allow our model to adjust the embeddings to our target task. To test this, we will now build a text classifier where the embedding weights are trainable. In addition, we will demonstrate transferring of the GRU weights of the previous model. Using the final weights of Model #2, which were trained, as starting point for this model should raise accuracy. Let's try it out.

In [None]:
#* Extract weights of the GRU layer of the previous model
GRUw = model2.layers[1].get_weights()

In [None]:
model3=Sequential()
embedding_layer=Embedding(NUM_WORDS, 
                         EMBEDDING_DIM,  
                         embeddings_initializer=Constant(imdb_weights), #weights to start with, and not nouch during training
                         input_length=MAX_REVIEW_LENGTH, 
                         trainable=True  # Note this difference to our first GRU
                         )
model3.add(embedding_layer)
#model2.add(Dropout(0.2))
model3.add(GRU(NB_HIDDEN, weights=GRUw)) #  dropout=0.1, recurrent_dropout=0.1, for recurring unit and recurrant state
model3.add(Dense(1, activation="sigmoid"))
model3.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model3.summary()

In [None]:
# Train the model 
model3_story = model3.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

# Assess and store the model
SCORE_BAG.update( {'M3': diag_model(model3, model3_story, X_ts_int_pad, y_test) })

to_disk = (model3, model3_story)
with open(DATA_DIR + 'model3.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

### Model 4: GRU with pre-trained GloVe embeddings
Building your own embeddings might not be the smartest thing. In case the vocabulary you will be working with is not too specific, you can always use the pre-trained embeddings that were crafted from very big corpora. GloVe stands for "Global Vectors for Word Representation". It's a somewhat popular embedding technique based on factorizing a matrix of word co-occurrence statistics (https://nlp.stanford.edu/projects/glove/). We could equally stick to W2V and download one of the many available pre-trained versions. There is no specific reason for switching to Glove other than trying our something new. Since the data files of pre-trained embeddings can be quite large, we stick to the smallest available version of Glove, which has an embedding dimension of 50.


**Warning**
Even when using an embedding dimension of just 50, the data that we process in memory gets quite big. You might experience problems when running the code on your own computer (e.g., slow response times). Should you find that running the code on your machine is practically infeasible, a workaround is to not use GloVe embeddings but re-use the embeddings of previous models. Of course, you would not be able to compare the results properly but you could at least run the codes. To do this, you can make use of the *hack*

In [None]:
glove_weights = imdb_weights

and skip over the next code cells in which we load the Glove embeddings and create our weight matrix.

In [None]:
# Load GloVe embeddings
# Make sure to adjust the path / name to find the file on your hard disk. You can download it from the above URL
glove_index = {}
with open(DATA_DIR + 'glove.6B.50d.txt', 'r', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        glove_index[word] = coefs

print('Found %s word vectors.' % len(glove_index))

We can re-use our helper function to obtain a proper look-up table for Keras. And that is about it. Everything else is in place so that we can straight go on with building our next classifier. 

In [None]:
# Create matrix with Glove embeddings
# Caution: this operation may take a long time and consume a lot of memory
glove_weights, _ = get_embedding_matrix(tokenizer_obj, glove_index, NUM_WORDS)

In [None]:
# Embedding layer with the pre-trained GloVe weights
embedding_layer = Embedding(NUM_WORDS, 
                         EMBEDDING_DIM,  
                         embeddings_initializer=Constant(glove_weights), 
                         input_length=MAX_REVIEW_LENGTH, 
                         trainable=False  # we start with frozen weights and relax this choice in model #5
                         )

model4=Sequential()
model4.add(embedding_layer)
#model4.add(Dropout(0.25))
model4.add(GRU(NB_HIDDEN))
#model4.add(Dropout(0.25))
model4.add(Dense(1, activation="sigmoid"))
model4.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model4.summary()

In [None]:
# Train the model 
model4_story = model4.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

# Assess and store the model
SCORE_BAG.update( {'M4': diag_model(model4, model4_story, X_ts_int_pad, y_test) })

to_disk = (model4, model4_story)
with open(DATA_DIR + 'model4.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

### Model 5: Advance Model #4 by allowing a fine-tuning of embeddings
Let's see if fine-tuning the GloVe weights helps. Model #5 is equivalent to model #4 but updates the embeddings as part of the training. Similar to the previous example with the IMDB embeddings (i.e., Model #2 c.f. Model #3), we re-use the weights of the GRU layer to start the training process.

In [None]:
# Extract the GRU weights from the previous model
GRUglo= model4.layers[1].get_weights()

# Set up embedding layer
embedding_layer = Embedding(NUM_WORDS, 
                         EMBEDDING_DIM,  
                         embeddings_initializer=Constant(glove_weights), 
                         input_length=MAX_REVIEW_LENGTH, 
                         trainable=True  # main difference to model previous model
                         )
model5=Sequential()
model5.add(embedding_layer)
model5.add(GRU(NB_HIDDEN, weights=GRUglo))#, dropout=0.1, recurrent_dropout=0.1 ))
#model5.add(Dropout(0.25))
model5.add(Dense(1, activation="sigmoid"))
model5.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model5.summary()

In [None]:
# Train the model 
model5_story = model5.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

# Assess and store the model
SCORE_BAG.update( {'M5': diag_model(model5, model5_story, X_ts_int_pad, y_test) })

to_disk = (model5, model5_story)
with open(DATA_DIR + 'model5.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

### Model 6: Bidirectional GRU
Ok, we have trained a lot of different models, and to be fair, in terms of programming and how we use Keras, the codes have become quite repetitive. Let's build one last model and conclude. Our last model with be a **bidirectional GRU**. In sentiment analysis, the full text is available when making a prediction. Therefore, bidirectional NLP models are feasible. Given their conceptual advantage of having access to both, left and right context, we would expect them to perform a little better. Let's see whether this holds true for our data.  

In [None]:
embedding_layer = Embedding(NUM_WORDS, 
                         EMBEDDING_DIM,  
                         embeddings_initializer=Constant(glove_weights), #weights to start with, and not nouch during training
                         input_length=MAX_REVIEW_LENGTH, 
                         trainable=False  
                         )

A bidirectional layer is actually two layers with the same structure. Both layers take the input step-by-step, one from beginning to end and one from end-to-beginning. The two hidden states at step $t$ are typically merged by concatenating or summing them. 

In [None]:
from keras.layers import Bidirectional

model6 = Sequential()
model6.add(embedding_layer) #embeddings are trained
model6.add(Bidirectional(GRU(NB_HIDDEN), merge_mode="concat"))
#model6.add(Dropout(0.25))
model6.add(Dense(units=1, activation='sigmoid'))
model6.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

print(model6.summary())

# Train the model 
model6_story = model6.fit(X_tr_int_pad, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_split=VAL_SPLIT)

# Assess and store the model
SCORE_BAG.update( {'M6': diag_model(model6, model6_story, X_ts_int_pad, y_test) })

to_disk = (model6, model6_story)
with open(DATA_DIR + 'model5.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)

## Conclusions
The tutorial has covered several important concepts in deep learning for NLP and text classification. We developed several deep learning-based text classifiers using Keras and advanced our understanding of word embeddings. We also saw examples of how to use pre-trained word embeddings in downstream tasks, such as sentiment analysis. 

Although the focus was more on concepts and code examples than on building top-notch sentiment classifiers, let's conclude the tutorial with a brief analysis how the different approaches compared to each other. Note that the best results for the IMDB data set vary from 91 to 94% accuracy.

In [None]:
# Add logit model benchmark results to the score dictionary
SCORE_BAG.update( {'Logit': [0, score_lr] })

# Put everything in a data frame
compare = pd.DataFrame(SCORE_BAG, index=['loss', 'accuracy'])

In [None]:
# If needed, you can run this code (after some adjustment) to load saved models from disk 
# and do the evaluation ex-post
score_dic = {}
for i in range(1,7):
  file = DATA_DIR + 'model' + str(i) + '.pkl'
  print(file)
  # Load from disk if needed 
  with open(file,'rb') as f:
    model, story = pickle.load(f)
    key = 'M' + str(i)
    score_dic.update( {key : diag_model(model, story, X_ts_int_pad, y_test) })

# Add logit model benchmark results to the score dictionary
score_dic.update( {'Logit': [0, score_lr] })

# Put everything in a data frame
compare = pd.DataFrame(score_dic, index=['loss', 'accuracy'])


In [None]:
# Plot the results 
x = np.arange(len(compare.columns.values))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, compare.loc['loss'], width, label='Loss')
rects2 = ax.bar(x + width/2, compare.loc['acc'], width, label='Accuracy')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Scores')
ax.set_title('Model cmparison')
ax.set_xticks(x)
ax.set_xticklabels(compare.columns.values)
ax.legend()

fig.tight_layout()

plt.show()