# Lab 9 

In this lab, we will see how we can develop and train a Convolutional Neural Network using the Python library Keras. We will use the movie reviews provided on moodle (training.txt) which we also used last week. 

When implementing a Convolutional Neural Network:
1. We firstly need to split the data into training and testing and then we need to extract features (or vice versa). 
2. Then we need to define the architecture of the convolutional neural network.
3. Then we need to train the model, and finally,
4. Evaluate it (more on this at Unit 10).

### Firstly, you will need to load the dataset, in this case, the dataset is provided on moodle and it contains movie reviews.  In this example, we will use pandas to load the data (there are other ways to load a .txt file as we saw last week). 

In [None]:
import pandas as pd

In [None]:
#the data is saved as a dataframe with two column names: label and review
df = pd.read_csv('training.txt', names=['label', 'review'], sep='\t')

In [None]:
#We can see how the first example looks like below
print(df.iloc[0])

In [None]:
# We store the reviews and labels in two arrays as follows:
reviews = df['review'].values
labels = df['label'].values

### Then, we need to split the data into training and testing sets. Last week, we saw how to do this manually (i.e. writing our own code). Sklearn offers a function called train_test_split that you can also use as follows: 

In [None]:
from sklearn.model_selection import train_test_split

reviews_train, reviews_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.20)

# Convolutional Neural Nets

The Keras library allows us to quickly define a CNN architecture, train and evaluate a CNN model with minimal effort. But firstly, we will need to introduce a new concept: __Embeddings__

1. Word Embedding 

Word embeddings are approaches for representing words and documents using vectors. Word embeddings offer an improvement over the traditional bag-of-word encoding paradigms where large sparse vectors are used to represent each word or texts. This way of representing words and texts introduce the sparsity problem, because the number of unique words in a document can be vast and a given word or text is represented by a large vector comprised mostly of zero values. 

Instead, in an embedding, words are represented by __dense vectors__ (smaller vectors with minimal zero values) where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learnt from text and is based on the surrounding words of a word when it is used. The __position of a word__ in the learnt vector space is known as its __embedding__.

There are a few popular ways of learning word embeddings from text such as Word2Vec, Doc2Vec, Glove and ELMO. 

In addition to these popular methods, word embeddings can be learned as part of a deep learning model as we will see below. 

2. Keras Embedding Layer

Keras offers an Embedding layer that can be used for training neural networks using text data. Keras' Embedding layer takes as input integer encoded data, so that each word is represented by a unique integer. We can use tokenisation to pre-process the data. 

Then, the weights of the Embedding layer are randomly initialized and it is trained to learn an embedding for all of the words in the training dataset.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

1. __input_dim__: the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-199, then the input_dim will be 200.
2. __output_dim__: the size of the vector space in which words will be embedded. It is a user-specified value, so test different values for your task.
3. __input_length__: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

For example, below we define an Embedding layer with a vocabulary of 1922 (e.g. integer encoded words from 0 to 1921, inclusive), a vector space of 50 dimensions in which words will be embedded, and input documents that have 50 words each.

__Embedding(1922, 50, input_length=30)__

In [None]:
from keras.preprocessing.text import Tokenizer

In [None]:
#define the tokenizer: https://keras.io/preprocessing/text/ 
tokenizer = Tokenizer(num_words=2000)

#Use tokenisation only on the training data!
tokenizer.fit_on_texts(reviews_train)

X_train = tokenizer.texts_to_sequences(reviews_train)
X_test = tokenizer.texts_to_sequences(reviews_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print(reviews_train[0])
print(X_train[0])
vocab_size

The above method will result in text sequences of variable length of words. To counter this, we can use pad_sequence() which pads the sequence of words with zeros. Additionally you would want to add a maxlen parameter to specify how long the sequences should be. For more oprions of the __pad_sequence__ function look here:  https://keras.io/preprocessing/sequence/ 

In the following code, you can see how to pad sequences with Keras:

In [None]:
from keras.preprocessing.sequence import pad_sequences

maxlen = 50

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

As we saw during the workbook/lecture, CNNs is just a sequence of different types of layers. Keras allows us to "build" this sequence of layers easily. First, we need to define the type of model, in this case Sequential as follows:

__model = Sequential()__

Then we can add layers as we want, e.g. we can add Convolutional Layers (e.g. Conv1D), Pooling layers (e.g. MaxPooling1D), Fully Connected layers (e.g. Dense) and so on. For a full list of acceptable layers, please see the Keras documentation: https://keras.io/layers/about-keras-layers/ 

Then you simply add the layers of your choice as follows:

__model.add(layers.Dense(1, activation='sigmoid'))__

As you see from the example, you can choose the activation function you want as well. 


In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen)) #https://keras.io/layers/embeddings/ 
model.add(layers.Conv1D(128, 5, activation='relu')) #https://keras.io/layers/convolutional/
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

#spend some time familiarising with the keras documentation. 
#It is imporssible to remember all the options available but you should be able to remember the basics. 

You can now see that we have 129,529 new parameters to train. The 93,950 parameters are derived from vocab_size times the embedding_dim (1879 x 50). These weights of the embedding layer are initialized with random weights and are then adjusted through backpropagation during training. This model takes the words as they come in the order of the sentences as input vectors. You can train it with the following:

In [None]:
training = model.fit(X_train, y_train, epochs=10, verbose=False, validation_data=(X_test, y_test), batch_size=10)
#details about the model: https://keras.io/models/model/ 

You can see the accuracy during training and testing as follows: 

In [None]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Finally, we can observe how fast the model learns by plotting the historical data of accuracy and loss. We can see that our model would reach a good level of accuracy after only three epochs. 

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(training):
    acc = training.history['acc']
    val_acc = training.history['val_acc']
    loss = training.history['loss']
    val_loss = training.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(14, 8))
    plt.subplot(1, 2, 2)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

In [None]:
plot_history(training)

### This is a walk through on how to create a train a CNN. As an exercise, familiarise yourself with the Keras exercise and try to add new layers to the CNN architcture / change the values of variables and observe what is happening. You can also try a CNN with the assignment data. 
