# An NLP workshop - Categorizing tweets into relevant or non-relevant
#### adapted from https://github.com/hundredblocks/concrete_NLP_tutorial.git

## 4 - Deep Learning - CNN

In this notebook we use Word2Vec embeddings with a Convolutional Neural Network (CNN)to classify our tweets

First lets import all the libraries we will need upfront

In [None]:
import gensim
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

import itertools

In [None]:
%matplotlib inline

### Let's read in our cleansed dataset

In [None]:
clean_questions = pd.read_csv("clean_data.csv")

Also define a function that plots a *Confusion Matrix* which helps us see our false positives and false negatives

In [None]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.winter):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=30)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, fontsize=20)
    #plt.yticks(tick_marks, classes, fontsize=20)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.

    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", 
                 color="white" if cm[i, j] < thresh else "black", fontsize=40)
    
    plt.tight_layout()
    plt.ylabel('True label', fontsize=30)
    plt.xlabel('Predicted label', fontsize=30)

    return plt

### Load pre-trained Word2Vec embeddings

In [None]:
word2vec_path = "~/Downloads/GoogleNews-vectors-negative300.bin"
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

# Leveraging text structure
The models in our previous notebook have been performing well, but they completely ignore the structure. To see whether capturing some more sense of structure would help, we will try a final, more complex model.

## CNNs for text classification
Here, we will be using a Convolutional Neural Network for sentence classification. While not as popular as RNNs, they have been proven to get competitive results (sometimes beating the best models), and are very fast to train, making them a perfect choice for this workshop. First let's embed our text!

First, we tokenize the text, creating an index for every unique word

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 35
VOCAB_SIZE = 20000
VALIDATION_SPLIT=.2

tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(clean_questions["text"].tolist())
sequences = tokenizer.texts_to_sequences(clean_questions["text"].tolist())
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Our CNN needs a constant length input, so make each sequence have a maximum length of 35, padding where necessary

In [None]:
cnn_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

Shuffle the sequences and determine the train/val split

In [None]:
labels = to_categorical(np.asarray(clean_questions["class_label"]))
indices = np.arange(cnn_data.shape[0])

np.random.shuffle(indices)
cnn_data = cnn_data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * cnn_data.shape[0])

For every word in the word index, find the corresponding word2vec embedding, substituting a random embedding if not found

In [None]:
embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word,index in word_index.items():
    embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM)
print(embedding_weights.shape)

Now, we will define a simple Convolutional Neural Network

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten
from tensorflow.keras.layers import SpatialDropout1D, Conv1D, MaxPooling1D


def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index, trainable=False):
    model = Sequential()
    model.add(Embedding(num_words, embedding_dim, weights=[embeddings], input_length=max_sequence_length, trainable=trainable)) 
    model.add(Conv1D(filters=128, kernel_size=3, activation='relu'))
    model.add(MaxPooling1D(pool_size=3))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(labels_index, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model = ConvNet(embedding_weights, MAX_SEQUENCE_LENGTH, len(word_index)+1, EMBEDDING_DIM, 
                len(list(clean_questions["class_label"].unique())), trainable=False)
model.summary()

Use a checkpoint to store the weights whenever the accuracy improves

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

# checkpoint
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

Now let's train our Neural Network

In [None]:
x_train = cnn_data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = cnn_data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

NUM_EPOCHS = 5
BATCH_SIZE = 128

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, callbacks=callbacks_list)

Let's load in the weights that gave us the best validation accuracy

In [None]:
# load weights
model.load_weights("weights.best.hdf5")

How does this compare to the simpler classifiers used earlier?

In [None]:
# Re-calculate accuracy on validation dataset using loaded weights
scores = model.evaluate(x_val, y_val, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

### Confusion Matrix

In [None]:
y_hat = model.predict_classes(x_val, batch_size=BATCH_SIZE)
y_val_equiv = []
for val in y_val:
    if   val[0]== 1. : y_val_equiv.append(0)
    elif val[1]== 1. : y_val_equiv.append(1)
    else             : y_val_equiv.append(2)

In [None]:
cm_cnn = confusion_matrix(y_val_equiv, y_hat)
fig = plt.figure(figsize=(10, 10))
plot = plot_confusion_matrix(cm_cnn, classes=['Irrelevant','Disaster','Unsure'], normalize=False, title='Confusion matrix')
plt.show()