# An NLP workshop - Categorizing tweets into relevant or non-relevant
#### adapted from https://github.com/hundredblocks/concrete_NLP_tutorial.git

## 4 - Deep Learning - CNN

In this notebook we use Word2Vec embeddings with a Convolutional Neural Network (CNN) to classify our tweets

First lets import the main libraries we will need upfront

In [None]:
import os
import itertools
import datetime

import gensim
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
%matplotlib inline

In [None]:
%load_ext tensorboard

### Let's read in our cleansed dataset

In [None]:
clean_questions = pd.read_csv("clean_data.csv")

Let's drop the small number of 'undecided' labels

In [None]:
clean_questions = clean_questions[clean_questions.class_label != 2]

### Load pre-trained Word2Vec embeddings

In [None]:
word2vec_path = "~/Downloads/GoogleNews-vectors-negative300.bin"
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [None]:
# word2vec.most_similar("fire")  # Takes a while to run first time, subsequent calls are faster

```python
word2vec.most_similar("fire")
```




    [('blaze', 0.7516718506813049),
     ('fires', 0.7222490310668945),
     ('Fire', 0.69910728931427),
     ('flames', 0.6387674808502197),
     ('carelessly_discarded_cigarette', 0.6215550899505615),
     ('inferno', 0.6056278347969055),
     ('firefighters', 0.6039329767227173),
     ('alarm_blaze', 0.5841655731201172),
     ('brush_fires', 0.579571008682251),
     ('grassfire', 0.5759598612785339)]

# Leveraging text structure
The models in our previous notebook have been performing well, but they completely ignore the structure. To see whether capturing some more sense of structure would help, we will try a final, more complex model.

## CNNs for text classification
Here, we will be using a Convolutional Neural Network for sentence classification. While not as popular as RNNs, they have been proven to get competitive results (sometimes beating the best models), and are very fast to train, making them a perfect choice for this workshop.

<img src="cnn.png"/>

*Image from “Convolutional Neural Networks for Sentence Classification.” by Yoon Kim (2014)*

First, we tokenize the text, creating an index for every unique word

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [None]:
EMBEDDING_DIM = 300           # Our word2vec model has 300 dimensions
MAX_SEQUENCE_LENGTH = 35      # Max number of words in any single tweet
VOCAB_SIZE = 20000            # Set max number of tokens
VALIDATION_SPLIT=.2           # 80/20 test/validation split

In [None]:
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(clean_questions["text"].tolist())
sequences = tokenizer.texts_to_sequences(clean_questions["text"].tolist())
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Our CNN needs a constant length input, so make each sequence have a maximum length of 35, padding where necessary

In [None]:
cnn_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

Shuffle the sequences and determine the train/val split

In [None]:
labels = to_categorical(np.asarray(clean_questions["class_label"]))
indices = np.arange(cnn_data.shape[0])

np.random.shuffle(indices)
cnn_data = cnn_data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * cnn_data.shape[0])

We have converted each label to a one-hot encoded vector

In [None]:
print(labels)

For every word in the word index, find the corresponding word2vec embedding, substituting a random embedding if not found

In [None]:
embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word,index in word_index.items():
    embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM)
print(embedding_weights.shape)

Now, we will define a simple Convolutional Neural Network

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten
from tensorflow.keras.layers import Conv1D, MaxPooling1D

In [None]:
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, num_labels, trainable=False):
    model = Sequential()
    model.add(Embedding(num_words, embedding_dim, weights=[embeddings], input_length=max_sequence_length, trainable=trainable)) 
    model.add(Conv1D(filters=128, kernel_size=3, activation='relu'))
    model.add(MaxPooling1D(pool_size=3))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(num_labels, activation='softmax'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model = ConvNet(embedding_weights, MAX_SEQUENCE_LENGTH, len(word_index)+1, EMBEDDING_DIM, 
                len(list(clean_questions["class_label"].unique())), trainable=False)
model.summary()

Use a checkpoint to store the weights whenever the accuracy improves

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

In [None]:
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

Setup TensorBoard - will help visualize our model training 

In [None]:
!rm -rf ./logs/ 

In [None]:
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

In [None]:
callbacks_list = [checkpoint, tensorboard_callback]

Create our training and validation sets

In [None]:
x_train = cnn_data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = cnn_data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

Set a couple of very important *hyper-parameters*

In [None]:
NUM_EPOCHS = 10
BATCH_SIZE = 256

Now let's train our Neural Network

In [None]:
model.fit(x_train, y_train, 
          validation_data=(x_val, y_val), 
          epochs=NUM_EPOCHS, 
          batch_size=BATCH_SIZE, 
          callbacks=callbacks_list)

Let's fire up TensorBoard to visualize our training accuracy

In [None]:
%tensorboard --logdir "./logs"

Let's load in the weights that gave us the best validation accuracy

In [None]:
# load weights
model.load_weights("weights.best.hdf5")

How does this compare to the simpler classifiers used earlier?

In [None]:
# Re-calculate accuracy on validation dataset using loaded weights
scores = model.evaluate(x_val, y_val, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Let's also calculate **precision**, **recall** and **f1-score**.

First let's get the predicted value, y_hat

In [None]:
y_hat = model.predict_classes(x_val, batch_size=BATCH_SIZE)
print(y_hat)

y_val is a one-hot encoded output, so let's convert it back into a simple vector, y_val_equiv

In [None]:
y_val

In [None]:
y_val_equiv = []
for val in y_val:
    if   val[0]== 1. : y_val_equiv.append(0)
    else             : y_val_equiv.append(1)
y_val_equiv = np.array(y_val_equiv)
print(y_val_equiv)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def get_metrics(y_test, y_predicted):  
    # true positives / (true positives+false positives)
    precision = precision_score(y_test, y_predicted, pos_label=None,
                                    average='weighted')             
    # true positives / (true positives + false negatives)
    recall = recall_score(y_test, y_predicted, pos_label=None,
                              average='weighted')
    
    # harmonic mean of precision and recall
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    
    # true positives + true negatives/ total
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1

In [None]:
accuracy, precision, recall, f1 = get_metrics(y_val_equiv, y_hat)
print("accuracy = {:2.2%}, precision = {:2.2%}, recall = {:2.2%}, f1 = {:2.2%}".format(accuracy, precision, recall, f1))

In [None]:
print(classification_report(y_val_equiv, y_hat))

### Confusion Matrix
Let's plot a *Confusion Matrix* which helps us see our false positives and false negatives

In [None]:
from sklearn.metrics import confusion_matrix

df_cm = pd.DataFrame(confusion_matrix(y_val_equiv, y_hat),
                     index=["Irrelevant", "Disaster"], 
                     columns=["Irrelevant", "Disaster"])

In [None]:
plt.figure(figsize=(10,8))
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, 
           annot=True, 
           annot_kws={"size": 30}, 
           cmap=plt.cm.winter, 
           fmt='.0f',
           lw=0.5, 
           linecolor='w')
plt.title('Confusion matrix', fontsize=30)
plt.tight_layout()
plt.ylabel('Actual', fontsize=20)
plt.xlabel('Predicted', fontsize=20)
plt.show();

## Alternative CNN architectures

In [None]:
from tensorflow.keras.layers import GlobalMaxPooling1D, Activation

def ConvNet2(embeddings, max_sequence_length, num_words, embedding_dim, labels_index, trainable=False):
    model = Sequential()
    model.add(Embedding(num_words,
                        embedding_dim,
                        weights=[embeddings],
                        input_length=max_sequence_length,
                        trainable=trainable))
    model.add(Dropout(0.2))
    model.add(Conv1D(250,
                     3,
                     padding='valid',
                     activation='relu',
                     strides=1))
    model.add(GlobalMaxPooling1D())
    model.add(Flatten())
    model.add(Dense(250))
    model.add(Dropout(0.2))
    model.add(Dense(labels_index, activation='sigmoid'))


    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
def ConvNet3(embeddings, max_sequence_length, num_words, embedding_dim, labels_index, trainable=False):
    model = Sequential()
    model.add(Embedding(num_words,
                        embedding_dim,
                        weights=[embeddings],
                        input_length=max_sequence_length,
                        trainable=trainable))

    model.add(Conv1D(128, 3, activation='relu'))
    model.add(MaxPooling1D(3))
    model.add(Conv1D(128, 3, activation='relu'))
    model.add(MaxPooling1D(3))
    model.add(Conv1D(128, 3, activation='relu'))
    #model.add(MaxPooling1D(35))  # global max pooling
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(labels_index, activation='softmax'))

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model