## Neural network (RNN with LSTM layer)

Why a neural network with LSTM layer?

We are still on our binary classification of tweets/comments, the label 0 is for "sarcasm" and the label 1 means "cyberbullying".

A neural network, unlike the previous Naive Bayes model that we have tried out, will make use of the context (as opposed to dealing with individual words).
And a Long Short Term Memory layer means the model can handle “long-term dependencies” (which a classical RNN cannot handle).
LSTM supposedly also solves the problem of vanishing gradient.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from tensorflow.keras import layers, Sequential
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping
import gensim.downloader as api

2023-02-02 12:36:36.333538: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-02 12:36:36.443208: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-02 12:36:36.447280: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-02 12:36:36.447292: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if yo

In [2]:
data = pd.read_csv('merged_all_modified_csv.csv')
X = data['comment_cleaned_lower']
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [16]:
X_train.shape

(29751,)

In [30]:
#Train a word2vec model on our sentences
word2vec = Word2Vec(sentences=[str(x) for x in X_train], vector_size=30, window =2, min_count=5)
#word2vec = Word2Vec(sentences=str(X_train), vector_size=30, window =2, min_count=5)
print(word2vec)

Word2Vec<vocab=124, vector_size=30, alpha=0.025>


In [31]:
# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space
def embed_sentence(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec.wv:
            embedded_sentence.append(word2vec.wv[word])
        
    return np.array(embedded_sentence)

# Function that converts a list of sentences into a list of matrices
def embedding(word2vec, sentences):
    embed = [] 
    for sentence in sentences:
        embedded_sentence = embed_sentence(word2vec, sentence)
        embed.append(embedded_sentence)    
    return embed

# Embed the training and test sentences
X_train_embed = embedding(word2vec, [str(x) for x in X_train])
X_test_embed = embedding(word2vec, [str(x) for x in X_test])

# Pad the training and test embedded sentences
X_train_pad = pad_sequences(X_train_embed, dtype='float32', padding='post', maxlen=200)
X_test_pad = pad_sequences(X_test_embed, dtype='float32', padding='post', maxlen=200)

In [32]:
X_train_pad.shape, X_test_pad.shape

((29751, 200, 30), (12751, 200, 30))

In [33]:
#checking our X_train_pad and X_test_pad, they should be np arrays, 3-dim,  
#last dimension must be of the size of the word2vec embedding space, and 1st dim must be of size of X_train and X_test

for X in [X_train_pad, X_test_pad]:
    assert type(X) == np.ndarray
    assert X.shape[-1] == word2vec.wv.vector_size


assert X_train_pad.shape[0] == len(X_train)
assert X_test_pad.shape[0] == len(X_test)

## Baseline accuracy

In [34]:
baseline_accuracy = y_train.sum() / y_train.shape[0]
baseline_accuracy

0.7490168397700918

## RNN model, without transfer learning

In [35]:
vocab_size = word2vec.wv.vectors.shape[0]

In [36]:
def init_model():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(20, activation="tanh"))
    model.add(layers.Dense(10, activation = 'relu'))
    model.add(layers.Dense(1, activation="sigmoid"))
    
    model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
    return model

model = init_model()

In [37]:
callback = EarlyStopping(monitor='accuracy', patience=3)
model.fit(X_train_pad, y_train, epochs=10, batch_size=128, callbacks=[callback], verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5c1433760>

In [38]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 masking_1 (Masking)         (None, 200, 30)           0         
                                                                 
 lstm_1 (LSTM)               (None, 20)                4080      
                                                                 
 dense_2 (Dense)             (None, 10)                210       
                                                                 
 dense_3 (Dense)             (None, 1)                 11        
                                                                 
Total params: 4,301
Trainable params: 4,301
Non-trainable params: 0
_________________________________________________________________


In [39]:
model.predict(X_test_pad)



array([[0.97180146],
       [0.95489734],
       [0.9362625 ],
       ...,
       [0.9579762 ],
       [0.9467767 ],
       [0.9647433 ]], dtype=float32)

In [None]:
model.evaluate(X_test_pad, y_test)

## Same model, this time pretrained on much larger (and similar) dataset : Glove
transfer learning


In [40]:
#print(list(api.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [41]:
word2vec_transfer = api.load('glove-twitter-200')

In [42]:
embedding_size_transfer = word2vec_transfer.vector_size
vocab_size_transfer = word2vec_transfer.vectors.shape[0]

In [44]:
# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space
def embed_sentence_with_TF(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec:
            embedded_sentence.append(word2vec[word])        
    return np.array(embedded_sentence)

# Function that converts a list of sentences into a list of matrices
def embedding(word2vec, sentences):
    embed = []  
    for sentence in sentences:
        embedded_sentence = embed_sentence_with_TF(word2vec, sentence)
        embed.append(embedded_sentence)   
    return embed

# Embed the training and test sentences
X_train_embed_transfer = embedding(word2vec_transfer, [str(x) for x in X_train])
X_test_embed_transfer = embedding(word2vec_transfer, [str(x) for x in X_test])

In [45]:
#padding
X_train_pad_transfer = pad_sequences(X_train_embed_transfer, dtype='float32', padding='post', maxlen=200)
X_test_pad_transfer = pad_sequences(X_test_embed_transfer, dtype='float32', padding='post', maxlen=200)

In [46]:
model_transfer = init_model()

In [47]:
model_transfer.fit(X_train_pad_transfer, y_train, epochs=100, batch_size=128, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7fc594da4910>

In [48]:
model_transfer.predict(X_test_pad_transfer)



array([[9.9960357e-01],
       [9.9983239e-01],
       [9.9295008e-01],
       ...,
       [9.9990988e-01],
       [4.3896434e-04],
       [9.9986994e-01]], dtype=float32)

In [49]:
res = model_transfer.evaluate(X_test_pad_transfer, y_test, verbose=2)

399/399 - 22s - loss: 0.1285 - accuracy: 0.9602 - 22s/epoch - 56ms/step
