# __Recurrent Neural Network (RNN)__

Recurrent neural networks (RNNs) stand out as a specialized form of neural network. Unlike other neural networks, RNNs introduce a unique mechanism leveraging the output of preceding steps as input for the current step within a sequence. This integration, facilitated by what's termed a hidden state, furnishes RNNs with the invaluable capacity to retain memory of past inputs. Such memory retention proves instrumental in capturing temporal dependencies inherent in sequential data, rendering RNNs indispensable for a broad spectrum of applications spanning natural language processing, speech recognition, and beyond.

In [42]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import (
    confusion_matrix,
    classification_report
)
from tensorflow.keras.layers import (
    Embedding, 
    SimpleRNN, 
    Flatten, 
    Dense, 
    Dropout
)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model

# Custom libraries
import sys
sys.path.append('..')
from functions.models import *

In [43]:
train_path = '../data/train_prep.tsv'
test_path = '../data/test_prep.tsv'
validation_path = '../data/validation_prep.tsv'

train = pd.read_csv(train_path, sep='\t')
test = pd.read_csv(test_path, sep='\t')
validation = pd.read_csv(validation_path, sep='\t')

### Tokenization, Padding and Sequencing

In the realm of natural language processing (NLP), tokenization stands as the pivotal process of segmenting a text into smaller entities, which could be words, characters, or even groups of words termed n-grams.

Each unit is then assigned an index to represent it. This allows us to transform a piece of text into a sequence of numbers that a machine learning model can understand.

We will first check that there are no Nan that could cause problems in the tokenization.

In [44]:
train.shape

(40091, 3)

In [45]:
train = train.dropna()

In [46]:
train.shape

(40034, 3)

There were actuallly some Nan that had been erased before tokenization

In [54]:
# Tokenize the column 'comment'
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train['comment'])

Indexes of each token

In [55]:
tokenizer.word_index

{'name': 1,
 'like': 2,
 'love': 3,
 'would': 4,
 'good': 5,
 'get': 6,
 'one': 7,
 'people': 8,
 'thanks': 9,
 'really': 10,
 'lol': 11,
 'know': 12,
 'think': 13,
 'thank': 14,
 'time': 15,
 'make': 16,
 'thing': 17,
 'see': 18,
 'hope': 19,
 'much': 20,
 'look': 21,
 'got': 22,
 'sorry': 23,
 'going': 24,
 'year': 25,
 'great': 26,
 'well': 27,
 'want': 28,
 'go': 29,
 'still': 30,
 'even': 31,
 'feel': 32,
 'right': 33,
 'oh': 34,
 'guy': 35,
 'way': 36,
 'never': 37,
 'cannot': 38,
 'game': 39,
 'could': 40,
 'bad': 41,
 'yeah': 42,
 'u': 43,
 'man': 44,
 'need': 45,
 'day': 46,
 'happy': 47,
 'better': 48,
 'say': 49,
 'best': 50,
 'actually': 51,
 'pretty': 52,
 'also': 53,
 'glad': 54,
 'back': 55,
 'sure': 56,
 'wow': 57,
 'though': 58,
 'someone': 59,
 'work': 60,
 'something': 61,
 'thought': 62,
 'take': 63,
 'mean': 64,
 'always': 65,
 'post': 66,
 'yes': 67,
 'hate': 68,
 'first': 69,
 'made': 70,
 'life': 71,
 'wish': 72,
 'nice': 73,
 'help': 74,
 'new': 75,
 'find': 76

In [56]:
train.head(2)

Unnamed: 0,comment,id,label
0,everyone think he laugh screwing people instea...,ed00q6i,27
1,fuck bayless isoing,eezlygj,2


Once we have tokenized our text, we need to use the generated tokens to convert the headlines into sequences:

In [57]:
train_sequences = tokenizer.texts_to_sequences(train["comment"])
train_sequences[:3]

[[111, 13, 1017, 260, 5387, 8, 365, 51, 437],
 [81, 6366, 11487],
 [16, 32, 2974]]

As it can be noticed, since each headline has a different length, each sequence has also a different length. However, in order to feed the data into the neural network, all the sequences must have the same length. Therefore, we will need to apply some kind of transformation to these sequences to make them all have the same length. For doing so, we will apply a technique called padding.

In [58]:
max_seq_len = max([len(seq) for seq in train_sequences])
max_seq_len

20

In [59]:
train_padded = pad_sequences(train_sequences, maxlen=max_seq_len, padding="post")

In [60]:
# Sequences after padding
train_padded[:3]

array([[  111,    13,  1017,   260,  5387,     8,   365,    51,   437,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [   81,  6366, 11487,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [   16,    32,  2974,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0]])

### Model Building

In [None]:
model = Sequential()

# Add Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=50, input_length=max_seq_len))

# Add SimpleRNN layer
model.add(SimpleRNN(128, recurrent_dropout=0.5, dropout=0.5))

# Flatten and Dense Layer
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(64, activation="relu"))

# Output layer
model.add(Dense(1, activation="sigmoid"))

In [None]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

In [None]:

model.summary()

### Model Training

### Model Validation

### Model Evaluation