### Importing Libraries

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout

from sklearn.metrics import log_loss,confusion_matrix,classification_report,roc_curve,auc

import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from scipy import sparse
%matplotlib inline
seed = 42
import os
os.environ['OMP_NUM_THREADS'] = '4'


### Visualisation 

In [58]:


# Read the training data from the CSV file
train = pd.read_csv('train.csv')

# Read the test data from the CSV file
test = pd.read_csv('test.csv')

# Read the test labels from the CSV file
test_labels = pd.read_csv('test_labels.csv')

# Display the number of rows and columns in the training data
print('Number of rows and columns in the train data set:', train.shape)

# Display the number of rows and columns in the test data
print('Number of rows and columns in the test data set:', test.shape)

# Display the number of rows and columns in the test labels data
print('Number of rows and columns in the test labels data set:', test_labels.shape)


Number of rows and columns in the train data set: (159571, 8)
Number of rows and columns in the test data set: (153164, 2)
Number of rows and columns in the test data set: (153164, 7)


In [59]:
# Merge the test data with the test labels based on the 'id' column
test = pd.merge(test, test_labels, on='id')

# Display the first few rows of the merged test data
test.head()

# Filter out rows where any label has a value of -1 (indicating missing or invalid label)
test = test[(test[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']] != -1).all(axis=1)]

# Display the number of rows and columns in the filtered test data
print('Number of rows and columns in the test data set:', test.shape)


Number of rows and columns in the test data set: (63978, 8)


In [61]:
test.fillna(' ',inplace=True)


## Text Data Tokenization and Padding

The following code segment outlines the essential steps of tokenizing and padding text data for effective text classification:

### Objective:

Prepare text data for classification by applying tokenization and padding techniques.

### Hyperparameters:

- **Vocabulary Size (vocab_size):** Set to 20,000, controlling the number of unique words in the vocabulary.
- **Embedding Dimension (embedding_dim):** Chosen as 128, determining the size of the embedding vectors for words.
- **Maximum Length (max_length):** Limited to 200, ensuring uniform sequence length for the classification model.
- **Truncation Type (trunc_type):** Set to 'post', indicating truncation from the end of sequences.
- **Padding Type (padding_type):** Set to 'post', indicating padding at the end of sequences.
- **Out-of-Vocabulary Token (oov_tok):** Defined as '<OOV>', representing out-of-vocabulary words.

### Tokenization Process:

1. The Keras Tokenizer is employed to tokenize the sentences in the comment text.
2. The tokenizer is fitted on the training data, creating a vocabulary index.
3. Sequences of integers are generated for both the training and test sets based on the tokenizer's vocabulary.

### Padding:

1. The generated sequences are padded to a maximum length of 200 to ensure uniformity.
2. Padding is applied post-tokenization and truncation to create sequences of consistent length.

### Train-Validation Split:

1. The train set is split into train and validation sets using a test size of 20% for model evaluation.

This process prepares the textual data for subsequent use in training a text classification model.


In [63]:
# Objective: Tokenization and Padding for Text Classification

# Defining target columns for classification
target_col = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
y = train[target_col]

# Hyperparameters for tokenization and padding
vocab_size = 20000
embedding_dim = 128
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

# Tokenizing the sentences using the Keras Tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train['comment_text'])
word_index = tokenizer.word_index

# Converting the train and test sets into sequences
train_sequences = tokenizer.texts_to_sequences(train['comment_text'])
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

test_sequences = tokenizer.texts_to_sequences(test['comment_text'])
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Splitting the train set into train and validation sets
train_padded, val_padded, train_labels, val_labels = train_test_split(train_padded, y, test_size=0.2, random_state=42)


In [64]:
# Define the text classification model
model = Sequential([
    # Embedding layer for word representation
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    
    # First LSTM layer with 64 units and return sequences set to True
    LSTM(64, return_sequences=True),
    
    # Second LSTM layer with 32 units for further sequence processing
    LSTM(32),
    
    # Dense layer with 64 units and ReLU activation
    Dense(64, activation='relu'),
    
    # Dropout layer with a dropout rate of 50% to prevent overfitting
    Dropout(0.5),
    
    # Output Dense layer with 6 nodes and sigmoid activation for multi-label classification
    Dense(6, activation='sigmoid')
])


# Recurrent Neural Network (RNN) for Text Classification

In our pursuit of text classification for toxic comments, we employed a Recurrent Neural Network (RNN) model using Keras with a TensorFlow backend. The architecture of the RNN is structured as follows:

## Model Architecture

- **Embedding Layer:** The initial layer for word representation using word embeddings, capturing semantic meanings.
  
- **LSTM Layers:** Two Long Short-Term Memory (LSTM) layers with 64 and 32 units, respectively. The first layer returns sequences, providing a deeper understanding of contextual information, while the second layer further processes sequences.

- **Dense Layer:** A dense layer with 64 units and ReLU activation for additional non-linearity.

- **Dropout Layer:** To prevent overfitting, a dropout layer with a dropout rate of 50% is introduced.

- **Output Dense Layer:** The final dense layer with 6 nodes and sigmoid activation is designed for multi-label classification, producing probabilities for each toxic label.

## Compilation and Training

The model is compiled using binary crossentropy loss and the Adam optimizer, with accuracy as the metric. Training is performed for three epochs on the training data, validating on a separate validation set.

## Prediction

After training, the model predicts toxic labels on the test set, providing insights into its performance.

## Why RNN with LSTM?

- **Sequential Nature of Text:** Toxic comments often exhibit patterns and context that require understanding of the sequential nature of text. LSTM, as a type of RNN, is well-suited for capturing such long-range dependencies.

- **Semantic Representations:** The ability of LSTM to capture semantic representations is crucial for discerning the nuanced meaning in toxic comments, where context plays a vital role.

- **Contextual Understanding:** The use of LSTM layers, especially one returning sequences, allows the model to build a richer contextual understanding by considering the order of words in a comment.

This RNN architecture, particularly with LSTM layers, proves suitable for the complexities of toxic comment classification, leveraging sequential information and semantic representations for enhanced accuracy.


In [65]:

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
num_epochs = 3
history = model.fit(train_padded, train_labels, epochs=num_epochs, validation_data=(val_padded, val_labels), verbose=2)

# Predict on test set
test_pred = model.predict(test_padded)

Epoch 1/3
3990/3990 - 572s - loss: 0.1512 - accuracy: 0.9002 - val_loss: 0.1416 - val_accuracy: 0.9941 - 572s/epoch - 143ms/step
Epoch 2/3
3990/3990 - 755s - loss: 0.0992 - accuracy: 0.9938 - val_loss: 0.0540 - val_accuracy: 0.9941 - 755s/epoch - 189ms/step
Epoch 3/3
3990/3990 - 792s - loss: 0.0525 - accuracy: 0.9942 - val_loss: 0.0518 - val_accuracy: 0.9941 - 792s/epoch - 199ms/step


In [66]:
print(test_pred)

[[1.9348902e-03 4.4465221e-12 3.0096526e-05 5.5960708e-08 7.2777533e-05
  5.6555446e-06]
 [1.6209672e-01 1.0431326e-04 3.1703383e-02 2.1255468e-03 4.2972788e-02
  9.6544931e-03]
 [1.0993951e-01 2.8888813e-05 1.7916238e-02 9.9200802e-04 2.5854262e-02
  5.5971574e-03]
 ...
 [7.1963382e-01 1.0672147e-02 3.1103054e-01 2.2713598e-02 2.9539418e-01
  6.0466934e-02]
 [9.9724633e-01 2.5943160e-01 9.3785584e-01 3.1459097e-02 8.3884734e-01
  1.3564067e-01]
 [5.0212285e-03 1.9356791e-10 1.3561477e-04 6.1508717e-07 2.9553953e-04
  3.0352878e-05]]


In [67]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Assuming test_pred contains the prediction probabilities and test_pred_binary contains binary predictions
threshold = 0.5
test_pred_binary = (test_pred > threshold).astype(int)

target_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Iterate over each column to compute metrics for both classes
for col in target_cols:
    # Metrics for class 0
    precision_0 = precision_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)
    recall_0 = recall_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)
    f1_0 = f1_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)

    # Metrics for class 1
    precision_1 = precision_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)
    recall_1 = recall_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)
    f1_1 = f1_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)

    # AUC for the label
    auc_score = roc_auc_score(test[col], test_pred[:, target_cols.index(col)])

    print(f"Metrics for label: {col}")
    print(f"Class 0 - Precision: {precision_0:.2f}, Recall: {recall_0:.2f}, F1-Score: {f1_0:.2f}")
    print(f"Class 1 - Precision: {precision_1:.2f}, Recall: {recall_1:.2f}, F1-Score: {f1_1:.2f}")
    print(f"AUC: {auc_score:.2f}")
    print('-' * 50)


Metrics for label: toxic
Class 0 - Precision: 0.98, Recall: 0.94, F1-Score: 0.96
Class 1 - Precision: 0.60, Recall: 0.79, F1-Score: 0.68
AUC: 0.96
--------------------------------------------------
Metrics for label: severe_toxic
Class 0 - Precision: 0.99, Recall: 1.00, F1-Score: 1.00
Class 1 - Precision: 0.31, Recall: 0.01, F1-Score: 0.02
AUC: 0.98
--------------------------------------------------
Metrics for label: obscene
Class 0 - Precision: 0.98, Recall: 0.98, F1-Score: 0.98
Class 1 - Precision: 0.67, Recall: 0.70, F1-Score: 0.68
AUC: 0.97
--------------------------------------------------
Metrics for label: threat
Class 0 - Precision: 1.00, Recall: 1.00, F1-Score: 1.00
Class 1 - Precision: 0.00, Recall: 0.00, F1-Score: 0.00
AUC: 0.94
--------------------------------------------------
Metrics for label: insult
Class 0 - Precision: 0.98, Recall: 0.98, F1-Score: 0.98
Class 1 - Precision: 0.63, Recall: 0.60, F1-Score: 0.61
AUC: 0.96
--------------------------------------------------

In [70]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Assuming test_pred contains the prediction probabilities and test_pred_binary contains binary predictions
threshold = 0.5
test_pred_binary = (test_pred > threshold).astype(int)

target_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Iterate over each column to compute metrics and confusion matrix for both classes
for col in target_cols:
    # Confusion Matrix
    cm = confusion_matrix(test[col], test_pred_binary[:, target_cols.index(col)])
    
    # Metrics for class 0
    precision_0 = precision_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)
    recall_0 = recall_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)
    f1_0 = f1_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=0, zero_division=0)
    
    # Metrics for class 1
    precision_1 = precision_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)
    recall_1 = recall_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)
    f1_1 = f1_score(test[col], test_pred_binary[:, target_cols.index(col)], pos_label=1, zero_division=0)
    
    # AUC for the label
    auc_score = roc_auc_score(test[col], test_pred[:, target_cols.index(col)])
    
    print(f"Metrics for label: {col}")
    print("Confusion Matrix:")
    print(cm)
    print(f"Class 0 - Precision: {precision_0:.2f}, Recall: {recall_0:.2f}, F1-Score: {f1_0:.2f}")
    print(f"Class 1 - Precision: {precision_1:.2f}, Recall: {recall_1:.2f}, F1-Score: {f1_1:.2f}")
    print(f"AUC: {auc_score:.2f}")
    print('-' * 50)


Metrics for label: toxic
Confusion Matrix:
[[54633  3255]
 [ 1264  4826]]
Class 0 - Precision: 0.98, Recall: 0.94, F1-Score: 0.96
Class 1 - Precision: 0.60, Recall: 0.79, F1-Score: 0.68
AUC: 0.96
--------------------------------------------------
Metrics for label: severe_toxic
Confusion Matrix:
[[63602     9]
 [  363     4]]
Class 0 - Precision: 0.99, Recall: 1.00, F1-Score: 1.00
Class 1 - Precision: 0.31, Recall: 0.01, F1-Score: 0.02
AUC: 0.98
--------------------------------------------------
Metrics for label: obscene
Confusion Matrix:
[[59013  1274]
 [ 1110  2581]]
Class 0 - Precision: 0.98, Recall: 0.98, F1-Score: 0.98
Class 1 - Precision: 0.67, Recall: 0.70, F1-Score: 0.68
AUC: 0.97
--------------------------------------------------
Metrics for label: threat
Confusion Matrix:
[[63767     0]
 [  211     0]]
Class 0 - Precision: 1.00, Recall: 1.00, F1-Score: 1.00
Class 1 - Precision: 0.00, Recall: 0.00, F1-Score: 0.00
AUC: 0.94
--------------------------------------------------
Me

## Findings
### Toxic:
- The model performs well in identifying non-toxic comments (Class 0) with high precision (0.98) and recall (0.94).
- For toxic comments (Class 1), precision is moderate (0.60), indicating some false positives, but recall is relatively high (0.79), capturing a good portion of actual toxic comments.
- Overall, the model achieves a good balance with an AUC of 0.96.

### Severe Toxic:
- The model excels in correctly classifying non-severe toxic comments (Class 0) with high precision (0.99) and recall (1.00).
- However, for severe toxic comments (Class 1), precision is low (0.31), suggesting a higher rate of false positives, and recall is extremely low (0.01), indicating that the model struggles to identify most severe toxic comments.
- The AUC is relatively high at 0.98.

### Obscene:
- The model shows excellent performance in distinguishing non-obscene comments (Class 0) with high precision (0.98) and recall (0.98).
- For obscene comments (Class 1), precision is moderate (0.67), and recall is reasonable (0.70), indicating a decent identification of obscene content.
- The AUC is high at 0.97.

### Threat:
- The model is highly accurate in identifying non-threatening comments (Class 0) with high precision (0.99) and recall (1.00).
- However, for threatening comments (Class 1), precision is extremely low (0.00), suggesting a high rate of false positives, and recall is also zero, indicating that the model fails to identify any threatening comments.
- The AUC is good at 0.94.

In summary, the model generally performs well in distinguishing non-toxic and non-threatening comments, but it faces challenges, especially in identifying severe toxic and threatening comments, where precision and recall are lower. Balancing precision and recall is crucial for different application scenarios.

## Conclusion
It is imperative to realize that RNN has indeed improved upon the Naive Bayes model in cases where data imbalance of classes, 1, and 0, respectively, are not extremely significant. However, it performs worse with 0 Recalls and Precisions in cases where data imbalnces are indeed significant, such as in column = threat. While in general this is a better model, this leads towards our areas of Improvements.

While we did perform data sampling in Naive Bayes, we werent able to do so in this notebook as it was performing great on validation set, but poorly on test set. Perhaps, this was brought about due to overfitting. Therefore, we decided not to undersample to fit both classes in this case. This is , however, an area of improvement where we could have better data sampling to provide better results for hugely imbalance classes.

Moreover, while we did add LSTMs to improve long term dependcies, realistically, RNNs are not known for their long term dependcies, and as such, another area of improvement is the Model itself, which brings us to our third and last model, the state of the art, Transformers!

(Note: Model 2 > Model1 follows in general as RNN performed better overall)