# LSTM



The LSTM (Long Short-Term Memory) model is a recurrent neural network architecture designed for sequential data processing. It is specifically chosen for toxicity classification due to its ability to capture sequential dependencies in text data. Here's an expanded explanation:

#### 1. Sequential Data Processing:

- **Nature of Text Data:** Toxicity classification tasks involve analyzing text data where the order and context of words matter. Traditional machine learning models struggle to capture the sequential nature of language, making them less effective for tasks like toxicity detection in sentences.

- **Sequential Dependencies:** In sentences, the meaning of a word can depend on the words that precede and follow it. LSTMs are designed to address this challenge by capturing long-term dependencies in sequences, making them well-suited for tasks where context is crucial.

#### 2. Overcoming Limitations of Regular RNNs:

- **Vanishing Gradient Problem:** Regular RNNs suffer from the vanishing gradient problem, where gradients diminish as they are backpropagated through time, making it challenging for the model to learn long-term dependencies.

- **LSTM's Solution:** LSTMs address the vanishing gradient problem with a more complex architecture that includes memory cells and gating mechanisms. This allows LSTMs to selectively remember or forget information over long sequences, enabling them to capture dependencies over extended contexts.

#### 3. Context Understanding in Toxicity Classification:

- **Context Sensitivity:** Toxicity in language often involves subtle nuances and context-dependent meanings. A model must understand the context of words in a sentence to accurately identify toxic elements.

- **Long-Term Dependencies:** LSTMs excel at capturing long-term dependencies, allowing them to consider the entire context of a sentence, even when the toxic elements are separated by several words.

#### 4. Enhanced Semantic Understanding:

- **Semantic Richness:** LSTMs, by virtue of their architecture, can generate embeddings that carry rich semantic information. This enables the model to learn not just from individual words but also from the relationships between them.

#### 5. Adaptability to Variable-Length Sequences:

- **Handling Varied Lengths:** Comments or sentences in toxicity classification datasets can have varying lengths. LSTMs can handle variable-length sequences through their ability to process input sequentially and produce a fixed-size output regardless of input length.

#### Conclusion:

In summary, the choice of LSTM for toxicity classification is grounded in its ability to overcome the limitations of regular RNNs, capture long-term dependencies, understand context, and handle variable-length sequences. The sequential nature of language is effectively modeled, making LSTMs a suitable choice for tasks where the order of words significantly influences the meaning, as is the case in toxicity classification.

## Imports

In [1]:
import re

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

from gpt4all import Embed4All

import torch
import torch.nn as nn

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.utils import to_categorical





## Preprocessing

### Preprocessing Approach Explanation:

#### 1. Dataset Loading:

- **Test and Train Data:** Two datasets, test_data and train_data, are loaded from CSV files (data/test.csv and data/train.csv respectively).
- **Test Labels:** Test labels are loaded from a separate CSV file (data/test_labels.csv).

#### 2. Filtering Train Data:

- **Conditions:** The train_data is filtered based on specific conditions for each toxicity label. Rows are retained only if all toxicity labels (toxic, severe_toxic, obscene, threat, insult, identity_hate) have valid values (not equal to -1).

#### 3. Text Normalization Functions:

- **Lowercasing (lowercase):** Converts text to lowercase, ensuring uniformity and reducing the vocabulary size.
- **Removing Punctuation (remove_punctuation):** Eliminates punctuation, reducing noise in the text.
- **Removing Stopwords (remove_stopwords):** Removes common English stopwords, reducing dimensionality and focusing on informative words.
- **Removing Numbers (remove_numbers):** Excludes numerical values, as they may not contribute significantly to toxicity classification.
- **Removing URLs (remove_url):** Eliminates URLs, as they often do not contribute to the semantic meaning of the text.

#### 4. Normalization Pipeline (normalize_sentence):

- **Integration of Functions:** The normalize_sentence function integrates all the above preprocessing functions to provide a comprehensive text normalization pipeline.

#### 5. Applying Normalization to Train Data:

- **Normalization of Comment Texts:** The comment_text column in the train_data is transformed using the normalize_sentence function.
- **Length Filter:** Rows are further filtered based on the length of the normalized comment text. Only comments with a length greater than 20 characters are retained.

#### 6. Train Data and Labels:

- **Conversion to NumPy Arrays:** The comment_text column is converted to a NumPy array for model input.
- **Labels Extraction:** The toxicity labels are extracted from the filtered train_data and converted to a NumPy array for model training.

### Why This Preprocessing Works:

1. **Noise Reduction:**
   - The combination of lowercase conversion, punctuation removal, and stopword elimination reduces noise in the text, helping the model focus on meaningful words.

2. **Dimensionality Reduction:**
   - Removing numbers and stopwords reduces the dimensionality of the data, making it more manageable and improving model efficiency.

3. **Normalization for Consistency:**
   - Normalizing the text ensures consistency in representation, allowing the model to learn more effectively across diverse comments.

4. **Handling URLs:**
   - Removing URLs is beneficial as they typically do not contribute much to the meaning of the text and can introduce noise.

5. **Length Filtering:**
   - Filtering out short comments (length < 20) ensures that the model is trained on comments with sufficient context, improving its ability to understand and classify toxicity in longer sentences.

6. **Label Filtering:**
   - Filtering train_data based on the availability of valid toxicity labels ensures that the model is trained on relevant and annotated data, avoiding potential issues with missing or incomplete labels.

In summary, this preprocessing approach is designed to clean and standardize the text data, reduce noise, and focus on informative features. These steps aim to enhance the models ability to learn meaningful patterns and improve its performance in toxicity classification.

In [17]:
test_data = pd.read_csv('data/test.csv')
test_labels = pd.read_csv('data/test_labels.csv')

train_data = pd.read_csv('data/train.csv')
train_data = train_data[(train_data['toxic'] != -1) & (train_data['severe_toxic'] != -1) & (train_data['obscene'] != -1) & (train_data['threat'] != -1) & (train_data['insult'] != -1) & (train_data['identity_hate'] != -1)]

In [18]:
test_data = pd.read_csv('data/test.csv')
test_labels = pd.read_csv('data/test_labels.csv')

train_data = pd.read_csv('data/train.csv')
train_data = train_data[(train_data['toxic'] != -1) & (train_data['severe_toxic'] != -1) & (train_data['obscene'] != -1) & (train_data['threat'] != -1) & (train_data['insult'] != -1) & (train_data['identity_hate'] != -1)]
# nltk.download('stopwords')
def lowercase(txt):
    
    return txt.lower()

def remove_punctuation(txt):
    txt=re.sub(r'[^\w\s]', '', txt)
    return txt

def remove_stopwords(txt):
    stop_words = set(stopwords.words('english'))
    txt_parsed = txt.split()
    removed=[x for x in txt_parsed if x not in stop_words]
    txt=' '.join(removed)
    return txt

def remove_numbers(txt):
    txt_parsed = txt.split()
    removed=[x for x in txt_parsed if x.isalpha()]
    txt=' '.join(removed)
    return txt

def remove_url(txt):
    url = re.compile(r'https?://\S+|www\.\S+')
    txt = url.sub('',txt)
    return txt

def normalize_sentence(txt):
    '''
    Aggregates all the above functions to normalize/clean a sentence
    '''
    txt=lowercase(txt)
    txt=remove_punctuation(txt)
    txt=remove_stopwords(txt)
    txt=remove_numbers(txt)
    txt=remove_url(txt)
    
    return txt

train_data['comment_text'] = train_data['comment_text'].apply(normalize_sentence)
train_data = train_data[train_data['comment_text'].apply(lambda x: len(x) > 20)]

train_labels = np.array(train_data[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]])
train_data = np.array(train_data["comment_text"])

In [19]:
X = train_data
y = train_labels

print(X)

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenize and pad sequences
max_words = 10000  # Define the maximum number of words in your vocabulary
max_len = 100  # Define the maximum length of sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_val_seq = tokenizer.texts_to_sequences(X_val)

X_train_padded = pad_sequences(X_train_seq, maxlen=max_len)
X_val_padded = pad_sequences(X_val_seq, maxlen=max_len)

['explanation edits made username hardcore metallica fan reverted werent vandalisms closure gas voted new york dolls fac please dont remove template talk page since im retired'
 'daww matches background colour im seemingly stuck thanks talk january utc'
 'hey man im really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info'
 ... 'spitzer umm theres actual article prostitution ring crunch captain'
 'looks like actually put speedy first version deleted look'
 'really dont think understand came idea bad right away kind community goes bad ideas go away instead helping rewrite']


### Model Architecture and Training Parameter Choices:

#### 1. Embedding Dimension (`embedding_dim = 100`):

- **Decision:** The embedding dimension is set to 100.
- **Explanation:** The embedding dimension determines the size of the word embeddings learned by the model. A lower value, such as 50, is chosen to reduce the dimensionality of the word representations. This is often a trade-off between capturing rich semantic information and managing model complexity and a less complex model was enough in this case. Higher dimensions actually increased the error and validation and test loss as per our experiments.

#### 2. Model Architecture:

- **Decision:** Sequential model architecture with an Embedding layer, an LSTM layer, and a Dense output layer.
- **Explanation:** This architecture is suitable for sequential data processing, capturing long-term dependencies with LSTM, and producing a binary classification output with the Dense layer.

#### 3. LSTM Layer Configuration (`LSTM(50)`):

- **Decision:** LSTM layer with 50 units.
- **Explanation:** The number of LSTM units determines the capacity of the layer to capture sequential patterns. A choice of 50 units strikes a balance between model capacity and avoiding overfitting.A higher value overfitted the data and increase test loss.

#### 4. Dense Output Layer (`Dense(6, activation='sigmoid')`):

- **Decision:** Dense output layer with 6 units (for each toxicity type) and sigmoid activation.
- **Explanation:** The output layer is designed for binary classification of each toxicity type independently. Sigmoid activation squashes output values to the range [0, 1], allowing the model to predict the probability of the presence of each toxicity type.

#### 5. Model Compilation:

- **Decision:** Adam optimizer, binary cross-entropy loss, and accuracy metric.
- **Explanation:** The Adam optimizer is chosen for its adaptive learning rate capabilities. Binary cross-entropy is suitable for binary classification tasks, and accuracy is chosen as the evaluation metric.

#### 6. Training Parameters (`epochs = 4`, `batch_size = 64`):

- **Decision:** Training for 4 epochs with a batch size of 64.
- **Explanation:** The number of epochs determines how many times the model iterates over the entire training dataset. A lower value of 4 is chosen, which can be beneficial for faster training as a higher value vastly increased the computation time increased loss with some value of batch sizes. The batch size of 64 balances the trade-off between computational efficiency and stable gradient updates during training. A lower batch size overfitted the data and vastly increased computation time while their was no need for a higher one as 64 seemed enough to handle the complexity of the task

#### 7. Model Training:

- **Training Data and Validation Data:** The model is trained on `X_train_padded` and `y_train` with a validation split using `validation_data=(X_val_padded, y_val)`.

#### 8. Evaluation:

- **Validation Loss and Accuracy:** The model's performance is evaluated on the validation set, and the loss and accuracy metrics are printed.

### Why These Choices:

- **Embedding Dimension:**
  - Lower embedding dimension reduces model complexity, potentially improving efficiency.

- **LSTM Layer Units:**
  - 50 LSTM units strike a balance between capturing sequential patterns and avoiding overfitting.

- **Dense Output Layer:**
  - 6 units with sigmoid activation are chosen for binary classification of each toxicity type independently.
  - Sigmoid activation is suitable for binary classification tasks.

- **Training Parameters:**
  - 4 epochs are chosen to balance model learning and training time.
  - Batch size of 64 balances computational efficiency and stable training dynamics.

- **Model Compilation:**
  - Adam optimizer adapts the learning rate during training.
  - Binary cross-entropy loss suits binary classification.
  - Accuracy is a suitable metric for binary classification evaluation.

These choices aim to strike a balance between model complexity, computational efficiency, and the ability to capture sequential dependencies in text data. Fine-tuning hyperparameters and monitoring model performance during training and validation can further optimize these choices based on specific dataset characteristics.

In [20]:
embedding_dim = 100  # Choose the dimension of your word embeddings

model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
model.add(LSTM(50))  # Adjust the number of LSTM units as needed
model.add(Dense(6, activation='sigmoid'))  # Output layer with 6 units (for each toxicity type)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [21]:
epochs = 4  # Choose the number of epochs
batch_size = 64  # Choose the batch size

model.fit(X_train_padded, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val_padded, y_val))

loss, accuracy = model.evaluate(X_val_padded, y_val)
print(f'Validation Loss: {loss}, Validation Accuracy: {accuracy}')


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Validation Loss: 0.053878527134656906, Validation Accuracy: 0.972442626953125


In [22]:
test_data

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.
...,...,...
153159,fffcd0960ee309b5,". \n i totally agree, this stuff is nothing bu..."
153160,fffd7a9a6eb32c16,== Throw from out field to home plate. == \n\n...
153161,fffda9e8d6fafa9e,""" \n\n == Okinotorishima categories == \n\n I ..."
153162,fffe8f1340a79fc2,""" \n\n == """"One of the founding nations of the..."


In [23]:
test_data = pd.concat([test_data, test_labels], axis=1)
test_data = test_data[(test_data['toxic'] != -1) & (test_data['severe_toxic'] != -1) & (test_data['obscene'] != -1) & (test_data['threat'] != -1) & (test_data['insult'] != -1) & (test_data['identity_hate'] != -1)]


test_data['comment_text'] = test_data['comment_text'].apply(normalize_sentence)
test_data = test_data[test_data['comment_text'].apply(lambda x: len(x) > 20)]

test_labels = np.array(test_data[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]])
test_data = np.array(test_data["comment_text"])


In [24]:
X_test = test_data
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len)



In [28]:
test_labels

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

In [29]:
test_loss, test_accuracy = model.evaluate(X_test_padded, test_labels)

print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')


Test Loss: 0.07164348661899567, Test Accuracy: 0.9669210910797119


# Metrics

In [30]:
from sklearn.metrics import classification_report

# Get model predictions on validation set
y_val_pred_prob = model.predict(X_val_padded)

# Convert probabilities to binary predictions (0 or 1)
y_val_pred = (y_val_pred_prob > 0.5).astype(int)

# Generate classification report
class_names = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
classification_rep = classification_report(y_val, y_val_pred, target_names=class_names)

print("Classification Report on Validation Set:\n", classification_rep)

Classification Report on Validation Set:
                precision    recall  f1-score   support

        toxic       0.84      0.69      0.76      2805
 severe_toxic       0.57      0.14      0.23       277
      obscene       0.84      0.74      0.79      1513
       threat       1.00      0.04      0.09        90
       insult       0.74      0.58      0.65      1429
identity_hate       0.61      0.16      0.25       276

    micro avg       0.81      0.62      0.71      6390
    macro avg       0.77      0.39      0.46      6390
 weighted avg       0.80      0.62      0.69      6390
  samples avg       0.06      0.06      0.06      6390



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [34]:
from sklearn.metrics import classification_report

# Get model predictions on test set
y_test_pred_prob = model.predict(X_test_padded)

# Convert probabilities to binary predictions (0 or 1)
y_test_pred = (y_test_pred_prob > 0.5).astype(int)

# Generate classification report
class_names = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
classification_rep_test = classification_report(test_labels, y_test_pred, target_names=class_names)

print("Classification Report on Test Set:\n", classification_rep_test)

Classification Report on Test Set:
                precision    recall  f1-score   support

        toxic       0.56      0.78      0.65      5137
 severe_toxic       0.42      0.26      0.32       337
      obscene       0.61      0.72      0.66      3149
       threat       0.27      0.02      0.04       178
       insult       0.64      0.58      0.61      2942
identity_hate       0.63      0.18      0.28       608

    micro avg       0.59      0.66      0.62     12351
    macro avg       0.52      0.42      0.43     12351
 weighted avg       0.59      0.66      0.61     12351
  samples avg       0.06      0.06      0.06     12351



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
