<a href="https://colab.research.google.com/github/pragyamishraa517/Hate-Speech-Classification/blob/main/Hate_Speech_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Install and Import Necessary Libraries**

Kagglehub is a library that allows you to download datasets from Kaggle directly.

Here, we import libraries:

1. NumPy and Pandas for data handling.

2. TensorFlow and its Keras API for building and training the neural network.

3. Sklearn for splitting the dataset and generating a classification report.

4. Tokenizer and pad_sequences for text preprocessing.


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


**Step 2: Download and Load the Dataset Using kagglehub**

We use kagglehub to download the dataset.

We use Kaggle's API to download the specified dataset ie **Hate Speech and Offensive Language Dataset**.

'path' stores the local directory where the dataset is saved.

We load the downloaded dataset as a CSV file into a pandas DataFrame, df.

We display the first few rows, giving a preview of the data to understand its structure and columns.

In [None]:
!pip install kagglehub




In [None]:
import kagglehub
import os
import pandas as pd

# Download the dataset
dataset_path = kagglehub.dataset_download("mrmorj/hate-speech-and-offensive-language-dataset")

print("Path to dataset files:", dataset_path)

# Search for the dataset file within the downloaded directory
for root, _, files in os.walk(dataset_path):
    for file in files:
        if file.endswith(".csv"):  # Assuming the dataset file is a CSV
            dataset_file_path = os.path.join(root, file)
            break  # Stop searching once found
    else:
        continue  # Continue searching in subdirectories if not found
    break  # Stop searching once found in any directory

# Check if the dataset file was found
if dataset_file_path:
    # Load the CSV file into a pandas DataFrame
    df = pd.read_csv(dataset_file_path)

    # Display the first few rows to understand the dataset structure
    print(df.head())
else:
    print("Dataset file not found within the downloaded directory.")

Path to dataset files: /root/.cache/kagglehub/datasets/mrmorj/hate-speech-and-offensive-language-dataset/versions/1
   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  
0  !!! RT @mayasolovely: As a woman you shouldn't...  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...  


**Step 3: Data Preprocessing**

**Map Labels to Binary Format**: We map the class column into a binary format, where 0 indicates non-offensive, and 1 indicates offensive or hate speech.

lambda x: 1 if x > 0 else 0 sets the label to 1 if class is greater than 0, otherwise it’s 0.

**Tokenize and Pad Text Data:** Here, we separate the tweet text (tweet column) into X and the label (label column) into y for model training.

vocab_size sets the maximum number of unique words in our vocabulary.

max_length defines the maximum number of words in each text sample. Longer
texts will be truncated.

oov_token handles words not in the vocabulary.

Tokenizer is initialized with vocab_size and oov_token.

fit_on_texts(X) builds a vocabulary by converting words to numeric IDs.

word_index holds the mapping of each word to its unique integer ID.

texts_to_sequences(X) converts each text to a sequence of integers where each integer represents a word from the vocabulary.

pad_sequences ensures each sequence has the same length (max_length). Shorter sequences are padded with zeros, and longer ones are truncated at the end.

In [None]:
# Convert labels into binary (0 for non-hate, 1 for offensive/hate)
df['label'] = df['class'].apply(lambda x: 1 if x > 0 else 0)

# Splitting dataset into text (X) and labels (y)
X = df['tweet'].values
y = df['label'].values

# Tokenize the text
vocab_size = 10000  # Vocabulary size
max_length = 50     # Max length of each tweet
oov_token = "<OOV>"

# Initialize Tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(X)
word_index = tokenizer.word_index

# Convert texts to sequences
X_sequences = tokenizer.texts_to_sequences(X)

# Pad sequences to ensure uniform input size
X_padded = pad_sequences(X_sequences, maxlen=max_length, padding='post', truncating='post')


We use train_test_split to split X_padded and y into training and test sets.
test_size=0.2 means 20% of the data is for testing.
random_state=42 ensures the split is consistent each time you run the code.

In [None]:
# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)


In [None]:
# Set class weights
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

# Train model with class weights
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), class_weight=class_weights_dict, verbose=2)


Epoch 1/10
310/310 - 59s - 192ms/step - accuracy: 0.9778 - loss: 0.0710 - val_accuracy: 0.9116 - val_loss: 0.3472
Epoch 2/10
310/310 - 74s - 240ms/step - accuracy: 0.9837 - loss: 0.0485 - val_accuracy: 0.8874 - val_loss: 0.5389
Epoch 3/10
310/310 - 82s - 266ms/step - accuracy: 0.9895 - loss: 0.0266 - val_accuracy: 0.9112 - val_loss: 0.5892
Epoch 4/10
310/310 - 81s - 260ms/step - accuracy: 0.9856 - loss: 0.0371 - val_accuracy: 0.8443 - val_loss: 0.8445
Epoch 5/10
310/310 - 84s - 270ms/step - accuracy: 0.9878 - loss: 0.0268 - val_accuracy: 0.8977 - val_loss: 0.6393
Epoch 6/10
310/310 - 83s - 267ms/step - accuracy: 0.9908 - loss: 0.0231 - val_accuracy: 0.8899 - val_loss: 0.6934
Epoch 7/10
310/310 - 45s - 146ms/step - accuracy: 0.9929 - loss: 0.0155 - val_accuracy: 0.9044 - val_loss: 0.7825
Epoch 8/10
310/310 - 92s - 296ms/step - accuracy: 0.9866 - loss: 0.0318 - val_accuracy: 0.9131 - val_loss: 0.5649
Epoch 9/10
310/310 - 75s - 242ms/step - accuracy: 0.9891 - loss: 0.0249 - val_accuracy: 

**Threshold for Prediction**
The 0.5 threshold for converting probabilities to binary class labels might be too high or too low, given the dataset’s characteristics.
Solution: Use a validation set to determine an optimal threshold by testing different values between 0.0 and 1.0.

In [None]:
# Testing different thresholds on validation set
for threshold in np.arange(0.1, 1.0, 0.1):
    y_val_pred = (model.predict(X_test) > threshold).astype(int)
    print(f"Threshold: {threshold}")
    print(classification_report(y_test, y_val_pred))


[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 24ms/step
Threshold: 0.1
              precision    recall  f1-score   support

           0       0.31      0.37      0.33       290
           1       0.96      0.95      0.95      4667

    accuracy                           0.91      4957
   macro avg       0.63      0.66      0.64      4957
weighted avg       0.92      0.91      0.92      4957

[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 36ms/step
Threshold: 0.2
              precision    recall  f1-score   support

           0       0.30      0.38      0.33       290
           1       0.96      0.94      0.95      4667

    accuracy                           0.91      4957
   macro avg       0.63      0.66      0.64      4957
weighted avg       0.92      0.91      0.92      4957

[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 24ms/step
Threshold: 0.30000000000000004
              precision    recall  f1-score   support

  

loss='binary_crossentropy' specifies a loss function for binary classification.

optimizer='adam' uses the Adam optimization algorithm.

metrics=['accuracy'] tracks accuracy during training and testing.

model.summary() provides a summary of the model’s architecture, layers, and parameters.

In [None]:
import re

# Basic text preprocessing function
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation
    return text

# Apply preprocessing
df['tweet'] = df['tweet'].apply(preprocess_text)


**Hyperparameter Tuning**

The model’s parameters (e.g., LSTM layers, dropout rates, embedding dimension) might not be optimal for this dataset.
Solution: Experiment with different architectures, especially focusing on the number of LSTM layers, the hidden state size, and dropout rates.
**Improving Data Preprocessing**

If the input text is not cleaned well (e.g. removing punctuation, converting to lowercase), the model might struggle to learn patterns effectively.
Solution: Add more preprocessing steps, like converting to lowercase, removing punctuation, and filtering stopwords.

**Increase Training Data or Use Transfer Learning**:
Sometimes the dataset may be too small or insufficiently varied to capture all the nuances between hate and non-hate speech.
Solution: Add more labeled data if possible, or explore transfer learning by using pre-trained embeddings like GloVe or word2vec in the embedding layer.

In [None]:
#Download the GloVe embeddings:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# Load GloVe embeddings and set up the embedding layer with them
embeddings_index = {}
with open('glove.6B.100d.txt', 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coef = np.array(values[1:], dtype='float32')
        embeddings_index[word] = coef

# Create an embedding matrix
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# Set up embedding layer with pre-trained embeddings
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)


--2024-11-11 04:41:44--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-11-11 04:41:44--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-11-11 04:41:44--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202



1. **Embedding Layer:** Converts words (integer-encoded) to dense vector representations of length 64.
2. **Bidirectional LSTM:** An LSTM that processes data forwards and backwards to capture context from both directions.
3. **return_sequences=True** ensures the output from the first LSTM layer can be fed into the next LSTM.
4. **Dropout Layers:** Regularize the model by randomly deactivating neurons, helping reduce overfitting.
5. **Dense Layer with ReLU Activation:** Fully connected layer with ReLU activation to add non-linearity.
6. **Output Layer with Sigmoid Activation:** Outputs a probability between 0 and 1 for binary classification.

In [None]:
# Define the model architecture
model = Sequential([
    Embedding(vocab_size, 64, input_length=max_length),  # Embedding layer
    Bidirectional(LSTM(64, return_sequences=True)),      # Bidirectional LSTM for better context capture
    Dropout(0.5),                                        # Dropout for regularization
    Bidirectional(LSTM(32)),                             # Another LSTM layer
    Dense(64, activation='relu'),                        # Dense layer with ReLU activation
    Dropout(0.5),                                        # Dropout for regularization
    Dense(1, activation='sigmoid')                       # Output layer for binary classification
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Display model summary
model.summary()


1. loss='binary_crossentropy' specifies a loss function for binary classification.
2. optimizer='adam' uses the Adam optimization algorithm.
3. metrics=['accuracy'] tracks accuracy during training and testing.

**Train the model**:
1. model.fit trains the model on the training set (X_train, y_train).
2. epochs=10 trains the model for 10 iterations over the data.
3. batch_size=64 specifies the number of samples processed before updating model weights.
4. validation_data=(X_test, y_test) evaluates the model on the test set after each epoch.

1. model.fit trains the model on the training set (X_train, y_train).
2. epochs=10 trains the model for 10 iterations over the data.
3. batch_size=64 specifies the number of samples processed before updating model weights.
4. validation_data=(X_test, y_test) evaluates the model on the test set after each epoch.

In [None]:
# Train the model
epochs = 10
batch_size = 64

history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=2)


Epoch 1/10
310/310 - 66s - 214ms/step - accuracy: 0.9396 - loss: 0.2420 - val_accuracy: 0.9415 - val_loss: 0.2218
Epoch 2/10
310/310 - 58s - 188ms/step - accuracy: 0.9427 - loss: 0.1836 - val_accuracy: 0.9415 - val_loss: 0.1702
Epoch 3/10
310/310 - 80s - 258ms/step - accuracy: 0.9558 - loss: 0.1257 - val_accuracy: 0.9389 - val_loss: 0.1760
Epoch 4/10
310/310 - 53s - 170ms/step - accuracy: 0.9681 - loss: 0.0915 - val_accuracy: 0.9389 - val_loss: 0.2047
Epoch 5/10
310/310 - 83s - 268ms/step - accuracy: 0.9788 - loss: 0.0628 - val_accuracy: 0.9381 - val_loss: 0.2422
Epoch 6/10
310/310 - 55s - 178ms/step - accuracy: 0.9836 - loss: 0.0433 - val_accuracy: 0.9282 - val_loss: 0.3070
Epoch 7/10
310/310 - 81s - 263ms/step - accuracy: 0.9887 - loss: 0.0329 - val_accuracy: 0.9235 - val_loss: 0.3640
Epoch 8/10
310/310 - 53s - 170ms/step - accuracy: 0.9903 - loss: 0.0249 - val_accuracy: 0.9288 - val_loss: 0.3836
Epoch 9/10
310/310 - 86s - 278ms/step - accuracy: 0.9933 - loss: 0.0188 - val_accuracy: 

In [None]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

# Generate classification report for detailed performance analysis
y_pred = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, y_pred, target_names=['Non-Hate', 'Hate']))


Test Accuracy: 93.10%
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 27ms/step
              precision    recall  f1-score   support

    Non-Hate       0.35      0.21      0.27       290
        Hate       0.95      0.98      0.96      4667

    accuracy                           0.93      4957
   macro avg       0.65      0.59      0.61      4957
weighted avg       0.92      0.93      0.92      4957



**Test the model on sample texts to verify if it can correctly identify non-hate speech texts.**

In [None]:
# Define sample tweets to test the model
sample_texts = [
    "you are the worst",
    "You're not such a great friend!",
    "This is absolutely disgusting and I hope you disappear.",
    "Have a horrible day and spread negativity!"
]

# Preprocess the sample texts
sample_sequences = tokenizer.texts_to_sequences(sample_texts)
sample_padded = pad_sequences(sample_sequences, maxlen=max_length, padding='post', truncating='post')

# Predict hate speech probabilities
predictions = model.predict(sample_padded)

# Display results
for i, text in enumerate(sample_texts):
    print(f"Text: {text}")
    print(f"Prediction (1 = Hate, 0 = Non-Hate): {'Hate' if predictions[i] > 0.5 else 'Non-Hate'}\n")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Text: you are the worst
Prediction (1 = Hate, 0 = Non-Hate): Hate

Text: You're not such a great friend!
Prediction (1 = Hate, 0 = Non-Hate): Hate

Text: This is absolutely disgusting and I hope you disappear.
Prediction (1 = Hate, 0 = Non-Hate): Hate

Text: Have a horrible day and spread negativity!
Prediction (1 = Hate, 0 = Non-Hate): Hate



In [None]:
#@title Convert ipynb to HTML in Colab
# Upload ipynb
from google.colab import files
f = files.upload()

# Convert ipynb to html
import subprocess
file0 = list(f.keys())[0]
_ = subprocess.run(["pip", "install", "nbconvert"])
_ = subprocess.run(["jupyter", "nbconvert", file0, "--to", "html"])

# download the html
files.download(file0[:-5]+"html")


Saving Hate_Speech_Classification.ipynb to Hate_Speech_Classification.ipynb


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>