## **Detecting Hate Speech on Twitter using NLP and Machine Learning**

#### **Project Overview**

**Twiiter** is a massive social media platform where users can freely express their opinions. Unfortunately, some users spread hate speech, which can harm communities and individuals.

In this project, we aim to automaitically detect speech in tweets using **Natural Language Processing** (NLP) and **Machine Learning** techniques.

We will:
- Clean and preprocess tweet text.
- Convert into numerical features using **TF_IDF**.
- Build a **Logical Regression** model.
- Handle **class imbalance**
- Perform **hyperparameter tuning** using **GridSearchCv** with **Stratified K-Fold** cross-validation

## 📂 Dataset
**Columns:**
- `id` -> Unique identifier for the tweet
- `label` -> 0 = Non-hate, 1 = Hate speech.
- `tweet` -> The actual tweet text.

In [23]:
# Data handling
import pandas as pd
import numpy as np

# Text processing
import re
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from collections import Counter
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, f1_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KgomotsoMkhawane\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KgomotsoMkhawane\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### **Basis Exploratory Data Analysis**

In [24]:
# Load Data

hate_speech = pd.read_csv('TwitterHate.csv')

print(hate_speech.head(20))
print(hate_speech.info())

    id  label                                              tweet
0    1      0   @user when a father is dysfunctional and is s...
1    2      0  @user @user thanks for #lyft credit i can't us...
2    3      0                                bihday your majesty
3    4      0  #model   i love u take with u all the time in ...
4    5      0             factsguide: society now    #motivation
5    6      0  [2/2] huge fan fare and big talking before the...
6    7      0   @user camping tomorrow @user @user @user @use...
7    8      0  the next school year is the year for exams.ð...
8    9      0  we won!!! love the land!!! #allin #cavs #champ...
9   10      0   @user @user welcome here !  i'm   it's so #gr...
10  11      0   â #ireland consumer price index (mom) climb...
11  12      0  we are so selfish. #orlando #standwithorlando ...
12  13      0  i get to see my daddy today!!   #80days #getti...
13  14      1  @user #cnn calls #michigan middle school 'buil...
14  15      1  no comment

In [25]:
hate_speech.isna().sum()

id       0
label    0
tweet    0
dtype: int64

In [26]:
hate_speech.shape

(31962, 3)

#### **Convert Tweets to a List**

In [27]:
tweets = hate_speech['tweet'].tolist()

#### **Text Cleaning**

We will:
- Lowercase text
- Remove user handlings (@username)
- Remove redundant terms (amp, rt)
- Remove # but keep the word
- Remove single-character terms

In [37]:
# Display before cleaning
print("Sample tweet before cleaning:", tweets[0])

# Initialize Tokenizer
tokenizer = TweetTokenizer(
    preserve_case=False, # Lowercase text
    strip_handles=True, # Remove user handles
    reduce_len=True # Reduce redundant terms (e.g. "sooo" to "so")
)

# Stopwords
stop_words = set(stopwords.words('english'))
redundant_words = {'amp', 'rt'}

cleaned_tweets = []

for tweet in tweets:
    # Remove URLs
    tweet = re.sub(r'http\S+|www\S+|https\s+', '', tweet, flags=re.MULTILINE)

    # Tokenize
    tokens = tokenizer.tokenize(tweet)

    # Remove stopwords, redundant words, hashtags, and single characters
    tokens = [re.sub(r'#', '', word) for word in tokens] # remove #
    tokens = [word for word in tokens if word not in stop_words and word not in redundant_words and len(word) > 1] # remove stopwords, redundant words, and single characters

    cleaned_tweets.append(tokens)

print("Sample cleaned tweet:", cleaned_tweets[0])

Sample tweet before cleaning:  @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Sample cleaned tweet: ['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']


#### **Check Top 10 Most Common Words**

In [38]:
all_words = [word for tokens in cleaned_tweets for word in tokens]
word_freq = Counter(all_words)
print(word_freq.most_common(12))

[('...', 2808), ('love', 2748), ('day', 2276), ('happy', 1684), ('time', 1131), ('life', 1118), ('like', 1047), ('today', 1013), ('new', 994), ('thankful', 946), ('positive', 931), ('get', 917)]


#### **Prepare Data for Modelling**

We may need to join tokens into strings for TF-IDF.

In [30]:
cleaned_tweets = [" ".join(tokens) for tokens in cleaned_tweets]  # Join tokens into strings

X = cleaned_tweets
y = hate_speech['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

#### **TF-IDF Vectorization**

In [31]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

#### **Logistic Regression Model**

In [32]:
# First Model
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_tfidf, y_train)

# Prediction
y_train_pred = lr.predict(X_train_tfidf)
y_test_pred = lr.predict(X_test_tfidf)

# Evaluation
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Train Recall:" , recall_score(y_train, y_train_pred, average='weighted'))
print("Train F1 Score:", f1_score(y_train, y_train_pred, average='weighted'))

Train Accuracy: 0.955649419218585
Train Recall: 0.955649419218585
Train F1 Score: 0.9470729109073178


#### **Handle Class Imbalance**

In [33]:
lr_balanced =LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr_balanced.fit(X_train_tfidf, y_train)

y_train_pred_balanced = lr_balanced.predict(X_train_tfidf)

# Evaluation
print("Balanced Train Accuracy:", accuracy_score(y_train, y_train_pred_balanced))
print("Balanced Train Recall:", recall_score(y_train, y_train_pred_balanced, average='weighted'))
print("Balanced Train F1 Score:", f1_score(y_train, y_train_pred_balanced, average='weighted'))

Balanced Train Accuracy: 0.9425476162540577
Balanced Train Recall: 0.9425476162540577
Balanced Train F1 Score: 0.9495076676748813


#### **Hyperparameter Tuning with GridSearchCV**

In [34]:
param_grid = {
    'C': [0.01, 0.1, 1, 10], # Inverse of regularization strength
    'penalty': ['l2'], # Regularization type , 'l1' requires solver='liblinear'
    'solver': ['lbfgs', 'liblinear'] # Optimization algorithm
}

skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)

grid = GridSearchCV(
    LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42), # Base model
    param_grid, # Hyperparameter grid
    scoring='recall', # Evaluation metric
    cv=skf, # Cross-validation strategy
    n_jobs=-1, # Use all available cores
)

grid.fit(X_train_tfidf, y_train)

print("Best Parameters:", grid.best_params_)
best_model = grid.best_estimator_

Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


#### **Final Evaluation**

In [35]:
y_test_pred_best = best_model.predict(X_test_tfidf)

print("Test Accuracy:", accuracy_score(y_test, y_test_pred_best))
print("Test Recall:", recall_score(y_test, y_test_pred_best))
print("Test F1 Score:", f1_score(y_test, y_test_pred_best, average='weighted'))

Test Accuracy: 0.9206945096198967
Test Recall: 0.7857142857142857
Test F1 Score: 0.9299297047168323


## 📌 **Insights & Conclusion**

- **Best Parameters:** Found using GridSearchCV with Stratified K-Fold cross-validation.
- **Recall** is prioritised because we want to **catch as many hate tweets as possible**.
- **TF-IDF** helped in representing tweets effectively.
- **Class imbalance** was handled using `class_weight='balanced'`.