# Hate Speech Detection with Machine Learning

## Introduction
Hate speech is a serious issue on social media platforms. In this notebook, we will train a hate speech detection model using machine learning in Python.

## Libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load Data

In [9]:
# Load the dataset
data = pd.read_csv("twitter.csv")  # Replace with your dataset path
data.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


## EDA

In [12]:
# Add labels based on the classification
data['labels'] = np.where(data['hate_speech'] == 1, 'Hate Speech',
                          np.where(data['offensive_language'] == 1, 'Offensive Language', 
                                   'No Hate and Offensive'))

# Select relevant columns
data = data[['tweet', 'labels']]
data.head()

Unnamed: 0,tweet,labels
0,!!! RT @mayasolovely: As a woman you shouldn't...,No Hate and Offensive
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,No Hate and Offensive
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,No Hate and Offensive
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,No Hate and Offensive
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,No Hate and Offensive


In [14]:
data.tail()

Unnamed: 0,tweet,labels
24778,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,No Hate and Offensive
24779,"you've gone and broke the wrong heart baby, an...",Offensive Language
24780,young buck wanna eat!!.. dat nigguh like I ain...,No Hate and Offensive
24781,youu got wild bitches tellin you lies,No Hate and Offensive
24782,~~Ruffled | Ntac Eileen Dahlia - Beautiful col...,No Hate and Offensive


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tweet   24783 non-null  object
 1   labels  24783 non-null  object
dtypes: object(2)
memory usage: 387.4+ KB


In [18]:
data.shape

(24783, 2)

## stopword

In [21]:
# Function to clean tweets
def sw(tweet):
    tweet = tweet.lower()  # Lowercase
    tweet = re.sub(r'@[A-Za-z0-9]+', '', tweet)  # Remove mentions
    tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet, flags=re.MULTILINE)  # Remove URLs
    tweet = re.sub(r'\#', '', tweet)  # Remove hashtags
    tweet = re.sub(r'[^\w\s]', '', tweet)  # Remove punctuation
    return tweet

# Clean the tweet column
data['tweet'] = data['tweet'].apply(sw)

## Prepare Training

In [24]:
X = data['tweet']
y = data['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
vectorizer = CountVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [32]:
# Train the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

In [34]:
y_pred = model.predict(X_test_vec)

In [36]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 77.30%


In [38]:
classification_report(y_test, y_pred)

'                       precision    recall  f1-score   support\n\n          Hate Speech       0.27      0.02      0.04       681\nNo Hate and Offensive       0.78      0.99      0.87      3849\n   Offensive Language       0.40      0.05      0.08       427\n\n             accuracy                           0.77      4957\n            macro avg       0.48      0.35      0.33      4957\n         weighted avg       0.68      0.77      0.69      4957\n'

## Summary

So this is how you can train a machine learning model for the task of detecting hate speech by using the Python programming language. Hate speech is one of the serious issues we see on social media platforms like Twitter and Facebook daily. Most of the posts containing hate speech can be found in the accounts of people with political views. I hope you liked this article on the task of detecting hate speech with Machine Learning using Python. Feel free to ask your valuable questions in the comments section below.