****Model to Detect Hate Speech****

*Loading Data*

In [7]:
import pandas as pd

hatespeechprepared = "Data/hate-speech-prepared-spreadsheet.csv"
df = pd.read_csv(hatespeechprepared, delimiter="\t", encoding="utf-8")

df.head()

Unnamed: 0.1,Unnamed: 0,###,Comment,Type,Target,Implicit,Metaphor/metonymy,Sarcasm/humor,Rhetorical question,Circumlocution,binary-hate-speech
0,0,https://www.facebook.com/228735667216_10153273...,Can we shut up about refugees already?,Acceptable speech,No target,,,,,,Acceptable speech
1,1,https://www.facebook.com/228735667216_10153273...,Why should we? It's the biggest humanitarian c...,Acceptable speech,No target,,,,,,Acceptable speech
2,2,https://www.facebook.com/228735667216_10153273...,these refugees adult males are cowards for not...,Background offensive,Migrants,0.0,,,,,Hate speech
3,3,https://www.facebook.com/228735667216_10153273...,Does Syria own the BBC?.........,Acceptable speech|Other offensive,No target|Journalist or medium,,,,,,discard
4,4,https://www.facebook.com/228735667216_10153273...,They are all mentally jerking off to the refug...,Background offensive,Migrants,0.0,,,,,Hate speech


*Pre-Processing Data*

In [8]:
df = df.drop(columns=['Unnamed: 0', '###', 'Type'])
df = df[df['binary-hate-speech'] != 'discard']
df.head()

Unnamed: 0,Comment,Target,Implicit,Metaphor/metonymy,Sarcasm/humor,Rhetorical question,Circumlocution,binary-hate-speech
0,Can we shut up about refugees already?,No target,,,,,,Acceptable speech
1,Why should we? It's the biggest humanitarian c...,No target,,,,,,Acceptable speech
2,these refugees adult males are cowards for not...,Migrants,0.0,,,,,Hate speech
4,They are all mentally jerking off to the refug...,Migrants,0.0,,,,,Hate speech
5,You only see what you want to see. Pretty much...,Commenter,1.0,1.0,1.0,0.0,0.0,Hate speech


In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer

# Mapping 'Acceptable speech' to 0 and 'Hate speech' to 1
df['binary-hate-speech'] = df['binary-hate-speech'].map({'Acceptable speech': 0, 'Hate speech': 1})

# Handling missing text data by imputing with 'unknown' placeholder
text_imputer = SimpleImputer(strategy='constant', fill_value='unknown')
df['Comment'] = pd.Series(text_imputer.fit_transform(df[['Comment']]).flatten(), index=df.index)

# Vectorizing the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=500)
text_features = vectorizer.fit_transform(df['Comment'])

# Convert the sparse matrix to a DataFrame and merge with the original dataframe
text_df = pd.DataFrame(text_features.toarray(), columns=vectorizer.get_feature_names_out(), index=df.index)
df = pd.concat([df, text_df], axis=1)

# Handling missing values in other columns by filling with the most frequent value (mode)
df['Target'] = df['Target'].fillna(df['Target'].mode()[0])

# Encoding categorical columns 'Type' and 'Target' as numerical values
df['Target'] = df['Target'].astype('category').cat.codes

*Actual Model*

In [12]:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Defining features (X) and target (y)
X = df.drop(columns=['binary-hate-speech', 'Comment'])  # Excluding 'Comment' column as it is already vectorized
y = df['binary-hate-speech']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

# Making predictions and evaluating the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0], got [nan]

**Other Predictors**