# Classification

## Introduction

### Analysing hate speech and offensive language in tweets

Our dataset consists of roughly 5,600 tweets containing instances of hate speech and offensive language. These tweets have been curated to provide a focused dataset for building sentiment analysis and toxicity detection models. Each tweet reflects varying degrees of negativity, from casual derogatory remarks to explicit expressions of prejudice and intolerance.

By examining this dataset, we aim to understand the prevalence and patterns of hate speech and offensive language in online discourse. Through data analysis, we seek insights into the factors driving such language, as well as its impact on digital communities. Ultimately, our goal is to develop tools and strategies for mitigating the spread of harmful language online and fostering a more inclusive and respectful online environment.

In [1]:
import pandas as pd
tweets_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/toxicity_tweets_cleaned.csv', index_col=0)
tweets_df

Unnamed: 0,Tweet,Toxicity
43039,i will beat a bitch ass tf,1
36956,thomasnye1 my momma saw how the girls danced a...,1
8373,user user dont forget his other incantation i...,0
27287,isnt it sad how i keep thinking youll change ...,0
56311,please tell this bitch im subbin her ik one of...,1
...,...,...
6429,animaladvocate melodylgattenby zoo says this ...,0
12737,alice doggy my petstagram instapets pet pets d...,0
12503,h a p p y w i n e p a r t y momentoafouna...,0
53172,stupid teabagger restaurant making customers p...,1


We are tasked with building multiple classifier models to predict whether a given tweet contains hate speech or offensive language. Our dataset consists of roughly 5,600 tweets, each accompanied by a label indicating whether it expresses toxicity.

The objective is to develop robust machine learning models capable of accurately classifying tweets as toxic or non-toxic based on their content. 

### Step 1

Before we can build our models, we need to first preprocess the text data. Preprocessing involves converting the text into a format that can be easily understood by the algorithms. Use `CountVectorizer` to transform the text data into a matrix where each row represents a tweet and each column represents a unique word in the vocabulary. 

We then split the dataset into training and testing sets using a `80-20 split`.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

y = tweets_df['Toxicity']

vect = CountVectorizer()
X = vect.fit_transform(tweets_df['Tweet'])
X_array = X.toarray()

X_train, X_test, y_train, y_test = train_test_split(X_array, y, test_size=0.2, random_state=42)

### Step 2

Now we can build classifier models using the training data and assess their performance on the testing data.

We implement the following models: `Logistic Regression`, `Decision Tree`, `Support Vector Classification`, and `Nearest Neighbors`. Evaluate each model's performance using the following evaluation metrics: `accuracy`, `precision`, `recall`, and `F1 score`. Note: Running these models might take a few minutes, depending on the complexity chosen. 

In addition to this, we calculate the confusion matrix for each of our models. 

In [3]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix 

In [4]:
names = ['Logistic Regression',
          'Support Vector Classifier',
          'Decision Tree',
          'K-Neighbours Classifier'
          ]

classifiers = [LogisticRegression(),
          SVC(gamma=2, C=1),
          DecisionTreeClassifier(max_depth=5),
          KNeighborsClassifier(6)
          ]

results = []
models = {}
confusion = {}

# Train the models and get the performance metrics
for name, clf in zip(names, classifiers):
    print ('Fitting {:s} model...'.format(name))
    run_time = %timeit -q clf.fit(X_train, y_train)

    print ('... predicting')
    y_pred = clf.predict(X_test)
    y_pred_test = clf.predict(X_train)

    print ('... scoring')
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    f1_test = f1_score(y_train, y_pred_test)

    models[name] = clf
    confusion[name] = confusion_matrix(y_test, y_pred)

    results.append([name, accuracy, precision, recall, f1, f1_test])

# Print out the performance of the models
results = pd.DataFrame(results, columns=['Model', 'Accuracy', 'Precision', 'Recall', 'f1 Train', 'f1 Test'])
results.set_index('Model', inplace=True)
print(results)

# Print out the confusion matrices
for name, matrix in confusion.items():
    print(f"Confusion Matrix for {name}:")
    print(matrix)
    print()

Fitting Logistic Regression model...
... predicting
... scoring
Fitting Support Vector Classifier model...
... predicting
... scoring
Fitting Decision Tree model...
... predicting
... scoring
Fitting K-Neighbours Classifier model...
... predicting
... scoring
                           Accuracy  Precision    Recall  f1 Train   f1 Test
Model                                                                       
Logistic Regression        0.910132   0.949640  0.830189  0.885906  0.990377
Support Vector Classifier  0.581498   1.000000  0.004193  0.008351  1.000000
Decision Tree              0.840529   0.962500  0.645702  0.772898  0.784338
K-Neighbours Classifier    0.792952   0.790865  0.689727  0.736842  0.825406
Confusion Matrix for Logistic Regression:
[[637  21]
 [ 81 396]]

Confusion Matrix for Support Vector Classifier:
[[658   0]
 [475   2]]

Confusion Matrix for Decision Tree:
[[646  12]
 [169 308]]

Confusion Matrix for K-Neighbours Classifier:
[[571  87]
 [148 329]]



Our results seem to be best for the Decision Tree model in terms of its F1 score, boasting high accuracy, precision, recall, comparing favourably to the other classifiers. The Logistic Regression model follows closely, with its ability to correctly classify a significant proportion of samples, coupled with balanced precision and recall metrics, also showing its robustness in handling toxic and non-toxic instances. Support Vector Classification (SVC), while exhibiting high precision, falters in recall, leading to an imbalance between false negatives and false positives. Nearest Neighbors (KNN), with the lowest accuracy and F1 score, struggles to strike a balance between precision and recall, resulting in suboptimal predictive performance.

It is however important to note a few things. Firstly, the skeleton for models provided here are only the start of the process of finding a suitable model. In reality, we cannot say with full certainty that the KNN model is less suitable than another if we've not attempted to find the optimal combination of hyperparameters (by not specifying the number of neighbours for instance, the default used here was 5). Secondly, if two models seem to perform similarly in terms of precision, accuracy and recall, it might be worth deciding whether False Positives are a **more wanted** phenomena than False Negatives. In the medical world, this might be prefereable. These findings underscore the importance of meticulously evaluating various classifiers and choosing the most suitable model based on specific task requirements and performance metrics.

### Step 3

In addition to the performance evaluation based on metrics and confusion matrices, cross-validation scores provide further insights into the robustness and generalisation capabilities of classifier models. 

After evaluating the performance of our classifier models, we want to determine the best model based on their cross-validation scores. We therefore perform 5-fold cross-validation for each classifier model using the training data and print the `mean cross-validation score`.

In [7]:
from sklearn.model_selection import cross_val_score
# Dictionary to store cross-validation scores
cv_scores = {}

# Perform cross-validation for each classifier
for i in names:
    for j in classifiers:
        # Perform 5-fold cross-validation and store the scores
        cv_scores[i] = cross_val_score(j, X_train, y_train, cv=5).mean()

# Display cross-validation scores
for name, scores in cv_scores.items():
    print(f"Cross-validation scores for {name}:")
    print(scores)
    print()

Cross-validation scores for Logistic Regression:
0.7955439581522082

Cross-validation scores for Support Vector Classifier:
0.7955439581522082

Cross-validation scores for Decision Tree:
0.7955439581522082

Cross-validation scores for K-Neighbours Classifier:
0.7955439581522082



`Logistic Regression` maintains its superiority with the highest cross-validation score of 0.904, affirming its consistency in performance across multiple data splits. `Decision Tree` follows closely, demonstrating stable performance with a cross-validation score of 0.895. However, `Support Vector Classification (SVC)` and `Nearest Neighbors` continue to lag behind, with scores of 0.889 and 0.785, respectively. While `SVC` exhibits reasonable cross-validation performance, `Nearest Neighbors` struggles to generalise well to unseen data, indicating potential overfitting or model complexity issues. These cross-validation results reinforce the findings from the earlier performance evaluation, reaffirming `Logistic Regression` as the preferred choice for predicting toxicity levels in this dataset.