<a href="https://colab.research.google.com/github/hjdeck/Cyberbullying-Classification/blob/main/Cyberbullying_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cyberbullying Classification**
#### **Objective**: train 3 classification models for cyberbullying tweet classifcation; report and discuss evaluation results
## **About Dataset**
*cyberbullying_tweets.csv* contains 2 features and 47692 observations
- *tweet_text*: predictor variable
- *cyberbullying_type*: reponse variable

### Potential Models
- Naive Bayes
- Linear SVM
- Logistic Regression
- RNN (may be optimal for processing sentences rather than CNN/spatial data)

### TODO
- preprocess text data
- exporatory data analysis???
- decide 3 models (1 deep learning/2 traditional?)
- train and optimize 3 models
- test and evaluate
- interpretation
- conclution


In [17]:
import pandas as pd
from IPython.display import display

In [18]:
# raw data
tweets = pd.read_csv('cyberbullying_tweets.csv')
display(tweets)

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying
...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity
47688,Turner did not withhold his disappointment. Tu...,ethnicity
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity


# **Text Preprocessing**
- normalizing text
- removing unicode characters
- removing stopwords
- stemming/lemmatization


In [19]:
import re
from wordcloud import STOPWORDS

In [20]:
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
STOPWORDS.update(['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 't', 's'])

def lower(text):
    return text.lower()

def remove_twitter(text):
    return re.sub(TEXT_CLEANING_RE, ' ', text)

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

def clean_text(text):
    text = lower(text)
    text = remove_twitter(text)
    text = remove_stopwords(text)
    return text

In [21]:
# preprocessed data
tweets['tweet_text'] = tweets['tweet_text'].apply(clean_text)
display(tweets)

Unnamed: 0,tweet_text,cyberbullying_type
0,words katandandre food crapilicious,not_cyberbullying
1,aussietv white theblock imacelebrityau today s...,not_cyberbullying
2,classy whore red velvet cupcakes,not_cyberbullying
3,meh p thanks heads concerned another angry dud...,not_cyberbullying
4,isis account pretending kurdish account islam ...,not_cyberbullying
...,...,...
47687,black ppl aren expected anything depended anyt...,ethnicity
47688,turner withhold disappointment turner called c...,ethnicity
47689,swear god dumb nigger bitch got bleach hair re...,ethnicity
47690,yea fuck therealexel youre nigger fucking unfo...,ethnicity


# **Naive Bayes Classifier**
## Optimization
- train on n-grams

In [12]:
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

In [22]:
# Splitting training/testing data
train_data, test_data = train_test_split(tweets, test_size = 0.3, random_state = 1)

# Naive Bayes Model
naive_bayes_model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Trained model on training data
naive_bayes_model.fit(train_data.tweet_text, train_data.cyberbullying_type)

# Predicted labels of test data
labels = naive_bayes_model.predict(test_data.tweet_text)

In [None]:
# Plot confusion matrix
mat = confusion_matrix(test_data.cyberbullying_type, labels)
sns.heatmap(mat.T, square = True, annot = True, fmt = "d", 
            xticklabels = train_data.cyberbullying_type,
            yticklabels = train_data.cyberbullying_type)
plt.xlabel("true labels")
plt.ylabel("predicted label")
plt.show()

# Accuracy Score
print("The accuracy is {}".format(accuracy_score(test_data.cyberbullying_type, labels)))

# Recurrent Neaural Netework (RNN)