# Cyberbullying Classification
## **Objective: train ≥ 3 classification models for cyberbullying classifcation, report and discuss evaluation results**
### Potential Models
- Naive Bayes
- Linear SVM
- Logistic Regression
- CNN/RNN

### TODO
- preprocess text data
- exporatory data analysis???
- decide 3 models (1 deep learning/2 classic?)
- train and tune 3 models
- test and evaluate
- interpretation
- conclude


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.feature_extraction.text import CountVectorizer

import plotly.graph_objs as go
from plotly.offline import iplot

import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable = True, theme = 'pearl')

In [None]:
tweets = pd.read_csv("cyberbullying_tweets.csv")
tweets.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


# Text Preprocessing
- case normalization
- remove special characters/punctuation
- remove stop words

In [None]:
# twitter text cleaning pattern from https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis

TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

from wordcloud import STOPWORDS
STOPWORDS.update(['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 't', 's'])

def lower(text):
    return text.lower()

def remove_twitter(text):
    return re.sub(TEXT_CLEANING_RE, ' ', text)

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

def clean_text(text):
    text = lower(text)
    text = remove_twitter(text)
    text = remove_stopwords(text)
    return text

In [None]:
def get_top_n_gram(corpus,ngram_range,n=None):
    vec = CountVectorizer(ngram_range=ngram_range,stop_words = STOPWORDS).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
# preprocessed
tweets['tweet_text'] = tweets['tweet_text'].apply(clean_text)
tweets.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,words katandandre food crapilicious,not_cyberbullying
1,aussietv white theblock imacelebrityau today s...,not_cyberbullying
2,classy whore red velvet cupcakes,not_cyberbullying
3,meh p thanks heads concerned another angry dud...,not_cyberbullying
4,isis account pretending kurdish account islam ...,not_cyberbullying


In [None]:
from nltk.stem import WordNetLemmatizer

lematizer = WordNetLemmatizer()

def lemmatizer_words(text):
    return " ".join([lematizer.lemmatize(word) for word in text.split()])

tweets['tweet_text'] = tweets['tweet_text'].apply(lambda text: lemmatizer_words(text))


# Naive Bayes Classifier