# Table of Contents

* [1. Results](#results)
* [2. Import libraries and data](#import)
    * [2.1 Data dictionary](#dict)
* [3. Preprocessing](#preprocessing)
* [4. Feature Engineering](#feature)
* [5. Modelling](#model)

## 1. Results <a class="anchor" id="results"></a>

Veggies es bonus vobis, proinde vos postulo essum magis kohlrabi welsh onion daikon amaranth tatsoi tomatillo melon azuki bean garlic.

Gumbo beet greens corn soko endive gumbo gourd. Parsley shallot courgette tatsoi pea sprouts fava bean collard greens dandelion okra wakame tomato. Dandelion cucumber earthnut pea peanut soko zucchini.

Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke.

### Resources

- Multi-label:
    - https://machinelearningmastery.com/multi-label-classification-with-deep-learning/
    - https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5

## 2. Import libraries and data <a class="anchor" id="import"></a>

**Import libraries**

In [164]:
# Reading files
import os
import json

# Data cleaning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer  # remove punctuation
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer  # to create document-term matrices from X_train and X_test

# Model utility
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE 

# We cannot use SMOTE for multi-class: Need to do 1 vs all
# https://www.kaggle.com/questions-and-answers/93669

# Modelling
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.multiclass import OneVsRestClassifier

# Model evaluation
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import classification_report

**Read data**

In [45]:
os.chdir('../data')

# df1 = pd.read_csv("dataset_1.csv", encoding="latin1")

# with open('MMHS150K_GT.json', encoding="utf8") as json_file:
#     json_file = json.load(json_file)
    
df = pd.read_csv("full_df.csv", encoding="latin1", index_col=0)

In [46]:
df.head()

Unnamed: 0,Tweets,NotHate,Racist,Sexist,Homophobe,Religion,OtherHate
0,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,0,1,0,1,1,0
1,My horses are retarded https://t.co/HYhqc6d5WN,0,0,0,0,0,1
2,âNIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL...,1,0,0,0,0,0
3,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,1,1,0,0,0,0
4,âEVERYbody calling you Nigger now!â https:...,1,1,0,0,0,0


In [27]:
df1.rename(columns = {
    "id": "id",
    "Tweets": "tweets1",
    "Label": "label1"
}, inplace=True)

In [18]:
json_file['1114679353714016256']

{'img_url': 'http://pbs.twimg.com/tweet_video_thumb/D3gi9MHWAAAgfl7.jpg',
 'labels': [4, 1, 3],
 'tweet_url': 'https://twitter.com/user/status/1114679353714016256',
 'tweet_text': '@FriskDontMiss Nigga https://t.co/cAsaLWEpue',
 'labels_str': ['Religion', 'Racist', 'Homophobe']}

In [31]:
def convert_json_todf(json):
    """
    Convert JSON data into dataframe, using by mapping as follows:
    id: json key
    Tweets: tweet_text
    Labels (list of int): labels
    Labels (list of str): labels_str
    """
    res = {"id": [], "tweets": [], "label": [], "label_str": []}
    
    for key, value in json.items():
        res["id"].append(key)
        res["tweets"].append(value["tweet_text"])
        res["label"].append(value["labels"])
        res["label_str"].append(value["labels_str"])
        
    df = pd.DataFrame(res)
        
    return df

In [34]:
df2 = convert_json_todf(json_file)
df2.head()

Unnamed: 0,id,tweets,label,label_str
0,1114679353714016256,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,"[4, 1, 3]","[Religion, Racist, Homophobe]"
1,1063020048816660480,My horses are retarded https://t.co/HYhqc6d5WN,"[5, 5, 5]","[OtherHate, OtherHate, OtherHate]"
2,1108927368075374593,“NIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL S...,"[0, 0, 0]","[NotHate, NotHate, NotHate]"
3,1114558534635618305,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,"[1, 0, 0]","[Racist, NotHate, NotHate]"
4,1035252480215592966,“EVERYbody calling you Nigger now!” https://t....,"[1, 0, 1]","[Racist, NotHate, Racist]"


In [48]:
print(len(df1), len(df2), len(df))

16035 149823 165858


In [47]:
df.head(20)  # we will use this dataset

Unnamed: 0,Tweets,NotHate,Racist,Sexist,Homophobe,Religion,OtherHate
0,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,0,1,0,1,1,0
1,My horses are retarded https://t.co/HYhqc6d5WN,0,0,0,0,0,1
2,âNIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL...,1,0,0,0,0,0
3,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,1,1,0,0,0,0
4,âEVERYbody calling you Nigger now!â https:...,1,1,0,0,0,0
5,â real ass bitch give a fuck boutta niggaâ...,1,0,0,0,0,0
6,@WhiteHouse @realDonaldTrump Fuck ice. White s...,0,1,0,0,0,1
7,Dayâs a cunt https://t.co/Ie6QZReHsw,1,0,0,0,0,0
8,#sissy faggot https://t.co/bm1nk8HcYO,1,0,0,1,0,0
9,@Gloriko_ Nigga what? https://t.co/nOwIJtgtU1,1,0,0,1,1,0


In [56]:
print("Percentage of NotHate classified: {}%".format(df['NotHate'].value_counts()[1] * 100 / len(df)))
print("Percentage of Racist classified: {}%".format(df['Racist'].value_counts()[1] * 100 / len(df)))
print("Percentage of Sexist classified: {}%".format(df['Sexist'].value_counts()[1] * 100 / len(df)))
print("Percentage of Homophobe classified: {}%".format(df['Homophobe'].value_counts()[1] * 100 / len(df)))
print("Percentage of Religion classified: {}%".format(df['Religion'].value_counts()[1] * 100 / len(df)))
print("Percentage of OtherHate classified: {}%".format(df['OtherHate'].value_counts()[1] * 100 / len(df)))

Percentage of NotHate classified: 91.75258353531333%
Percentage of Racist classified: 31.25625535096287%
Percentage of Sexist classified: 13.306563445839211%
Percentage of Homophobe classified: 7.340013746698983%
Percentage of Religion classified: 1.4596823789024347%
Percentage of OtherHate classified: 14.875375321057772%


### 2.1 Data Dictionary <a class="anchor" id="dict"></a>

|Column Name|Variable Name| Description
|---|:---:|:---
|id|id|Unique identifier for each tweet
|Tweets|Tweet content|Body of tweet
|Label|classification of label|Multi-class label: sexism, racism, homophobe, religion, other hate or none

## 3. Preprocessing <a class="anchor" id="preprocessing"></a>

TODO: Add code to combine datasets

0. Tokenizing (NLTK)
1. Remove stopwords
2. Remove links and '@' <-- if using only English words, then not necessary
3. Stemming / Lemmatization (https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)
4. Add custom hate speech vocab  # decide later
5. Dealing with emojis # TODO: later
7. Tokenizing (CountVectorizer)

TODO: Create custom vocab <br>
TODO2: https://stackoverflow.com/questions/28339622/is-there-a-corpora-of-english-words-in-nltk

- Reasons for not removing English words: some people like to mispell
--> We can try to do word similarities and autocorrect

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221152

https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer

**Propose**: Similarity matching for mispelling

### 3.1 Tokenize (NLTK) and remove stopwords

In [76]:
stopword_list = stopwords.words("english")
stopword_list[:10]  # all in lowercase

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [77]:
english_words = words.words()
english_words[:10]  # case-sensitive

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

In [99]:
def tokenize(df):
    """
    Takes in a dataframe and produces a list containing a list of tokenized tweets.
    Applies the following filters:
    1. Removes words with @
    2. (optional) Removes non-English words
    """
    res = []
    
    stopword_list = stopwords.words("english")
    english_words = words.words()
    
    tokenizer = RegexpTokenizer(r'\w+')  # remove punctuation    
    tokenized = list(map(tokenizer.tokenize, list(df['Tweets'])))
    
    for tweet in tokenized:
        cleaned_tweet = [w for w in tweet if w.lower() not in stopword_list]
        # cleaned_tweet = [w for w in cleaned_tweet if w in english_words]  # too slow
        
        res.append(cleaned_tweet)
        
    return res

In [100]:
tokenized_list = tokenize(df)
tokenized_list[:5]

[['FriskDontMiss', 'Nigga', 'https', 'co', 'cAsaLWEpue'],
 ['horses', 'retarded', 'https', 'co', 'HYhqc6d5WN'],
 ['â',
  'NIGGA',
  'MOMMA',
  'YOUNGBOY',
  'SPITTING',
  'REAL',
  'SHIT',
  'NIGGAâ',
  'https',
  'co',
  'UczofqHrLq'],
 ['RT',
  'xxSuGVNGxx',
  'ran',
  'HOLY',
  'NIGGA',
  'TODAY',
  'ð',
  'ð',
  'ð',
  'ð',
  'https',
  'co',
  'Wa6Spl9kIw'],
 ['â', 'EVERYbody', 'calling', 'Nigger', 'â', 'https', 'co', '6mguJ6KIBF']]

### 3.2 Lemmatization

In [104]:
def lemmatize(lst):
    res = []
    wordnet_lemmatizer = WordNetLemmatizer()
    
    for tweet in lst:
        lemmatized_tweet = [wordnet_lemmatizer.lemmatize(w) for w in tweet]
        res.append(lemmatized_tweet)
    
    return res

In [105]:
lemmatized_list = lemmatize(tokenized_list)
lemmatized_list[:5]

[['FriskDontMiss', 'Nigga', 'http', 'co', 'cAsaLWEpue'],
 ['horse', 'retarded', 'http', 'co', 'HYhqc6d5WN'],
 ['â',
  'NIGGA',
  'MOMMA',
  'YOUNGBOY',
  'SPITTING',
  'REAL',
  'SHIT',
  'NIGGAâ',
  'http',
  'co',
  'UczofqHrLq'],
 ['RT',
  'xxSuGVNGxx',
  'ran',
  'HOLY',
  'NIGGA',
  'TODAY',
  'ð',
  'ð',
  'ð',
  'ð',
  'http',
  'co',
  'Wa6Spl9kIw'],
 ['â', 'EVERYbody', 'calling', 'Nigger', 'â', 'http', 'co', '6mguJ6KIBF']]

In [110]:
lemma_test = []

for idx, ele in enumerate(lemmatized_list):
    for i in range(len(ele)):
        if lemmatized_list[idx][i] != tokenized_list[idx][i]:
            if (lemmatized_list[idx][i], tokenized_list[idx][i]) not in lemma_test:
                lemma_test.append((lemmatized_list[idx][i], tokenized_list[idx][i]))

In [112]:
lemma_test[:10]

[('http', 'https'),
 ('horse', 'horses'),
 ('as', 'ass'),
 ('say', 'says'),
 ('call', 'calls'),
 ('co', 'cos'),
 ('racist', 'racists'),
 ('jena', 'jenas'),
 ('saving', 'savings'),
 ('there', 'theres')]

**Append back to dataframe**

Issue: too much junk words --> high dimensionality

### 3.3 Use CountVectorizer

In [114]:
df.head()

Unnamed: 0,Tweets,NotHate,Racist,Sexist,Homophobe,Religion,OtherHate,tokenized_tweets
0,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,0,1,0,1,1,0,"[@, FriskDontMiss, Nigga, https, :, //t.co/cAs..."
1,My horses are retarded https://t.co/HYhqc6d5WN,0,0,0,0,0,1,"[My, horses, are, retarded, https, :, //t.co/H..."
2,âNIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL...,1,0,0,0,0,0,"[âNIGGA, ON, MA, MOMMA, YOUNGBOY, BE, SPITTI..."
3,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,1,1,0,0,0,0,"[RT, xxSuGVNGxx, :, I, ran, into, this, HOLY, ..."
4,âEVERYbody calling you Nigger now!â https:...,1,1,0,0,0,0,"[âEVERYbody, calling, you, Nigger, now, !, â..."


In [143]:
def count_vectorizer_tokenize(df, stop_words=None, max_df=1.0, min_df=1.0):
    """
    1. Split data
    2. Create document-term matrices for train and test
    Code taken from: 04 Natural_Language_Processing using NB
    """
    vect = CountVectorizer(
        stop_words=stop_words,
        max_df=max_df,
        min_df=min_df
    )  # flexible parameters
    
    X = df['Tweets']
    y = df.drop(columns = ["Tweets", "tokenized_tweets"])
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    return vect, X_train_dtm, X_test_dtm, y_train, y_test

In [118]:
vect, X_train_dtm, X_test_dtm, y_train, y_test = count_vectorizer_tokenize(df)

In [119]:
X_train_dtm.shape, X_test_dtm.shape

((124393, 221454), (41465, 221454))

In [144]:
vect, X_train_dtm, X_test_dtm, y_train, y_test = count_vectorizer_tokenize(df, 
                                                                           stop_words="english", max_df=1.0, min_df=0.0001)

In [145]:
X_train_dtm.shape, X_test_dtm.shape  # using 0.0001 we get ~6000 features

((124393, 6066), (41465, 6066))

In [153]:
vect.get_feature_names()[:10]

# can remove numbers, gibberish (aalwuhaib1977), userids (allanharper1920)

['00', '000', '01', '02', '03', '04', '05', '05pm', '06', '06jank']

Since the prof used 16825 features in the example, we'll try to get around there...

In [147]:
vect, X_train_dtm, X_test_dtm, y_train, y_test = count_vectorizer_tokenize(df, 
                                                                           stop_words="english", 
                                                                           max_df=1.0, min_df=0.00005)

In [148]:
X_train_dtm.shape, X_test_dtm.shape 

((124393, 9920), (41465, 9920))

In [151]:
vect, X_train_dtm, X_test_dtm, y_train, y_test = count_vectorizer_tokenize(df, 
                                                                           stop_words="english", 
                                                                           max_df=1.0, min_df=0.00003)

In [152]:
X_train_dtm.shape, X_test_dtm.shape  # 0.00003 looks good

((124393, 15761), (41465, 15761))

In [156]:
vect.get_feature_names()[:100:-1]

['ø³ù',
 'øªù',
 'ï¼',
 '¾ð',
 '¾ï',
 '¾â',
 '½ð',
 '½ï',
 '½â',
 '¼ð',
 '¼ï',
 '¼ã',
 '¼â',
 'ºð',
 'ºï',
 'ºâ',
 'ºredneck',
 '¹ð',
 '¹ï',
 '¹â',
 'µð',
 'µï',
 'µâ',
 '³ó',
 '³ð',
 '³ï',
 '³â',
 '²ð',
 'ªð',
 'ªï',
 'ªã',
 'ªâ',
 'zâ',
 'zython86',
 'zupta_chologist',
 'zulu',
 'zuckles',
 'zoro',
 'zoom',
 'zoo',
 'zones',
 'zone',
 'zombies',
 'zombie',
 'zoeydollaz',
 'zoey',
 'zoe',
 'zo',
 'zirtun',
 'zip_zona',
 'zip',
 'ziorim',
 'zionist',
 'zionazi',
 'zion',
 'zimmerman',
 'zigmanfreud',
 'zh_ha89',
 'zge1s6xz84',
 'zeus',
 'zesty',
 'zero',
 'zendayacoochie',
 'zebra',
 'zealanders',
 'zealand',
 'zakirnaikirf',
 'zaki_safar',
 'zainab',
 'zaibatsunews',
 'zahoorgorsi',
 'zackfox',
 'zack',
 'zach',
 'za',
 'yâ',
 'yywodfargd',
 'yvngplank',
 'yvettecoopermp',
 'yves',
 'yusuke',
 'yusufpeaceful',
 'yuskan0723',
 'yup',
 'yungill314',
 'yungbastard_',
 'yung',
 'yummy',
 'yum',
 'yui',
 'yuhkyle',
 'yuh',
 'yugi',
 'yuck',
 'yu',
 'yt',
 'yslchain',
 'ysl',
 'yrs',
 'yr',

## 4. Feature Engineering <a class="anchor" id="feature"></a>

Maybe in this part we can try to remove words (take some ideas from the final exam)

## 5. Modelling <a class="anchor" id="model"></a>

Try out different approaches:
- Approach 3: Don't convert to lowercase
- Approach 4: Include 1-grams and 2-grams

In [169]:
def build_model(model_class, X_train_dtm, X_test_dtm, y_train, y_test):
    model = OneVsRestClassifier(model_class())
    
    print('Features: ', X_train_dtm.shape[1])
    
    model.fit(X_train_dtm, y_train)
    y_pred_class = model.predict(X_test_dtm)
    
    print('Training Accuracy: ', accuracy_score(y_train, model.predict(X_train_dtm)))
    print('Test Accuracy: ', accuracy_score(y_test, y_pred_class))
    
    print('Training F1: ', f1_score(y_train, model.predict(X_train_dtm), average='macro'))
    print('Test F1: ', f1_score(y_test, y_pred_class,  average='macro'))

### 5.1 Base Model: Multinomial Naive Bayes

In [170]:
build_model(MultinomialNB, X_train_dtm, X_test_dtm, y_train, y_test)

Features:  15761
Training Accuracy:  0.4580000482342254
Test Accuracy:  0.41483178584348246
Training F1:  0.5615938682318115
Test F1:  0.5160897313132401
