# Table of Contents

* [1. Results](#results)
* [2. Import libraries and data](#import)
    * [2.1 Data dictionary](#dict)
* [3. Preprocessing](#preprocessing)
* [4. Feature Engineering](#feature)
* [5. Modelling](#model)

## 1. Results <a class="anchor" id="results"></a>

Veggies es bonus vobis, proinde vos postulo essum magis kohlrabi welsh onion daikon amaranth tatsoi tomatillo melon azuki bean garlic.

Gumbo beet greens corn soko endive gumbo gourd. Parsley shallot courgette tatsoi pea sprouts fava bean collard greens dandelion okra wakame tomato. Dandelion cucumber earthnut pea peanut soko zucchini.

Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke.

## 2. Import libraries and data <a class="anchor" id="import"></a>

**Import libraries**

In [14]:
# Reading files
import os
import json

# Data cleaning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model util
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE

# Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Model evaluation
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import classification_report

**Read data**

In [11]:
os.chdir('../data')

df1 = pd.read_csv("dataset_1.csv", encoding="latin1")

with open('MMHS150K_GT.json', encoding="utf8") as json_file:
    json_file = json.load(json_file)

In [28]:
df1.head()

Unnamed: 0,id,tweets1,label1
0,7281,The jokes and puns are flying free in this cam...,none
1,7282,#MKR Lets see who the producers think are goin...,none
2,7283,Praying Jac and Shaz do well! They're my faves...,none
3,7284,RT @realityraver: Pete Evans the Paleo Capital...,none
4,7285,If Kat and Andre stay tonight I will stop watc...,none


In [27]:
df1.rename(columns = {
    "id": "id",
    "Tweets": "tweets1",
    "Label": "label1"
}, inplace=True)

In [18]:
json_file['1114679353714016256']

{'img_url': 'http://pbs.twimg.com/tweet_video_thumb/D3gi9MHWAAAgfl7.jpg',
 'labels': [4, 1, 3],
 'tweet_url': 'https://twitter.com/user/status/1114679353714016256',
 'tweet_text': '@FriskDontMiss Nigga https://t.co/cAsaLWEpue',
 'labels_str': ['Religion', 'Racist', 'Homophobe']}

In [31]:
def convert_json_todf(json):
    """
    Convert JSON data into dataframe, using by mapping as follows:
    id: json key
    Tweets: tweet_text
    Labels (list of int): labels
    Labels (list of str): labels_str
    """
    res = {"id": [], "tweets": [], "label": [], "label_str": []}
    
    for key, value in json.items():
        res["id"].append(key)
        res["tweets"].append(value["tweet_text"])
        res["label"].append(value["labels"])
        res["label_str"].append(value["labels_str"])
        
    df = pd.DataFrame(res)
        
    return df

In [34]:
df2 = convert_json_todf(json_file)
df2.head()

Unnamed: 0,id,tweets,label,label_str
0,1114679353714016256,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,"[4, 1, 3]","[Religion, Racist, Homophobe]"
1,1063020048816660480,My horses are retarded https://t.co/HYhqc6d5WN,"[5, 5, 5]","[OtherHate, OtherHate, OtherHate]"
2,1108927368075374593,“NIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL S...,"[0, 0, 0]","[NotHate, NotHate, NotHate]"
3,1114558534635618305,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,"[1, 0, 0]","[Racist, NotHate, NotHate]"
4,1035252480215592966,“EVERYbody calling you Nigger now!” https://t....,"[1, 0, 1]","[Racist, NotHate, Racist]"


### 2.1 Data Dictionary <a class="anchor" id="dict"></a>

|Column Name|Variable Name| Description
|---|:---:|:---
|id|id|Unique identifier for each tweet
|Tweets|Tweet content|Body of tweet
|Label|classification of label|Multi-class label: sexism, racism or none

## 3. Preprocessing <a class="anchor" id="preprocessing"></a>

## 4. Feature Engineering <a class="anchor" id="feature"></a>

## 5. Modelling <a class="anchor" id="model"></a>