# Exploratory Data Analysis 
---
*Disclaimer: This project contains text that is profane, vulgar, and offensive due to the nature of the dataset.*

We live in an age where people's lives have become intertwined with their online presence allowing humans to engage with one another on a larger scale than ever before.  However, not all of these interactions are friendly.  People exploit the fact that their identity remains hidden and choose to target one another with unwarranted abuse causing harm to people they will never meet in person.  For our society to truly prosper from the digital age we have to combat this toxic behavior.  This will enable more and more people who are scared of what other people think when they post on social media to engage freely without the fear of being the target of online hate.  

Our task is to improve civility on social media platforms (e.g., Twitter) and online comment forums (e.g., Reddit) by training a model that determines how likely a users comment will make another user leave a conversation.  With our model, we will create a web application that tracks how toxic each social media platform and online comment forum is to bring awareness to this issue and spark initiative to create a toxic free environment for all users.  

This issue can only be solved with the help of everyone encouraging kindness instead of spreading hate.

## Data Description
The dataset used in this project is from the kaggle competition: [Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).

In 2017 the Civil Comments platform shutdown and released their archive of ~2 million public comments for researchers to study in an effort to improve civility online. Jigsaw funded the annotation of this data by human raters. 

The column `toxicity` is our toxicity label which contains a number between 0-1 denoting the fraction of human labelers that believed the comment would make someone else leave a conversation. 

For our analysis we'll define the comment as toxic (denoted 1) when the value of our target `toxicity` is greater than or equal to 0.5 otherwise we'll assign the instance to the negative class (denoted 0).

There are a lot of additional labels denoting the fraction of human labelers who believed the comment depicted several other sub-toxic labels and whether specific identity attributes were mentioned in the comment. These subtype attributes are:
* severe_toxicity
* obscene
* threat
* insult
* identity_attack
* sexual_explicit
* male
* female
* transgender
* other_gender
* heterosexual
* homosexual_gay_or_lesbian
* bisexual
* other_sexual_orientation
* christian
* jewish
* muslim
* hindu
* buddhist
* atheist
* other_religion
* black
* white
* asian
* latino
* other_race_or_ethnicity
* physical_disability
* intellectual_or_learning_disability
* psychiatric_or_mental_illness
* other_disability

It was also mentioned that different comments might have the exact same text, but labeled with different targets or sybtype attributes.

These subtype attributes will be **removed** from the training schema because we will not have access to them in a production environment.  However, they will be very resourceful as tags used during error analysis to see where exactly our model is performing poorly on.

### Labeling Schema
As mentioned on Kaggle, each comment was shown to up to 10 human labelers.  The labelers were asked to rate how toxic each comment is. 
* Very Toxic
* Toxic
* Hard to Say
* Not Toxic

The ratings were then aggragated into the `target` column.

The subtype attributes were collected by asking the human labelers whether the attribute was mentioned in the comment.  Then, taking these responses and aggregating them into a value between 0-1 representing the fraction of human labelers who believed the subtype attribute was mentioned in the comment.

*Note: Some comments were shown to more than 10 human labelers because of sampling strategies to increase rating accuracy.*

In [1]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import os
import pandas as pd

import nltk
from sklearn.model_selection import train_test_split
import tensorflow as tf
import textblob
import wordcloud

# Root directory used to navigate tree.
PROJECT_ROOT_DIR = os.path.dirname(os.getcwd())

# Path to save images and figures.
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "reports/figures")
if not os.path.exists(IMAGES_PATH):
    os.makedirs(IMAGES_PATH, exist_ok=True)
    
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    """
    Saves figures in toxic_media/reports/figures.
    
    Parameters
    ----------
    fig_id : str
        Name of the figure to be saved.
    tight_layout : bool, default=True
        Enables padding between figure edge and edges of subplots when set to True.
    fig_extension : {'jpg', 'png', 'svg'}, default='png'
        Figure format.
    resolution : int
        Figure resolution.
    """
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print(f"Saving figure: {fig_id}")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Load Data
We're going to start by loading the raw data to get a quick look at it before creating a training set with which we'll use to explore in-depth. As mentioned above, the initial data file has extra columns we will not be using.  Therefore, we'll only load in the target column and the comment text. 

In [2]:
DATA_DIR = os.path.join(PROJECT_ROOT_DIR, "data")

def load_data(path="raw/all_data.csv"):
    """
    Load data into pandas dataframe object.
    
    Parameters
    ----------
    path : str, default=raw/all_data.csv
        Directory and filename in format 'directory/filename.csv'.
    """
    csv_path = os.path.join(DATA_DIR, path)
    
    return pd.read_csv(csv_path)

In [3]:
toxic = load_data()

In [4]:
toxic.head()

Unnamed: 0,id,comment_text,split,created_date,publication_id,parent_id,article_id,rating,funny,wow,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,1083994,He got his money... now he lies in wait till a...,train,2017-03-06 15:21:53.675241+00,21,,317120,approved,0,0,...,,,,,,,,,0,67
1,650904,Mad dog will surely put the liberals in mental...,train,2016-12-02 16:44:21.329535+00,21,,154086,approved,0,0,...,,,,,,,,,0,76
2,5902188,And Trump continues his lifelong cowardice by ...,train,2017-09-05 19:05:32.341360+00,55,,374342,approved,1,0,...,,,,,,,,,0,63
3,7084460,"""while arresting a man for resisting arrest"".\...",test,2016-11-01 16:53:33.561631+00,13,,149218,approved,0,0,...,,,,,,,,,0,76
4,5410943,Tucker and Paul are both total bad ass mofo's.,train,2017-06-14 05:08:21.997315+00,21,,344096,approved,0,0,...,,,,,,,,,0,80


In [5]:
toxic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999516 entries, 0 to 1999515
Data columns (total 46 columns):
 #   Column                               Dtype  
---  ------                               -----  
 0   id                                   int64  
 1   comment_text                         object 
 2   split                                object 
 3   created_date                         object 
 4   publication_id                       int64  
 5   parent_id                            float64
 6   article_id                           int64  
 7   rating                               object 
 8   funny                                int64  
 9   wow                                  int64  
 10  sad                                  int64  
 11  likes                                int64  
 12  disagree                             int64  
 13  toxicity                             float64
 14  severe_toxicity                      float64
 15  obscene                         

In [6]:
toxic.isnull().sum() / len(toxic)

id                                     0.000000e+00
comment_text                           5.001210e-07
split                                  0.000000e+00
created_date                           0.000000e+00
publication_id                         0.000000e+00
parent_id                              4.325082e-01
article_id                             0.000000e+00
rating                                 0.000000e+00
funny                                  0.000000e+00
wow                                    0.000000e+00
sad                                    0.000000e+00
likes                                  0.000000e+00
disagree                               0.000000e+00
toxicity                               0.000000e+00
severe_toxicity                        0.000000e+00
obscene                                0.000000e+00
sexual_explicit                        0.000000e+00
identity_attack                        0.000000e+00
insult                                 0.000000e+00
threat      

In [7]:
toxic.describe()

Unnamed: 0,id,publication_id,parent_id,article_id,funny,wow,sad,likes,disagree,toxicity,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
count,1999516.0,1999516.0,1134709.0,1999516.0,1999516.0,1999516.0,1999516.0,1999516.0,1999516.0,1999516.0,...,448000.0,448000.0,448000.0,448000.0,448000.0,448000.0,448000.0,448000.0,1999516.0,1999516.0
mean,4065400.0,49.88997,3715138.0,281025.7,0.2776687,0.04437174,0.1089289,2.441188,0.5808151,0.1029241,...,0.056534,0.011886,0.006151,0.008158,0.001351,0.001117,0.012068,0.001219,1.431667,8.77572
std,2527563.0,27.71895,2451507.0,104077.8,1.054819,0.2458644,0.455557,4.712994,1.854332,0.1970386,...,0.215175,0.086906,0.058828,0.042429,0.017461,0.016391,0.089072,0.014114,17.63593,43.31605
min,59848.0,2.0,61006.0,2006.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
25%,856579.8,21.0,793011.0,160003.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
50%,5340220.0,54.0,5217531.0,331925.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,5955782.0,54.0,5774684.0,366227.0,0.0,0.0,0.0,3.0,0.0,0.1666667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
max,7194640.0,115.0,6333965.0,399544.0,102.0,21.0,31.0,300.0,187.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.6,1866.0,4936.0


In [9]:
toxic["toxicity"].round().value_counts() / len(toxic)

0.0    0.941048
1.0    0.058952
Name: toxicity, dtype: float64

### Summary
After our initial look into the data we found a few things to take note of:
* We're missing text in a few instances (handle this in preprocessing).
* Our target column `toxicity` is heavily skewed towards 0 meaning we'll have to deal with a class imbalance when it comes to modeling (~5% of comments are toxic).
* Many of the subtype attributes regarding identities are 77% null meaning we might not be able to make use of these tags during error analysis.  However, the sub-toxic attributes are complete and will be helpful understanding where our model is misclassifying toxic comments. 

## Test Set
Before we dive deep into the dataset we need to create a test set.  Our test set will be 5% (~100,000 instances) which is plenty big enough to provide accurate statistics in our modeling phase. Since we plan on building deep learning models which benefit greatly from large amounts of data this will allow us to allocate 95% of our dataset for training.  

Since our dataset is imbalanced we will use stratified sampling to preserve the ratio of classes in the test set.

In [45]:
X = toxic['comment_text']
y = toxic['toxicity'].round()

def create_test(X, y, test_size=0.05):
    """
    Create holdout data with stratified sampling techniques.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Raw feature set to be split into training and test set.
    test_size : float, default=0.05
        Decimal value representing percentage of dataset used to create holdout set.
        
    Notes
    -----
    This function only creates the test set if it has not been created yet.  
    """
    # Check if test set exists.
    TEST_PATH = os.path.join(DATA_DIR, "interim/test.csv")
    if not os.path.exists(TEST_PATH):
        
        # Train test split with stratified sampling of the target.
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
                                                            stratify=y)
        
        # Combine target and text into train and test sets.
        train = pd.concat([X_train, y_train], axis=1)
        test = pd.concat([X_test, y_test], axis=1)
        
        # Save train and test set in directory ~/data/interim
        path = os.path.join(DATA_DIR, "interim")
        train.to_csv(os.path.join(path, "train.csv"), index=False)
        test.to_csv(os.path.join(path, "test.csv"), index=False)

create_test(X, y)

## Data Exploration

In [46]:
toxic = load_data(path="interim/train.csv")