# Data Exploration: Toxic Comments

## Set Up: Load modules and training data

In [1]:
import numpy as np
import pandas as pd

train = pd.read_csv("./input/train.csv")

## Label Exploration

First we need to learn about the class labels. Let's count the toxic comments for starters.

In [3]:
train.toxic.value_counts()

0    144277
1     15294
Name: toxic, dtype: int64

The good news here is that, even though it might feel worse, the percentage of toxic 
comments is not too high.

In [4]:
pd.crosstab(train.toxic, train.severe_toxic)


severe_toxic,0,1
toxic,Unnamed: 1_level_1,Unnamed: 2_level_1
0,144277,0
1,13699,1595


As we might expect, all severely toxic comments are also toxic comments, but not all toxic 
comments are severe.

In [5]:
pd.crosstab(train.toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

obscene,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1
threat,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1
insult,0,0,1,1,0,0,1,1,0,0,1,1,0,1,1
identity_hate,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1
toxic,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4
0,143346,54,301,28,22,0,3,0,317,3,181,18,2,2,0
1,5707,139,1229,141,124,8,17,3,1916,41,4789,883,15,195,87


Interestingly, over a third of comments are "civilly" toxic, meaning they are neither obscene, 
insult, threat, nor identity hate, yet they are still disruptive. However, adding these labels 
greatly increases the prevalence of toxicity.

In [6]:
pd.crosstab(train.severe_toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

obscene,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1
threat,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1
insult,0,0,1,1,0,0,1,1,0,0,1,1,0,1,1
identity_hate,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1
severe_toxic,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4
0,149012,190,1516,162,135,7,19,3,2075,38,3981,636,13,133,56
1,41,3,14,7,11,1,1,0,158,6,989,265,4,64,31


The cases of "civil" severely toxic comments are much rarer - there are only 41. Generally, a 
smaller portion of comments are severely toxic, no matter what other labels we condition on. 

In [13]:
train.iloc[:, 2:8].corr()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
toxic,1.0,0.308619,0.676515,0.157058,0.647518,0.266009
severe_toxic,0.308619,1.0,0.403014,0.123601,0.375807,0.2016
obscene,0.676515,0.403014,1.0,0.141179,0.741272,0.286867
threat,0.157058,0.123601,0.141179,1.0,0.150022,0.115128
insult,0.647518,0.375807,0.741272,0.150022,1.0,0.337736
identity_hate,0.266009,0.2016,0.286867,0.115128,0.337736,1.0


The correlation matrix is another way of summarizing the relationships between labels,
although here we only see pair-wise correlations rather than the full cross-tabulation.
The story stays the same - all the correlations are positive, and the correlations for 
severe_toxic are always smaller than for toxic. 

The correlation matrix will make for a good sanity check later when making multi-class 
predicitons. We should expect the correlation matrix of the predicted probabilities to look 
very similar to this one, else something is likely awry. 

## What makes a comment toxic?

Let's start out with an overview of the comments' structures.

In [8]:
train[train.comment_text.isnull()]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate


There are no missing values for the comment texts, so let's check for empty strings.

In [15]:
train[train.comment_text == '']

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_length


Looks okay. If we find secretly missing values later we can deal with them then.

In [10]:
train['comment_length'] = train.comment_text.str.len()
train.comment_length.describe()

count    159571.000000
mean        394.073221
std         590.720282
min           6.000000
25%          96.000000
50%         205.000000
75%         435.000000
max        5000.000000
Name: comment_length, dtype: float64

The mean is about double the median, so there are some huge comments skewing the data. The 
largest comment is 5000 characters, while the inter-quartile range is only 96 to 435 characters. Let's look at the longest comments and see if they are naughty or nice.

In [11]:
train = train.sort_values(by="comment_length", ascending=False)
pd.set_option('display.max_colwidth', -1)
train.comment_text.head(1)

46583    hahahahahahahahahahahahahahahahahaha vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism rules vandalism 

Well, I've only displayed 1 comment, but change this to head(10) or so, and you'll see for
yourself these are very vulgar and spammy. You could probably target these basic spam posts
by targeting a low ratio of unique words to comment length.

In [14]:
one_percent = int(np.ceil(train.shape[0] / 100))
train_sub = train.iloc[0:one_percent, :]
train_sub.toxic.value_counts()

0    1390
1    206 
Name: toxic, dtype: int64

Long comments in general aren't especially toxic. In the above 1% longest comments, still over
80% are not toxic.