# Jigsaw Rate Severity of Toxic Comments

In [14]:
import re
import numpy as np
import pandas as pd
from string import printable, punctuation
from itertools import groupby

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
stopwords = set(STOPWORDS)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.naive_bayes import MultinomialNB

import matplotlib.pyplot as plt
plt.style.use('ggplot')

## EDA: Love at First Sight 👉👈
I have to be comfortable with the data I'm working with in order to understand it. So let's take a look at what we have:
- **comments_to_score.csv**: This is our test set which we use for the final prediction. We must score the toxicity level of each comment.
- **sample_submission.csv**: Just a sample submission file.
- **validation_data.csv**: Pair of comments that are used for model validation. Our validation is mainly based this dataset (However we could use more data from the following dataset..)

You probably have wondered, where the hell is the training data??? Don't worry, they are in the previous competition [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data). We have:
- **sample_submission.csv**: Yeah, you guessed it.
- **test.csv**: Dataset used for model prediction.
- **test_labels.csv**: The actual answers for the *test.csv*.
- **train.csv**: Yeeey! our training data!!!

In the second dataset, we know the actual answers to the *test.csv*. So, we could combine that with our *train.csv* to **extend our training data** or used it for **validation** (maybe a different validation from what we have in the original dataset). It's really up to you what you do with them, but I suggest trying them both!

In [15]:
# Original competition data
val_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')
test_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')

# Previous competition data - Adjusted the column names for more clarity
train_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv')
train_df.rename(columns = {'id': 'comment_id','comment_text': 'text'}, inplace = True)

print(f'train_df\n- Shape: {train_df.shape}\n- Columns: {list(train_df.columns)}\n')
print(f'test_df\n- Shape: {test_df.shape}\n- Columns: {list(test_df.columns)}\n')
print(f'val_df\n- Shape: {val_df.shape}\n- Columns: {list(val_df.columns)}\n')

Good, we know what we've got. Time to do some ***Exploratory Data Analysis*** and ***Feature Engineering***...
### Exploratory Data Analysis (EDA)
Since the training data in divided into 6 categories, we must analyze them seperatly and see (for example) what makes *toxic* comment different from an *insult*. We do the followings:
- First, plot the distribution of all categories to see the occurrence of each category.
- It wouldn't hurt to use a WordCloud to see the important words for each category.

In [16]:
categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

fig = plt.figure(figsize = (20, 5))
plt.title('Toxicity Categories Count')
plt.bar(categories, [train_df[cat].value_counts()[1] for cat in categories], label = 'Number of occurrences')
plt.legend()
plt.show()

In [17]:
wordcloud = WordCloud(stopwords = stopwords)
fig, ax = plt.subplots(3, 2, figsize = (20, 10))

i = 0
for row in ax:
    for col in row:        
        wordcloud.generate(' '.join(train_df.loc[train_df[categories[i]] == 1, 'text'].tolist()))
        col.set_title(categories[i])        
        col.imshow(wordcloud)        
        col.axis("off")
        i += 1
plt.tight_layout(pad = 0)
plt.show()

The two columns *idenity_hate* and *threat* are pretty straight forward according to our results because they have their own vocabulary (: But others share common words.
### Feature Engineering
We need to somehow combine these 6 categories into one. (For instance) looking at the categories, *severe_toxic* and *thread* cannot be treated equally. Same applies to all categories, so we need to assign weights for each category. **It is extremely important what weight you use for each category!!!**

In [18]:
toxicity_coefs = {
    'toxic': 1,
    'severe_toxic': 2,
    'obscene': 1,
    'threat': 1,
    'insult': 1,
    'identity_hate': 2
}
train_df['toxicity'] = sum([train_df[type] * coef for type, coef in toxicity_coefs.items()])

print(f"Number of distinct values: {len(train_df['toxicity'].unique())}")
print(f"Toxicity Mean: {train_df['toxicity'].mean()}")
print(f"Standard Ddeviation: {train_df['toxicity'].std()}")

In [6]:
toxicity_values = train_df['toxicity'].value_counts()

plt.figure(figsize = (20, 5))
plt.title('Toxicity Level Distribution')
plt.bar(toxicity_values.keys(), toxicity_values.values, color = 'g')
plt.show()

## Balancing our data
As you saw in the above plot, our set of weights matter alot. We should choose a set which both **makes sense** and it is **as balanced as it can be**.

For instance if we set all weights to 1, it would violate both our rules: weights won't make sense as *toxic* isn't equal to *threat* (obviously!) and also will create a very unbalanced results where we would have more than 140k comments with 0 toxicity but fewer than 10k comments for other toxicity scores.

Inbalanced data can be very problematic. Yes, we can pull a few tricks but at a cost. There are three approches to solve this problem:
- Our most reliable option is that we can (and we will) downsample our data.
- Our second most reliable option is ***NOT*** our weights. Yes, ***NOT***! Because we have many zeros and there is absolutely nothing our weights can do about them.
- Our third reliable option is to use *test.csv* and *test_labels.csv* from the previous competition to add to our training data. Yes, this will make matters worse in terms of balance (because *test.csv* itself is unbalanced!), but at least we have more data for each toxicity level. Then we could downsample and we should lose less data.
- We could also smooth our *toxicity score* and merge a few values as shown below..

### Increasing our Training Data

In [7]:
# Merge two dataframes test.csv & test_labels.csv
tmp_df = pd.merge(
    pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv'),
    pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv'),
    on = 'id',
    how = 'outer'
).rename(columns = {'id': 'comment_id', 'comment_text': 'text'})

# Combine all toxicity levels into one with the same weights set
tmp_df['toxicity'] = sum([tmp_df[type] * coef for type, coef in toxicity_coefs.items()])

# Add to the training data
train_df = pd.concat([train_df, tmp_df.loc[tmp_df['toxicity'] >= 0]])

print(f"Added {tmp_df.shape[0]} data to train_df. Now we have {train_df.shape[0]} entries.")

### Smoothing Toxicity Scores
If your weights created too many *toxicity scores*, we can smooth them out.

In [8]:
# Convert our toxicity scores into a smoothed toxicity score
train_df['smoothed_toxicity'] = train_df['toxicity'].apply(lambda x: 1 if x > 0 else 0)
# train_df['smoothed_toxicity'] = train_df['toxicity']

### Downsampling
The code below will work for any number of discrete *smoothed_toxicity* values and applies the cutoff accordingly.

In [9]:
# Be careful about the cutoff value. Too low value will likely make our model to underfit (We lose too many training data!)
smoothed_toxicity = train_df['smoothed_toxicity'].value_counts()
cutoff = smoothed_toxicity.min()

# Apply cutoff to each toxicity score and save them in a list
cutoff_partitions = [train_df.loc[train_df['smoothed_toxicity'] == toxicity].sample(cutoff) for toxicity in smoothed_toxicity.index]

# Concatenate and save them into the train_df
train_df = pd.concat(cutoff_partitions)

### Text Cleaning
As language models improve, text-cleaning is becoming less necessary, but that's not the case for all models. My strategy is to start simple and test some cleaning methods to see if they help the model or not.

### Validation: Defining our Validation Method
We need some sort of validation for our model to monitor its behavior. One way to accomplish this is to use the data from *validation_data.csv*. We can use our model to predict on both *less_toxic* and *more_toxic* columns and comparing the results.

In [10]:
# Defining the validation function
def validate_pipe(pipe, less_toxic, more_toxic):
    
    prob_1 = pipe.predict_proba(less_toxic)
    prob_2 = pipe.predict_proba(more_toxic)
    
    return (prob_1[:, 1] < prob_2[:, 1]).mean()

### Creating the Pipeline

In [12]:
# Define pipeline
pipe = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words = 'english')),
    ('model', MultinomialNB())
])

# Train the pipeline
pipe.fit(train_df['text'], train_df['smoothed_toxicity'])

# Validate
validate_pipe(pipe, val_df['less_toxic'], val_df['more_toxic'])

### Creating Submission

In [13]:
y_pred = pipe.predict_proba(test_df['text'])

submission_df = pd.DataFrame(data = {
    'comment_id': test_df['comment_id'],
    'score': y_pred[:, 1]
}).to_csv('submission.csv', index = False)