# Jigsaw Rate Severity of Toxic Comments
Special thanks to:
- [Jigsaw - Incredibly Simple Naive Bayes [0.768]](https://www.kaggle.com/julian3833/jigsaw-incredibly-simple-naive-bayes-0-768)
- [JRSoTC - RidgeRegression (ensemble of 3)](https://www.kaggle.com/steubk/jrsotc-ridgeregression-ensemble-of-3/notebook)

In [2]:
import re
import numpy as np
import pandas as pd
from copy import deepcopy

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
stopwords = set(STOPWORDS)

from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import Ridge

In [3]:
RANDOM_STATE = 201

## Strategy ⁉
There are three datasets introduced in the competition page. I will be using all three to build an Emsemble model. Datasets used in the notebook:
- [jigsaw-toxic-comment-classification-challenge](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge)
- [jigsaw-unintended-bias-in-toxicity-classification](https://www.kaggle.com/julian3833/jigsaw-unintended-bias-in-toxicity-classification)

In [4]:
# Toxicity coefficients - These weights are used to combine all toxicity levels into one
toxicity_coefs = {
    'toxic': 1,
    'severe_toxic': 2,
    'obscene': 1,
    'threat': 1,
    'insult': 1,
    'identity_hate': 2,
    'sexual_explicit': 1
}

toxicity_types = list(toxicity_coefs.keys())

#### Validation: Defining our Validation Method
We need to validate our model to tracks its performance. In the process, we use *validation_data.csv* as our validation set. I used *RMSE* and *Accuracy* as my metrics.

**NOTE #1**: *Accuracy* is not a recommended metrics as our data is strongly unbalanced! ([See why](https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/))

In [5]:
# Performs a Stratified K-Fold validation using the given pipeline
def kfold_validate(pipe, folds, X, y, less_toxic, more_toxic, verbose = False):
    skf = StratifiedKFold(n_splits = folds, shuffle = True, random_state = RANDOM_STATE)
    accuracies, rmse_scores = [], []    
    
    for fold, (train_index, val_index) in enumerate(skf.split(X, y)):
        X_train, y_train = X[train_index], y[train_index]
        X_val, y_val = X[val_index], y[val_index]
        
        # Fit the pipeline (Re-copy the pipeline to avoid fitting on the same one!)
        _pipe = deepcopy(pipe)
        _pipe.fit(X_train, y_train)
        
        # Calculate RMSE
        rmse_score = mean_squared_error(_pipe.predict(X_val), y_val, squared = False) 
        rmse_scores.append(rmse_score)
        
        # Calculate accuracy
        prob_1 = _pipe.predict(less_toxic)
        prob_2 = _pipe.predict(more_toxic)
        accuracy = (prob_1 < prob_2).mean()
        accuracies.append(accuracy)
        
        if verbose:
            print(f"FOLD #{fold + 1}: Accuracy: {accuracy}, RMSE: {rmse_score}")
        
    return np.array(accuracies).mean(), np.array(rmse_scores).mean()

#### Visualization: Plotting the Data
Since we pretty much do the same analysis all datasets, plots are converted into functions to avoid further duplication.

In [6]:
# Plots the didtribution of values in toxicity columns of the given dataframe
def plot_toxicity_dist(df):
    toxicity_values = df['toxicity'].value_counts()
    
    plt.figure(figsize = (20, 5))
    plt.title('Toxicity Level Distribution')
    plt.bar(toxicity_values.keys(), toxicity_values.values, color = 'g')
    plt.show()


# Plots number of values for each toxicity level in the given dataframe
def plot_toxic_types_dist(df):    
    fig = plt.figure(figsize = (20, 5))
    plt.title('Toxicity Categories Count')
    plt.bar([type for type in toxicity_types if type in jtc_df.columns], [df[type].value_counts()[1] for type in toxicity_types if type in df.columns], label = 'Number of occurrences')
    plt.legend()
    plt.show()


# Plots the wordcloud for each toxicity level of the given data frame (Stopwords are removed)
def plot_wordcloud(df):
    wordcloud = WordCloud(stopwords = stopwords)
    fig, ax = plt.subplots(3, 2, figsize = (20, 10))

    i = 0
    for row in ax:
        for col in row:        
            wordcloud.generate(' '.join(df.loc[df[toxicity_types[i]] != 0, 'text'].tolist()))
            col.set_title(toxicity_types[i])        
            col.imshow(wordcloud)        
            col.axis("off")
            i += 1
    plt.tight_layout(pad = 0)
    plt.show()

## Jigsaw Rate Severity of Toxic Comments
This is our original dataset for the competition, let's take a look:
- *comment_to_score.csv*: This is the final dataset that we have to make predictions on.
- *validation_data.csv*: The dataset that will help validate our model.
- *sample_submission.csv*: A sample submission file.

In [7]:
val_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')
test_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')

print(f'test_df\n- Shape: {test_df.shape}\n- Columns: {list(test_df.columns)}\n')
print(f'Duplicated texts: {test_df.duplicated("text").sum()}')

print(f'val_df\n- Shape: {val_df.shape}\n- Columns: {list(val_df.columns)}\n')
print(f'Duplicated texts: {val_df.duplicated("text").sum()}')

## jigsaw toxic comment classification challenge
This is the second dataset in the notebook. Let's take a look at what we have:
- *train.csv*: The training data.
- *test.csv*: The test data used for final prediction
- *test_labels.csv*: The actual answers for the *test.csv*.
- *sample_submission.csv*: A sample submission file.

**NOTE #1**: Since the actual test labels are published, I will be using them to increase the number of training data.

**NOTE #2**: I will be changing the columns names to match the original dataset columns' names. (This applies to all used datasets)

#### Loading the training data: *train.csv*

In [8]:
# Renaming the columns will help avoid any complications
jtc_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv').rename(
    columns = { 'id': 'comment_id', 'comment_text': 'text'}
)

jtc_test_df = pd.merge(
    pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv'),
    pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv'),
    on = 'id',
    how = 'outer'
).rename(columns = {'id': 'comment_id', 'comment_text': 'text'})

# Add to the training data
jtc_df = pd.concat([jtc_df, jtc_test_df])

print(f'- Shape: {jtc_df.shape}\n- Columns: {list(jtc_df.columns)}\n')
print(f'Duplicated texts: {jtc_df.duplicated("text").sum()}')

#### Expanding the Data: Adding *test.csv* and *test_labels.csv*

In [9]:
# Combine all toxicity levels into one with the same weights set
jtc_df['toxicity'] = sum([jtc_df[type] * coef for type, coef in toxicity_coefs.items() if type in jtc_df])

# Filter the ones with negative toxicity (They are invalid)
jtc_df = jtc_df.loc[jtc_df['toxicity'] >= 0]

print(f"- New shape: {jtc_df.shape}")

#### Downsampling
Our data is heavily unblanaced ([See why that's bad](https://machinelearningmastery.com/what-is-imbalanced-classification/)) and we must fix it. There are a few tricks we can pull off but down-sampling is the best way to go.

In [10]:
# Cutoff & threshold
threshold = 0
cutoff = (jtc_df['toxicity'] > threshold).sum()
cutoff_coef = 1.5

# Downsample non-toxic comments
jtc_non_toxic_df = jtc_df.loc[jtc_df['toxicity'] <= threshold].sample(int(cutoff * cutoff_coef), random_state = RANDOM_STATE)

# Concatenate the two dataframes
jtc_df = pd.concat([jtc_non_toxic_df, jtc_df[jtc_df['toxicity'] > threshold]])

### EDA: Exploratory Data Analysis
Our data from **jigsaw-toxic-comment-classification-challenge** is ready to be fed into a model and then prediction, which results in a clean 77% score on the submission. I did further explorations on the data but to keep things short, they are commented out below. (Run in seperate cells)

In [11]:
# plot_toxic_types_dist(jtc_df)
# plot_toxicity_dist(jtc_df)
# plot_wordcloud(jtc_df)

## jigsaw unintended bias in toxicity classification
Our third dataset has many dataframes but luckily *all_data.csv* contains them all.

**NOTE #1**: The *toxicity_annotator_count* feature can be used to remove the comments with very few annotators. I'll remove the ones with less than 10 annotators.

**NOTE #2**: The *sexual_explicit* feature is new, but keeping it might be a good idea, why?

In [28]:
jutc_features_to_select = ['comment_id', 'text', 'toxic', 'severe_toxic', 'obscene', 'insult', 'identity_hate', 'sexual_explicit', 'toxicity_annotator_count']

# Load and rename columns
jutc_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv').rename(
    columns = {
        'id': 'comment_id',
        'comment_text': 'text',
        'identity_attack': 'identity_hate',
        'toxicity': 'toxic',
        'severe_toxicity': 'severe_toxic'
    }
)

# Filter annonators and select only the features we need
jutc_df = jutc_df.loc[jutc_df['toxicity_annotator_count'] > 5, jutc_features_to_select]

print(f'- Shape: {jutc_df.shape}\n- Columns: {list(jutc_df.columns)}\n')

In [29]:
# Calculate toxicity
jutc_df['toxicity'] = jutc_df[['severe_toxic', 'obscene', 'insult', 'identity_hate', 'sexual_explicit']].sum(axis = 1)

jutc_df['toxicity'] = jutc_df.apply(lambda x: x["toxic"] if x["toxic"] <= 0.5 else x["toxicity"], axis = 1)

#### Downsampling
This dataset is more balanced that the previous one, but still requires dow-sampling.

In [38]:
# Cutoff and threshold
threshold = 0.5
cutoff = (jutc_df['toxicity'] > threshold).sum()
cutoff_coef = 1.5

# Downsample non-toxic comments
jutc_non_toxic_df = jutc_df[jutc_df['toxicity'] <= threshold].sample(int(cutoff * cutoff_coef), random_state = RANDOM_STATE)

# Concatenate the two dataframes
jutc_df = pd.concat([jutc_non_toxic_df, jutc_df[jutc_df['toxicity'] > threshold]])

print(f'- Shape: {jutc_df.shape}')

In [15]:
# Convert to discrete valeus (instead of continuous)
jutc_df['toxicity'] = (np.round(jutc_df['toxicity'], decimals = 1) * 10).astype(int)

### EDA: Exploratory Data Analysis

In [16]:
# plot_toxicity_dist(jutc_df)

### Text Cleaning
As language models improve, text-cleaning is becoming less necessary, but that's not the case for all models. My strategy is to start simple and test some cleaning methods to see if they help the model or not.

### Modeling: Creating the Pipeline

In [39]:
train_X = jtc_df['text']
train_y = jtc_df['toxicity']
test_X = test_df['text']

# train_X = jutc_df['text']
# train_y = jutc_df['toxicity']
# test_X = test_df['text']

In [40]:
# Define pipeline
pipe = Pipeline([
    ('vectorizer', TfidfVectorizer(analyzer = 'char_wb', max_df = 0.5, min_df = 3, ngram_range = (4, 6))),
    ('model', Ridge())
])

# Validate (Pipeline must not be fitted!)
acc_mean, rmse_mean = kfold_validate(
    pipe = pipe,
    folds = 5,
    X = np.array(train_X),
    y = np.array(train_y),
    less_toxic = val_df['less_toxic'],
    more_toxic = val_df['more_toxic'],
    verbose = True
)
print(f"Mean Accuracy: {acc_mean}\nMean RMSE: {rmse_mean}")

### Creating the Submission

In [42]:
# Train the pipeline
pipe.fit(train_X, train_y)

# Make predictions
y_pred = pipe.predict(test_X)

# Create submission file
submission_df = pd.DataFrame(data = {
    'comment_id': test_df['comment_id'],
    'score': y_pred
}).to_csv('submission.csv', index = False)