---
## [Jigsaw Rate Severity of Toxic Comments][1]
---
reference notebooks
1. [☣️ Jigsaw - Incredibly Simple Naive Bayes [0.768]][2]
2. [AutoNLP for toxic ratings ;)][3]


[1]: https://www.kaggle.com/c/jigsaw-toxic-severity-rating/overview
[2]: https://www.kaggle.com/julian3833/jigsaw-incredibly-simple-naive-bayes-0-768
[3]: https://www.kaggle.com/abhishek/autonlp-for-toxic-ratings

# 0. Settings

In [1]:
# Import dependencies libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

import os
import pathlib
import gc
import sys
import math 
import time 
import tqdm 
from tqdm import tqdm 
import random

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold 

import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.layers.experimental import preprocessing

import transformers 
import datasets 

In [2]:
# global config set up
config = {
    'nfolds': 10,
    'learning_rate': 1e-4,
    'num_epochs': 3,
    'batch_size': 8,
}

AUTOTUNE = tf.data.experimental.AUTOTUNE

# For reproducible results    
def seed_all(s):
    random.seed(s)
    np.random.seed(s)
    tf.random.set_seed(s)
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    os.environ['PYTHONHASHSEED'] = str(s) 
global_seed = 42
seed_all(global_seed)

In [3]:
# data
DATA_PATH = '../input/jigsaw-toxic-comment-classification-challenge/train.csv'


# 1. Data Preprocessing

### 1. Create train data

For training data, I used [Toxic Comment Classification Challenge][1] dataset.

[1]: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

I turn it into a binary toxic/ no-toxic classification

In [8]:
df = pd.read_csv(DATA_PATH)
df['y'] = (df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
df = df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
df.sample(10)

Unnamed: 0,text,y
108725,July 2009 \n\nThe recent edit you made to soju...,0
102598,"""\n OK. Then don't )Although I resent your sug...",0
149377,An editor should be careful and specific about...,0
11262,(Note: I have posted a message on WP:AN regard...,0
32217,Kurt Wallander. 194.81.33.9,0
2540,This should be noted as it makes it a detrimen...,0
96947,"I'll be back!! \n\nFrom now on SJP, or shithea...",1
139092,what don't you understand about fair use? does...,0
105051,Samanda \n\nWhat is your source on the new sin...,0
95556,Rot In Hell \n\nRot in hell asshole.\nYou dirt...,1


### 1.2 Undersampling

The dataset is very unbalanced. Here we undersample the majority class. Other strategies might work better.

In [11]:
df['y'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: y, dtype: float64

In [13]:
df['y'].value_counts()

0    143346
1     16225
Name: y, dtype: int64

In [14]:
min_len = (df['y'] == 1).sum()
df_y0_undersample = df[df['y'] == 0].sample(n=min_len, random_state=global_seed)
train_df = pd.concat([df[df['y'] == 1], df_y0_undersample]).reset_index(drop=True)
train_df['y'].value_counts()

1    16225
0    16225
Name: y, dtype: int64

In [18]:
train_df.sample(10)

Unnamed: 0,text,y
31654,}}\n{{WikiProject Energy| class = Start| impor...,0
22024,"Blocking, gagging, and so forth \n\nDoes it no...",0
19337,Personally I don't care about whatever has gon...,0
11274,MUSLIM SCUM go die soon will you,1
4749,"""\n\nI didn't call you a """"biased backward yan...",1
20888,You called? \n\nWhat's up Shiitthead? I have 2...,0
26383,Propaganda \n\nThis article is pure propaganda...,0
19156,"""\nFeel free to notify me when you submit any ...",0
25590,"Please refrain from adding nonsense, such as e...",0
21270,"""\n\nAccording to this comment I am bashing La...",0


### 1.3 k-fold

In [21]:
n_folds = 10

skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=global_seed)
for nfold, (train_index, val_index) in enumerate(skf.split(X=train_df.index,
                                                           y=train_df.y)):
    train_df.loc[val_index, 'fold'] = nfold
print(train_df.groupby(['fold', train_df.y]).size())

fold  y
0.0   0    1622
      1    1623
1.0   0    1622
      1    1623
2.0   0    1622
      1    1623
3.0   0    1622
      1    1623
4.0   0    1622
      1    1623
5.0   0    1623
      1    1622
6.0   0    1623
      1    1622
7.0   0    1623
      1    1622
8.0   0    1623
      1    1622
9.0   0    1623
      1    1622
dtype: int64


In [20]:
p_fold = 0
p_train = train_df.query(f'fold != {p_fold}').reset_index(drop=True)
p_valid = train_df.query(f'fold == {p_fold}').reset_index(drop=True)

print(len(p_train))
print(len(p_valid))

p_train.head(10)

29205
3245


Unnamed: 0,text,y,fold
0,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,2.0
1,Hey... what is it..\n@ | talk .\nWhat is it......,1,1.0
2,"Bye! \n\nDon't look, come or think of comming ...",1,1.0
3,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",1,8.0
4,I'm Sorry \n\nI'm sorry I screwed around with ...,1,6.0
5,GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK T...,1,1.0
6,Stupid peace of shit stop deleting my stuff as...,1,8.0
7,=Tony Sidaway is obviously a fistfuckee. He lo...,1,1.0
8,Why can't you believe how fat Artie is? Did yo...,1,6.0
9,All of my edits are good. Cunts like you who ...,1,6.0


# 2. DataSet

In [25]:
checkpoint = "bert-large-uncased"
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

In [None]:
checkpoint = "bert-base-uncased"
tokenizer = transformers.BertTokenizer.from_pretrained(checkpoint)

In [26]:
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-large-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [27]:
train_ds = datasets.Dataset.from_pandas(p_train)
valid_ds = datasets.Dataset.from_pandas(p_valid)

print(train_ds)
print(valid_ds)

Dataset({
    features: ['text', 'y', 'fold'],
    num_rows: 29205
})
Dataset({
    features: ['text', 'y', 'fold'],
    num_rows: 3245
})


In [28]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_train_ds = train_ds.map(tokenize_function, batched=True)
tokenized_valid_ds = valid_ds.map(tokenize_function, batched=True)

print(tokenized_train_ds)
print(tokenized_valid_ds)

  0%|          | 0/30 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

Dataset({
    features: ['attention_mask', 'fold', 'input_ids', 'text', 'token_type_ids', 'y'],
    num_rows: 29205
})
Dataset({
    features: ['attention_mask', 'fold', 'input_ids', 'text', 'token_type_ids', 'y'],
    num_rows: 3245
})


In [29]:
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

tf_train_ds = tokenized_train_ds.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["y"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=config['batch_size'],
)

tf_valid_ds = tokenized_valid_ds.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["y"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=config['batch_size'],
)

print(len(tf_train_ds))
print(len(tf_valid_ds))

3650
406


# 3. Model Training

In [30]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
num_epochs = 2
num_train_steps = len(tf_train_ds) * num_epochs

lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_scheduler),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  335141888 
_________________________________________________________________
dropout_73 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  2050      
Total params: 335,143,938
Trainable params: 335,143,938
Non-trainable params: 0
_________________________________________________________________


In [33]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


If the output is Num GPUs Available: 1 (or more), TensorFlow detects the GPU.

If TensorFlow detects the GPU but still uses the CPU, you can explicitly force it to use the GPU.

In [34]:
with tf.device('/GPU:0'):
    fit_history = model.fit(tf_train_ds,
                        epochs=num_epochs,
                        validation_data=tf_valid_ds,
                        verbose=1)

Epoch 1/2


ResourceExhaustedError:  failed to allocate memory
	 [[node tf_bert_for_sequence_classification/bert/encoder/layer_._21/attention/self/dropout_64/dropout/GreaterEqual (defined at opt/conda/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:272) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_45425]

Function call stack:
train_function


# 4. Prediction & Submit

In [None]:
test_df = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
test_ds = datasets.Dataset.from_pandas(test_df)
tokenized_test_ds = test_ds.map(tokenize_function, batched=True)
tf_test_ds = tokenized_test_ds.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=config['batch_size'],
)

In [None]:
raw_result = model.predict(tf_test_ds)
result = tf.sigmoid(raw_result.logits)

test_df['score'] = result.numpy()[:, 0]
submission_df = test_df[['comment_id', 'score']]

# submission_df.to_csv("submission.csv", index=False) 
submission_df