## About this notebook

*[Jigsaw Multilingual Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)* is the 3rd annual competition organized by the Jigsaw team. It follows *[Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)*, the original 2018 competition, and *[Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)*, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use english only training data to run toxicity predictions on many different languages, which can be done using multilingual models, and speed up using TPUs.

Many awesome notebooks has already been made so far. Many of them used really cool technologies like [Pytorch XLA](https://www.kaggle.com/theoviel/bert-pytorch-huggingface-starter). This notebook instead aims at constructing a **fast, concise, reusable, and beginner-friendly model scaffold**. 

**THIS DOES NOT USE ANY TRANSLATED DATA, BUT IT DOES TRAIN ON THE VALIDATION SET.**


### References
* Original Author: [@xhlulu](https://www.kaggle.com/xhlulu/)
* Original notebook: [Link](https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras)

In [None]:
import os
import csv

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

## Data

In [None]:
files_translated = [
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-es.csv",
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-fr.csv",
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-it.csv",
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-pt.csv",
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-ru.csv",
"../input/jigsaw-train-multilingual-coments-google-api/jigsaw-toxic-comment-train-google-tr.csv"]

In [None]:
def one_train(files_translated):
    train = []

    for file in files_translated:
        lang = file.split('-')[-1].split('.')[0]
        df = pd.read_csv(file).dropna(subset=['comment_text', 'toxic'])
        df.loc[:, 'lang'] = lang
        train.append(df[['comment_text', 'lang', 'toxic']])
    train = pd.concat(train, axis=0).sample(frac=0.3)
    train.to_csv('train_translated.csv', index=None)


In [None]:
one_train(files_translated)

In [None]:
def regular_encode(texts, tokenizer, maxlen=512):
    enc = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=True, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        return_tensors='tf',
        max_length=maxlen
    )
    
    return enc['input_ids'], enc['attention_mask']

def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    """
    https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

  def __init__(self, max_lr, initial_lr, warmup_steps=10, cooling_steps=100):
    super(CustomSchedule, self).__init__()

    self.warmup = tf.range(initial_lr, max_lr, max_lr/warmup_steps)
    self.cooling = tf.range(max_lr, 0, -max_lr/cooling_steps)
    self.lrs = tf.concat([self.warmup, self.cooling], axis=0)
    

  def __call__(self, step):
    

    return self.lrs[tf.cast(step, tf.int32)]


def build_model(transformer, max_len=512):
    """
    https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
#     attention_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    drop = Dropout(0.1)(cls_token)
    out = Dense(1, activation='sigmoid', name='output')(drop)
    
    model = Model(inputs=input_word_ids, outputs=out)
#     learning_rate = CustomSchedule(0, 2e-6, 500, 10000)
    model.compile(Adam(2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

## TPU Configs

In [None]:
# Detect hardware, return appropriate distribution strategy

    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())


tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)


print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

# Data access
# GCS_DS_PATH = KaggleDatasets().get_gcs_path('/kaggle/working/train_translated.csv')

# Configuration
EPOCHS = 1
BATCH_SIZE = 12 * strategy.num_replicas_in_sync
MAX_LEN = 300
MODEL = 'jplu/tf-xlm-roberta-large'

## Create fast tokenizer

In [None]:
from tokenizers import (ByteLevelBPETokenizer,
                            CharBPETokenizer,
                            SentencePieceBPETokenizer,
                            BertWordPieceTokenizer)

In [None]:
# !wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-sentencepiece.bpe.model

In [None]:
# First load the real tokenizer
# tokenizer = BertWordPieceTokenizer('bert-base-multilingual-cased-vocab.txt', lowercase=False)
tokenizer = SentencePieceBPETokenizer("xlm-roberta-large-sentencepiece.bpe.model")

In [None]:
train = pd.read_csv('train_translated.csv')

valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')
submission = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')

In [None]:

%%time 

x_train= fast_encode(train.comment_text.values, tokenizer, maxlen=MAX_LEN)


In [None]:
x_train.shape

In [None]:
x_valid = fast_encode(valid.comment_text.values, tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.values, tokenizer, maxlen=MAX_LEN)

y_train = train.toxic.values
y_valid = valid.toxic.values

In [None]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_test, ))
    .batch(BATCH_SIZE)
)

In [None]:
%%time
with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

## Train Model

In [None]:
# train_iterator = generate_batches('train_translated.csv', tokenizer, BATCH_SIZE, MAX_LEN)
# valid_iterator = generate_batches_val('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv', tokenizer, BATCH_SIZE, MAX_LEN)

In [None]:
EPOCHS = 20
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=1000,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

Now that we have pretty much saturated the learning potential of the model on english only data, we train it for one more epoch on the `validation` set, which is significantly smaller but contains a mixture of different languages.

In [None]:
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
    valid_dataset.repeat(),
    steps_per_epoch=n_steps,
    epochs=20
)

## Submission

In [None]:
sub['toxic'] = model.predict(test_dataset)

In [None]:
submission=sub[['id','toxic']]

In [None]:
submission.to_csv('submission.csv',index=False)