# Problem Statement
<ul style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>
<li>Build a model that produces scores that rank each pair of comments the same way as the professional raters in the training dataset.</li>
</ul>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Why this competition?</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>As evident from the problem statement, this competition presents an unique challenge for a greater purpose. Online bullying has become a epidemic with the boom in connectivity.<br>Hopefully the solutions contribute towards controlling this behaviour so that the internet remains a safe place for everyone.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Expected Outcome</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>In this competition we will be ranking comments in order of severity of toxicity.<br>We are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Data Description</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>There is no training data for this competition. We can refer to previous Jigsaw competitions for data that might be useful to train models.<br>However, we are provided a set of paired toxicity rankings(as per expert raters) that can be used to validate models.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Grading Metric</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>Submissions are evaluated on <b>Average Agreement</b> with Annotators.<br>
For the ground truth, annotators were shown two comments and asked to identify which of the two was more toxic. Pairs of comments can be, and often are, rated by more than one annotator, and may have been ordered differently by different annotators.</p>

<p style='background:MediumSeaGreen; border:0; color: white; text-align: center; font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 24px'>If you found this notebook useful or use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. That keeps me motivated and inspires me to write and share such public kernels.<br>Thanks! 😊</p>

# About This Notebook:-
<ul style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>
<li>This notebook tried to demonstrate the use of Transfer learning using the Huggingface and Pytorch library.</li>
<li>We use a vanilla <b>roberta-base</b> transformer model for extracting language embeddings and pass them through a dense head to find the rankings.</li>
<li>We use <a href='https://pytorch.org/docs/stable/generated/torch.nn.MarginRankingLoss.html'><b>MarginRankingLoss</b></a> as our loss function.</li>
<li>This notebook only covers the training part. Inference can be found in the notebook link below.</li>
</ul>

<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>Inference Notebook:- <a href='https://www.kaggle.com/manabendrarout/jrstc-pytorch-roberta-ranking-baseline-infer'><b>https://www.kaggle.com/manabendrarout/jrstc-pytorch-roberta-ranking-baseline-infer</b></a></p>

# Get GPU Info

In [None]:
!nvidia-smi

# Imports

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
from tqdm.auto import tqdm
from bs4 import BeautifulSoup
from collections import defaultdict
import pandas as pd
import numpy as np
import os
import re
import random
import gc
pd.set_option('display.max_columns', None)
np.seterr(divide='ignore', invalid='ignore')
gc.enable()

# Deep Learning
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import OneCycleLR
# NLP
from transformers import AutoTokenizer, AutoModel

# Random Seed Initialize
RANDOM_SEED = 42

def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything()

# Device Optimization
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    
print(f'Using device: {device}')

# Reading File

In [None]:
data_dir = '../input/jrstc-train-folds'
train_file_path = os.path.join(data_dir, 'validation_data_5_folds.csv')
print(f'Train file: {train_file_path}')

In [None]:
train_df = pd.read_csv(train_file_path)

# Text Cleaning

In [None]:
def text_cleaning(text):
    '''
    Cleans text into a basic form for NLP. Operations include the following:-
    1. Remove special charecters like &, #, etc
    2. Removes extra spaces
    3. Removes embedded URL links
    4. Removes HTML tags
    5. Removes emojis
    
    text - Text piece to be cleaned.
    '''
    template = re.compile(r'https?://\S+|www\.\S+') #Removes website links
    text = template.sub(r'', text)
    
    soup = BeautifulSoup(text, 'lxml') #Removes HTML tags
    only_text = soup.get_text()
    text = only_text
    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    text = re.sub(r"[^a-zA-Z\d]", " ", text) #Remove special Charecters
    text = re.sub(' +', ' ', text) #Remove Extra Spaces
    text = text.strip() # remove spaces at the beginning and at the end of string

    return text

In [None]:
tqdm.pandas()
train_df['less_toxic'] = train_df['less_toxic'].progress_apply(text_cleaning)
train_df['more_toxic'] = train_df['more_toxic'].progress_apply(text_cleaning)

In [None]:
train_df.sample(10)

In [None]:
train_df.groupby(['kfold']).size()

# CFG

In [None]:
params = {
    'device': device,
    'debug': False,
    'checkpoint': 'roberta-base',
    'output_logits': 768,
    'max_len': 256,
    'num_folds': train_df['kfold'].nunique(),
    'batch_size': 16,
    'dropout': 0.2,
    'num_workers': 2,
    'epochs': 3,
    'lr': 2e-5,
    'margin': 0.7,
    'scheduler_name': 'OneCycleLR',
    'max_lr': 5e-5,                 # OneCycleLR
    'pct_start': 0.1,               # OneCycleLR
    'anneal_strategy': 'cos',       # OneCycleLR
    'div_factor': 1e3,              # OneCycleLR
    'final_div_factor': 1e3,        # OneCycleLR
    'no_decay': True
}

In [None]:
if params['debug']:
    train_df = train_df.sample(frac=0.01)
    print('Reduced training Data Size for Debugging purposes')

# Dataset

In [None]:
class BERTDataset:
    def __init__(self, more_toxic, less_toxic, max_len=params['max_len'], checkpoint=params['checkpoint']):
        self.more_toxic = more_toxic
        self.less_toxic = less_toxic
        self.max_len = max_len
        self.checkpoint = checkpoint
        self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
        self.num_examples = len(self.more_toxic)

    def __len__(self):
        return self.num_examples

    def __getitem__(self, idx):
        more_toxic = str(self.more_toxic[idx])
        less_toxic = str(self.less_toxic[idx])

        tokenized_more_toxic = self.tokenizer(
            more_toxic,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_attention_mask=True,
            return_token_type_ids=True,
        )

        tokenized_less_toxic = self.tokenizer(
            less_toxic,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_attention_mask=True,
            return_token_type_ids=True,
        )

        ids_more_toxic = tokenized_more_toxic['input_ids']
        mask_more_toxic = tokenized_more_toxic['attention_mask']
        token_type_ids_more_toxic = tokenized_more_toxic['token_type_ids']

        ids_less_toxic = tokenized_less_toxic['input_ids']
        mask_less_toxic = tokenized_less_toxic['attention_mask']
        token_type_ids_less_toxic = tokenized_less_toxic['token_type_ids']

        return {'ids_more_toxic': torch.tensor(ids_more_toxic, dtype=torch.long),
                'mask_more_toxic': torch.tensor(mask_more_toxic, dtype=torch.long),
                'token_type_ids_more_toxic': torch.tensor(token_type_ids_more_toxic, dtype=torch.long),
                'ids_less_toxic': torch.tensor(ids_less_toxic, dtype=torch.long),
                'mask_less_toxic': torch.tensor(mask_less_toxic, dtype=torch.long),
                'token_type_ids_less_toxic': torch.tensor(token_type_ids_less_toxic, dtype=torch.long),
                'target': torch.tensor(1, dtype=torch.float)}

# Scheduler

In [None]:
def get_scheduler(optimizer, scheduler_params=params):
    if scheduler_params['scheduler_name'] == 'CosineAnnealingWarmRestarts':
        scheduler = CosineAnnealingWarmRestarts(
            optimizer,
            T_0=scheduler_params['T_0'],
            eta_min=scheduler_params['min_lr'],
            last_epoch=-1
        )
    elif scheduler_params['scheduler_name'] == 'OneCycleLR':
        scheduler = OneCycleLR(
            optimizer,
            max_lr=scheduler_params['max_lr'],
            steps_per_epoch=int(df_train.shape[0] / params['batch_size']) + 1,
            epochs=scheduler_params['epochs'],
            pct_start=scheduler_params['pct_start'],
            anneal_strategy=scheduler_params['anneal_strategy'],
            div_factor=scheduler_params['div_factor'],
            final_div_factor=scheduler_params['final_div_factor'],
        )
    return scheduler

# Metrics

In [None]:
class MetricMonitor:
    def __init__(self, float_precision=4):
        self.float_precision = float_precision
        self.reset()

    def reset(self):
        self.metrics = defaultdict(lambda: {"val": 0, "count": 0, "avg": 0})

    def update(self, metric_name, val):
        metric = self.metrics[metric_name]

        metric["val"] += val
        metric["count"] += 1
        metric["avg"] = metric["val"] / metric["count"]

    def __str__(self):
        return " | ".join(
            [
                "{metric_name}: {avg:.{float_precision}f}".format(
                    metric_name=metric_name, avg=metric["avg"],
                    float_precision=self.float_precision
                )
                for (metric_name, metric) in self.metrics.items()
            ]
        )

# NLP Model

In [None]:
class ToxicityModel(nn.Module):
    def __init__(self, checkpoint=params['checkpoint'], params=params):
        super(ToxicityModel, self).__init__()
        self.checkpoint = checkpoint
        self.bert = AutoModel.from_pretrained(checkpoint, return_dict=False)
        self.layer_norm = nn.LayerNorm(params['output_logits'])
        self.dropout = nn.Dropout(params['dropout'])
        self.dense = nn.Sequential(
            nn.Linear(params['output_logits'], 128),
            nn.SiLU(),
            nn.Dropout(params['dropout']),
            nn.Linear(128, 1)
        )

    def forward(self, input_ids, token_type_ids, attention_mask):
        _, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
        pooled_output = self.layer_norm(pooled_output)
        pooled_output = self.dropout(pooled_output)
        preds = self.dense(pooled_output)
        return preds

# Training And Validation Loops

## 1. Train Function

In [None]:
def train_fn(train_loader, model, criterion, optimizer, epoch, params, scheduler=None):
    metric_monitor = MetricMonitor()
    model.train()
    stream = tqdm(train_loader)
    
    for i, batch in enumerate(stream, start=1):
        ids_more_toxic = batch['ids_more_toxic'].to(device)
        mask_more_toxic = batch['mask_more_toxic'].to(device)
        token_type_ids_more_toxic = batch['token_type_ids_more_toxic'].to(device)
        ids_less_toxic = batch['ids_less_toxic'].to(device)
        mask_less_toxic = batch['mask_less_toxic'].to(device)
        token_type_ids_less_toxic = batch['token_type_ids_less_toxic'].to(device)
        target = batch['target'].to(device)

        logits_more_toxic = model(ids_more_toxic, token_type_ids_more_toxic, mask_more_toxic)
        logits_less_toxic = model(ids_less_toxic, token_type_ids_less_toxic, mask_less_toxic)
        loss = criterion(logits_more_toxic, logits_less_toxic, target)
        metric_monitor.update('Loss', loss.item())
        loss.backward()
        optimizer.step()
            
        if scheduler is not None:
            scheduler.step()
        
        optimizer.zero_grad()
        stream.set_description(f"Epoch: {epoch:02}. Train. {metric_monitor}")

## 2. Validate Function

In [None]:
def validate_fn(val_loader, model, criterion, epoch, params):
    metric_monitor = MetricMonitor()
    model.eval()
    stream = tqdm(val_loader)
    all_loss = []
    with torch.no_grad():
        for i, batch in enumerate(stream, start=1):
            ids_more_toxic = batch['ids_more_toxic'].to(device)
            mask_more_toxic = batch['mask_more_toxic'].to(device)
            token_type_ids_more_toxic = batch['token_type_ids_more_toxic'].to(device)
            ids_less_toxic = batch['ids_less_toxic'].to(device)
            mask_less_toxic = batch['mask_less_toxic'].to(device)
            token_type_ids_less_toxic = batch['token_type_ids_less_toxic'].to(device)
            target = batch['target'].to(device)

            logits_more_toxic = model(ids_more_toxic, token_type_ids_more_toxic, mask_more_toxic)
            logits_less_toxic = model(ids_less_toxic, token_type_ids_less_toxic, mask_less_toxic)
            loss = criterion(logits_more_toxic, logits_less_toxic, target)
            all_loss.append(loss.item())
            metric_monitor.update('Loss', loss.item())
            stream.set_description(f"Epoch: {epoch:02}. Valid. {metric_monitor}")
            
    return np.mean(all_loss)

# Run

In [None]:
best_models_of_each_fold = []

In [None]:
gc.collect()
for fold in range(params['num_folds']):
    print(f'******************** Training Fold: {fold+1} ********************')
    current_fold = fold
    df_train = train_df[train_df['kfold'] != current_fold].copy()
    df_valid = train_df[train_df['kfold'] == current_fold].copy()

    train_dataset = BERTDataset(
        df_train.more_toxic.values,
        df_train.less_toxic.values
    )
    valid_dataset = BERTDataset(
        df_valid.more_toxic.values,
        df_valid.less_toxic.values
    )

    train_dataloader = DataLoader(
        train_dataset, batch_size=params['batch_size'], shuffle=True,
        num_workers=params['num_workers'], pin_memory=True
    )
    valid_dataloader = DataLoader(
        valid_dataset, batch_size=params['batch_size']*2, shuffle=False,
        num_workers=params['num_workers'], pin_memory=True
    )
    
    model = ToxicityModel()
    model = model.to(params['device'])
    criterion = nn.MarginRankingLoss(margin=params['margin'])
    if params['no_decay']:
        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.weight', 'LayerNorm.bias']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        ]
        optimizer = optim.AdamW(optimizer_grouped_parameters, lr=params['lr'])
    else:
        optimizer = optim.AdamW(model.parameters(), lr=params['lr'])
    scheduler = get_scheduler(optimizer)

    # Training and Validation Loop
    best_loss = np.inf
    best_epoch = 0
    best_model_name = None
    for epoch in range(1, params['epochs'] + 1):
        train_fn(train_dataloader, model, criterion, optimizer, epoch, params, scheduler)
        valid_loss = validate_fn(valid_dataloader, model, criterion, epoch, params)
        if valid_loss <= best_loss:
            best_loss = valid_loss
            best_epoch = epoch
            if best_model_name is not None:
                os.remove(best_model_name)
            torch.save(model.state_dict(), f"{params['checkpoint']}_{epoch}_epoch_f{fold+1}.pth")
            best_model_name = f"{params['checkpoint']}_{epoch}_epoch_f{fold+1}.pth"

    # Print summary of this fold
    print('')
    print(f'The best LOSS: {best_loss} for fold {fold+1} was achieved on epoch: {best_epoch}.')
    print(f'The Best saved model is: {best_model_name}')
    best_models_of_each_fold.append(best_model_name)
    del df_train, df_valid, train_dataset, valid_dataset, train_dataloader, valid_dataloader, model
    _ = gc.collect()
    torch.cuda.empty_cache()

In [None]:
for i, name in enumerate(best_models_of_each_fold):
    print(f'Best model of fold {i+1}: {name}')

<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>This is a simple starter kernel on implementation of Transfer Learning using Pytorch for this problem. Huggingface transformers library has many SOTA NLP models which you can try out using the guidelines in this notebook.<br>I hope you have learnt something from this notebook. I have created this notebook as a baseline model, which you can easily fork and paly-around with to get much better results. I might update parts of it down the line when I get more GPU hours and some interesting ideas.</p>

<p style='background:MediumSeaGreen; border:0; color: white; text-align: center; font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 24px'>If you found this notebook useful or use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. That keeps me motivated and inspires me to write and share such public kernels.<br>Thanks and Happy Kaggling! 😊</p>