# Finetuning DistilBERT Model for Toxic Comment Classification

### Importing Libraries
*NOTE: Transformers version>4.20 necessary for use of Trainer, TrainingArguments, and DistilBertTokenizerFast*

In [None]:
%pip install transformers==4.2
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader 
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split

Collecting transformers==4.2
  Using cached transformers-4.2.0-py3-none-any.whl (1.8 MB)
Collecting tokenizers==0.9.4
  Using cached tokenizers-0.9.4.tar.gz (184 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[47 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m creating build
  [31m   [0m creating build/lib.linux-x86_64-cpython-310
  [31m   [0m creating build/lib.linux-x86_64-cpython-310/tokenizers
  [31m   [0m copying py_src/

  from .autonotebook import tqdm as notebook_tqdm


### Setting Up For GPU Usage

In [2]:
from torch import cuda
device = torch.device('cuda' if cuda.is_available() else 'cpu')

print(f"Current device: {device}")

Current device: cuda


## Preprocessing and Cleaning Domain Data
*Preprocessing assumes that both csv files are downloaded, unzipped, and saved in data/input/~.csv*

1. Read the csv files into dataframes using Pandas.
2. Drop the ID column from the data.
3. Take all the values of individual categories of toxicity and combine into a new column: 'labels'.
4. Drop all the old columns of individual categories.
5. Change all comment text to lower case.
6. Replace non-breaking spaces with regular spaces and ensure there is only one space between each word.

In [None]:
train_path = 'data/input/train.csv'
test_path = 'data/input/test.csv'

df = pd.read_csv(train_path)

df_test = pd.read_csv(test_path)
df = df.sample(frac=0.5, random_state=42)
print(f"Sampled Training Records : {len(df)}")

df.drop(['id'], inplace=True, axis=1)
df['labels'] = df.iloc[:, 1:].values.tolist()
df.drop(df.columns.values[1:-1].tolist(), inplace=True, axis=1)

df["comment_text"] = df["comment_text"].str.lower()
df["comment_text"] = df["comment_text"].str.replace("\xa0", " ", regex=False).str.split().str.join(" ")

df.head()

df_train, df_val = train_test_split(df, test_size=0.1)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [4]:
df.head()

Unnamed: 0,comment_text,labels
119105,"geez, are you forgetful! we've already discuss...","[0, 0, 0, 0, 0, 0]"
131631,carioca rfa thanks for your support on my requ...,"[0, 0, 0, 0, 0, 0]"
125326,""" birthday no worries, it's what i do ;)enjoy ...","[0, 0, 0, 0, 0, 0]"
111256,pseudoscience category? i'm assuming that this...,"[0, 0, 0, 0, 0, 0]"
83590,"(and if such phrase exists, it would be provid...","[0, 0, 0, 0, 0, 0]"


### Training Parameters

In [5]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory for model predictions and checkpoints
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

MAX_LEN = 200

## Defining the Model

In [6]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier

## Preparing the Dataset and Dataloader

### ToxicCommentsDataset Dataset Class
* This class accepts the tokenizer, dataframe, max length, and evaluation mode and generates tokenized output that will be used by the DistilBERT model.
* The DistilBERT tokenizer will be used to tokenize the data from the "comment_text" dataframe column.
* This class is used to create 2 datasets, one for training and one for validation. We will be doing an 80-20 split of the data for training and validation, respectively.

In [7]:
class ToxicCommentsDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = dataframe.comment_text
        self.targets = self.data['labels'].values
        self.max_len = max_len

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, index):
        comment_text = str(self.comment_text[index])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'input_ids': torch.tensor(ids, dtype=torch.long),
            'attention_mask': torch.tensor(mask, dtype=torch.long),
            'labels': torch.tensor(self.targets[index], dtype=torch.float)
}

### Loading Tokenizer and Generating Training Set

In [8]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)
training_set = ToxicCommentsDataset(df_train, tokenizer, MAX_LEN)
validation_set = ToxicCommentsDataset(df_val, tokenizer, MAX_LEN)

### Defining the Trainer

In [9]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=training_set,         # training dataset
    eval_dataset=validation_set             # evaluation dataset
)

### Training the Model

In [10]:
trainer.train()

***** Running training *****
  Num examples = 71807
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 26928


Step,Training Loss
10,0.6934
20,0.6854
30,0.6732
40,0.6496
50,0.6039
60,0.5397
70,0.4919
80,0.4072
90,0.3645
100,0.2777


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3000
Configuration saved in ./results/checkpoint-3

TrainOutput(global_step=26928, training_loss=0.039592880258459515, metrics={'train_runtime': 7528.952, 'train_samples_per_second': 28.612, 'train_steps_per_second': 3.577, 'total_flos': 1.11477715236648e+16, 'train_loss': 0.039592880258459515, 'epoch': 3.0})

### Saving the Model

In [11]:
model.save_pretrained("./toxic_comment_model")
tokenizer.save_pretrained("./toxic_comment_model")

Configuration saved in ./toxic_comment_model/config.json
Model weights saved in ./toxic_comment_model/pytorch_model.bin
tokenizer config file saved in ./toxic_comment_model/tokenizer_config.json
Special tokens file saved in ./toxic_comment_model/special_tokens_map.json


('./toxic_comment_model/tokenizer_config.json',
 './toxic_comment_model/special_tokens_map.json',
 './toxic_comment_model/vocab.txt',
 './toxic_comment_model/added_tokens.json',
 './toxic_comment_model/tokenizer.json')