## 1 Importing the Pre-Processed Dataset

As the dataset has been separated to 4 parts, we need to reread them from files:

- X_train (training variables of the dataset)
- X_val (validation variables of the dataset)
- y_train (training labels of the dataset)
- y_val (validation labels of the dataset)

In [2]:
import pandas as pd

In [3]:
X_train = pd.read_csv('X_train.csv')
X_val = pd.read_csv('X_val.csv')
y_train = pd.read_csv('y_train.csv')
y_val = pd.read_csv('y_val.csv')

Because we are only attempting to classify the ```True``` from the ```False``` by the text. Then we should select the variable "text" from X_train and X_val, and select the variable "target" from y_train and y_val.

In [17]:
train_text = X_train['cleaned_text'].to_list()
train_label = y_train['target'].to_list()
val_text = X_val['cleaned_text'].to_list()
val_label = y_val['target'].to_list()

In [33]:
train_text[:5]

['  jimmyfallon crush squirrel bone mortar pestl school  bio dept  realli sure whi worstsummerjob',
 ' mccainenl think spectacular look stonewal riot obliter white house ',
 'can t bloodi wait   soni set date stephen king       the dark tower    stephenk thedarktow    bdisgust',
 'protest ralli stone mountain  atleast they r burn build loot store like individu  protest ',
 ' rbcinsur quot websit   disaster  tri 3 browser  amp  3 machines  alway get  miss info  error due non exist drop down ']

In [34]:
train_label[:5]

[0, 1, 0, 0, 0]

## 2 Tokenize the text by the pre-trained model DistilBert

In this notebook, we choose a pre-trained model named DistilBert to tokenize the text from the dataset.

Then, we should download the pre-trained model first:

In [10]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Apply the pre-trained model to the text by the tokenizer.

In [11]:
train_encodings = tokenizer(train_text, truncation=True, padding=True)
val_encodings = tokenizer(val_text, truncation=True, padding=True)

In [52]:
train_encodings.get(0)

Encoding(num_tokens=53, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

## 3 Defining the Dataset Class

Before we input the dataset into Neural Network, we should define a Class then instantiate it to store the encodings and labels of the data.

In [26]:
import torch

class DatasetClass(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DatasetClass(train_encodings, train_label)
val_dataset = DatasetClass(val_encodings, val_label)

In [27]:
train_dataset.__getitem__(1)

{'input_ids': tensor([  101, 19186,  2368,  2140,  2228, 12656,  2298,  2962, 13476, 11421,
         27885, 22779,  2099,  2317,  2160,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]),
 'labels': tensor(1)}

## 4 Defining Evaluation Matrics

In [29]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluation_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## 5 Fine-tuning with Huggingface Trainer

Huggingface Trainer is a highly packaged trainer, we need to define the following arguments before we train:

- **args** -> TrainingArguments (Contains the definition of hyperparameters, which is also an important feature of the trainer, where most of the training-related parameters are set)
- **model** -> Model (is a model that integrates *transformers.PreTrainedMode* or *torch.nn.module*, which is officially mentioned as being optimised by Trainer for transformers.PreTrainedModel)
- **compute_metrics** -> Evaluation Metrics (to define how to evaluate the results of the fine-tuned model)
- **train_dataset** -> Train Dataset
- **eval_dataset** -> Validation Dataset

In [28]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

In [30]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./DistilBert-results',# output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated pre-trained model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=evaluation_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # validation dataset
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

In [31]:
trainer.train()

***** Running training *****
  Num examples = 6090
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1143
  Number of trainable parameters = 66955010


Step,Training Loss


Saving model checkpoint to ./results\checkpoint-500
Configuration saved in ./results\checkpoint-500\config.json
Model weights saved in ./results\checkpoint-500\pytorch_model.bin
Saving model checkpoint to ./results\checkpoint-1000
Configuration saved in ./results\checkpoint-1000\config.json
Model weights saved in ./results\checkpoint-1000\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1143, training_loss=0.39357276679336867, metrics={'train_runtime': 230.8184, 'train_samples_per_second': 79.153, 'train_steps_per_second': 4.952, 'total_flos': 250526380454280.0, 'train_loss': 0.39357276679336867, 'epoch': 3.0})

## 6 Results of Evaluating the Validation Dataset

In [32]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1523
  Batch size = 64


{'eval_loss': 0.5285603404045105,
 'eval_accuracy': 0.8063033486539725,
 'eval_f1': 0.7619047619047619,
 'eval_precision': 0.7866666666666666,
 'eval_recall': 0.7386541471048513,
 'eval_runtime': 4.3464,
 'eval_samples_per_second': 350.403,
 'eval_steps_per_second': 5.522,
 'epoch': 3.0}