<a href="https://colab.research.google.com/github/mtzig/NLP_CTF/blob/main/PyTorchDemo/Demonstration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### A Demonstration of Training and Evaluating A Toxicity Classifier
Thomas Zeng
9/20/22

## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine.

If you're using colab, make sure to run below code to clone the repo

In [1]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/PyTorchDemo

Cloning into 'NLP_CTF'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 27 (delta 1), reused 7 (delta 1), pack-reused 19[K
Unpacking objects: 100% (27/27), done.
/content/NLP_CTF/PyTorchDemo


If you are on colab and using a gpu instance, below command will show the GPU google allocated to this session.

In [3]:
!nvidia-smi

Wed Sep 21 02:34:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Colab does not have the Python library `transformers` (which I use in below code) automatically installed, so we meed to manually install when we start up instance.

In [6]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 7.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 51.4 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 59.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1


## Notebook Setup

Autoreload extension is used so that everyime you modify python files imported by notebook, you don't have to restart the notebook

In [4]:
%load_ext autoreload
%autoreload 2 # 2 is mode where every file you import gets reloaded when you run a block

In [7]:
import torch
from utils import load_jigsaw
from dataloaders import InfiniteDataLoader
from models import BertClassifier
from train_eval import train, evaluate

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Pytorch can do it's computation in three modes.

1. Cuda: if you have a Nvidia gpu and you have a version of PyTorch installed that has cuda enabled (fastest speed)
2. Metal: if you have an Apple device with Apple Silicon (M1 or M2) -- this implementation of PyTorch is new, not feature complete and rather buggy (medium speed)
3. CPU: this is just the cpu (slowest speed)

In [8]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

Using GPU


## Data Initialization

Pytorch requires its datasets to be ascessible following the [datasets api](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

Below I wrote a simple function to load in the [Jigsaw Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) that the paper [Counterfactual Fairness in
Text Classification through Robustness](https://dl.acm.org/doi/pdf/10.1145/3306618.3317950) used to train its toxicity classifier.

In [9]:
train_data, test_data = load_jigsaw(device=DEVICE, demo_mode=True) #demo_mode=True only loads a subset of entire dataset so as to make training faster for demonstration purposes

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

100%|██████████| 10000/10000 [00:18<00:00, 548.61it/s]
100%|██████████| 1000/1000 [00:01<00:00, 567.62it/s]


PyTorch models receive data for training and inference through a dataloader. A dataloader samples from a dataset and returns a batch of samples each time it is called.

In [10]:
train_loader =  InfiniteDataLoader(train_data, batch_size=16)
test_loader = InfiniteDataLoader(test_data, batch_size=32)

## Model and Training Stuff Initialization

For our model, we use transfer learning, i.e. we use a pre-trained model -- DistilBert -- implemented by HuggingFace. Our `BertClassifier` is a simple wrapper arround their model.

In [11]:
model = BertClassifier(device=DEVICE)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [14]:
epochs = 1
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

## Train and Evaluation

We train for 1 epoch

In [15]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True, use_tqdm=True)

Epoch 1/1


100%|██████████| 625/625 [04:22<00:00,  2.39it/s]

Average training loss: 0.7019077969551086





We evaluate our reults on the test data.

In [16]:
_ = evaluate(test_loader, model, get_loss=True, verbose=True)

100%|██████████| 32/32 [00:08<00:00,  3.59it/s]

Loss: 0.6939319968223572
Accuracy: 0.5, Sensitivity: 1.0, Specificity: 0.0





Our results are pretty bad, which normally would suggest we need hyperparameter tuning. <br>
In this case however, it is most likely because we used only a very small subset of the dataset and did not train for very long.