<a href="https://colab.research.google.com/github/mtzig/NLP_CTF/blob/main/baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Baseline Toxicity Classifier
Thomas Zeng
9/27/22

## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine.

If you're using colab, make sure to run below code to clone the repo

In [1]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/

fatal: destination path 'NLP_CTF' already exists and is not an empty directory.
/content/NLP_CTF


Download Word2Vec Embeddings

In [2]:
%cd ./data
!wget -O 'GoogleNews-vectors-negative300.bin.gz' 'https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98'
!gzip -d 'GoogleNews-vectors-negative300.bin.gz'
%cd ..

/content/NLP_CTF/data
--2022-09-27 16:06:40--  https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98
Resolving drive.google.com (drive.google.com)... 64.233.182.139, 64.233.182.113, 64.233.182.138, ...
Connecting to drive.google.com (drive.google.com)|64.233.182.139|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98 [following]
--2022-09-27 16:06:40--  https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98
Reusing existing connection to drive.google.com:443.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/3k9m6ebft92d5iino7sl77bm340qjjmg/1664294775000/06848720943842814915/*/0B7X

Colab does not have the Python library `transformers` (which I use in below code) automatically installed, so we meed to manually install when we start up instance.

In [5]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 5.1 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 69.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 42.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.2


## Notebook Setup

In [6]:
%load_ext autoreload
%autoreload 2

In [7]:
import torch
from getData import get_jigsaw_datasets, init_embed_lookup
from dataloaders import InfiniteDataLoader
from models import CNNClassifier
from train_eval import train, evaluate

Pytorch can do it's computation in three modes.

1. Cuda: if you have a Nvidia gpu and you have a version of PyTorch installed that has cuda enabled (fastest speed)
2. Metal: if you have an Apple device with Apple Silicon (M1 or M2) -- this implementation of PyTorch is new, not feature complete and rather buggy (medium speed)
3. CPU: this is just the cpu (slowest speed)

In [8]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

Using GPU


## Data Initialization

Pytorch requires its datasets to be ascessible following the [datasets api](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

Below I wrote a simple function to load in the [Jigsaw Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) that the paper [Counterfactual Fairness in
Text Classification through Robustness](https://dl.acm.org/doi/pdf/10.1145/3306618.3317950) used to train its toxicity classifier.

I use only a very small subset of the available data here for demonstration purposes. Specificaly 256 comments (128 toxic and 128 nontoxic) sampled randomly for the train set and test set respectively.

In [9]:
train_data, test_data = get_jigsaw_datasets(device=DEVICE) #demo_mode=True only loads a subset of entire dataset so as to make training faster for demonstration purposes

PyTorch models receive data for training and inference through a dataloader. A dataloader samples from a dataset and returns a batch of samples each time it is called.

In [10]:
train_loader =  InfiniteDataLoader(train_data, batch_size=64)
test_loader = InfiniteDataLoader(test_data, batch_size=64)

## Model and Training Stuff Initialization

In [11]:
pretrained_embed = torch.from_numpy(init_embed_lookup().vectors)


In [12]:

model = CNNClassifier(pretrained_embed,device=DEVICE)

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [13]:
epochs = 20
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters())

## Train and Evaluation

For traing, we train for 10 epochs. <br>
In general, you should (or more specifically are required to) train and evaluate using different datasets.

In [14]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True, use_tqdm=True)

Epoch 1/20


100%|██████████| 2494/2494 [00:39<00:00, 62.89it/s]


Average training loss: 0.32278019489471016
Epoch 2/20


100%|██████████| 2494/2494 [00:32<00:00, 76.42it/s]


Average training loss: 0.318268960872075
Epoch 3/20


100%|██████████| 2494/2494 [00:33<00:00, 75.05it/s]


Average training loss: 0.31790827812795747
Epoch 4/20


100%|██████████| 2494/2494 [00:33<00:00, 74.25it/s]


Average training loss: 0.3172767324807793
Epoch 5/20


100%|██████████| 2494/2494 [00:33<00:00, 73.45it/s]


Average training loss: 0.31701686424774556
Epoch 6/20


 11%|█         | 276/2494 [00:03<00:30, 72.98it/s]


KeyboardInterrupt: ignored

We first evaluate our results on our train data

In [15]:
_ = evaluate(train_loader, model, get_loss=True, verbose=True)

100%|██████████| 2494/2494 [00:18<00:00, 136.27it/s]

Loss: 0.31613287329673767
Accuracy: 0.904349787868723, Sensitivity: 0.0, Specificity: 1.0





We then evaluate our reults on the test data:

In [16]:
_ = evaluate(test_loader, model, get_loss=True, verbose=True)

100%|██████████| 1000/1000 [00:07<00:00, 134.61it/s]

Loss: 0.31514671444892883
Accuracy: 0.9048110287911469, Sensitivity: 0.0, Specificity: 1.0



