<a href="https://colab.research.google.com/github/mtzig/NLP_CTF/blob/main/baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Baseline Toxicity Classifier
Thomas Zeng
9/27/22

## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine.

If you're using colab, make sure to run below code to clone the repo

In [1]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/

Cloning into 'NLP_CTF'...
remote: Enumerating objects: 127, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 127 (delta 0), reused 1 (delta 0), pack-reused 123[K
Receiving objects: 100% (127/127), 74.31 MiB | 14.76 MiB/s, done.
Resolving deltas: 100% (64/64), done.
Checking out files: 100% (24/24), done.
/content/NLP_CTF


Download Word2Vec Embeddings

In [2]:
%cd ./data
# !wget -O 'GoogleNews-vectors-negative300.bin.gz' 'https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98'
# !gzip -d 'GoogleNews-vectors-negative300.bin.gz'
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7" -O GoogleNews-vectors-negative300.bin && rm -rf /tmp/cookies.txt
%cd ..

/content/NLP_CTF/data
--2022-09-27 21:10:34--  https://docs.google.com/uc?export=download&confirm=t&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7
Resolving docs.google.com (docs.google.com)... 142.251.10.102, 142.251.10.138, 142.251.10.113, ...
Connecting to docs.google.com (docs.google.com)|142.251.10.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-08-7o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/mdmjs1jp7afof5qi51d9d3s1o01m4cj6/1664313000000/15857340408018396550/*/1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7?e=download&uuid=d81e2d60-ef1c-4ae0-8c43-0f72d75a6493 [following]
--2022-09-27 21:10:36--  https://doc-08-7o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/mdmjs1jp7afof5qi51d9d3s1o01m4cj6/1664313000000/15857340408018396550/*/1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7?e=download&uuid=d81e2d60-ef1c-4ae0-8c43-0f72d75a6493
Resolving doc-08-7o-docs.googleusercontent.com (doc-08-7o-docs.googleusercontent.co

Colab does not have the Python library `transformers` (which I use in below code) automatically installed, so we meed to manually install when we start up instance.

In [3]:
!pip install transformers
!pip install --upgrade gensim


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 22.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 49.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 74.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |██████

## Notebook Setup

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import torch
from getData import get_jigsaw_datasets, init_embed_lookup
from dataloaders import InfiniteDataLoader
from models import CNNClassifier
from train_eval import train, evaluate

Pytorch can do it's computation in three modes.

1. Cuda: if you have a Nvidia gpu and you have a version of PyTorch installed that has cuda enabled (fastest speed)
2. Metal: if you have an Apple device with Apple Silicon (M1 or M2) -- this implementation of PyTorch is new, not feature complete and rather buggy (medium speed)
3. CPU: this is just the cpu (slowest speed)

In [6]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

Using GPU


## Data Initialization

Pytorch requires its datasets to be ascessible following the [datasets api](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

Below I wrote a simple function to load in the [Jigsaw Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) that the paper [Counterfactual Fairness in
Text Classification through Robustness](https://dl.acm.org/doi/pdf/10.1145/3306618.3317950) used to train its toxicity classifier.

I use only a very small subset of the available data here for demonstration purposes. Specificaly 256 comments (128 toxic and 128 nontoxic) sampled randomly for the train set and test set respectively.

In [7]:
train_data, test_data = get_jigsaw_datasets(device=DEVICE) #demo_mode=True only loads a subset of entire dataset so as to make training faster for demonstration purposes

PyTorch models receive data for training and inference through a dataloader. A dataloader samples from a dataset and returns a batch of samples each time it is called.

In [9]:
train_loader =  InfiniteDataLoader(train_data, batch_size=64)
test_loader = InfiniteDataLoader(test_data, batch_size=64)

## Model and Training Stuff Initialization

In [10]:
pretrained_embed = torch.from_numpy(init_embed_lookup().vectors)


In [11]:

model = CNNClassifier(pretrained_embed,device=DEVICE)

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [12]:
epochs = 10
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters())

## Train and Evaluation

For traing, we train for 10 epochs. <br>
In general, you should (or more specifically are required to) train and evaluate using different datasets.

In [13]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True, use_tqdm=True)

Epoch 1/10


100%|██████████| 2494/2494 [00:40<00:00, 62.34it/s]


Average training loss: 0.1273053494090126
Epoch 2/10


100%|██████████| 2494/2494 [00:34<00:00, 73.18it/s]


Average training loss: 0.10737995810339222
Epoch 3/10


100%|██████████| 2494/2494 [00:34<00:00, 72.23it/s]


Average training loss: 0.09863319539998261
Epoch 4/10


100%|██████████| 2494/2494 [00:35<00:00, 71.15it/s]


Average training loss: 0.08987199589462597
Epoch 5/10


100%|██████████| 2494/2494 [00:35<00:00, 70.27it/s]


Average training loss: 0.08245628459267797
Epoch 6/10


100%|██████████| 2494/2494 [00:35<00:00, 70.39it/s]


Average training loss: 0.075326835919424
Epoch 7/10


100%|██████████| 2494/2494 [00:35<00:00, 69.48it/s]


Average training loss: 0.06925108117023789
Epoch 8/10


100%|██████████| 2494/2494 [00:35<00:00, 69.65it/s]


Average training loss: 0.06395984761403468
Epoch 9/10


100%|██████████| 2494/2494 [00:35<00:00, 69.64it/s]


Average training loss: 0.059577055947605664
Epoch 10/10


100%|██████████| 2494/2494 [00:36<00:00, 69.07it/s]

Average training loss: 0.05518341557467812





We first evaluate our results on our train data

In [14]:
_ = evaluate(train_loader, model, get_loss=True, verbose=True)

100%|██████████| 2494/2494 [00:18<00:00, 131.79it/s]

Loss: 0.03164718300104141
Accuracy: 0.988036673330367, Sensitivity: 0.8896299202301556, Specificity: 0.9984682243184985





We then evaluate our reults on the test data:

In [15]:
_ = evaluate(test_loader, model, get_loss=True, verbose=True)

100%|██████████| 1000/1000 [00:07<00:00, 134.93it/s]

Loss: 0.21121707558631897
Accuracy: 0.9310075338397574, Sensitivity: 0.7559934318555008, Specificity: 0.9494195688225538



