<a href="https://colab.research.google.com/github/mtzig/NLP_CTF/blob/main/notebooks/baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Baseline Toxicity Classifier
Thomas Zeng
9/27/22

## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine.

If you're using colab, make sure to run below code to clone the repo

In [1]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/

Cloning into 'NLP_CTF'...
remote: Enumerating objects: 502, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (129/129), done.[K
remote: Total 502 (delta 97), reused 111 (delta 44), pack-reused 329[K
Receiving objects: 100% (502/502), 87.01 MiB | 13.32 MiB/s, done.
Resolving deltas: 100% (295/295), done.
Checking out files: 100% (42/42), done.
/content/NLP_CTF


Download Word2Vec Embeddings

In [2]:
%cd ./data
# !wget -O 'GoogleNews-vectors-negative300.bin.gz' 'https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download&confirm=t&uuid=e1f49911-ab4d-44ba-af00-f6733ccabb98'
# !gzip -d 'GoogleNews-vectors-negative300.bin.gz'
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7" -O GoogleNews-vectors-negative300.bin && rm -rf /tmp/cookies.txt
%cd ..

/content/NLP_CTF/data
--2022-10-25 19:35:01--  https://docs.google.com/uc?export=download&confirm=t&id=1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7
Resolving docs.google.com (docs.google.com)... 142.251.12.138, 142.251.12.139, 142.251.12.102, ...
Connecting to docs.google.com (docs.google.com)|142.251.12.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-08-7o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/tv4ecg3dmv5oo61qkr79n9193q8uc3s0/1666726500000/15857340408018396550/*/1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7?e=download&uuid=a851da9a-37bb-46e2-b294-7a842f99e159 [following]
--2022-10-25 19:35:02--  https://doc-08-7o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/tv4ecg3dmv5oo61qkr79n9193q8uc3s0/1666726500000/15857340408018396550/*/1JXm1N6SHmzIawgH7Aa4Ag-ZVuqLX7ba7?e=download&uuid=a851da9a-37bb-46e2-b294-7a842f99e159
Resolving doc-08-7o-docs.googleusercontent.com (doc-08-7o-docs.googleusercontent.co

Colab does not have the Python library `transformers` (which I use in below code) automatically installed, so we meed to manually install when we start up instance.

In [None]:
!pip install --upgrade gensim


## Notebook Setup

In [4]:
%load_ext autoreload
%autoreload 2

In [None]:
# %cd ..

/Users/tzeng/repos/NLP_CTF


In [5]:
import torch
from process_data import get_jigsaw_datasets, init_embed_lookup, get_ctf_datasets, get_CivilComments_Datasets
from models import CNNClassifier
from train_eval import train, evaluate, CTF, get_pred
from torch.utils.data import DataLoader
from loss import CLP_loss, ERM_loss

In [6]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

Using GPU


## Data Initialization

Pytorch requires its datasets to be ascessible following the [datasets api](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

Below I wrote a simple function to load in the [Jigsaw Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) that the paper [Counterfactual Fairness in
Text Classification through Robustness](https://dl.acm.org/doi/pdf/10.1145/3306618.3317950) used to train its toxicity classifier.

I use only a very small subset of the available data here for demonstration purposes. Specificaly 256 comments (128 toxic and 128 nontoxic) sampled randomly for the train set and test set respectively.

In [7]:
embed_lookup = init_embed_lookup()

In [8]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='baseline', embed_lookup=embed_lookup)

100%|██████████| 159571/159571 [00:13<00:00, 12225.80it/s]


PyTorch models receive data for training and inference through a dataloader. A dataloader samples from a dataset and returns a batch of samples each time it is called.

In [9]:
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)

## Model and Training Stuff Initialization

In [10]:
pretrained_embed = torch.from_numpy(embed_lookup.vectors)

In [11]:

model = CNNClassifier(pretrained_embed,device=DEVICE)

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [12]:
epochs = 5
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

optimizer = torch.optim.AdamW(model.parameters())

## Train and Evaluation

For traing, we train for 10 epochs. <br>
In general, you should (or more specifically are required to) train and evaluate using different datasets.

In [13]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

Epoch 1/5


100%|██████████| 2494/2494 [00:38<00:00, 64.03it/s]


Average training loss: 0.12558598912813207
Epoch 2/5


100%|██████████| 2494/2494 [00:32<00:00, 76.34it/s]


Average training loss: 0.10657821136206365
Epoch 3/5


100%|██████████| 2494/2494 [00:33<00:00, 74.96it/s]


Average training loss: 0.09713671822026193
Epoch 4/5


100%|██████████| 2494/2494 [00:33<00:00, 74.09it/s]


Average training loss: 0.09003182570412666
Epoch 5/5


100%|██████████| 2494/2494 [00:33<00:00, 73.58it/s]

Average training loss: 0.08173915485940342





We first evaluate our results on our train data

In [57]:
get_pred('f', model, embed_lookup=embed_lookup)

([0.24295976758003235, -0.11695495247840881], 0.410980224609375)