### Table 5 generator

Creates the data for table 5 in the appendix. Based off code in baseline.ipynb


## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine. If you're using colab, make sure to run below code to clone the repo

In [None]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/

Download Word2Vec Embeddings

In [None]:
%cd /content/NLP_CTF/data
!wget -O GoogleNews-vectors-negative300.bin  'https://www.dropbox.com/s/mlg71vsawice3xd/GoogleNews-vectors-negative300.bin?dl=1'
%cd ./civil_comments
!wget -O civil_comments.csv 'https://www.dropbox.com/s/xv8zkmcmg74n0ak/civil_comments.csv?dl=1'
%cd ..
%cd ..

Colab does not have the Python library `transformers` (which I use in below code) automatically installed, so we meed to manually install when we start up instance.

In [None]:
!pip install --upgrade gensim

## Notebook Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# %cd ..

In [None]:
import torch
from process_data import get_jigsaw_datasets, init_embed_lookup, get_ctf_datasets, get_CivilComments_Datasets
from models import CNNClassifier
from train_eval import train, evaluate, CTF, get_pred
from torch.utils.data import DataLoader
from loss import CLP_loss, ERM_loss

In [None]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

## Data Initialization

Pytorch requires its datasets to be ascessible following the [datasets api](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

Below I wrote a simple function to load in the [Jigsaw Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) that the paper [Counterfactual Fairness in
Text Classification through Robustness](https://dl.acm.org/doi/pdf/10.1145/3306618.3317950) used to train its toxicity classifier.

I use only a very small subset of the available data here for demonstration purposes. Specificaly 256 comments (128 toxic and 128 nontoxic) sampled randomly for the train set and test set respectively.

In [None]:
embed_lookup = init_embed_lookup()

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='baseline', embed_lookup=embed_lookup)

PyTorch models receive data for training and inference through a dataloader. A dataloader samples from a dataset and returns a batch of samples each time it is called.

In [None]:
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)

## Model and Training Stuff Initialization

In [None]:
pretrained_embed = torch.from_numpy(embed_lookup.vectors)

In [None]:
model = CNNClassifier(pretrained_embed,device=DEVICE)

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [None]:
epochs = 5
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

optimizer = torch.optim.AdamW(model.parameters())

## Training and Evaulation Baseline

For traing, we train for 10 epochs. <br>
In general, you should (or more specifically are required to) train and evaluate using different datasets.

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [1]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

NameError: name 'get_pred' is not defined

## setup train eval for blindness

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='blindness', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

## Setup train eval for augment

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='augment', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

## Setup train eval for CTF

In [None]:
train_data, A = get_jigsaw_datasets(device=DEVICE, data_type='CLP', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = CLP_loss(torch.nn.CrossEntropyLoss(), A, lmbda=float(5))

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)