<a href="https://colab.research.google.com/github/mtzig/NLP_CTF/blob/main/notebooks/table5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Table 5 generator

Creates the data for table 5 in the appendix. Based off code in baseline.ipynb


## Colab setup

This section is only pertinent if the notebook is run in Colab and not on a local machine. If you're using colab, make sure to run below code to clone the repo

In [18]:
!git clone https://github.com/mtzig/NLP_CTF.git
%cd /content/NLP_CTF/

Cloning into 'NLP_CTF'...
remote: Enumerating objects: 698, done.[K
remote: Counting objects: 100% (138/138), done.[K
remote: Compressing objects: 100% (99/99), done.[K
remote: Total 698 (delta 70), reused 87 (delta 31), pack-reused 560[K
Receiving objects: 100% (698/698), 99.25 MiB | 7.17 MiB/s, done.
Resolving deltas: 100% (416/416), done.
Checking out files: 100% (66/66), done.
/content/NLP_CTF


Download Word2Vec Embeddings

In [None]:
%cd /content/NLP_CTF/data
!wget -O GoogleNews-vectors-negative300.bin  'https://www.dropbox.com/s/mlg71vsawice3xd/GoogleNews-vectors-negative300.bin?dl=1'
%cd ./civil_comments
!wget -O civil_comments.csv 'https://www.dropbox.com/s/xv8zkmcmg74n0ak/civil_comments.csv?dl=1'
%cd ..
%cd ..

/content/NLP_CTF/data
--2022-10-30 22:58:00--  https://www.dropbox.com/s/mlg71vsawice3xd/GoogleNews-vectors-negative300.bin?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6023:18::a27d:4312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/mlg71vsawice3xd/GoogleNews-vectors-negative300.bin [following]
--2022-10-30 22:58:01--  https://www.dropbox.com/s/dl/mlg71vsawice3xd/GoogleNews-vectors-negative300.bin
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7db1f3056c1239cff0ba03a45b.dl.dropboxusercontent.com/cd/0/get/Bv2_Q0Dz3XlzuGxuIBpfPCrkWiO0fKOy7xfY-y1Cj2ZFWklGgUYKDYtqApAqCCzBsMKOrh_EXO-iBanud288VRrEjd2NHFhAFe-oUNoFTICqMYg1dsI9R_B04u_GcfGwqm6P9fv8o9g-0_0pkeheBDyI6W5ryOSXPGwAOaBHkpM_ZEGMhiAZGhn2EO-Fj6Q9ILM/file?dl=1# [following]
--2022-10-30 22:58:01--  https://uc7db1f3056c1239cff0ba03a45b.d

In [None]:
!pip install --upgrade gensim

## Notebook Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# %cd ..

In [None]:
import torch
from process_data import get_jigsaw_datasets, init_embed_lookup, get_ctf_datasets, get_CivilComments_Datasets
from models import CNNClassifier
from train_eval import train, evaluate, CTF, get_pred
from torch.utils.data import DataLoader
from loss import CLP_loss, ERM_loss
import pandas as pd
from pathlib import Path


In [None]:
if torch.cuda.is_available():
    print('Using GPU')
    DEVICE = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    # macbooks can use metal if the right version of pytorch is installed
    print('Using Metal')
    DEVICE = torch.device('mps')
else:
    print('Using cpu')
    DEVICE = torch.device('cpu')

## Data Initialization

In [None]:
embed_lookup = init_embed_lookup()

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='baseline', embed_lookup=embed_lookup)

In [None]:
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)

## Model and Training Stuff Initialization

In [None]:
pretrained_embed = torch.from_numpy(embed_lookup.vectors)

In [None]:
model = CNNClassifier(pretrained_embed,device=DEVICE)

An epoch is the number of times you go through your datase during training. That is you have trained for 1 epoch when you have seen every sample in your training dataset once.<br>
The loss function is the training objective we want our model to minimize.<br>
The optimizer is used at every time step i.e. everyime we compute the loss and its gradient. It is used to update the model weights.

In [None]:
epochs = 5
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

optimizer = torch.optim.AdamW(model.parameters())

## Training and Evaulation Baseline

For traing, we train for 10 epochs. <br>
In general, you should (or more specifically are required to) train and evaluate using different datasets.

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

In [None]:
baseline_data = []
baseline_data.append(get_pred('Some people are gay', model, embed_lookup=embed_lookup)[1])
baseline_data.append(get_pred('Some people are straight', model, embed_lookup=embed_lookup)[1])
baseline_data.append(get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)[1])
baseline_data.append(get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)[1])
baseline_data.append(get_pred('Some people are Christian', model, embed_lookup=embed_lookup)[1])

## setup train eval for blindness

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='blindness', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

In [None]:
blindness_data = []
blindness_data.append(get_pred('Some people are gay', model, embed_lookup=embed_lookup)[1])
blindness_data.append(get_pred('Some people are straight', model, embed_lookup=embed_lookup)[1])
blindness_data.append(get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)[1])
blindness_data.append(get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)[1])
blindness_data.append(get_pred('Some people are Christian', model, embed_lookup=embed_lookup)[1])




## Setup train eval for augment

In [None]:
train_data = get_jigsaw_datasets(device=DEVICE, data_type='augment', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = ERM_loss(torch.nn.CrossEntropyLoss())

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

In [None]:
augment_data = []
augment_data.append(get_pred('Some people are gay', model, embed_lookup=embed_lookup)[1])
augment_data.append(get_pred('Some people are straight', model, embed_lookup=embed_lookup)[1])
augment_data.append(get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)[1])
augment_data.append(get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)[1])
augment_data.append(get_pred('Some people are Christian', model, embed_lookup=embed_lookup)[1])

## Setup train eval for CTF

In [None]:
train_data, A = get_jigsaw_datasets(device=DEVICE, data_type='CLP', embed_lookup=embed_lookup)
train_loader =  torch.utils.data.DataLoader(train_data, batch_size=64)
optimizer = torch.optim.AdamW(model.parameters())
loss_fn = CLP_loss(torch.nn.CrossEntropyLoss(), A, lmbda=float(5))

In [None]:
for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    train(train_loader, model, loss_fn, optimizer, verbose=True)

In [None]:
get_pred('Some people are gay', model, embed_lookup=embed_lookup)
get_pred('Some people are straight', model, embed_lookup=embed_lookup)
get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)
get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)
get_pred('Some people are Christian', model, embed_lookup=embed_lookup)

In [None]:
ctf_data = []
ctf_data.append(get_pred('Some people are gay', model, embed_lookup=embed_lookup)[1])
ctf_data.append(get_pred('Some people are straight', model, embed_lookup=embed_lookup)[1])
ctf_data.append(get_pred('Some people are Jewish', model, embed_lookup=embed_lookup)[1])
ctf_data.append(get_pred('Some people are Muslim', model, embed_lookup=embed_lookup)[1])
ctf_data.append(get_pred('Some people are Christian', model, embed_lookup=embed_lookup)[1])

## Create dataframe

In [None]:
lst = [baseline_data, blindness_data, augment_data, ctf_data]
df = pd.DataFrame(lst, columns =['Some people are gay', 'Some people are straight', 'Some people are Jewish', 'Some people are Muslim', 'Some people are Christian']
                  , index=['Baseline', 'Blindness', 'Augment', 'CLP lambda = 5'])
df = df.T
print(df)

df.to_csv(Path("./data/table5.csv"))