# GPT2 Models

<a href="https://colab.research.google.com/github/hjesse92/style_transfer_w266/blob/main/notebooks/GPT2_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [1]:
!pip install -q transformers rouge_score accelerate evaluate

In [1]:
#Am I running a GPU and what type is it?
!nvidia-smi

Sun Apr  2 21:22:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   23C    P8    31W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import torch

# Clear out cuda
torch.cuda.empty_cache()

if torch.cuda.is_available():     
    device = torch.device("cuda")
    print('Number of GPU(s) available:', torch.cuda.device_count())
    print('GPU device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available')
    device = torch.device("cpu")

Number of GPU(s) available: 1
GPU device name: NVIDIA A10G


In [3]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import GPT2LMHeadModel, AutoTokenizer, GPT2Tokenizer
from datasets import load_metric, load_dataset
from transformers import AdamW, TrainingArguments, Trainer, DataCollatorForLanguageModeling

import re
import random
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pprint
import nltk

from logging import warning
import warnings
warnings.filterwarnings('ignore')

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# cd drive/MyDrive/w266/style_transfer_w266/
%cd ..

/home/ubuntu/style_transfer_w266


In [6]:
train_file = 'data/original-train.tsv'
dev_file = 'data/original-dev.tsv'
test_file = 'data/original-test.tsv'
df_train = pd.read_csv(train_file, sep='\t')
df_dev = pd.read_csv(dev_file, sep='\t')
df_test = pd.read_csv(test_file, sep='\t')

## Few Shot Learning GPT2

In [None]:
#@title 3-Shot Learning
df_fewshot = df_test.copy()
n_shots = 3

for row in range(len(df_fewshot)):
    #For each item in the test set, we use 3 examples from the training set to as n-shots
    sources_targets = df_train.sample(n_shots, replace=False)

    for i in range(n_shots):
        df_fewshot.loc[row, f'shot{i+1}_source'] = sources_targets.iloc[i,0]
        df_fewshot.loc[row, f'shot{i+1}_target'] = sources_targets.iloc[i,1]


df_fewshot = df_fewshot.rename(columns={'offensive-text':'source', 'style-transferred-text':'target'})
df_fewshot.head()

Unnamed: 0,source,target,shot1_source,shot1_target,shot2_source,shot2_target,shot3_source,shot3_target
0,So maybe you should be more retarded.,So maybe you should be more backward,Has been a US military puppet from jump,Has been a tool for the US military from the b...,It's not though you fucking idiot,It's not though,Maybe you shouldn't turn your new country into...,You should be careful not to bring your old pr...
1,THERES A MEGATHREAD FOR VACCINE OR COVID RELAT...,THERES ACTUALLY A MEGATHREAD FOR VACCINE OR CO...,wasn't that like 12 feds and an autistic fuck,wasn't that like 12 feds and an unintelligent ...,"Sympathy: That sucks, I'm sorry you feel that ...","Sympathy: That stinks, I'm sorry you feel that...","Yep, it sucks...","Yeah, that's bad"
2,the fuck.... you are on drugs,uh..... you are not being realistic,This guy is overrated..... But he is right abo...,This guy is overrated...... But he is right ab...,That roommate's presence is gross and we need ...,That roommate's presence is unbearable and we ...,Funny how Biden canceled Trump’s insulin cap w...,Funny how Biden canceled Trump’s insulin cap w...
3,NTA. Dump his ass. Tablemanners are no rocke...,You should leave him. Tablemanners are no rock...,You don’t deserve it dumbass.,You don't deserve it.,I just realized you are from India LMAO bet yo...,I just realized you are from India LMAO are yo...,Also sounds like facism,Also sounds like autocracy
4,Youre soft as baby shit,Youre really soft,"Don't dare say a word, whitey. We run this lib...",Don't dare say a word. This sub is liberal.,Yep my Karma is going way down 😆 and my masoch...,Yep my Karma is going way down 😆 and my pain l...,"Pre-Covid Libtards: My body, my choice! Black...","Pre-Covid Liberals: My body, my choice! Black ..."


In [None]:
# df_fewshot['prompt'] = df_fewshot.apply(lambda x: 
#                  'Rewrite the toxic text in non-toxic style: \n\n'
#                  'Toxic text: ' + x['shot1_source'] + '\n' + 'Non-toxic text: ' + x['shot1_target'] + '\n\n' + \
#                  'Toxic text: ' + x['shot2_source'] + '\n' + 'Non-toxic text: ' + x['shot2_target'] + '\n\n' + \
#                  'Toxic text: ' + x['shot3_source'] + '\n' + 'Non-toxic text: ' + x['shot3_target'] + '\n\n' + \
#                  'Toxic text: ' + x['source'] + '\n' + 'Non-toxic text: ', axis=1)

df_fewshot['prompt'] = df_fewshot.apply(lambda x: 
                 'Rewrite the toxic text in non-toxic style: \n###\n'
                 'Toxic text: ' + x['shot1_source'] + '\n' + 'Non-toxic text: ' + x['shot1_target'] + '\n' + '###' + '\n' +\
                 'Toxic text: ' + x['shot2_source'] + '\n' + 'Non-toxic text: ' + x['shot2_target'] + '\n' + '###' + '\n' +\
                 'Toxic text: ' + x['shot3_source'] + '\n' + 'Non-toxic text: ' + x['shot3_target'] + '\n' + '###' + '\n' +\
                 'Toxic text: ' + x['source'] + '\n' + 'Non-toxic text: ', axis=1)

In [None]:
print(df_fewshot['prompt'][0])

Rewrite the toxic text in non-toxic style: 
###
Toxic text: Has been a US military puppet from jump
Non-toxic text: Has been a tool for the US military from the beginning
###
Toxic text: It's not though you fucking idiot
Non-toxic text: It's not though
###
Toxic text: Maybe you shouldn't turn your new country into the shithole you just left.
Non-toxic text: You should be careful not to bring your old problems from your old country to your new country.
###
Toxic text: So maybe you should be more retarded.
Non-toxic text: 


In [None]:
gpt2model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2model.to(device)
gpt2tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
for i in range(len(df_fewshot)):
    if i%10 == 0:
        print(f'Working on number: {i}')
    gpt2_input = gpt2tokenizer([df_fewshot.prompt[i]], return_tensors='pt')
    gpt2_input_ids = gpt2_input.input_ids.to(device)
    gpt2_input_mask = gpt2_input.attention_mask.to(device)

    generated_ids = gpt2model.generate(input_ids=gpt2_input_ids, 
                                      attention_mask=gpt2_input_mask,
                                      num_beams=5,
                                      no_repeat_ngram_size=2,
                                      num_return_sequences=1,
                                      top_p = 0.92,
                                      top_k = 50,
                                      max_length = 512,
                                      do_sample=True,
                                      temperature=0.9,
                                      early_stopping=True,
                                      )
    output = gpt2tokenizer.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    df_fewshot.loc[i,'fewshot_output'] = output

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 0


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 10


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 20


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 30


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 40


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 50


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 60


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 70


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 80


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 90


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 100


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 110


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 120


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 130


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 140


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 150


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 160


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 170


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 180


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Working on number: 190


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
df_fewshot.loc[:,['source','target','prompt','fewshot_output']]

Unnamed: 0,source,target,prompt,fewshot_output
0,So maybe you should be more retarded.,So maybe you should be more backward,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
1,THERES A MEGATHREAD FOR VACCINE OR COVID RELAT...,THERES ACTUALLY A MEGATHREAD FOR VACCINE OR CO...,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
2,the fuck.... you are on drugs,uh..... you are not being realistic,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
3,NTA. Dump his ass. Tablemanners are no rocke...,You should leave him. Tablemanners are no rock...,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
4,Youre soft as baby shit,Youre really soft,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
...,...,...,...,...
194,"NTA, Jes-us your brother is dumb. How long did...","NTA, your brother is not thinking straight. Ho...",Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
195,Formula one…nascar….shit it exactly the same e...,Formula one…nascar….it exactly the same except...,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
196,"BRB, gonna go call the mods pathetic egotistic...","BRB, gonna go call the mods.",Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...
197,CUCKOLD Carlson is a problem,Carlson is the problem,Rewrite the toxic text in non-toxic style: \n#...,Rewrite the toxic text in non-toxic style: \n#...


In [None]:
print(df_fewshot['fewshot_output'][0])

Rewrite the toxic text in non-toxic style: 
###
Toxic text: Has been a US military puppet from jump
Non-toxic text: Has been a tool for the US military from the beginning
###
Toxic text: It's not though you fucking idiot
Non-toxic text: It's not though
###
Toxic text: Maybe you shouldn't turn your new country into the shithole you just left.
Non-toxic text: You should be careful not to bring your old problems from your old country to your new country.
###
Toxic text: So maybe you should be more retarded.
Non-toxic text:  I don't know what to do with you. You're just a fucking moron. I'm just going to give you the benefit of the doubt.


In [None]:
import re

def extract_predicted_text(x):
  # Remove the first 3 shot examples & instruction
  few_shot_pred = x.split('\n###\n')[4]
  # Extract output text from GPT-Neo
  few_shot_pred = re.findall("\nNon-toxic text: (.*)", few_shot_pred)[0]
  return few_shot_pred

In [None]:
df_fewshot['fewshot_output_text'] = df_fewshot['fewshot_output'].apply(extract_predicted_text)

In [None]:
df_fewshot.loc[:,['source','target','prompt','fewshot_output_text']]

Unnamed: 0,source,target,prompt,fewshot_output_text
0,So maybe you should be more retarded.,So maybe you should be more backward,Rewrite the toxic text in non-toxic style: \n#...,I don't know what to do with you. You're just...
1,THERES A MEGATHREAD FOR VACCINE OR COVID RELAT...,THERES ACTUALLY A MEGATHREAD FOR VACCINE OR CO...,Rewrite the toxic text in non-toxic style: \n#...,"I'm not going to talk about that shit here, b..."
2,the fuck.... you are on drugs,uh..... you are not being realistic,Rewrite the toxic text in non-toxic style: \n#...,"I don't care what you think about me, I'm her..."
3,NTA. Dump his ass. Tablemanners are no rocke...,You should leave him. Tablemanners are no rock...,Rewrite the toxic text in non-toxic style: \n#...,"I'm not a rocket scientist, I'm a human being."
4,Youre soft as baby shit,Youre really soft,Rewrite the toxic text in non-toxic style: \n#...,I'm not soft. I'm hard as a baby. I've got a...
...,...,...,...,...
194,"NTA, Jes-us your brother is dumb. How long did...","NTA, your brother is not thinking straight. Ho...",Rewrite the toxic text in non-toxic style: \n#...,I'm not going to lie to you. I know what you'...
195,Formula one…nascar….shit it exactly the same e...,Formula one…nascar….it exactly the same except...,Rewrite the toxic text in non-toxic style: \n#...,"You're not going to win a race, you're just n..."
196,"BRB, gonna go call the mods pathetic egotistic...","BRB, gonna go call the mods.",Rewrite the toxic text in non-toxic style: \n#...,"I don't care what you think about me, I'm goi..."
197,CUCKOLD Carlson is a problem,Carlson is the problem,Rewrite the toxic text in non-toxic style: \n#...,"I'm not sure if it's true or not, but he's a ..."


In [None]:
df_fewshot = df_fewshot.loc[:,['source','target','prompt','fewshot_output_text']]

In [None]:
df_fewshot.to_csv('outputs/gpt2_few_shot_output.csv',sep='\t',index=False)

### Evaluation with Rouge

In [None]:
import evaluate

rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
#@title Score after few shot learning
print(rouge.compute(predictions=df_fewshot.fewshot_output_text,
              references=df_fewshot.target))

{'rouge1': 0.11761212163377222, 'rouge2': 0.016773842025308953, 'rougeL': 0.0988545581550871, 'rougeLsum': 0.09854884567491168}


### Evaluation with NonToxicScore

In [8]:
import sys
sys.path.append('./notebooks')
from DistilBertClassification import BertClassificationML, NonToxicScoreDataLoader, NonToxicScore

# Load DistilBERT Classification Model to calculate NonToxicScore
score_model = BertClassificationML()
score_model = score_model.to(device)

# Load training weights
pretrained_weights = torch.load('models/DistilBertToxicClassification7.pth')
score_model.load_state_dict(pretrained_weights )

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<All keys matched successfully>

In [9]:
output_file = 'outputs/gpt2_few_shot_output.csv'
output_col = 'fewshot_output_text'

# Create Data Loader
score_loader = NonToxicScoreDataLoader(output_file, output_col)

# Calculate NonToxicScore
fewshot_NonToxicScores, avg_score = NonToxicScore(score_loader, score_model)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

{'NonToxicScore': 0.572811652515458}


In [16]:
# Look at output texts with low NonToxicScore, or still classified as Toxic after Style Transfering
df_fewshot = pd.read_csv('outputs/gpt2_few_shot_output.csv',sep='\t')
df_fewshot['NonToxicScore'] = fewshot_NonToxicScores

df_fewshot_lowscore = df_fewshot[df_fewshot['NonToxicScore'] < 0.5][['source', 'target', 'fewshot_output_text', 'NonToxicScore']]

for i in range(len(df_fewshot_lowscore)):
    print(df_fewshot_lowscore.fewshot_output_text.values[i])
    print('NonToxicScore:', df_fewshot_lowscore.NonToxicScore.values[i])

 I don't know what to do with you. You're just a fucking moron. I'm just going to give you the benefit of the doubt.
NonToxicScore: 0.0013592148898169398
 I'm not going to talk about that shit here, because I don't want you to hear about it, but if you want to, you can go ahead and do it here. If you're not sure how to do that, here's a link to get started: http://www.youtube.com/watch?v=6WZJtR6xQyU
NonToxicScore: 0.020876146852970123
 I'm not soft. I'm hard as a baby.  I've got a lot of shit to deal with, but I can handle it. You're not a soft baby, you're a hard guy.
NonToxicScore: 0.08097605407238007
 I don't care if you're a man or a woman. If you are, I'm going to kick you out of the house. I will not allow you to be a part of this shit. You are not a human being, you‪t belong in this fucking house and you will be kicked out.
NonToxicScore: 0.001381579670123756
 I'm not a racist. I don't want to be. But I do want you to know that I am not racist, and I will not tolerate your bigot

## Fine Tuning GPT2

In [5]:
# Clear out cuda
torch.cuda.empty_cache()

In [60]:
gpt2tokenizer = GPT2Tokenizer.from_pretrained("gpt2", 
                                              bos_token='<|startoftext|>',
                                              eos_token='<|endoftext|>',
                                              pad_token='<pad>'
                                             )

gpt2model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2model.to(device)
gpt2model.resize_token_embeddings(len(gpt2tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50259, 768)

In [6]:
train_file = 'data/original-train.tsv'
dev_file = 'data/original-dev.tsv'
test_file = 'data/original-test.tsv'
df_train = pd.read_csv(train_file, sep='\t')
df_dev = pd.read_csv(dev_file, sep='\t')
df_test = pd.read_csv(test_file, sep='\t')

In [7]:
dataset = load_dataset('csv', sep="\t",
                       data_files={'train': train_file, 'validation': dev_file,'test': test_file})

Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-c5dfda268eae0812/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [61]:
## Data Clean Up
def clean_up_text(x):
  """Remove line breaks, special characters, within each post"""
  # Remove special characters and punctuations
  SPECIAL_CHARS_PATTERN = re.compile(r"(\*)|(\~)|(\=)|(\’)|(\_)|(\-)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\\)|(\{)|(\})")
  x = SPECIAL_CHARS_PATTERN.sub("", x)

  # Remove different types of line breaks and white spaces
  x = re.sub(r"\n|\r|\r\n|<br\s*/?>", " ", x)
  
  # Remove extra white spaces
  x = re.sub(r"\s+", " ", x.strip())

  return x

In [62]:
def preprocess_data(examples, tokenizer=gpt2tokenizer, data='train'):
    input_prefix = '<|startoftext|>Offensive text: '
    label_prefix = '\nInoffensive text: '
    max_input_length = 512
    max_target_length = 512

    source_inputs = [input_prefix + clean_up_text(text) + label_prefix for text in examples['offensive-text']]
    target_inputs = [clean_up_text(text) for text in examples['style-transferred-text']]

    # Add labels into training set source inputs
    if data == 'train' or data =='validation':  
        source_inputs = [source_inputs[i] + target_inputs[i] +'<|endoftext|>' for i in range(len(source_inputs))]

    # Tokenize imputs and labels
    model_inputs = tokenizer(source_inputs, max_length=max_input_length, padding="max_length", truncation=True) 
    target_tokens = tokenizer(target_inputs, max_length=max_target_length, padding="max_length", truncation=True)
    
    # Add labels to model_inputs
    # model_inputs["labels"] = target_tokens.input_ids
  
    return model_inputs

In [63]:
encoded_train_ds = dataset['train'].map(lambda x: preprocess_data(x, data='train'), batched=True, remove_columns=dataset['train'].column_names)
encoded_val_ds = dataset['validation'].map(lambda x: preprocess_data(x, data='validation'), batched=True, remove_columns=dataset['validation'].column_names)
encoded_test_ds = dataset['test'].map(lambda x: preprocess_data(x,data='test'), batched=True, remove_columns=dataset['test'].column_names)
encoded_train_ds.set_format(type="torch")
encoded_val_ds.set_format(type="torch")
encoded_test_ds.set_format(type="torch")

Map:   0%|          | 0/1584 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

Map:   0%|          | 0/199 [00:00<?, ? examples/s]

In [64]:
encoded_train_ds

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 1584
})

In [65]:
# Check train dataset model input
print(gpt2tokenizer.decode(encoded_train_ds[1]['input_ids'], skip_special_tokens=True))

Offensive text: Ok, this makes no sense. This will create vigilantes. And I hope they put this savage animal down before he hurts anyone else.
Inoffensive text: Ok, this makes no sense. This will create vigilantes. And I hope they take care of this uncontrollable person before he hurts anyone else


In [66]:
# Check val dataset model input
print(gpt2tokenizer.decode(encoded_val_ds[1]['input_ids'], skip_special_tokens=True))

Offensive text: My favorite part is that youre still madly replying as if its going to quell your cousin fucking urges.
Inoffensive text: My favorite part is that you are still replying.


In [67]:
# Check test dataset model input
print(gpt2tokenizer.decode(encoded_test_ds[1]['input_ids'], skip_special_tokens=True))

Offensive text: THERES A MEGATHREAD FOR VACCINE OR COVID RELATED TOPICS. DON'T TALK ABOUT THAT SHIT HERE IDIOT!
Inoffensive text: 


In [12]:
# print(gpt2tokenizer.decode(encoded_train_ds[1]['labels'], skip_special_tokens=True))

Ok, this makes no sense. This will create vigilantes. And I hope they take care of this uncontrollable person before he hurts anyone else


In [68]:
# Trainer Argument
batch_size = 8
model_dir = "models/gpt2"

args = TrainingArguments(
    output_dir=model_dir,
    overwrite_output_dir=True,
    do_train=True,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    # logging_steps=100,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    num_train_epochs=4,
    load_best_model_at_end=True,
    # metric_for_best_model="rouge1",
    remove_unused_columns=False
)

In [17]:
metric = load_metric("rouge")

def compute_metrics(eval_pred, tokenizer=gpt2tokenizer):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_preds = [re.findall('\nTransfer to inoffensive text: (.*)', pred)[-1] for pred in list(decoded_preds)]
    
    # labels = np.where(labels != -100, labels, gpt2tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                      for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) 
                      for label in decoded_labels]
    
    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                            use_stemmer=True)

    # Extract ROUGE f1 scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length to metrics
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                      for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [69]:
from transformers import AdamW, get_cosine_schedule_with_warmup

optimizer = AdamW(gpt2model.parameters(), lr=2e-5)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=800)

In [70]:
data_collator = DataCollatorForLanguageModeling(tokenizer=gpt2tokenizer, mlm=False)

In [71]:
trainer = Trainer(
    model=gpt2model,
    args=args,
    train_dataset=encoded_train_ds, 
    eval_dataset=encoded_val_ds,
    data_collator=data_collator,
    tokenizer=gpt2tokenizer,
    optimizers=(optimizer, scheduler),
    # compute_metrics=compute_metrics
)

In [72]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,8.4224,2.427534
2,2.3769,2.408058
3,2.2702,2.392786
4,2.2324,2.383458


TrainOutput(global_step=792, training_loss=3.8254518219918916, metrics={'train_runtime': 336.2538, 'train_samples_per_second': 18.843, 'train_steps_per_second': 2.355, 'total_flos': 1655546314752000.0, 'train_loss': 3.8254518219918916, 'epoch': 4.0})

In [73]:
# save training weights
trainer.save_model('models/gpt2')
torch.save(gpt2model.state_dict(), 'models/gpt2.pth')

In [58]:
# gpt2tokenizer = GPT2Tokenizer.from_pretrained("gpt2", 
#                                               bos_token='<|startoftext|>',
#                                               eos_token='<|endoftext|>',
#                                               pad_token='<pad>'
#                                              )

# gpt2model = GPT2LMHeadModel.from_pretrained('gpt2')
# gpt2model.resize_token_embeddings(len(gpt2tokenizer))

# # Load training weights
# pretrained_weights = torch.load('models/gpt2.pth')
# gpt2model.load_state_dict(pretrained_weights )

# gpt2model.to(device)
# gpt2model 

In [108]:
#### Test block
test_prompt = "<|startoftext|>Offensive text: You are a special kind of idiot.\nInoffensive text:"
generated = gpt2tokenizer(test_prompt, return_tensors='pt', add_special_tokens=False).input_ids.cuda()
sample_outputs =  gpt2model.generate(generated,
                                    max_length=512,
                                    min_length=10,
                                    top_k=50,
                                    do_sample=False,
                                    top_p=0.9,
                                    temperature=1.,
                                   )
predicted_text = gpt2tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
predicted_text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Offensive text: You are a special kind of idiot.\nInoffensive text: You are a special kind of person.'

In [81]:
#### Test block
re.findall('\nInoffensive text: (.*)', predicted_text)[0]

'You are a special kind of person.'

In [109]:
# Extract predicted texts for test set
gpt2model.eval()

input_prefix = '<|startoftext|>Offensive text: '
label_prefix = '\nInoffensive text: '

original_texts = dataset['test']['offensive-text']
mod_texts = dataset['test']['style-transferred-text']
generated_texts = []

for org in original_texts:
    prompt = input_prefix + org + label_prefix
    prompt_encoded = gpt2tokenizer(prompt, return_tensors='pt', add_special_tokens=False).input_ids.to(device)
    output =  gpt2model.generate(prompt_encoded,
                                  max_length=512,
                                  min_length=10,
                                  top_k=50,
                                  do_sample=False,
                                  top_p=0.9,
                                  temperature=1., 
                                 )
    predicted_text = gpt2tokenizer.decode(output[0], skip_special_tokens=True)
    gen_text = re.findall('\nInoffensive text: (.*)', predicted_text)[0]
    if len(gen_text) == 0:
        gen_text = "None"
        
    generated_texts.append(gen_text.strip())

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

In [106]:
# gpt2model.eval()

# test_dataloader = DataLoader(encoded_test_ds, batch_size=8, collate_fn=data_collator)

# generated_texts = []
# with torch.no_grad():
#     for batch in test_dataloader:
#         input_ids = batch['input_ids'].to(device)
#         # attention_mask = batch['attention_mask'].to(device)
#         output = gpt2model.generate(input_ids=input_ids, 
#                                     # attention_mask=attention_mask,
#                                     max_length=512,
#                                     min_length=10,
#                                     top_k=50,
#                                     do_sample=False,
#                                     top_p=0.9,
#                                     temperature=1., 
#                                     )
#         generated_texts.extend(gpt2tokenizer.batch_decode(output, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 512, but `max_length` is set to 512. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 512, but `max_length` is set to 512. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_

In [110]:
generated_texts[:20]

['ive never seen anything like this',
 "__________THERES A MEGATHREAD FOR VACCINE OR COVID RELATED TOPICS. DON'T TALK ABOUT THAT SHIT HERE.",
 'ive been on drugs for years',
 'NTA.  Dump his ass. Tablemanners are no rocket science. Treating other people like human beings is no rocket science. His still a child.',
 'ive never seen a person so soft',
 '”because everyone knows why. It’s jadas fault. Will was just triggered because of all the shit he has to deal with because of her.',
 '”poor kids are just as bright as white kids”…. Your blundering bafoon chooses racism as well',
 'iced up Tucker is a racist that no one will miss when he’s gone… he’s also a racist ruble-rouser',
 '’there’s plenty of evidence, it’s exactly why they won’t allow forensic audits',
 'iced this up, carlsen is a moron and defines recreational outrage im',
 'None',
 'None',
 'LMFAO  Christ this sub needs help',
 '___________',
 '__________',
 'ive seen less than no value that’s cute coming from some douchebag that

In [89]:
# Create a new DataFrame with the generated texts
df_gen = pd.DataFrame({'gpt2_tuning_output': generated_texts})
df_gen['gpt2_tuning_output'] = df_gen['gpt2_tuning_output'].astype(str)

# Add the generated_text column to the df_test dataframe
df_output = pd.concat([df_test, df_gen], axis=1)

df_output.head(10)

Unnamed: 0,offensive-text,style-transferred-text,gpt2_tuning_output
0,So maybe you should be more retarded.,So maybe you should be more backward,ive never seen anything like this
1,THERES A MEGATHREAD FOR VACCINE OR COVID RELAT...,THERES ACTUALLY A MEGATHREAD FOR VACCINE OR CO...,__________THERES A MEGATHREAD FOR VACCINE OR C...
2,the fuck.... you are on drugs,uh..... you are not being realistic,ive been on drugs for years
3,NTA. Dump his ass. Tablemanners are no rocke...,You should leave him. Tablemanners are no rock...,NTA. Dump his ass. Tablemanners are no rocket...
4,Youre soft as baby shit,Youre really soft,ive never seen a person so soft
5,Because everyone knows why. It’s jadas fault. ...,Because everyone knows it's jadas fault. Will ...,”because everyone knows why. It’s jadas fault....
6,And “poor kids are just as bright as white kid...,"He said ""poor kids are just as bright as white...",”poor kids are just as bright as white kids”…....
7,Tucker is a garbage person that no one will mi...,Tucker is a bad person who instigates racism,iced up Tucker is a racist that no one will mi...
8,"There’s plenty of evidence, it’s exactly why t...","There is evidence, which is why they won't all...","’there’s plenty of evidence, it’s exactly why ..."
9,"Came here to say this, carlsen is a freaking m...","Came here to say this, carlsen is pretty unint...","iced this up, carlsen is a moron and defines r..."


In [92]:
df_output.describe()

Unnamed: 0,offensive-text,style-transferred-text,gpt2_tuning_output
count,199,199,199.0
unique,199,199,132.0
top,So maybe you should be more retarded.,So maybe you should be more backward,
freq,1,1,44.0


In [93]:
# Save the updated dataframe as a CSV file
df_output.to_csv('outputs/gpt2_tuning_output.csv',sep='\t',index=False)

### Evaluation with Rouge

In [94]:
import evaluate

rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [95]:
print(rouge.compute(predictions=df_output['gpt2_tuning_output'],
              references=df_output['style-transferred-text']))

{'rouge1': 0.27805597863114406, 'rouge2': 0.21061381676608837, 'rougeL': 0.27496133382093735, 'rougeLsum': 0.27455547216970266}


### Evaluation with NonToxicScore

In [96]:
import sys
sys.path.append('./notebooks')
from DistilBertClassification import BertClassificationML, NonToxicScoreDataLoader, NonToxicScore

# Load DistilBERT Classification Model to calculate NonToxicScore
score_model = BertClassificationML()
score_model = score_model.to(device)

# Load training weights
pretrained_weights = torch.load('./models/DistilBertToxicClassification7.pth')
score_model.load_state_dict(pretrained_weights )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<All keys matched successfully>

In [98]:
## Calculate NonToxicScore
output_file = 'outputs/gpt2_tuning_output.csv'
output_col = 'gpt2_tuning_output'

# Create Data Loader
score_loader = NonToxicScoreDataLoader(output_file, output_col, max_length=512)

# Calculate NonToxicScore
gpt2_NonToxicScores, avg_score = NonToxicScore(score_loader, score_model)

{'NonToxicScore': 0.674387867600498}


In [99]:
# Look at output texts with low NonToxicScore, or still classified as Toxic after Style Transfering
df_finetune = pd.read_csv('outputs/gpt2_tuning_output.csv',sep='\t')
df_finetune['NonToxicScore'] = gpt2_NonToxicScores

df_finetune_lowscore = df_finetune[df_finetune['NonToxicScore'] < 0.5][['offensive-text', 'style-transferred-text', 'gpt2_tuning_output', 'NonToxicScore']]
df_finetune_lowscore

Unnamed: 0,offensive-text,style-transferred-text,gpt2_tuning_output,NonToxicScore
1,THERES A MEGATHREAD FOR VACCINE OR COVID RELAT...,THERES ACTUALLY A MEGATHREAD FOR VACCINE OR CO...,__________THERES A MEGATHREAD FOR VACCINE OR C...,0.001785
3,NTA. Dump his ass. Tablemanners are no rocke...,You should leave him. Tablemanners are no rock...,NTA. Dump his ass. Tablemanners are no rocket...,0.001312
5,Because everyone knows why. It’s jadas fault. ...,Because everyone knows it's jadas fault. Will ...,”because everyone knows why. It’s jadas fault....,0.111351
6,And “poor kids are just as bright as white kid...,"He said ""poor kids are just as bright as white...",”poor kids are just as bright as white kids”…....,0.022050
7,Tucker is a garbage person that no one will mi...,Tucker is a bad person who instigates racism,iced up Tucker is a racist that no one will mi...,0.015307
...,...,...,...,...
177,Re-education camps for them. Cut their evil ou...,Re-education camps for them. They need to be p...,iced up camps for them. Cut their evil out 100%,0.108084
184,You guys love oppression. Let me get some barb...,You guys seem to love oppression. Let me get s...,iced coffee is bad. Let me get some barbecue r...,0.140414
185,You mean that chick with a dick?,You mean that transgender woman?,ive never seen a chick with a dick.,0.002308
190,You’ve been completely subverted yet here you ...,It's a little ironic that you've been subverte...,”you’ve been completely subverted yet here you...,0.146821


In [100]:
for i in range(len(df_finetune_lowscore)):
    print('Toxic Text:', df_finetune_lowscore['offensive-text'].values[i])
    print('Style Transfer Text:', df_finetune_lowscore.gpt2_tuning_output.values[i])
    print('NonToxicScore:', df_finetune_lowscore.NonToxicScore.values[i])

Toxic Text: THERES A MEGATHREAD FOR VACCINE OR COVID RELATED TOPICS. DON'T TALK ABOUT THAT SHIT HERE IDIOT!
Style Transfer Text: __________THERES A MEGATHREAD FOR VACCINE OR COVID RELATED TOPICS. DON'T TALK ABOUT THAT SHIT HERE.
NonToxicScore: 0.0017852680757641792
Toxic Text: NTA.   Dump his ass. Tablemanners are no rocket science. Treating other people like human beings is no rocket science. His still a child.
Style Transfer Text: NTA.  Dump his ass. Tablemanners are no rocket science. Treating other people like human beings is no rocket science. His still a child.
NonToxicScore: 0.0013116915943101048
Toxic Text: Because everyone knows why. It’s jadas fault. Will was just triggered because of all the shit he has to deal with because of her.
Style Transfer Text: ”because everyone knows why. It’s jadas fault. Will was just triggered because of all the shit he has to deal with because of her.
NonToxicScore: 0.11135101318359375
Toxic Text: And “poor kids are just as bright as white kids”