# 0. Introduction 

The following notebook was created for a final project for CS 6741. The goal was to determine low-cost methods for increasing 2-class toxicity classification using the huggingface jigsaw-toxicity-pred dataset. For a current working paper, please go to https://github.com/kcsadow/Low-Cost-Methods-BERT.

Sections include. 

1. Set up 
2. Experiment 0 -> DistilBERT (i.e. baseline)
3. Misclassification Analysis 
4. Log-Odds Method
5. Experiment 1 -> Log-odds
6. Experiment 2 -> Log-odds for nouns
7. Experiment 3 -> "Hand annotated" misclassified examples
8. Experiment 4 -> Log-odds using "hand annotated" misclassified examples
9. Misclassifaction Overlap Analysis on Experiments 0 and 4
10. Conclusion 

#1. Set-up 

## 1a. Install Python packages

In [1]:
# Install the libraries necessary for HuggingFace datasets and transformers
!pip install datasets
!pip install transformers
#!pip install sacremoses
#!pip install tokenizers
#!pip install sentencepiece

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |█▌                              | 10kB 25.6MB/s eta 0:00:01[K     |███                             | 20kB 32.7MB/s eta 0:00:01[K     |████▍                           | 30kB 22.4MB/s eta 0:00:01[K     |██████                          | 40kB 17.6MB/s eta 0:00:01[K     |███████▍                        | 51kB 15.0MB/s eta 0:00:01[K     |████████▉                       | 61kB 16.9MB/s eta 0:00:01[K     |██████████▍                     | 71kB 13.9MB/s eta 0:00:01[K     |███████████▉                    | 81kB 13.5MB/s eta 0:00:01[K     |█████████████▎                  | 92kB 14.3MB/s eta 0:00:01[K     |██████████████▊                 | 102kB 14.6MB/s eta 0:00:01[K     |████████████████▎               | 112kB 14.6MB/s eta 0:00:01[K     |█████████████████▊              | 122kB 14

In [2]:
#importing packages
from google.colab import drive

import datasets
from datasets import Dataset, load_dataset, concatenate_datasets
from torch.utils.data import Dataset, DataLoader

import torch

import transformers as transformers
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

import numpy as np
import pandas as pd

from collections import Counter

import re

import nltk
nltk.download('stopwords')
from nltk import word_tokenize, sent_tokenize
from nltk.stem.snowball import SnowballStemmer
stopwords = nltk.corpus.stopwords.words('english')

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
stemmer = SnowballStemmer("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
#Get device and use GPU if available
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')
 
def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)   

device=get_default_device()
device

device(type='cuda')

# 1b. Load data from Google Drive and set up preprocesser

Note that the dataset cannot be accessed from HuggingFace directly. It must be downloaded from its Kaggle page.

In [4]:
# Setting the drive
drive.mount('/content/gdrive')
drive_folder = "gdrive/My Drive/CS6741 - Topics in Natural Language Processing and Machine Learning/CS6741 Replication Project/Final Project"
#drive_folder = "gdrive/My Drive/CS6741 Replication Project/Final Project"

Mounted at /content/gdrive


In [5]:
# Import data from huggingface
data = datasets.load_dataset('jigsaw_toxicity_pred', data_dir=drive_folder+'/data/jigsaw_toxicity_pred')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2163.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=639.0, style=ProgressStyle(description_…

Using custom data configuration default-d9843c66a1f7bf76



Downloading and preparing dataset jigsaw_toxicity_pred/default (download: Unknown size, generated: 94.91 MiB, post-processed: Unknown size, total: 94.91 MiB) to /root/.cache/huggingface/datasets/jigsaw_toxicity_pred/default-d9843c66a1f7bf76/1.1.0/b5a7e4444c940e3254416217128ad87ab7a53c9a54db4c72df349baecd5f43e6...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset jigsaw_toxicity_pred downloaded and prepared to /root/.cache/huggingface/datasets/jigsaw_toxicity_pred/default-d9843c66a1f7bf76/1.1.0/b5a7e4444c940e3254416217128ad87ab7a53c9a54db4c72df349baecd5f43e6. Subsequent calls will reuse this data.


In [6]:
# Focusing on 2 part classification, therefore remove all columns in train and test apart from 'comment_text' and 'toxic'
data = data.remove_columns(['severe_toxic', 'obscene', 'threat', 'insult','identity_hate'])

In [7]:
print(data['train'][:10])

{'comment_text': ["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27", "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)", "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.", '"\nMore\nI can\'t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatt

In [8]:
sum(data['train']['toxic'])

15294

In [9]:
# Create a function to preprocess the text (which will help find meaningful log-odds differences later on)
# Credit: https://www.kaggle.com/satyamkryadav/bert-model-96-77

punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~"+ '""“”’'+ u"\u25AF"


def clean_special_chars(text, punct):
  for p in punct:
        text=text.lower()
        text = text.replace(p, "")
        text = text.strip('``')
        text = re.sub(r'[0-9]+' , '' ,text)
        text = re.sub(r'&amp;', '&', text)
        text = re.sub(r'\s+', ' ', text).strip()
        encoded_string = text.encode("ascii", "ignore")
        decode_string = encoded_string.decode()
  return decode_string


In [10]:
# Save the true labels of our test set
true_label=data['test']['toxic']
print(len(true_label))

63978


# 2. Experiment 0: Run DistilBERT model

This model runs DistilBERT on jigsaw-toxicity-pred dataset. We limited the classification task to predicting whether a comment is toxic or not. 

In [None]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

preprocessed_dataset = data.map(text_preprocessing_byexample)

print('preprocessed_dataset:', preprocessed_dataset)
print('preprocessed_dataset examples:', preprocessed_dataset['train'][0:10]['comment_text'])

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

encoded_dataset = preprocessed_dataset.map(preprocess_function, batched=True)
encoded_dataset_2 = encoded_dataset.rename_column("toxic", "label") # To avoid "KeyError: 'loss'" message during train()
type(encoded_dataset_2)

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=63978.0), HTML(value='')))


preprocessed_dataset: DatasetDict({
    train: Dataset({
        features: ['comment_text', 'toxic'],
        num_rows: 159571
    })
    test: Dataset({
        features: ['comment_text', 'toxic'],
        num_rows: 63978
    })
})
preprocessed_dataset examples: ['explanation why the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now', 'daww he matches this background colour im seemingly stuck with thanks talk january utc', 'hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info', 'more i cant make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the refe

HBox(children=(FloatProgress(value=0.0, max=160.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




datasets.dataset_dict.DatasetDict

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=encoded_dataset_2["train"],
    eval_dataset=encoded_dataset_2["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0931,0.166074,0.935415
2,0.0645,0.219898,0.927397
3,0.0442,0.320262,0.923614


TrainOutput(global_step=29922, training_loss=0.073197956096679, metrics={'train_runtime': 10639.673, 'train_samples_per_second': 2.812, 'total_flos': 0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 1825714176, 'init_mem_gpu_alloc_delta': 268953088, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 25358336, 'train_mem_gpu_alloc_delta': 815231488, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 6593935872})

In [None]:
#Saving model
cached_model_directory_name = '/results_preprocess'  
trainer.save(drive_folder+cached_model_directory_name)

In [None]:
#Accuracy results
predicted_results = trainer.predict(encoded_dataset_2["test"])
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96     57888
           1       0.62      0.84      0.71      6090

    accuracy                           0.94     63978
   macro avg       0.80      0.89      0.84     63978
weighted avg       0.95      0.94      0.94     63978



## 3. Misclassification Analysis: Examining Misclassified Examples from Experiment 0

In [None]:
#Grab misclassified examples from model E0
df=pd.DataFrame(list(zip(true_label, predicted_labels, data['test']['comment_text'])), columns=['true_label', 'predicted_labels', 'comment'])
df.head() 

In [None]:
#Get false positives and false negatives
df_fp=df[(df['true_label']==0) & (df['predicted_labels']==1)]
df_fn=df[(df['true_label']==1) & (df['predicted_labels']==0)]
df_tp=df[(df['true_label']==1) & (df['predicted_labels']==1)]
df_tn=df[(df['true_label']==0) & (df['predicted_labels']==0)]
print('Misclassified', np.sum(np.where(df['true_label']!=df['predicted_labels'], 1, 0)), 'Correctly classified', np.sum(np.where(df['true_label']==df['predicted_labels'], 1, 0)))

In [None]:
#Export results
df_fp.to_csv(drive_folder+'/results/fp.csv')
df_fn.to_csv(drive_folder+'/results/fn.csv')

In [None]:
#Examine False Positives
df_fp2=df_fp.sample(frac=.005)
for x in df_fp2['true_label']:
  for y in df_fp2['predicted_labels']: 
    if (x != y):
      print('TRUE LABEL:', x)
      print('PREDICTED LABEL:', y)
      print('REVIEW TEXT:', df_fp2['comment'][:100])
      print()

TRUE LABEL: 0
PREDICTED LABEL: 1
REVIEW TEXT: 3393     Is it too much to ask that vandalism, idiotic ...
18016    I read a lot about history, and find that shit...
10512    " \n == This is the worst fucking article I ha...
38160    == Oh fucking come on, do you expect it NOT to...
27114           ::Now it's out of my head (GPL brain dump)
57369    :::::Yup...and thanks to assholes like me and ...
25590    == Retake lies == \n\n This section is full of...
61626    ==Alexis Gounaris== \n Cheeky time wasting los...
39147    01B== \n  IO011990,                01B,      5...
14699    *If you go to his website, after a while an ev...
58122    == is it just my crappy computer? == \n\n or i...
60775                         so this hass stuff is stupid
35827                       GUESS what: yer dad \n\n dildo
30967    " \n\n It's not the greatest attraction, howev...
63025    ::::Maybe you should just take a deep breath a...
Name: comment, dtype: object

TRUE LABEL: 0
PREDICTED LABEL: 1
REVIEW

In [None]:
#Examine False Negatives
df_fn2=df_fn.sample(frac=.03)
for x in df_fn2['true_label']:
  for y in df_fn2['predicted_labels']: 
    if (x != y):
      print('TRUE LABEL:', x)
      print('PREDICTED LABEL:', y)
      print('REVIEW TEXT:', df_fn2['comment'][:100])
      print()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
20654    == There is NO TURK IN TUNISIA AT ALL AT ALL!!...
19534    Its a good thing GIANDREA that you admit you a...
24957         ==  \n == I like poop! == \n\n ==  == \n  ==
29291    Hello \n Were you sexually abused as a kid? Th...
33386    *I think there should be some mention of how h...
62378    TO all My Pakistani Brothers:- you can not jud...
38999    " \n == bias in the article == \n\n I can't qu...
15791    Pride? This school disgusts me. I hear stories...
20834    " \n\n == The word ""penetration"" == \n\n Goo...
36146    I can't believe this article exists.  This is ...
61839    == Putin == \n\n Your addition of a gay porn i...
55012    " \n\n ==Calvin999== \n You remember the above...
36547    u r krazy Asian woman? I don't think the quest...
40060    *Going out of your way to add the photos to ca...
30059    Deliogul you dont know what you are talking ab...
16749    I have a problem with editing!? Hello inc

# 4. Log-odds Method: determining the most unique words in the test dataset

We begin by preprocessing our text so that we can see what the most unique words are in the test versus train examples (sans stopwords). Note that, apart from the stopwords, the preprocessing function is identical to the one run when preparing the dataset for Experiment 0. We perform log-odds calculations on the same inputs that BERT sees during fine-tuning. 

We calculate log-odds because in an online learning setting when new data is constantly streaming in we wouldn't know the "true" labels of incoming examples; however, we can see how different the words of those incoming examples are from the words in our training set through a metric like log-odds. We simulate that situation here by calculating the log-odds difference between  words in our training/test datasets. 

In [11]:
#Pre-processing
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~"+ '""“”’'+ u"\u25AF"
stopwords2 = ["padding", "px", "e", "rowspan", "styleverticalalign"]
#nltk_stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def clean_special_chars(text, punct):
  for p in punct:
        text=text.lower()
        text = text.replace(p, "")
        text = text.strip('``')
        text = re.sub(r'[0-9]+' , '' ,text)
        text = re.sub(r'&amp;', '&', text)
        text = re.sub(r'\s+', ' ', text).strip()
        encoded_string = text.encode("ascii", "ignore")
        decode_string = encoded_string.decode()
  return decode_string

def preprocess_text(text, punct):
    clean_text=clean_special_chars(text, punct)
    clean_text=re.split('\W+', clean_text)
    clean_text=[token for token in clean_text if token not in stopwords]  
    clean_text=[token for token in clean_text if token not in stopwords2]  
    return " ".join(clean_text)

#Check function output on the previous examples
train = [preprocess_text(x, punct) for x in data['train']['comment_text']]
test = [preprocess_text(x, punct) for x in data['test']['comment_text']]

In [12]:
print(train[0:10])

['explanation edits made username hardcore metallica fan reverted werent vandalisms closure gas voted new york dolls fac please dont remove template talk page since im retired', 'daww matches background colour im seemingly stuck thanks talk january utc', 'hey man im really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info', 'cant make real suggestions improvement wondered section statistics later subsection types accidents think references may need tidying exact format ie date format etc later noone else first preferences formatting style references want please let know appears backlog articles review guess may delay reviewer turns listed relevant form eg wikipediagoodarticlenominationstransport', 'sir hero chance remember page thats', 'congratulations well use tools well talk', 'cocksucker piss around work', 'vandalism matt shirvington article reverted please dont banned', 'sorry word nonsense offensive anywa

In [13]:
#Create a counter for the training and test datasets separately

train_count=Counter()
test_count=Counter()

#for line in data['train']['comment_text']:
for line in train:
    tokens=word_tokenize(line.strip())
    for token in tokens:
        train_count[token]+=1

#for line in data['test']['comment_text']:
for line in test:
    tokens=word_tokenize(line.strip())
    for token in tokens:
        test_count[token]+=1

In [14]:
#Getting a counter for ALL vocabulary in train/test

count=Counter()

#for line in data['train']['comment_text']:
for line in train:
    tokens=word_tokenize(line.strip())
    for token in tokens:
        count[token]+=1

#for line in data['test']['comment_text']:
for line in test:
    tokens=word_tokenize(line.strip())
    for token in tokens:
        count[token]+=1

In [15]:
#Getting the entire vocabulary
vocabulary = list(count.keys())
vocab_train = list(train_count.keys())
vocab_test = list(test_count.keys())
vocab_size = len(vocab_train)+len(vocab_test)
vocab_set = set(vocab_train) | set(vocab_test)
len(set(vocab_train) | set(vocab_test))

298952

In [16]:
#Calculating word proportions
word_counts = np.zeros((len(vocabulary), 2))

for row, word in enumerate(vocabulary):
    word_counts[row,0] = train_count[word]
    word_counts[row,1] = test_count[word]

sums = word_counts.sum(axis=0)
word_proportions = word_counts / sums[np.newaxis,:]

In [17]:
#Plotting differences between train and test
def plot_words(y_value):
    y_std = y_value.std()
    pyplot.figure(figsize=(6,18))
    pyplot.scatter(log_type_frequencies, y_value, alpha=0.3)
    for i in range(len(vocabulary)):
        if np.abs(y_value[i]) > 2 * y_std:
            pyplot.text(log_type_frequencies[i], y_value[i], vocabulary[i])
    pyplot.show()

In [18]:
#Plotting differences between train and test
from matplotlib import pyplot
type_frequencies = word_counts.sum(axis=1)
log_type_frequencies = np.log(type_frequencies)
#plot_words(word_proportions[:,0] - word_proportions[:,1])

In [19]:
#Smoothing the word counts - some words have 0 freq
smoothed_word_counts = word_counts + 1.0
smoothed_author_sums = smoothed_word_counts.sum(axis=0)
smoothed_word_odds = smoothed_word_counts / (smoothed_author_sums[np.newaxis,:] - smoothed_word_counts)

In [20]:
#Using smoothed values to log odd difference between test/train
log_odds_difference = np.log(smoothed_word_odds[:,0] / smoothed_word_odds[:,1])
log_odds_variances = 1.0 / smoothed_word_counts[:,0] + 1.0 / smoothed_word_counts[:,1]

In [21]:
#Sorting the log-odd/vocabulary pair - get top 25
def sort_words(scores, n=25):
    sorted_pairs = sorted(zip(scores, vocabulary))
    sorted_words = [w for s,w in sorted_pairs]
    print("[more frequently in test] ", ", ".join(sorted_words[:n]))
    print("...")
    print(", ".join(sorted_words[-n:]), " [more frequently in train]")
sort_words(log_odds_difference / np.sqrt(log_odds_variances), 25)

[more frequently in test]  poop, dicks, gay, youfuck, niggers, bitch, fuck, wtf, youi, prick, stupid, hole, poo, da, boob, willy, bananas, traitor, die, nigger, penis, curps, und, fags, kill
...
pages, fish, thank, lol, wanker, attack, jew, admin, contributions, deletion, pig, blocked, edit, edits, personal, wiki, utc, moron, please, contribs, block, user, wikipedia, page, talk  [more frequently in train]


In [22]:
#Finding the most different words (word counts total)
word_odds=log_odds_difference / np.sqrt(log_odds_variances)
word_counts_tot=word_counts[:,0] + word_counts[:,1]
odds=pd.DataFrame(list(zip(word_counts_tot,  word_counts[:,1], word_odds, vocabulary)), columns=['word_counts','word counts test', 'logg_odds', 'vocabulary'])
#odds.to_csv(drive_folder+'/results/odds.csv')
odds=odds.sort_values(by=['logg_odds'])
odds.head()

Unnamed: 0,word_counts,word counts test,logg_odds,vocabulary
10923,2220.0,1700.0,-41.192611,poop
30329,1620.0,1454.0,-37.258679,dicks
865,5667.0,2936.0,-35.81183,gay
96184,1480.0,1165.0,-34.449425,youfuck
3316,1217.0,1079.0,-32.509025,niggers


In [23]:
print(sum('poop' in s for s in train), sum('poop' in s for s in test))

93 150


# 5a. Experiment 1

Step 1: Determine 20 most unique words in test dataset using log-odds ratios

Step 2: Iterate through list finding one example in the test dataset containing that word


In [None]:
#Using the test dataset to get examples (These aren't misclassified - just the raw test)
test=data['test']['comment_text']
df=pd.DataFrame(test, columns=['comment'])
df['index'] = df.index
df.head()

Unnamed: 0,comment,index
0,Thank you for understanding. I think very high...,0
1,:Dear god this site is horrible.,1
2,"""::: Somebody will invariably try to add Relig...",2
3,""" \n\n It says it right there that it IS a typ...",3
4,""" \n\n == Before adding a new product to the l...",4


In [None]:
#Get list of top 20 words that are unique to test 
def sort_words(scores, n=30):
  sorted_pairs = sorted(zip(scores, vocabulary))
  sorted_words = [w for s,w in sorted_pairs]
  final_list=sorted_words[:n]
  return final_list
test_final=sort_words(word_odds)
print(test_final)

['poop', 'dicks', 'gay', 'youfuck', 'niggers', 'bitch', 'fuck', 'wtf', 'youi', 'prick', 'stupid', 'hole', 'poo', 'da', 'boob', 'willy', 'bananas', 'traitor', 'die', 'nigger', 'penis', 'curps', 'und', 'fags', 'kill', 'je', 'middle', 'fucking', 'r', 'fucked']


In [None]:
#Emerging words - grab first 20 
df=df.sample(frac=1) #randomly shuffle 
total=0
def check_list(text, elist, emerge_word=0):
    global total
    clean_text=re.split('\W+', text)
    for token in clean_text:
       if (token in elist) and (total<=19):
          emerge_word+=1
       if (token in elist) and (emerge_word==1) and (total<=19):
          elist.remove(token)
          total += 1
    final_count=np.sum(emerge_word)
    return final_count
df['tag'] = df['comment'].apply(lambda x: check_list(x, test_final))
df['tag']=np.where(df['tag']>0.0, 1, 0)

In [None]:
df.head()

Unnamed: 0,comment,index,tag
0,Thank you for understanding. I think very high...,0,0
1,:Dear god this site is horrible.,1,0
2,"""::: Somebody will invariably try to add Relig...",2,0
3,""" \n\n It says it right there that it IS a typ...",3,0
4,""" \n\n == Before adding a new product to the l...",4,0


In [None]:
#Experiment 1 - creating dataframe to add to train AND creating new test that drops old examples 
df_final=df[df['tag']==1] #keeps the 20 examples with the emerging words (1 example per emerging word)
final_index=df_final['index'].to_list() #.pop(0) removes by index, .remove() removes based on the value in the list 
e1_train_add=data['test'].select(final_index)
e1_test=data['test']
for i in final_index:
  e1_test=e1_test.filter(lambda example, indice: (indice!=i) , with_indices=True)
print('train add examples:', len(e1_train_add), 'test drop examples:', len(e1_test))

HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))


train add examples: 20 test drop examples: 63978


In [None]:
#Creating new train dataset
e1_train = concatenate_datasets([data['train'], e1_train_add])
print('Old train size', len(data['train']), 'New train size', len(e1_train))

Old train size 159571 New train size 159591


In [None]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

In [None]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

e1_train_prep = e1_train.map(text_preprocessing_byexample)
e1_test_prep = e1_test.map(text_preprocessing_byexample)

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

e1_train_encoded = e1_train_prep.map(preprocess_function, batched=True)
e1_train_encoded_2 = e1_train_encoded.rename_column("toxic", "label") 

e1_test_encoded = e1_test_prep.map(preprocess_function, batched=True)
e1_test_encoded_2 = e1_test_encoded.rename_column("toxic", "label") 

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e1',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e1_train_encoded_2,
    eval_dataset=e1_test_encoded_2,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e1'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Create new test labels
true_label=e1_test['toxic']
print(len(true_label))

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e1_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 5b. Experiment 1 - Control

Reuse the same test dataset from experiment 1 (removing the same set of 20 examples), but add 20 control examples to train instead of moving the 20 examples identified through log-odds to train.

In [None]:
#Using control examples
control = load_dataset( 'csv', data_files=['gdrive/My Drive/CS6741 Replication Project/Final Project/data/control.csv'], encoding='ISO-8859-1')
e1_train_control = concatenate_datasets([data['train'], control['train']])
print('Old train size', len(data['train']), 'New train size', len(e1_train_control))

In [None]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

In [None]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

e1_train_control_prep = e1_train_control.map(text_preprocessing_byexample)

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

e1_train_control_encoded = e1_train_control_prep.map(preprocess_function, batched=True)
e1_train_control_encoded_2 = e1_train_control_encoded.rename_column("toxic", "label") 

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e1_control',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e1_train_control_encoded_2,
    eval_dataset=e1_test_encoded_2,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e1_control'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e1_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 6a. Experiment 2

Step 1: Determine 20 most unique NOUNS in test dataset using log-odds ratios

Step 2: Iterate through list finding one example in the test dataset containing that noun


In [None]:
#Get list of top 20 words that are unique to test 
def sort_words(scores, n=100):
  sorted_pairs = sorted(zip(scores, vocabulary))
  sorted_words = [w for s,w in sorted_pairs]
  final_list=sorted_words[:n]
  return final_list
test_final=sort_words(word_odds)
print(test_final)

['poop', 'dicks', 'gay', 'youfuck', 'niggers', 'bitch', 'fuck', 'wtf', 'youi', 'prick', 'stupid', 'hole', 'poo', 'da', 'boob', 'willy', 'bananas', 'traitor', 'die', 'nigger', 'penis', 'curps', 'und', 'fags', 'kill', 'je', 'middle', 'fucking', 'r', 'fucked', 'anal', 'boi', 'motherfucker', 'faggot', 'de', 'stylecolor', 'em', 'solid', 'der', 'vandal', 'stylefontsize', 'comedy', 'omfg', 'se', 'wang', 'dictatorship', 'k', 'vandalizer', 'labour', 'gray', 'fuckers', 'scratch', 'orly', 'wheels', 'moon', 'xlarge', 'fdffe', 'stinks', 'width', 'hamish', 'bella', 'crap', 'goverment', 'trevor', 'stylebackgroundcolor', 'babies', 'f', 'height', 'styleborder', 'n', 'su', 'anime', 'kids', 'du', 'aligncenter', 'porn', 'harte', 'bums', 'que', 'en', 'projectors', 'pelican', 'france', 'sack', 'ist', 'rubbish', 'cellspacing', 'scruffy', 'thorpe', 'slut', 'ballz', 'la', 'cellpadding', 'mother', 'styleverticalaligntop', 'anus', 'es', 'den', 'fart', 'te']


In [None]:
# part of speech tag - keep first 20 most frequent nouns 
# https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

pos=nltk.pos_tag(test_final) #nouns-NN, NNP, NNS
df_noun=pd.DataFrame(pos, columns=['word', 'pos'])
df_noun=df_noun[(df_noun['pos']=="NN") | (df_noun['pos']=="NNP") | (df_noun['pos']=="NNS")]
df_noun=df_noun[:25]
test_final=df_noun['word'].to_list()

In [None]:
#Emerging words (NOUNS) - grab first 20 
df=df.sample(frac=1) #randomly shuffle 
total=0
def check_list(text, elist, emerge_word=0):
    global total
    clean_text=re.split('\W+', text)
    for token in clean_text:
       if (token in elist) and (total<=19):
          emerge_word+=1
       if (token in elist) and (emerge_word==1) and (total<=19):
          elist.remove(token)
          total += 1
    final_count=np.sum(emerge_word)
    return final_count
df['tag'] = df['comment'].apply(lambda x: check_list(x, test_final))
df['tag']=np.where(df['tag']>0.0, 1, 0)

In [None]:
#Experiment 2 - creating dataframe to add to train AND creating new test that drops old examples 

df_final=df[df['tag']==1] #keeps the 20 examples with the emerging words (1 example per emerging word)
final_index=df_final['index'].to_list() #.pop(0) removes by index, .remove() removes based on the value in the list 
e2_train_add=data['test'].select(final_index)
e2_test=data['test']
for i in final_index:
  e2_test=e2_test.filter(lambda example, indice: (indice!=i) , with_indices=True)
print('train add examples:', len(e2_train_add), 'test examples after drop:', len(e2_test))

HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))


train add examples: 20 test examples after drop: 63978


In [None]:
#Creating new train dataset
e2_train = concatenate_datasets([data['train'], e2_train_add])
print('Old train size', len(data['train']), 'New train size', len(e2_train))

Old train size 159571 New train size 159591


In [None]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

In [None]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

e2_train_prep = e2_train.map(text_preprocessing_byexample)
e2_test_prep = e2_test.map(text_preprocessing_byexample)

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

e2_train_encoded = e2_train_prep.map(preprocess_function, batched=True)
e2_train_encoded_2 = e2_train_encoded.rename_column("toxic", "label") 

e2_test_encoded = e2_test_prep.map(preprocess_function, batched=True)
e2_test_encoded_2 = e2_test_encoded.rename_column("toxic", "label") 

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e2',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e2_train_encoded_2,
    eval_dataset=e2_test_encoded_2,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e2'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Create new test labels
true_label=e2_test['toxic']
print(len(true_label))

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e2_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 6b. Experiment 2 - Control

Reuse the same test dataset from experiment 2 (removing the same set of 20 examples), but add 20 control examples to train instead of moving the 20 examples identified through log-odds to train. (NOTE: The control train dataset is the same as in experiment 1)

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e2_control',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e1_train_control_encoded_2,
    eval_dataset=e2_test_encoded_2, # Test is different because different examples are dropped; train is consistent among all controls
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e2_control'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e2_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 7a. Experiment 3

Step 1: Determine the first 20 misclassified examples
For our simulation, we will use the false positive and false negative files we created after the initial run of distilBERT in experiment 0. 

In [None]:
#Import misclassified examples
df_fp_import = pd.read_csv(drive_folder+"/results/fp.csv", encoding='utf-8')
df_fn_import = pd.read_csv(drive_folder+"/results/fn.csv", encoding='utf-8')
df=pd.concat([df_fp_import, df_fn_import], axis=0)
df['comment_clean'] = df['comment'].apply(lambda x: preprocess_text(x, punct))

3948

In [None]:
#Grab random 20 misclassified examples 
df=df.sample(frac=1) #randomly shuffle 
df=df[:20] #select 20 obs
df=df.rename(columns={'Unnamed: 0': 'index'})
final_index=df['index'].to_list() #.pop(0) removes by index, .remove() removes based on the value in the list 
e3_train_add=data['test'].select(final_index)
e3_test=data['test']
for i in final_index:
  e3_test=e3_test.filter(lambda example, indice: (indice!=i) , with_indices=True)
print('train add examples:', len(e3_train_add), 'test examples after drop:', len(e3_test))

HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))


train add examples: 20 test drop examples: 63978


In [None]:
#Creating new train dataset
e3_train = concatenate_datasets([data['train'], e3_train_add])
print('Old train size', len(data['train']), 'New train size', len(e3_train))

In [None]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

In [None]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

e3_train_prep = e3_train.map(text_preprocessing_byexample)
e3_test_prep = e3_test.map(text_preprocessing_byexample)

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

e3_train_encoded = e3_train_prep.map(preprocess_function, batched=True)
e3_train_encoded_2 = e3_train_encoded.rename_column("toxic", "label") 

e3_test_encoded = e3_test_prep.map(preprocess_function, batched=True)
e3_test_encoded_2 = e3_test_encoded.rename_column("toxic", "label") 

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e3',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e3_train_encoded_2,
    eval_dataset=e3_test_encoded_2,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e3'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Create new test labels
true_label=e3_test['toxic']
print(len(true_label))

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e3_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 7b. Experiment 3 - Control

Reuse the same test dataset from experiment 3 (removing the same set of 20 examples), but add 20 control examples to train instead of moving the 20 examples identified through log-odds to train. (NOTE: The control train dataset is the same as in experiment 1)

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e3_control',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e1_train_control_encoded_2, #this is the same for all - augment with 20 random features
    eval_dataset=e3_test_encoded_2, #remove the same 20 from test as in E3
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e3_control'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e3_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 8a. Experiment 4 

Step 1: Determine 20 most unique words in test dataset using log-odds ratios

Step 2: Iterate through the list of words and find the first misclassified example in the test dataset  

In [24]:
#Loop through misclassified examples/ get random example with one of the top 
df_fp_import = pd.read_csv(drive_folder+"/results/fp.csv", encoding='utf-8')
df_fn_import = pd.read_csv(drive_folder+"/results/fn.csv", encoding='utf-8')
df=pd.concat([df_fp_import, df_fn_import], axis=0)
df['comment_clean'] = df['comment'].apply(lambda x: preprocess_text(x, punct))
len(df)

3948

In [25]:
#Get list of top 20 words that are unique to test 
def sort_words(scores, n=30):
  sorted_pairs = sorted(zip(scores, vocabulary))
  sorted_words = [w for s,w in sorted_pairs]
  final_list=sorted_words[:n]
  return final_list
test_final=sort_words(word_odds)
print(test_final)

['poop', 'dicks', 'gay', 'youfuck', 'niggers', 'bitch', 'fuck', 'wtf', 'youi', 'prick', 'stupid', 'hole', 'poo', 'da', 'boob', 'willy', 'bananas', 'traitor', 'die', 'nigger', 'penis', 'curps', 'und', 'fags', 'kill', 'je', 'middle', 'fucking', 'r', 'fucked']


In [26]:
#Emerging words - grab first 20 
df=df.sample(frac=1) #randomly shuffle 
total=0
def check_list(text, elist, emerge_word=0):
    global total
    clean_text=re.split('\W+', text)
    for token in clean_text:
       if (token in elist) and (total<=19):
          emerge_word+=1
       if (token in elist) and (emerge_word==1) and (total<=19):
          elist.remove(token)
          total += 1
    final_count=np.sum(emerge_word)
    return final_count
df['tag'] = df['comment'].apply(lambda x: check_list(x, test_final))
df['tag']=np.where(df['tag']>0.0, 1, 0)

In [27]:
#Experiment 4 - creating dataframe to add to train AND creating new test that drops old examples 
df_final=df[df['tag']==1] #keeps the 20 examples with the emerging words (1 example per emerging word)
df_final=df_final.rename(columns={'Unnamed: 0': 'index'})
final_index=df_final['index'].to_list() #.pop(0) removes by index, .remove() removes based on the value in the list 
e4_train_add=data['test'].select(final_index)
e4_test=data['test']
for i in final_index:
  e4_test=e4_test.filter(lambda example, indice: (indice!=i) , with_indices=True)
print('train add examples:', len(e4_train_add), 'test examples after drop:', len(e4_test))

HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))


train add examples: 20 test examples after drop: 63958


In [28]:
#Creating new train dataset
e4_train = concatenate_datasets([data['train'], e4_train_add])
print('Old train size', len(data['train']), 'New train size', len(e4_train))

Old train size 159571 New train size 159591


In [29]:
# Define the parameters for loading BERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertForSequenceClassification, transformers.DistilBertTokenizerFast, 'distilbert-base-uncased') # To avoid "KeyError: 'loss'" message during train()

# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [30]:
# Preprocess the dataset
def text_preprocessing_byexample(example):
  example['comment_text'] = clean_special_chars(example['comment_text'], punct)
  return example

e4_train_prep = e4_train.map(text_preprocessing_byexample)
e4_test_prep = e4_test.map(text_preprocessing_byexample)

def preprocess_function(examples):
  return tokenizer(examples['comment_text'], truncation=True)

e4_train_encoded = e4_train_prep.map(preprocess_function, batched=True)
e4_train_encoded_2 = e4_train_encoded.rename_column("toxic", "label") 

e4_test_encoded = e4_test_prep.map(preprocess_function, batched=True)
e4_test_encoded_2 = e4_test_encoded.rename_column("toxic", "label") 

HBox(children=(FloatProgress(value=0.0, max=159591.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=63958.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=160.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=64.0), HTML(value='')))




In [31]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e4',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e4_train_encoded_2,
    eval_dataset=e4_test_encoded_2,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0942,0.158435,0.925217
2,0.0708,0.255163,0.927359
3,0.0396,0.33941,0.920682


TrainOutput(global_step=29925, training_loss=0.07505640289638077, metrics={'train_runtime': 9873.945, 'train_samples_per_second': 3.031, 'total_flos': 0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 1831927808, 'init_mem_gpu_alloc_delta': 268953088, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 112705536, 'train_mem_gpu_alloc_delta': 813401600, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 6859132416})

In [32]:
#Save model
cached_model_directory_name = '/results_e4'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [33]:
#Create new test labels
true_label=e4_test['toxic']
print(len(true_label))

63958


In [34]:
#Accuracy score 
predicted_results = trainer.predict(e4_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

              precision    recall  f1-score   support

           0       0.99      0.93      0.96     57870
           1       0.58      0.88      0.70      6088

    accuracy                           0.93     63958
   macro avg       0.78      0.91      0.83     63958
weighted avg       0.95      0.93      0.93     63958



# 8b. Experiment 4 - Control

Reuse the same test dataset from experiment 4 (removing the same set of 20 examples), but add 20 control examples to train instead of moving the 20 examples identified through log-odds to train. (NOTE: the control train dataset is the same as in experiment 1).

In [None]:
# Fine-tuning the BERT model

# Load pre-trained BERT model
num_labels = 2 # Toxic labels
model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)

metric_name = "accuracy"

training_args = TrainingArguments(
    evaluation_strategy = "epoch",   # set the evaluation to be done at the end of each epoch
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                 # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    output_dir='./results_e4_control',          # output directory
)

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels,preds)
  return {
      'accuracy': acc
  }

trainer = Trainer(
    model,
    training_args,
    train_dataset=e1_train_control_encoded_2, #this is the same as in experiment 1
    eval_dataset=e4_test_encoded_2, 
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#Save model
cached_model_directory_name = '/results_e4_control'  
trainer.save_model(drive_folder+cached_model_directory_name)

In [None]:
#Accuracy score 
predicted_results = trainer.predict(e4_test_encoded_2)
predicted_results.predictions.shape
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels.shape
predicted_labels = predicted_labels.flatten().tolist()
print(classification_report(true_label, predicted_labels))

# 9 Misclassifaction Overlap Analysis on Experiments 0 and 4

We perform a robustness check to verify whether the examples that were misclassified in experiment 0 overlap with those that were misclassified in experiment 4.

In [42]:
#Check overlap between misclassified examples from E0 and E4

#Create dataframes for E4 false positives and false negatives
df_check=pd.DataFrame(list(zip(true_label, predicted_labels, e4_test['comment_text'])), columns=['true_label', 'predicted_labels', 'comment'])

df_fp_check=df_check[(df_check['true_label']==0) & (df_check['predicted_labels']==1)]
df_fn_check=df_check[(df_check['true_label']==1) & (df_check['predicted_labels']==0)]
df_tp_check=df_check[(df_check['true_label']==1) & (df_check['predicted_labels']==1)]
df_tn_check=df_check[(df_check['true_label']==0) & (df_check['predicted_labels']==0)]
print('Misclassified', np.sum(np.where(df_check['true_label']!=df_check['predicted_labels'], 1, 0)), 'Correctly classified', np.sum(np.where(df_check['true_label']==df_check['predicted_labels'], 1, 0)))


Misclassified 4646 Correctly classified 59312


In [45]:
print(len(df_fp_check['comment'])) # E4
print(len(df_fp_import['comment'])) # E0
print(len(np.setdiff1d(df_fp_check['comment'], df_fp_import['comment'])))
print(np.setdiff1d(df_fp_check['comment'], df_fp_import['comment'])) # Prints the comments in df_fp_check that are NOT in df_fp_import

3925
2904
1305
["!!! YOU MORRONS   ... FOR FOREIGNERS  (COST-YOU-SHKO) BOOYAH !!! \n and dont think.. it is the most simple and managable Explanation ... Jezus I'm the first to thought it...??? it's sad"
 '" \n\n  \n\n     * ""I heard a news report  some lawyer in Florida wanna take us to court. Somebody tell that country-ass hick  to go suck a dead man\'s dick."" - Geto Boys, referring to Thompson \n\n Your comment on why it should stay: ""this quote shows other people\'s views of Thompson as an anti-rap alarmist""; \n\n No, it doesn\'t; it just says that somebody really doesn\'t like him. I\'m removing it again. If you think it should stay, we can request a third party\'s opinion on the matter.  "'
 '" \n\n  \n :: OK, TDC, your quote:  ""so please dont give me this tired bullshit that we are somehow driving Wikipedia to a right wing bias"" - read what I wrote, again.  I told the other two what I\'m afraid of is Wikipedia being at either extreme.  What I worry about is people coming h

In [46]:
print(len(df_fn_check['comment'])) # E4
print(len(df_fn_import['comment'])) # E0
print(len(np.setdiff1d(df_fn_check['comment'], df_fn_import['comment'])))
print(np.setdiff1d(df_fn_check['comment'], df_fn_import['comment'])) # Prints the comments in df_fn_check that are NOT in df_fn_import

721
1044
91
['" \n\n  \n == Roy Harper poem == \n\n ""..of knickers and ass,"" \n Is \'ass\' correct? In UK English \'arse\' is the correct spelling, further, it scans with \'grass\'.   "'
 '" \n\n ::I think it\'s stupid too, but since the manufacturer invented the car and the name Prius, they do get to be the ""official"" authority on how to spell it. It doesn\'t make it make sense, but it is ""official.""   "'
 '" \n\n ::It is going fine. I\'m working on a Tag title list at the moment, plus other things. Before you know it, you\'ll see a shit load of edits by me. That is if you have alot of TNA articles on your watch list.-Joe\'s gonna kill you!!!) "'
 '" \n\n :You\'re adding shite that fails {{WP:NOTDIR]], The images and references are fine but the huge route list isn\'t!, You can add images, Updated info and references WITHOUT the route list!..... - "'
 '" \n\n == Ad hominem remarks == \n\n Hi MSK, please tone down the comments you\'re making about other editors. Your edit summarie

# 10. Conclusion

This concludes our experiment. We found that the accuracy did not change regardless of the experiments we ran. We also adjusted the number of examples moved from test to train, iterating over N=100, and found no notable differences. However, there may be notable underlying misclassification patterns. Please see our paper for a discussion on why this might be the case and future directions.