In [27]:
# Load Data
import torch
import torch.nn as nn
from torch import optim
import numpy as np
import random
import gensim.downloader as api
import pandas as pd


# The Task: 'Toxicity' Modeling

In 2016 Google published a Kaggle competition called 'toxic comments', offering a prize to the user who could make the best model for predicting whether a given comment was toxic.

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

By offering a prize for the best model, google employed a shadow labor force of skilled workers---over 5000 submissions to date. 

kaggle is a good place to look for datasets. 

The data comes from Wikipedia talk pages. 

This dataset is already stored in huggingface: 
    https://huggingface.co/datasets

In [28]:
from datasets import load_dataset

dataset = load_dataset("jigsaw_toxicity_pred", data_dir='../data/jigsaw-toxic-comment-classification-challenge/')


Some examples labeled toxic.

# Exploratory Data Analysis
## Let's examine our data.

the data get returned to us as a Dataset object. ie an instance of the huggingface Datasets class. To see what you can do see the documentation here: https://huggingface.co/datasets

in the huggingface datasets library, a split is a specific subset of a dataset like train and test. List a dataset’s split names with the get_dataset_split_names() function. This data is already segmented for us

In [29]:
from datasets import get_dataset_split_names

get_dataset_split_names("jigsaw_toxicity_pred")


['train', 'test']

let's look at the first entry in the train split

In [30]:
# let's look at just train
train = dataset['train']

# Get the first row in the train set. 
train[0] #

{'comment_text': "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
 'toxic': 0,
 'severe_toxic': 0,
 'obscene': 0,
 'threat': 0,
 'insult': 0,
 'identity_hate': 0}

Use the - operator to start from the end of the dataset:

In [31]:
# Get the last row in the train set

train[-1]

{'comment_text': '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "',
 'toxic': 0,
 'severe_toxic': 0,
 'obscene': 0,
 'threat': 0,
 'insult': 0,
 'identity_hate': 0}

Indexing by the column name returns a list of all the values in the column:

In [32]:
train["comment_text"][0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

You can combine row and column name indexing to return a specific value at a position:

In [33]:
train[0]["comment_text"]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.

### Aside: dataframes

We are using the built in Datasets class from hugginface, which is useful because it allows us to download a ton of datasets right in Jupyter. But plain old Pandas dataframes are very powerful for storing, examining, manipulating, and transforming data.

Make a dataframe from our Dataset object

In [34]:
df = pd.DataFrame.from_records(train)

In [35]:
df[df['toxic']==1]

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
12,Hey... what is it..\n@ | talk .\nWhat is it......,1,0,0,0,0,0
16,"Bye! \n\nDon't look, come or think of comming ...",1,0,0,0,0,0
42,You are gay or antisemmitian? \n\nArchangel WH...,1,0,1,0,1,1
43,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",1,0,1,0,1,0
...,...,...,...,...,...,...,...
159494,"""\n\n our previous conversation \n\nyou fuckin...",1,0,1,0,1,1
159514,YOU ARE A MISCHIEVIOUS PUBIC HAIR,1,0,0,0,1,0
159541,Your absurd edits \n\nYour absurd edits on gre...,1,0,1,0,1,0
159546,"""\n\nHey listen don't you ever!!!! Delete my e...",1,0,0,0,1,0


## Multi-label vs Binary classification

Each example is a dictionary with comment text and then 6 annotations, which are coded for either 1 or 0. This is a multilabel classification problem. We are going to simplify it first into a binary classification task. Given a post, is it 'toxic' or not? 

What are the inputs to the model? what is the output?

# Pre-processing the Data

We have to transform our text inputs to numbers. How have we been doing this so far?

In [36]:
a: embeddings

<img src="../img/fullyconnected.png" alt="Alternative text" />

Another way to think of the layer of input neurons is as a vector. in fact, this is how we represent the weights in each layer of the network.

How can we represent the comment text as a vector? 



## Loading embeddings

We can use embeddings.

Let's load our word embeddings that we trained on X. 

In [37]:
"""
MISSING
"""

# load word2vec embeddings we trained

'\nMISSING\n'

Alternately, we can use pretrained vectors or embeddings downloaded from the internet. We can use Word2Vec, or GloVe, which is a model that came out a few years later and works very well. 

We use `gensim` which is a great library for working with embeddings and training topic models

In [38]:
info = api.info()

Print out available models (i.e. embeddings)

In [39]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

We'll continue with twitter embeddings. Feel free to sub in a different kind and see what kind of results you get.

In [40]:
# download the model and return as object ready for use
#embeddings_glove_twitter = api.load("glove-twitter-50")
embeddings = api.load("glove-twitter-100")



We can access the embedding of a single word in the dictionary like this

In [41]:
embeddings['sup']

array([-0.072944 ,  0.31349  , -0.37301  , -0.74591  ,  0.024118 ,
        0.26288  , -0.52766  ,  0.45845  ,  0.66482  , -0.32284  ,
        0.070524 , -0.23753  , -1.9064   , -0.12384  ,  0.34087  ,
        0.10557  ,  0.64763  , -1.5884   ,  0.28275  , -0.48506  ,
        0.37902  , -0.28709  , -0.0066354,  0.24017  , -0.034383 ,
       -0.3468   , -0.0061342, -0.12497  , -0.011999 , -0.63745  ,
       -0.35676  , -0.17062  , -0.86248  ,  0.18034  ,  0.1995   ,
       -0.25941  ,  0.2586   ,  0.17861  ,  0.7617   ,  0.8704   ,
       -0.53839  ,  0.38899  ,  0.27174  ,  0.33564  , -0.29995  ,
        0.9688   ,  0.15263  , -0.48423  ,  0.70449  ,  0.1936   ,
       -0.30026  , -0.46582  , -0.11025  , -1.3443   ,  0.7221   ,
        0.41727  ,  0.078679 , -0.34484  , -0.24705  , -0.65691  ,
       -0.22723  , -0.68642  ,  0.30941  ,  0.38086  ,  0.029261 ,
       -0.22846  , -0.33021  , -0.48214  , -0.28144  , -0.17876  ,
        0.071934 ,  0.17553  ,  0.49776  , -0.44887  ,  0.0209

# Preprocessing

## Transform model inputs into vector representation  (x) 

First we need to transform the input text into a vector representation. 

Let's write a function that constructs an input vector from the training example. It will take a single example in the form of a dictionary OR a comment string, and should have a switch depending on the type of the input. This will help us down the road 

what needs to happen in this function?

In [42]:
def form_input(example, embeddings):
    """
    :example: training example from huggingface dataset in the form of a dictionary, e.g.
    
    {'comment_text': '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "',
     'toxic': 0,
     'severe_toxic': 0,
     'obscene': 0,
     'threat': 0,
     'insult': 0,
     'identity_hate': 0}

    returns a dictionary with a new key, x, which maps to a vector representing the comment
    """
    
    if isinstance(example, str):
        raise Exception("not implemented")
    else:
        raise Exception("not implemented")
        
    example['x'] = None
    return example

We need to process the input string into words. It makes sense to use the same tokenization function we used for the w2v training corpus, because those are the vectors we have to work with.

In [43]:
import re

def tokenize(string):
    """
    tokenization function from n-gram models
    """
    tokenized = re.sub(r'(\w)([.,?!;:])', r'\1 \2', string) 
    tokenized = tokenized.split()
    tokenized = [word.lower() for word in tokenized]
    return tokenized

In [51]:
def form_input(example, word_embeddings):
    """
    :example: training example from huggingface dataset in the form of a dictionary, e.g.
    
    {'comment_text': '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "',
     'toxic': 0,
     'severe_toxic': 0,
     'obscene': 0,
     'threat': 0,
     'insult': 0,
     'identity_hate': 0}

    returns a dictionary with a new key, x, which maps to a vector representing the comment
    """
    
    # get the tokens
    if isinstance(example, str):
        tokens = tokenize(example)
        example = {} # dummy input dict
    else:
        tokens = tokenize(example['comment_text'])

        # tokens = ["segf","awe"]
        # embeddings 

        vectors = [embeddings[t] for t in tokens]

        

        
        
    # get the vectors for each token and average them into one vector
    vecs = []
    for word in tokens:
        try:
            vec = word_embeddings[word]
        except KeyError: # this token is not in our embeddings dictionary
            vec = np.zeros(word_embeddings.vector_size)
        vecs.append(vec)
    
    centroid = np.mean(vecs, axis=0)
    
    # convert numpy array to torch tensor
    torch_tensor = torch.from_numpy(centroid).float()
    
    # return the same example with a new key 'x' containing the vectorized input
    # example['x'] = torch_tensor
    example['x'] = centroid
    return example

In [52]:
form_input("fuck you, you fcuking fuck", embeddings)

{'x': array([ 0.02811934,  0.01134666,  0.064648  ,  0.4508567 , -0.19203   ,
         0.23058133,  0.16306323, -0.05258501,  0.22441666,  0.35127833,
         0.17506166, -0.22616668, -4.2064385 , -0.130898  , -0.05190283,
        -0.11419783,  0.05207833, -0.15887333, -0.27789333, -0.20398398,
        -0.11011901,  0.06850133, -0.16275083,  0.230263  ,  0.288115  ,
        -0.72416997,  0.06175267,  0.17263532,  0.05422334, -0.29975167,
        -0.345325  , -0.11183766, -0.02660634,  0.09518132, -0.46431866,
         0.28865   , -0.144839  ,  0.26856667,  0.32010666, -0.15639667,
        -0.82060415,  0.28755298, -0.04442   ,  0.04758999,  0.155905  ,
        -0.061172  ,  0.08584667,  0.06134185,  0.15439667,  0.20977919,
        -0.13046001,  0.18997884,  0.19542266, -0.30403498,  0.4323765 ,
         0.02775649, -0.6237783 , -0.044715  ,  0.03132983,  0.40435883,
         0.167084  ,  0.07760134, -0.051705  ,  0.19622315,  0.03124   ,
        -0.05763416, -0.46400836,  0.14215769,

We test out our form_input function and see that it adds a new key to the example dict

In [53]:
sample = train[0]
form_input(sample, embeddings)

KeyError: "Key 'weren't' not present"

we can also use a regular string as input

In [54]:
form_input("fuck you you fucking fuck", embeddings)

{'x': array([ 0.095279  ,  0.059992  ,  0.13705759,  0.49236003, -0.24676602,
        -0.0165944 ,  0.32601985, -0.03514801,  0.0622496 ,  0.47403723,
         0.12793799, -0.080112  , -5.2482595 , -0.2469496 , -0.1907184 ,
        -0.0444044 , -0.005334  , -0.1239496 , -0.433752  , -0.15544479,
        -0.22081602,  0.02698599, -0.13078442,  0.17862199,  0.636536  ,
        -0.758012  ,  0.0469492 ,  0.22871597, -0.032672  , -0.387164  ,
        -0.44408798, -0.23165521, -0.16098759, -0.0405344 , -0.44118243,
         0.35053402, -0.18633   ,  0.373294  ,  0.43902603, -0.30212998,
        -0.899114  ,  0.21428959, -0.09860601, -0.10665802,  0.361494  ,
        -0.1636624 ,  0.18448   ,  0.235944  ,  0.25262   ,  0.261711  ,
        -0.2349462 ,  0.28258198,  0.262682  , -0.32094198,  0.610094  ,
         0.05624299, -0.69200003,  0.025616  , -0.111052  ,  0.43973398,
         0.124874  ,  0.0639364 , -0.04205   ,  0.21510401,  0.065932  ,
        -0.19254479, -0.55749804,  0.41179925,

## Preprocessing: Model Outputs

We need to transform the output label into vector representation (gold y). 

### Question: What is our golf 

In [55]:
def form_output(example):
    example['y'] = example['toxic']
    return example

In [56]:
form_output(sample)

{'comment_text': "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
 'toxic': 0,
 'severe_toxic': 0,
 'obscene': 0,
 'threat': 0,
 'insult': 0,
 'identity_hate': 0,
 'y': 0}

# Building the Pytorch Model

We have several decisions to make.

- What is the shape of the input?
- What is the shape of the output?
- What form do we want the output to take? 
- How many layers do we want the model to have?
- What activation function should we use at each layer?
- Do we use dropout?



In [None]:
one matrix, dimensions of 

We subclass nn.Module (which itself is a class and able to keep track of state). In this case, we want to create a class that holds our weights, bias, and method for the forward step. nn.Module has a number of attributes and methods (such as .parameters() and .zero_grad()) which we will be using.

We also can store the loss function here as well.

Here is an example FFNN.


In [57]:
class FFNN(nn.Module):
    """
    Defines the core neural network for doing binary classification over a single datapoint at a time. This consists
    of matrix multiplication, tanh nonlinearity, another matrix multiplication, and then
    a sigmoid layer to give the ouputs.

    The forward() function does the important computation. The backward() method is inherited from nn.Module and
    handles backpropagation.
    """
    def __init__(self, word_embeddings, inp, hid, out):
        """
        Constructs the computation graph by instantiating the various layers and initializing weights.

        :param inp: size of input (integer)
        :param hid: size of hidden layer(integer)
        :param out: size of output (integer), which should be the number of classes
        """
        super(FFNN, self).__init__()
        self.V = nn.Linear(inp, hid)
        self.g = nn.Tanh()
        #self.g = nn.ReLU()
        self.W = nn.Linear(hid, out)
        self.sigmoid = nn.Sigmoid()
        
        # Initialize weights according to a formula due to Xavier Glorot.
        nn.init.xavier_uniform_(self.V.weight)
        nn.init.xavier_uniform_(self.W.weight)
        
        self.num_classes = out
        self.loss = nn.BCELoss()
        

    def forward(self, x):
        """
        Runs the neural network on the given data and returns log probabilities of the various classes.

        :param x: a [inp]-sized tensor of input data
        :return: an [out]-sized tensor of log probabilities. (In general your network can be set up to return either log
        probabilities or a tuple of (loss, log probability) if you want to pass in y to this function as well
        """
        raise Exception ("not implemented!!!")
    

Since we’re now using an object instead of just using a function, we first have to instantiate our model

- how many input dimensions does it have? output?

In [58]:
dimensions = None
hidden = None
output = None

model = FFNN(embeddings, dimensions,hidden,output)
print(model)

TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)


let's run the forward function to compute the output

In [None]:
example = train[0]
print(example)
inp = form_input(example, embeddings)
outp = model.forward(inp['x'])
outp

we can access the logits of the returned value

In [None]:
outp.detach()

And we can change the PyTorch tensor to a Numpy array

In [None]:
outp.detach().numpy()

# Making Predictions

The final layer is a continuous, differentiable function. It will predict a number between 0 and 1. But we want our output to be either 0 or 1 with no in between. How do we transform this output logit into a prediction?

Write a function called predict() that will generate a prediction that is either a 0 or a 1

In [None]:
def predict(model, word_embeddings, example) -> int:
    raise Exception("not implemented!")

In [None]:
output = predict(model, embeddings, "fuck you you fucking fuck")
output

The model doesn't know what google means by toxicity. we have to train it

# Evaluate

First, a way to evaluate the model performance on a whole bunch of predictions

In [None]:
def print_evaluation(golds, predictions):
    """
    Prints evaluation statistics comparing golds and predictions, each of which is a sequence of 0/1 labels.
    Prints accuracy as well as precision/recall/F1 of the positive class, which can sometimes be informative if either
    the golds or predictions are highly biased.

    :param golds: gold labels, list of ints
    :param predictions: pred labels, list of ints
    :return:
    """
    
    num_correct = 0
    num_pos_correct = 0
    num_pred = 0
    num_gold = 0
    num_total = 0
    if len(golds) != len(predictions):
        raise Exception("Mismatched gold/pred lengths: %i / %i" % (len(golds), len(predictions)))
    for idx in range(0, len(golds)):
        gold = golds[idx]
        #print("gold ", gold)
        prediction = predictions[idx]
        #print("prediction ", prediction)
        if prediction == gold:
            num_correct += 1
        if prediction == 1:
            num_pred += 1
        if gold == 1:
            num_gold += 1
        if prediction == 1 and gold == 1:
            num_pos_correct += 1
        num_total += 1
    acc = float(num_correct) / num_total
    output_str = "Accuracy: %i / %i = %f" % (num_correct, num_total, acc)
    prec = float(num_pos_correct) / num_pred if num_pred > 0 else 0.0
    rec = float(num_pos_correct) / num_gold if num_gold > 0 else 0.0
    f1 = 2 * prec * rec / (prec + rec) if prec > 0 and rec > 0 else 0.0
    output_str += ";\nPrecision (fraction of predicted positives that are correct): %i / %i = %f" % (num_pos_correct, num_pred, prec)
    output_str += ";\nRecall (fraction of true positives predicted correctly): %i / %i = %f" % (num_pos_correct, num_gold, rec)
    output_str += ";\nF1 (harmonic mean of precision and recall): %f;\n" % f1
    print(output_str)
    return acc, f1, output_str

# Lab: Coding the Training Loop

Your task will be to write the training loop for training the feed forward neural network.



### Get the data into the right shape

huggingface datasets has a .map() function that will alter our dataset in place. it asks for a function as its input.
that function has to be of a particular type. in particular, it has to expect something of the same type as a single example (a dictionary) and return a dictionary. We define an anonymous function using a lambda expression that uses our form_input function. no need to worry about the details here unless you are interested because type calculus is sick and so are lambda expressions.


In [None]:
dataset = dataset.map(lambda ex: form_input(ex, embeddings))
dataset['train'][0]

now our dataset has an extra column with vector representations of each example. we do this once ahead of time rather than on the fly because we will iterate through our training and dev sets many times, and we don't want waste time running form_input over and over

We do the same thing with form_output. clearly, we didnt need separate functions, but it helps to separate our thinking

In [None]:
dataset = dataset.map(lambda ex: form_output(ex))
dataset['train'][0]

### Dataset Split

Remember, the data are already split into train and test sets. 

It's normal to have a third small data split called val (validate) or dev, that we use to evaluate the model while the training loop is running. Then, at the end of training, we evaluate on the test set. The dev set can be used to do things like choose **hyperparameters** for the model---we can't use the test set for this because then we'd be 'cheating'

In [None]:
# load training data

train = dataset['train']
dev = dataset['test']
test = dataset['test']

print(repr(len(train)) + " / " + repr(len(dev)) + " / " + repr(len(test)) + " train/dev/test examples")


Let's set the model hyperparameters. 

In [None]:
# set hyperparameters

num_epochs = 10
hidden_size = 200
initial_learning_rate = 0.1

We initialize a model and an optimizer

In [None]:
ffnn = FFNN(embeddings, embeddings.vector_size, hidden_size, 1)
optimizer = optim.Adam(ffnn.parameters(), lr=initial_learning_rate)

Next we need to code the actual training loop. This loop will run through the training data for a set number of epochs. In each epoch, it should iterate through all of the training examples. Ideally, it will do this in a random order every time. You should run the example through the model, calculate the loss, and then use the loss to run the backward step. **be sure to zero out the gradients before calling the forward step!!** This is done with `ffnn.zero_grad()` Our loss function is stored in the FFNN object, and can be called using `ffnn.loss()`. The backpropagation step is run on the loss, i.e. `loss.backward()` (where `loss` is the loss computed at this time step). At the end of each epoch, the

In [None]:
## training loop for toxicity classification



for epoch in range(0, num_epochs):
    ex_indices = [i for i in range(0, len(train))]
    random.shuffle(ex_indices)
    total_loss = 0.0
    
    ffnn.train()
    for idx in ex_indices:
        
        # get inputs in the right format
        x = torch.Tensor(train[idx]['x'])
        y = train[idx]['y']
        y = torch.from_numpy(np.asarray(y,dtype=np.float32)).unsqueeze(-1)
        
        # Zero out the gradients from the FFNN object. *THIS IS VERY IMPORTANT TO DO BEFORE CALLING BACKWARD()*
        ffnn.zero_grad()
        y_hat = ffnn.forward(x)
        
        # Can also use built-in NLLLoss as a shortcut here but we're being explicit here
        loss = ffnn.loss(y_hat, y)
        total_loss += loss
        
        # Computes the gradient and takes the optimizer step
        loss.backward()
        optimizer.step()
    print("Total loss on epoch %s: %f" % (epoch, total_loss))
    
    ffnn.eval()
    dev_y_hats = [predict(ffnn, embeddings, ex['comment_text']) for ex in dev]
    print_evaluation(dev['y'], dev_y_hats)



# Make Predictions

Let's use the model to make some predictions:

In [None]:
predict(ffnn, embeddings, "fuck you you fucking fuck")

# Improving the Model with Features

This works okay. How can we get better performance? See if you can improve the model to X F1 score. You could try using different embeddings. You could change the model itself, adding more layers or trying a different activation function. You could add dropout after one of the layers (this takes one line to implement with pytorch, but you might have to look up about what it is or how it's used). You can also add hand-crafted features in addition to the embeddings to serve as input. This is a very common strategy and usually helps a lot.

Instead of just representing the input as word embeddings, we can represent the input as a vector of word embeddings concatenated with additional dimensions that encode other variables we are interested in. The features can be real-valued, like: how many tokens are in the example? what's the average word length? Features can also be binary, like: does the sentence contain a question mark? 

For example, s length. Here is a new form_input that adds an average word length feature and a contains X feature

In [None]:
def extract_features(tokens):
    question = 1 if True else 0
    features = {
        "avg_word_length": np.mean([len(t) for t in tokens]),
        "contains_q": 1 if "?" in tokens else 0
    }
    return features

def form_input_updated(example, word_embeddings):
    """
    :example: training example from huggingface dataset in the form of a dictionary, e.g.
    
    {'comment_text': '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "',
     'toxic': 0,
     'severe_toxic': 0,
     'obscene': 0,
     'threat': 0,
     'insult': 0,
     'identity_hate': 0}

    returns a dictionary with a new key, x, which maps to a vector representing the comment
    """
    
    if isinstance(example, str):
        tokens = tokenize(example)
        example = {} # dummy input dict
    else:
        tokens = tokenize(example['comment_text'])

    vecs = []
    for word in tokens:
        try:
            vec = word_embeddings[word]
        except KeyError: # this token is not in our embeddings dictionary
            vec = np.zeros(word_embeddings.vector_size)
        vecs.append(vec)
    
    centroid = np.mean(vecs, axis=0)
    
    
    features_dict = extract_features(tokens)
    features_list = list(features_dict.values())
    
    final_vector = np.append(centroid, features_list, axis=0)
    
    # we need torch form which is a tensor, not a numpy array
    
    torch_tensor = torch.from_numpy(final_vector).float()
    example['x'] = torch_tensor
    return example
    

In [None]:
form_input_updated("hello my name is fhwfhjwhfjwklfjwj4klwfkl?", embeddings)

In [None]:
# re-form our inputs, make a new model and train that one. 

ffnn = FFNN(embeddings, embeddings.vector_size, hidden_size, 1)
optimizer = optim.Adam(ffnn.parameters(), lr=initial_learning_rate)

dataset = dataset.map(lambda ex: form_input_updated(ex, embeddings))


In [None]:
dataset['train'][0]["x"]

In [None]:
## training loop for toxicity classification



for epoch in range(0, num_epochs):
    ex_indices = [i for i in range(0, len(train))]
    random.shuffle(ex_indices)
    total_loss = 0.0
    
    ffnn.train()
    for idx in ex_indices:
        
        # prepare input (x) and gold toxicity label (y) as torch Tensors
        
        x = torch.Tensor(train[idx]['x'])   
        y = train[idx]['y']
        y = torch.from_numpy(np.asarray(y,dtype=np.float32)).unsqueeze(-1)
        
        # Zero out the gradients from the FFNN object. *THIS IS VERY IMPORTANT TO DO BEFORE CALLING BACKWARD()*
        ffnn.zero_grad()
        y_hat = ffnn.forward(x)
        
        # Can also use built-in NLLLoss as a shortcut here but we're being explicit here
        loss = ffnn.loss(y_hat, y)
        total_loss += loss
        
        # Computes the gradient and takes the optimizer step
        loss.backward()
        optimizer.step()
    print("Total loss on epoch %s: %f" % (epoch, total_loss))
    
    ffnn.eval()
    dev_y_hats = [predict(ffnn, embeddings, ex['comment_text']) for ex in dev]
    print_evaluation(dev['y'], dev_y_hats)



# Appendix: Example Training Loop

Here is an example training loop for learning the XOR function

In [None]:
# MAKE THE DATA
# Synthetic data for XOR: y = x1 XOR x2
train_xs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
train_ys = np.array([0, 1, 1, 0], dtype=np.float32)

# Define some constants
# Inputs are of size 2
feat_vec_size = 2

# Let's use 4 hidden units
embedding_size = 4

# We're using 2 classes. What's presented here is multi-class code that can scale to more classes, though
# slightly more compact code for the binary case is possible.
num_classes = 1


# set hyperparameters
num_epochs = 100
ffnn = FFNN(feat_vec_size, embedding_size, num_classes)
initial_learning_rate = 0.1
optimizer = optim.Adam(ffnn.parameters(), lr=initial_learning_rate)


# RUN TRAINING
for epoch in range(0, num_epochs):
    
    ex_indices = [i for i in range(0, len(train_xs))]
    random.shuffle(ex_indices)
    total_loss = 0.0

    for idx in ex_indices:
        x =  torch.from_numpy(train_xs[idx]).float()
        y = train_ys[idx]
        y = torch.from_numpy(np.asarray(y,dtype=np.float32)).unsqueeze(-1)

        # Build one-hot representation of y. Instead of the label 0 or 1, y_onehot is either [0, 1] or [1, 0]. This
        # way we can take the dot product directly with a probability vector to get class probabilities.
        #y_onehot = torch.zeros(num_classes)
        
        # scatter will write the value of 1 into the position of y_onehot given by y
        #y_onehot.scatter_(0, torch.from_numpy(np.asarray(y,dtype=np.int64)), 1)
        # Zero out the gradients from the FFNN object. *THIS IS VERY IMPORTANT TO DO BEFORE CALLING BACKWARD()*
        ffnn.zero_grad()
        log_probs = ffnn.forward(x)
        
        print(x)
        print(log_probs)
        print(y)
        
        # Can also use built-in NLLLoss as a shortcut here but we're being explicit here
        loss = ffnn.loss(log_probs, y)
        
        total_loss += loss
        # Computes the gradient and takes the optimizer step
        loss.backward()
        optimizer.step()
    print("Total loss on epoch %i: %f" % (epoch, total_loss))
    

# Evaluate on the train set
# RUN TRAINING AND TEST

train_correct = 0
for idx in range(0, len(train_xs)):
    x = torch.from_numpy(train_xs[idx]).float()
    y = train_ys[idx]
    log_probs = ffnn.forward(x)
    prediction = 1 if log_probs > 0.5 else 0
    if y == prediction:
        train_correct += 1
    print("Example " + repr(train_xs[idx]) + "; gold = " + repr(train_ys[idx]) + "; pred = " +\
          repr(prediction) + " with probs " + repr(log_probs))
print(repr(train_correct) + "/" + repr(len(train_ys)) + " correct after training")

# References

- https://vijaygadepalli.medium.com/toxic-comments-classification-696603741872
    