## Prerequisites

torch==1.1.0

In [23]:
from collections import Counter

import pandas as pd 
import numpy as np 
import torch 
import torch.nn as nn 

In [35]:
df = pd.read_json("../jigsaw-toxic-comment-classification-challenge/train.json")

In [36]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, edits, made, username, hardcore,..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[d'aww, match, background, colour, 'm, seeming..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[sir, hero, chance, remember, page, 's]"


In this notebook you will learn pytorch basics, this framework will help you to build simple neural networks during this task.   
The first neural network we will try to learn is Feed Forward Neural Network which contain one Fully Connected Layer.  
It can have 1 or more fully connected layers, also it could be called as MLP - multilayer perceptron. 

Read about PyTorch here:  
https://en.wikipedia.org/wiki/PyTorch

And here:

https://neurohive.io/ru/tutorial/glubokoe-obuchenie-s-pytorch/

While reading these articles probably you will meet some unknown terms: 
backpropagation algorithm, gradient descent, activation function, loss function, etc.  
Please, try to look for an information about why do you need all of these stuff. 

Answer this questions about Neural Nets: 

1. In previous tasks we created some features manually, tried to weight our features, tried to select special words for vectorization, how deep learning solves this problem? 

2. Why do we work with tensors in PyTorch?

3. Please, find and read information - why do we need an activation functions in our models? Please, refer to the XOR problem with MLP without activation function, find information about it and answer the previous question. 

4. Please, answer the following question - what gradient is? Why do we need gradient descent algorithm? Which problem it solves? 

5. What is backpropagation algorithm? 

6. What is loss function? 

In [6]:
# Answer for the question number 1 

In [7]:
# Answer for the question number 2

In [8]:
# Answer for the question number 3

In [9]:
# Answer for the question number 4

In [5]:
# Answer for the question number 5

In [10]:
# Answer for the question number 6

Read the following article:

https://en.wikipedia.org/wiki/Feedforward_neural_network

What is FFNN? 

In [11]:
# Your answer here

## PyTorch basics

#### Autograd

In [13]:
# creating a tensor:
x = torch.ones(1, requires_grad=True)

print(x.grad)    # returns None

None


print(x.grad) is None because a tensor x is a scalar, so there is nothing to be calculated.

In [17]:
x = torch.ones(1, requires_grad=True)
y = 20 + x
z = (y ** 2) * 2 
z.backward()     # auto gradient calculation
print(x.grad)    # ∂z/∂x 

tensor([84.])


### Let's create a vocabulary: 

In [38]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [47]:
threshold_count_l = 15
threshold_count_h = 500
threshold_len = 4
cleaned_vocab = [token for token, count in cnt_vocab.items() if threshold_count_h > count > threshold_count_l and len(token) > threshold_len]

In [48]:
len(cleaned_vocab)

13061

In [49]:
# You will need to have an id for each of your token 
token_to_id = {v: k for k, v in enumerate(sorted(cleaned_vocab))}
id_to_token = {v: k for k, v in token_to_id.items()}

### Prepare the data

In [53]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, edits, made, username, hardcore,..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[d'aww, match, background, colour, 'm, seeming..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[sir, hero, chance, remember, page, 's]"


In [113]:
for column in df.columns: 
    if column not in ['id', 'comment_text', 'cleaned']:
        df[column] = df[column].astype('int32')
        

# create a toxicity column (sums all of the toxic labels)
df['toxicity'] = df.iloc[:,2:8].sum(axis=1)

clean = df[df['toxicity'] == 0]
obscene = df[df['obscene'] == 1]

df_binary = clean.append(obscene, ignore_index=True, sort=False)

In [115]:
# Shuffle
df_binary = df_binary.sample(frac=1)
df_binary.reset_index(inplace=True)

In [116]:
df_binary.head()

Unnamed: 0,index,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned,toxicity
0,138659,ac8ec1ee5c7e8878,Now I know why I wasn't that good at school an...,0,0,0,0,0,0,"[know, wa, n't, good, school, think, wa, n't, ...",0
1,79952,ee605f61eceee18c,"""\nYou need to talk to the drafting arb more s...",0,0,0,0,0,0,"[``, need, talk, drafting, arb, —, •, talk, •,...",0
2,56817,a9683a17d9a5d171,You're doing it again. Stop it.,0,0,0,0,0,0,"['re, stop]",0
3,139156,b5778a2fdb6e2abc,Note the control he believes he has over you. ...,0,0,0,0,0,0,"[note, control, belief, ha, like, ira, man, be...",0
4,144782,47f06447da579cb8,"""\nGo rot in hell you evil liittle bastard. ''...",1,0,1,0,1,0,"[``, go, rot, hell, evil, liittle, bastard., `...",3


In [157]:
# Load W2V model 

from gensim.models import KeyedVectors

we_model = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)

In [190]:
# Make stratified sampling, for example: select 500 examples with obscene == 1, and 500 clean examples. 
df_sample = df_binary.sample(1000)

# Obtain vectors for documents, vectorized documents list and labels
X = []
documents = []
labels = []
for i, (document, label) in enumerate(zip(df_sample.cleaned, df_sample.obscene)):
    row_vectors = []
    for kw in document:
        try: 
            row_vectors.append(we_model[kw])
        except (IndexError, KeyError): 
            continue
    if not row_vectors:
        continue
    row_vectors = np.asarray(row_vectors)
    vec = row_vectors.mean(axis=0)
    X.append(torch.tensor(vec))
    documents.append(document)
    labels.append(torch.tensor(label, dtype=torch.float))

### How to create a simple NN: 

In [142]:
import random

In [212]:
# Modify your model to work with batches, not only single item. 

class FeedForward(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size  = hidden_size
        self.fc1 = nn.Linear(self.input_size, self.hidden_size)
        self.relu = nn.ReLU()
        self.logits = nn.Linear(self.hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        hidden = self.fc1(x)
        relu = self.relu(hidden)
        logits = self.logits(relu)
        output = self.sigmoid(logits)
        return output

In [208]:
model = FeedForward(300, 100)

In [209]:
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

In [211]:
# Move model to the training mode
model.train()
epochs = 20

batch_size = 1
for epoch in range(epochs):  
    # Selects only 1 sample, modify it to select N samples, N == batch_size
    idx = random.sample(range(len(X)), 1)
    optimizer.zero_grad()    # Forward pass
    
    x_train = X[idx[0]]
    y_train = labels[idx[0]]
    
    y_pred = model(x_train)    # Compute Loss
    loss = criterion(y_pred.squeeze(), y_train)

    print('Epoch {}: train loss: {}'.format(epoch, loss.item()))    # Backward pass
    loss.backward()
    optimizer.step()

Epoch 0: train loss: 1.229011058807373
Epoch 1: train loss: 0.35259485244750977
Epoch 2: train loss: 0.34339311718940735
Epoch 3: train loss: 0.3355575501918793
Epoch 4: train loss: 0.35065779089927673
Epoch 5: train loss: 1.2297319173812866
Epoch 6: train loss: 0.33943042159080505
Epoch 7: train loss: 0.3437454402446747
Epoch 8: train loss: 0.338482528924942
Epoch 9: train loss: 0.337346613407135
Epoch 10: train loss: 0.34821411967277527
Epoch 11: train loss: 0.33761122822761536
Epoch 12: train loss: 0.3327389657497406
Epoch 13: train loss: 0.3430103659629822
Epoch 14: train loss: 0.33334851264953613
Epoch 15: train loss: 0.32554060220718384
Epoch 16: train loss: 0.335260808467865
Epoch 17: train loss: 0.332034707069397
Epoch 18: train loss: 0.32259663939476013
Epoch 19: train loss: 0.32579734921455383


### Task: 

1. Create stratified dataset, make your classes balanced! 

2. Retrain the model. 

3. Add batch size, modify your model to make it possible to process batches, not only single items. 

4. Change hidden size, n_layers, etc to modify your model. 