## Imports 

In [144]:
from collections import Counter

import pandas as pd 
import numpy as np 
import torch
import torch.nn as nn 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from string import punctuation
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## RNNs

Please, read about RNNs (Recurrent Neural Networks).  

1. Understand it's difference from the FFNNs. (Write your answer down below)  

https://towardsdatascience.com/recurrent-neural-networks-rnn-explained-the-eli5-way-3956887e8b75

https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7

2. Why do we need recurrent neural networks? 
3. For which tasks it would work better? 

1. Recurrent nets are using their outputs as inputs
2. With recurrent networks we can process sequences with different input shape. Also recurrent nets providing some kind of memory.
3. Recurrent networks are often applied to NLP.

## Load data 

In [145]:
# Load the DF created during the previous task
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/My Drive/train.csv")

def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation and len(token) > 4 and len(token) < 20]

df['cleaned'] = df.comment_text.apply(lambda x: preprocess_text(word_tokenize, lemmatizer, stop_words, punctuation, x))

for column in df.columns: 
    if column not in ['id', 'comment_text', 'cleaned']:
        df[column] = df[column].astype('int32')

df['toxicity'] = df.iloc[:,2:8].sum(axis=1)
clean = df[df['toxicity'] == 0]
obscene = df[df['obscene'] == 1]
df_binary = clean.append(obscene, ignore_index=True, sort=False)
df_binary = df_binary.sample(frac=1)
df_binary.reset_index(inplace=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Work with small amount of this data: 
df_sample, _ = train_test_split(df_binary, test_size=0.7, stratify=df_binary['obscene'])

In [147]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

cnt_vocab = Counter(flat_nested(df_sample.cleaned.tolist()))

print("Vocab size before filtering: {}".format(len(cnt_vocab)))

threshold_count_l = 1
threshold_count_h = 500
threshold_len = 2

cleaned_vocab = [token for token, count in cnt_vocab.items() if 
                     threshold_count_h > count > threshold_count_l and len(token) > threshold_len
                ]
print("Vocab size after filtering: {}".format(len(cleaned_vocab)))

Vocab size before filtering: 90423
Vocab size after filtering: 35199


In [0]:
cleaned_vocab.append(" ")
# Convert list to set 
cleaned_vocab = set(cleaned_vocab)

In [0]:
token_to_id = {v: k for k, v in enumerate(sorted(cleaned_vocab))}
id_to_token = {v: k for k, v in token_to_id.items()}

Before passing our raw text to the model we need to represent each raw text by a vector.   
Let's do this by creating an empty list with all of the tokens in it represented by its id. 

In [0]:
def vectorize(data, token_to_id, max_len=None, dtype='int32', batch_first=True):
    """
    Casts a list of tokens into rnn-digestable matrix
        "data" contains only sequences represented by tokens from the dictionary, filter noise before 
    """
    seq_lengths = list(map(len, data))
    max_len = max_len or max(map(len, data))
    # Create a marix with a shape [batch size, max number of tokens in sequence]
    data_ix = np.zeros([len(data), max_len], dtype) + token_to_id[' ']

    for i in range(len(data)):
        line_ix = [token_to_id[c] for c in data[i]]
        data_ix[i, :len(line_ix)] = line_ix

    return data_ix, seq_lengths

In [0]:
def filter_noise_tokens(df, cleaned_vocab): 
    df['filtered_tokens'] = df.cleaned.apply(lambda x: [tok for tok in x if tok in cleaned_vocab])
    return df 

In [152]:
# After applying this function there would be sentences with all tokens filtered - empty lists. 
df_sample = filter_noise_tokens(df_sample, cleaned_vocab)

# Remove examples without any tokens assigned 
df_filtered = df_sample[df_sample.astype(str)['filtered_tokens'] != '[]']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [0]:
# Perform train-test split stratified (would be imbalanced)
df_train, df_test = train_test_split(df_filtered, test_size=0.4, stratify=df_filtered['obscene'])

In [154]:
print("Train shape: {}".format(df_train.shape))
print("Test shape: {}".format(df_test.shape))

Train shape: (25755, 12)
Test shape: (17170, 12)


In [0]:
class RNNLoop(nn.Module):
    
    def __init__(self, num_tokens, emb_size=200, hid_size=128):
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(num_tokens, emb_size)
        self.rnn = nn.RNN(emb_size, hid_size, batch_first=True)
        self.logits = nn.Linear(hid_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, seq_lengths):
        # Embed the obtained sequence 
        emb = self.emb(x)
        # Pack padded sequence - why do we need this, refer to:
        # https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch
        
        pack = torch.nn.utils.rnn.pack_padded_sequence(emb,
                                                   seq_lengths,
                                                   batch_first=True,
                                                   enforce_sorted=False
                                                  ) 
        all_hidden_states, hidden = self.rnn(pack)
        logits = self.logits(hidden)
        # Cast logits to the range from 0 to 1 
        output = self.sigmoid(logits)
        return output

In [156]:
# Initialise the model 
model = RNNLoop(num_tokens=len(cleaned_vocab))
# specify loss function
criterion = nn.BCELoss()
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
history = []

batch_size = 64
n_epochs = 10
n_iters = df_train.shape[0] // batch_size
print("Number of iterations for 1 epoch: {}".format(n_iters))

for epoch in range(n_epochs):
    epoch_loss = 0 
    for step in range(n_iters):

        optimizer.zero_grad()    # Forward pass
        # Make a random sample from the dataframe 
        sample = df_train.sample(batch_size)

        # Vectorize the obtained sample 
        batch_ix, seq_lengths = vectorize(sample.filtered_tokens.tolist(), token_to_id)
        # Convert vectorized batch to tensor 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        # Select true labels 
        y_true = sample.obscene.tolist()
        # Convert true labels to tensor 
        y_true = torch.tensor(y_true, dtype=torch.float)

        # Make prediction 
        y_pred = model(batch_ix, seq_lengths)

        loss = criterion(y_pred.squeeze(), y_true)

        epoch_loss += loss.item() / n_iters
        loss.backward()   # Backward pass 
        optimizer.step()
            
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))    

Number of iterations for 1 epoch: 402
Epoch 0: train loss: 0.16399330706497775
Epoch 1: train loss: 0.09334906738770389
Epoch 2: train loss: 0.07606762232124549
Epoch 3: train loss: 0.0681611931122914
Epoch 4: train loss: 0.050832275790625665
Epoch 5: train loss: 0.04252213836370715
Epoch 6: train loss: 0.03742606459554414
Epoch 7: train loss: 0.031569908581852484
Epoch 8: train loss: 0.026241310798943564
Epoch 9: train loss: 0.024904278009142144


In [0]:
# Functions for test dataset splitting on batches 

def index_marks(nrows, chunk_size):
    return range(1 * chunk_size, (nrows // chunk_size + 1) * chunk_size, chunk_size)

def split(df, chunk_size):
    indices = index_marks(df.shape[0], chunk_size)
    return np.split(df, indices)

In [0]:
def make_predictions(model, df_test, batch_size, threshold): 
    n_prints = 0
    predictions = []
    true_labels = []
    # Split data in batches 
    test_batches = split(df_test, batch_size)
    for batch in test_batches:
        if not batch.empty:
            batch_ix, seq_lengths = vectorize(batch.filtered_tokens.tolist(), token_to_id)
            # Convert vectorized batch to tensor 
            batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

            # Select true labels 
            y_true = batch.obscene.tolist()

            # Make prediction 
            y_pred = model(batch_ix, seq_lengths).detach().squeeze().numpy()
            # Convert it to binaries 
            y_pred = [int(pred.item() > threshold) for pred in y_pred]
        
            # Add them to parallel lists 
            predictions.extend(y_pred)
            true_labels.extend(y_true)
        
            # Print some examples with obscene documents texts and predicted and true labels 
            for true, pred, document in zip(y_true, y_pred, batch.comment_text):
                if true == 1.0 and n_prints < 10:
                    print("Predicted label: {}".format(pred))
                    print("True label: {}".format(true))
                    print("Document: {}".format(document))
                    print("*-*-"*20)
                    n_prints += 1
        
    return true_labels, predictions

In [159]:
true_labels, predictions = make_predictions(model, df_test, batch_size=64, threshold=0.3)

Predicted label: 0
True label: 1
Document: A proposal to cut down on AIDS 

Less AIDS would be spread if you were to stop inserting your minuscule penis into little boys.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1
Document: HEY 
Nigger. Get a real job you cocksucking jew.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1
Document: "

 A barnstar for you! 

  The Photographer's Barnstar your photos are horrible. You are an idiot. Fuck you    "
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1
Document: Hey 

Hey guess what.  Not that I told anyone this yet but Larry Sanger sucked my c*****shhh during carnival in Rio back in '03 and I nutted in his mouth so basically I get to do whatever I want on Wikipedia for life.  Sucks for you eh?
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

In [160]:
# Pring a classification report: 

print(classification_report(true_labels, predictions))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96     16264
           1       0.36      0.47      0.41       906

    accuracy                           0.93     17170
   macro avg       0.67      0.71      0.69     17170
weighted avg       0.94      0.93      0.93     17170



In [0]:
class LSTMnet(nn.Module):
    def __init__(self, num_tokens, emb_size=200, hid_size=128):
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(num_tokens, emb_size)
        self.rnn = nn.LSTM(emb_size, hid_size, batch_first=True)
        self.logits = nn.Linear(hid_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, seq_lengths): 
        emb = self.emb(x)
        pack = torch.nn.utils.rnn.pack_padded_sequence(emb, seq_lengths, batch_first=True, enforce_sorted=False) 
        packed_output, (hidden, cell) = self.rnn(pack)
        logits = self.logits(hidden.squeeze(0))
        output = self.sigmoid(logits)
        return output

In [162]:
#train LSTM on obscene
model = LSTMnet(num_tokens=len(cleaned_vocab))
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
history = []

batch_size = 64
n_epochs = 10
n_iters = df_train.shape[0] // batch_size
print("Number of iterations for 1 epoch: {}".format(n_iters))

for epoch in range(n_epochs):
    epoch_loss = 0 
    for step in range(n_iters):

        optimizer.zero_grad()
        sample = df_train.sample(batch_size)

        batch_ix, seq_lengths = vectorize(sample.filtered_tokens.tolist(), token_to_id) 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        y_true = sample.obscene.tolist()
        y_true = torch.tensor(y_true, dtype=torch.float)

        y_pred = model(batch_ix, seq_lengths)

        loss = criterion(y_pred.squeeze(), y_true)

        epoch_loss += loss.item() / n_iters
        loss.backward()
        optimizer.step()
            
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))    

Number of iterations for 1 epoch: 402
Epoch 0: train loss: 0.13710353038834408
Epoch 1: train loss: 0.06930935842017477
Epoch 2: train loss: 0.04380234466826273
Epoch 3: train loss: 0.02714419715463842
Epoch 4: train loss: 0.02467759663588945
Epoch 5: train loss: 0.016561179003673633
Epoch 6: train loss: 0.01505775541404568
Epoch 7: train loss: 0.010431133473226979
Epoch 8: train loss: 0.009237416144064791
Epoch 9: train loss: 0.008229222731683673


In [163]:
true_labels, predictions = make_predictions(model, df_test, batch_size=64, threshold=0.3)
print(classification_report(true_labels, predictions))

Predicted label: 0
True label: 1
Document: A proposal to cut down on AIDS 

Less AIDS would be spread if you were to stop inserting your minuscule penis into little boys.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1
Document: HEY 
Nigger. Get a real job you cocksucking jew.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1
Document: "

 A barnstar for you! 

  The Photographer's Barnstar your photos are horrible. You are an idiot. Fuck you    "
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1
Document: Hey 

Hey guess what.  Not that I told anyone this yet but Larry Sanger sucked my c*****shhh during carnival in Rio back in '03 and I nutted in his mouth so basically I get to do whatever I want on Wikipedia for life.  Sucks for you eh?
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

In [165]:
#train LSTM on toxic
model = LSTMnet(num_tokens=len(cleaned_vocab))
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
history = []

batch_size = 64
n_epochs = 10
n_iters = df_train.shape[0] // batch_size
print("Number of iterations for 1 epoch: {}".format(n_iters))

for epoch in range(n_epochs):
    epoch_loss = 0 
    for step in range(n_iters):

        optimizer.zero_grad()
        sample = df_train.sample(batch_size)

        batch_ix, seq_lengths = vectorize(sample.filtered_tokens.tolist(), token_to_id) 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        y_true = sample.toxic.tolist()
        y_true = torch.tensor(y_true, dtype=torch.float)

        y_pred = model(batch_ix, seq_lengths)

        loss = criterion(y_pred.squeeze(), y_true)

        epoch_loss += loss.item() / n_iters
        loss.backward()
        optimizer.step()
            
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))    

Number of iterations for 1 epoch: 402
Epoch 0: train loss: 0.12198619825987911
Epoch 1: train loss: 0.06342415216747223
Epoch 2: train loss: 0.03659168031328449
Epoch 3: train loss: 0.027594609049945762
Epoch 4: train loss: 0.016940820696781066
Epoch 5: train loss: 0.01490897669238837
Epoch 6: train loss: 0.013052777406966323
Epoch 7: train loss: 0.009750815514315538
Epoch 8: train loss: 0.009463562128989883
Epoch 9: train loss: 0.010163960251625593


In [166]:
def make_predictions(model, df_test, batch_size, threshold): 
    n_prints = 0
    predictions = []
    true_labels = []
    test_batches = split(df_test, batch_size)
    for batch in test_batches:
        if not batch.empty:
            batch_ix, seq_lengths = vectorize(batch.filtered_tokens.tolist(), token_to_id)

            batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

            y_true = batch.toxic.tolist()
            y_pred = model(batch_ix, seq_lengths).detach().squeeze().numpy() 
            y_pred = [int(pred.item() > threshold) for pred in y_pred]
        

            predictions.extend(y_pred)
            true_labels.extend(y_true)

    return true_labels, predictions

true_labels, predictions = make_predictions(model, df_test, batch_size=64, threshold=0.3)
print(classification_report(true_labels, predictions))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97     16336
           1       0.49      0.52      0.50       834

    accuracy                           0.95     17170
   macro avg       0.73      0.74      0.74     17170
weighted avg       0.95      0.95      0.95     17170



In [0]:
#balance dataset
def balance_df(df):
    df = pd.DataFrame(df)
    obscene = pd.DataFrame(df[df['obscene']==1])
    num_obscene = obscene.shape[0]
    categories = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
    for c in categories:
            df = df[df[c] == 0]
    df = df.sample(num_obscene)
    obscene = obscene.append(df)
    res = obscene.sample(frac=1)
    return res

df_sample = filter_noise_tokens(df_binary, cleaned_vocab)
df_filtered = df_sample[df_sample.astype(str)['filtered_tokens'] != '[]']
df_train, df_test = train_test_split(df_filtered, test_size=0.2, stratify=df_filtered['obscene'])

df_train = balance_df(df_train) 


In [168]:
#train on toxic, but on balanced by obscene
model = LSTMnet(num_tokens=len(cleaned_vocab))
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
history = []

batch_size = 64
n_epochs = 10
n_iters = df_train.shape[0] // batch_size
print("Number of iterations for 1 epoch: {}".format(n_iters))

for epoch in range(n_epochs):
    epoch_loss = 0 
    for step in range(n_iters):

        optimizer.zero_grad()
        sample = df_train.sample(batch_size)

        batch_ix, seq_lengths = vectorize(sample.filtered_tokens.tolist(), token_to_id) 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        y_true = sample.toxic.tolist()
        y_true = torch.tensor(y_true, dtype=torch.float)

        y_pred = model(batch_ix, seq_lengths)

        loss = criterion(y_pred.squeeze(), y_true)

        epoch_loss += loss.item() / n_iters
        loss.backward()
        optimizer.step()
            
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))   

Number of iterations for 1 epoch: 186
Epoch 0: train loss: 0.38721716628279734
Epoch 1: train loss: 0.21900644421737678
Epoch 2: train loss: 0.14094438809420795
Epoch 3: train loss: 0.09143749080718526
Epoch 4: train loss: 0.06720119281872226
Epoch 5: train loss: 0.05604481067688715
Epoch 6: train loss: 0.03504557028645649
Epoch 7: train loss: 0.030621828294799782
Epoch 8: train loss: 0.031731418567404204
Epoch 9: train loss: 0.033035006839156104


In [169]:
true_labels, predictions = make_predictions(model, df_test, batch_size=64, threshold=0.3)
print(classification_report(true_labels, predictions))

              precision    recall  f1-score   support

           0       0.99      0.81      0.89     27035
           1       0.18      0.79      0.29      1400

    accuracy                           0.81     28435
   macro avg       0.58      0.80      0.59     28435
weighted avg       0.95      0.81      0.86     28435



## Task

1. Make a dataset balanced: for example select all of the obscene messages, calculate its number and sample from the clean messages equal number of examples. **(1)See if it increased your score on toxic messages.** 

As the **additional** task you can modify your dataset sampling during the training/testing. Read about Datasets, DataSamplers and DataLoaders in pytorch. Try to apply them. 


2. Read about RNNs different types (LSTMs and GRUs): 
  https://colah.github.io/posts/2015-08-Understanding-LSTMs/  

  https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 
  
  **(2)What is the difference between RNN and LSTM? Why do we need LSTM? Explain it in your own words.**  
  
  **(3)What is the difference between LSTM and GRU? Explain it in your own words.** 
  
  
3. Modify your network to make it possible to work with nn.LSTM or nn.GRU layers. (Their outputs may be a little bit defferent from nn.RNN, so be careful to modify your code accordingly). 

4. Compare all of the previous examples: classification with RNN (or LSTM/GRU) and FFNN. **(4)Which one performed better according to the metrics? (5)To the time?**

5. **(6)How dataset imbalancing are influencing your model? Read about dataset imbalancing and about possibilities to handle them. (7)Write down below what can we do with it, or implement a solution.** 
  
  

Please, answer the questions 1-7 and write your answers down below: 

1. Scores become worse.
2. LSTM has different layer structure. It uses gates to manage data flow. LSTM was designed to solve short-term memory problem.
3. GRU is variation of LSTM, it was designed to solve vanishing gradient problem.
4.LSTM net
5.FFNN
6.__
7.__