# Project Overview

## Introduction

During conversing on online platforms, discussing things we care about can be difficult sometimes. There is always a threat of abuse and harassment online which means that many people stop expressing themselves and give up on seeking different opinions. Almost all Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

In an effort to monitor online conversations, one area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion

In this project, I have tries to build a model that’s capable of detecting different types of of toxicity like `threats, obscenity, insults, and identity-based`. The dataset is of comments from Wikipedia’s talk page edits. Such a model will hopefully help online discussion become more productive and respectful.

This was a kaggle competition hosted in 2018 and the Dataset can be found [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data).

## `1.` Cleaning Data

### 1.1. Importing Libraries and Data

In [2]:
# checking gpu configuration

!nvidia-smi

Thu Apr 29 22:33:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# hugging face transformers

!pip install transformers



In [8]:
# importing general libraries pandas and numpy
import pandas as pd
import numpy as np

# reading the test and train data
train = pd.read_csv('/content/drive/MyDrive/Jigsaw-Toxic/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Jigsaw-Toxic/test.csv')

In [10]:
from tqdm.notebook import tqdm # progress bar

import regex as re # for regex
import pickle # to save tokenizer


import nltk #for text processing
from unidecode import unidecode # for ascii characters
from nltk.stem import PorterStemmer # for stemming tokens
from nltk.stem import WordNetLemmatizer # for lemmatizing tokens
from nltk.corpus import stopwords # for stopwords

nltk.download('wordnet') # download 'wordnet' for wordnet lemmatizer
nltk.download('stopwords')# download 'stopwords' for list of stopwords
nltk.download('punkt')# to remove punctuation from text
from gensim.scripts.glove2word2vec import glove2word2vec # word2vec word embedding

# for neural networds
import tensorflow as tf 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import *
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import Model
from keras.layers import SpatialDropout1D

# for transformers
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel,TrainingArguments,Trainer, RobertaForSequenceClassification,get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer
import torch
from torch.utils.data import Dataset, random_split, DataLoader, \
                             RandomSampler, SequentialSampler, TensorDataset
# metric
from sklearn.metrics import roc_auc_score

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
max_len_sequence = 256 # maximum number of tokens in a sentence
embedding_dim = 300 # dimension of the word embedding

### 1.2. Clean the data

In [40]:
def clean_data(text):
    
  '''
      function to clean a string for use in classification.
      steps perfomed are:
      1. removing spaces and non ascii characters
      2. removing punctuation
      3. lemmatization and removing stopwords
  '''

  # removing spaces and non ascii characters
  text = text.lower().split()
  text = " ".join(text)
  text = unidecode(text)
  
  # removing punctuation
  text = re.sub(r"[^A-Za-z^,!.\/'+\-=]", " ", text) 
  text = re.sub(r"[-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~“”’∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&']", "", text)
  text = nltk.word_tokenize(text)
  text =" ".join([word.lower() for word in text if word.isalpha()]) 
    
  # lemmatization and removing stopwords  
  text = " ".join([WordNetLemmatizer().lemmatize(x) for x in text.split()])
  stop_words = set(stopwords.words('english')) 
  text = " ".join([x for x in text.split() if x not in stop_words])

  return text


In [None]:
# cleaning train and test data
train['comment_text'] = train['comment_text'].apply(clean_data)
test['comment_text'] = test['comment_text'].apply(clean_data)

In [None]:
# saving cleaned text as csv
train.to_csv('/content/drive/MyDrive/Jigsaw-Toxic/clean-train.csv')
test.to_csv('/content/drive/MyDrive/Jigsaw-Toxic/clean-test.csv')

In [12]:
# loading cleaned data 
train = pd.read_csv('/content/drive/MyDrive/Jigsaw-Toxic/clean-train.csv')
test = pd.read_csv('/content/drive/MyDrive/Jigsaw-Toxic/clean-test.csv')
train.drop('Unnamed: 0',axis=1, inplace=True)
test.drop('Unnamed: 0',axis=1, inplace=True)

In [14]:
# a sample sentence from train set after cleaning
clean_data(train.comment_text[15])

'juelz santanas age juelz santana wa year old came february th make juelz turn making song diplomat third neff signed cam label roc fella wa year old coming single santana town yes born really could older lloyd bank could birthday passed homie neff year old juelz death god forbid thinking equal go caculator stop changing year birth god'

In [17]:
# shuffling
train = train.sample(frac=1)

In [18]:
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

Train shape:  (159571, 8)
Test shape:  (153164, 2)


In [19]:
# filling empty comments with NA and converting data to list

train["comment_text"] = train["comment_text"].fillna("NA")
list_train_x = train['comment_text'].values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
list_train_y = train[list_classes].values
test["comment_text"] = test["comment_text"].fillna("NA")
list_test_x = test['comment_text'].values

In [20]:
list_train_x

array(['think current event mark helpful content page constantly changing week episode',
       'tag ha placed patrick cecil bishop requesting speedily deleted wikipedia ha done article seems person group people band club company web content doe indicate subject notable article subject included wikipedia criterion speedy deletion article assert notability may deleted time please see guideline generally accepted notable indicate subject article notable may contest tagging add top page existing db tag leave note article talk page explaining position please remove speedy deletion tag hesitate add information article would confirm subject notability guideline guideline specific type article may want check criterion biography web site band company feel free leave note talk page question',
       'rutherford hayes edits please refrain making unconstructive edits wikipedia rutherford hayes edit chance little shit',
       ...,
       'well done maybe soon back shorten block day maybe soon sic

In [21]:
list_train_y[:5]

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0]])

## 2. Tokenization and converting to a Dataloader

### 2.1. Tokenization

In [22]:
def get_tokenizer():
    
    '''
    This function is used to get RoBERTa rokenizer
    '''
    
    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
    return tokenizer
tokenizer = get_tokenizer()

In [23]:
# applying tokenizer to out input data

input_ids = [tokenizer.encode(x) for x in tqdm(list_train_x)]  # tokenization
input_ids = pad_sequences(sequences = input_ids, maxlen = max_len_sequence, dtype = 'long', padding='post', truncating='post') # padding and truncation of sentences to a common length
attention_masks = [[1 if i>0 else 0 for i in x] for x in input_ids] # applying attention mask to the model

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (1541 > 512). Running this sequence through the model will result in indexing errors





### 2.3. Converting data to tensors

In [24]:
# converting data to tensors

train_input_ids = torch.tensor(input_ids[:int(0.8*len(list_train_x))])
val_input_ids = torch.tensor(input_ids[int(0.8*len(list_train_x)):])

train_attention_masks = torch.tensor(attention_masks[:int(0.8*len(list_train_x))] )
val_attention_masks = torch.tensor(attention_masks[int(0.8*len(list_train_x)):])

train_y = torch.tensor(list_train_y[:int(0.8*len(list_train_x))])
val_y = torch.tensor(list_train_y[int(0.8*len(list_train_x)):])

In [25]:
# shape of input
train_attention_masks.shape

torch.Size([127656, 256])

### 2.3. Converting data to datasets and dataloaders

In [None]:


# converting tensors to Dataset, each datapoint consists of input_ids, attention_masks, labels

train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_y)  
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_y)

# sampling
train_sampler = RandomSampler(train_dataset) #A Sampler that returns indices shuffled
val_sampler = SequentialSampler(val_dataset) #A Sampler that returns indices sequentially

# dataset to batches
train_dataloader = DataLoader(train_dataset, sampler = train_sampler, batch_size = 32) 
val_dataloader = DataLoader(val_dataset, sampler = val_sampler, batch_size = 32)

In [28]:
# this shows one item in our dataloader
a = next(iter(train_dataloader))
a

[tensor([[    0, 37251,    65,  ...,     0,     0,     0],
         [    0,  5087,  8490,  ...,     0,     0,     0],
         [    0, 37111,  7316,  ...,     0,     0,     0],
         ...,
         [    0, 16714,  7878,  ...,     0,     0,     0],
         [    0, 14656,  9458,  ...,     0,     0,     0],
         [    0, 23233,  3872,  ...,     0,     0,     0]]),
 tensor([[0, 1, 1,  ..., 0, 0, 0],
         [0, 1, 1,  ..., 0, 0, 0],
         [0, 1, 1,  ..., 0, 0, 0],
         ...,
         [0, 1, 1,  ..., 0, 0, 0],
         [0, 1, 1,  ..., 0, 0, 0],
         [0, 1, 1,  ..., 0, 0, 0]]),
 tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [1, 0, 1, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [1, 0, 0, 0, 1, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0]

## 3. Model Training and Prediction

### 3.1. Define Model

In [27]:
%time
import torch
import torch.nn as nn
from transformers import BertModel


class Bert(nn.Module):
    
    '''
        this class defines our model, stack a classifier on top of bert model
    '''
    def __init__(self, freeze_bert=False):
        
        '''
            initialise configuration of our model
        '''
        
        super(Bert, self).__init__() # initialising parametes of superclass
        
        
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 64, 6

        # Instantiate RoBERTa model
        self.bert = RobertaModel.from_pretrained('roberta-base')

        # Instantiate an one-layer feed-forward classifier and one output layer
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        '''
            this method is implemented when the class is called
            computes output from a item in dataloader
        '''
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        # Extract the last hidden state of the token `[CLS]` for classification task, this token is used in classification cases
        last_hidden_state_cls = outputs[0][:, 0, :]
        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


In [41]:
# one sample preview for output from model
# this is random now because the model is not trained

bert= Bert(freeze_bert=True)
bert(input_ids = a[0],attention_mask = a[1])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




tensor([[ 0.0490,  0.0231, -0.1335, -0.0067, -0.0061,  0.0630]],
       grad_fn=<AddmmBackward>)

### 3.2. Define Metric

In [29]:

def accuracy_thresh(y_pred, y_true, thresh:float=0.4, sigmoid:bool=True):
    
    '''
        this function defines our metric 
        checks how many of the six classes have same prediction 
        and label based on a threshold and returns the mean
        across all labels
    '''
    
    if sigmoid: y_pred = y_pred.sigmoid()
    return np.mean(((y_pred>thresh).float()==y_true.float()).float().cpu().numpy(), axis=1).sum()

### 3.3. Training and Evaluation

In [34]:
def train(model, loss_fn, train_loader, val_loader,  scheduler,epochs=20, device="cuda", optimizer=None):
    '''
        training the model and evaluation
    '''
    
    model.to(device) # load model to gpu memory
    optimizer = torch.optim.AdamW(model.parameters()) # calling optmizer here because device affects this, in our case "cuda"
    for epoch in (range(1, epochs+1)):
        
        training_loss = 0.0 # inialise training loss
        valid_loss = 0.0 # inialise validation loss
        model.train() # model is in training phase so weights will be updated
        micro_roc_auc_acc_train = 0.0 # inialise our metric for training
        micro_roc_auc_acc_val=0.0 # inialise our metric for validation

        for batch in tqdm(train_loader):
            
            optimizer.zero_grad() # prevents gradients from accumulating
            input_ids, attention_mask, targets = tuple(t.to(device) for t in batch) # loading data to gpu memory
            

            output = model(input_ids, attention_mask) # model outputs
            targets = targets.type_as(output) # same data type of both output and target
         
            micro_roc_auc_acc_train +=  accuracy_thresh(output.view(-1,6),targets.view(-1,6)) # calculating metric for each step

            loss = loss_fn(output, targets) # computing loss
            loss.backward() # computing gradients
            optimizer.step() # move the weights
            scheduler.step() # change learning rate according to schedule
            training_loss += loss.data.item() * input_ids.size(0)
        training_loss /= len(train_loader.dataset) # loss for the epoch
        micro_roc_auc_acc_train /= len(train_loader.dataset) # metric for the epoch
        
        model.eval() # moving weights to evaluation mode, they will not be updated 
   
        
        for batch in val_loader:
            
            input_ids, attention_mask, targets = tuple(t.to(device) for t in batch) # loading data to gpu memory


            output = model(input_ids, attention_mask) # model outputs
            targets = targets.type_as(output) # same data type of both output and target
            loss = loss_fn(output,targets) # computing loss
            valid_loss += loss.data.item() * input_ids.size(0)

            micro_roc_auc_acc_val +=  accuracy_thresh(output.view(-1,6),targets.view(-1,6)) # metric

        # torch.save({
        #     'epoch': epoch,
        #     'model_state_dict': model.state_dict(),
        #     'optimizer_state_dict': optimizer.state_dict(),
        #     
        #     }, './checkpoint_bert')

        #saving the model
        
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
    
            }, './checkpoint_bert_one_cycle_4epochs')




        valid_loss /= len(val_loader.dataset) # loss for the epoch
        micro_roc_auc_acc_val /= len(val_loader.dataset) # metric for the epoch

        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}, auc_score_training = {:.2f}, auc_score_val = {:.2f}'.format(epoch, training_loss,
        valid_loss, micro_roc_auc_acc_train, micro_roc_auc_acc_val))



### 3.4. Tuning hyperparameters

**Training 1**

With parameters of bert model unfreezed, and linear schedule for learning rate and 2 epochs


In [33]:
model = Bert() # bert model
optimizer = torch.optim.AdamW(model.parameters()) # adamw optimizer with default params
scheduler = get_linear_schedule_with_warmup(optimizer, num_training_steps=len(train_dataloader)*2, num_warmup_steps=1)
loss_func = nn.BCEWithLogitsLoss() # Binary cross entropy loss for multilabel classification

In [72]:
train(model, optimizer = optimizer, loss_fn=loss_func, train_loader=train_dataloader, val_loader=val_dataloader,scheduler=scheduler, epochs=2)

HBox(children=(FloatProgress(value=0.0, max=127656.0), HTML(value='')))


Epoch: 1, Training Loss: 0.15, Validation Loss: 0.14, auc_score_training = 0.96, auc_score_val = 0.96


HBox(children=(FloatProgress(value=0.0, max=127656.0), HTML(value='')))


Epoch: 2, Training Loss: 0.15, Validation Loss: 0.14, auc_score_training = 0.96, auc_score_val = 0.96


**Training 2**

With bert model unfreezed, and One Cycle Learning rate scheduler and 2 epochs

In [25]:
model = Bert()
optimizer = torch.optim.AdamW(model.parameters())
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=int(len(train_dataloader)), epochs=2)
loss_func = nn.BCEWithLogitsLoss()

In [26]:
train(model, loss_func, train_dataloader, val_dataloader, scheduler, 2)

HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))




Epoch: 1, Training Loss: 0.14, Validation Loss: 0.14, auc_score_training = 0.96, auc_score_val = 0.96


HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))


Epoch: 2, Training Loss: 0.14, Validation Loss: 0.14, auc_score_training = 0.96, auc_score_val = 0.96


__One cycle lr leads to better metric score and loss than linear so from now on I will use One cycle lr__

**Training 3**

With freezed BERT model 

In [32]:
model = Bert(freeze_bert=True)
optimizer = torch.optim.AdamW(model.parameters())
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=int(len(train_dataloader)), epochs=2)
loss_func = nn.BCEWithLogitsLoss()

In [33]:
train(model, loss_func, train_dataloader, val_dataloader, scheduler, 2)

HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))




Epoch: 1, Training Loss: 0.11, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))


Epoch: 2, Training Loss: 0.10, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


__Freezed bert model gives better metrics score__

**Training 4**

With 4 epochs

In [35]:
model = Bert(freeze_bert=True)
optimizer = torch.optim.AdamW(model.parameters())
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=int(len(train_dataloader)), epochs=4)
loss_func = nn.BCEWithLogitsLoss()

In [36]:
train(model, loss_func, train_dataloader, val_dataloader, scheduler, 4)

HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))




Epoch: 1, Training Loss: 0.11, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))


Epoch: 2, Training Loss: 0.10, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))


Epoch: 3, Training Loss: 0.10, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


HBox(children=(FloatProgress(value=0.0, max=7979.0), HTML(value='')))


Epoch: 4, Training Loss: 0.10, Validation Loss: 0.09, auc_score_training = 0.97, auc_score_val = 0.97


__Increasing number of epochs, did not improve the result, therefore I will be using the model trained on 2 epochs__

### 3.4. Making Predictions on the test Dataset

In [45]:
# loading model to make predictions and submission

model = Bert()

checkpoint = torch.load('/content/drive/MyDrive/Jigsaw-Toxic/checkpoint_bert_one_cycle_4epochs')
model.load_state_dict(checkpoint['model_state_dict'])

model.eval()

Bert(
  (bert): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-

In [38]:
# test data
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,yo bitch ja rule succesful ever whats hating s...
1,0000247867823ef7,rfc title fine imo
2,00013b17ad220c46,source zawe ashton lapland
3,00017563c3f7919a,look back source information updated wa correc...
4,00017695ad8997eb,anonymously edit article


In [41]:
# clean the test data
test['comment_text'] = test['comment_text'].apply(clean_data)

In [42]:
# tokenizing the data
list_test_x = test['comment_text'].values
input_ids = [tokenizer.encode(x) for x in tqdm(list_test_x)]
input_ids = pad_sequences(sequences = input_ids, maxlen = max_len_sequence, dtype = 'long', padding='post', truncating='post')
attention_masks = [[1 if i>0 else 0 for i in x] for x in input_ids]

HBox(children=(FloatProgress(value=0.0, max=153164.0), HTML(value='')))




In [60]:
# converting to tensors
test_input_ids = torch.tensor(input_ids)
test_attention_masks = torch.tensor(attention_masks)

test_dataset = TensorDataset(test_input_ids, test_attention_masks)

test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler = test_sampler, batch_size = 1)

__batch size is 1 here as we want to make predictions and then convert this into a list__

In [61]:
def preds(model,test_loader, device=torch.device("cuda")):
    
    '''
        this function makes prediction on test data
        
    '''
    model.to(device)
    predictions = []
    for batch in tqdm(test_loader):
        input_ids, attention_mask = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            outputs = model(input_ids, attention_mask)
            outputs = torch.sigmoid(outputs) # taking sigmoid to convert to probability,but it was not necessary 
            predictions.append(outputs.cpu().detach().numpy().tolist())
    return predictions

In [62]:
predictions = preds(model=model,test_loader=test_dataloader)
predictions = np.array(predictions)[:,0]

HBox(children=(FloatProgress(value=0.0, max=153164.0), HTML(value='')))




In [63]:
# converting into a format required for submission

submission = pd.DataFrame(predictions,columns=['toxic','severe_toxic','obscene','threat','insult','identity_hate'])
test[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]=submission
final_sub = test[['id','toxic','severe_toxic','obscene','threat','insult','identity_hate']]
final_sub.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.959409,0.316901,0.92823,0.056724,0.886492,0.224608
1,0000247867823ef7,0.055276,0.00079,0.015786,0.000569,0.015154,0.002363
2,00013b17ad220c46,0.012814,0.000324,0.004613,0.000281,0.003916,0.000835
3,00017563c3f7919a,0.021597,0.000449,0.007614,0.000362,0.006122,0.000623
4,00017695ad8997eb,0.037285,0.00097,0.012257,0.000707,0.012376,0.001658


In [64]:
# converting to a csv
final_sub.to_csv('/content/drive/MyDrive/Jigsaw-Toxic/submission.csv')