# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.


## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json).

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



Set Runtime Type to GPU

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"ssupadhya","key":"f3379320fde2888945b4df9022f8acb5"}'}

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

'kaggle (1).json'   kaggle.json   [0m[01;34msample_data[0m/


### 4. Install the Kaggle API using the following command


Below code gives error when executed 1st time. Restart Runtime and execute the code from start again and the code gets executed successfully

In [None]:
!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

Downloading toxic-questions-classification.zip to /content
 99% 60.0M/60.6M [00:04<00:00, 17.2MB/s]
100% 60.6M/60.6M [00:04<00:00, 15.0MB/s]


In [None]:
!unzip /content/toxic-questions-classification.zip

Archive:  /content/toxic-questions-classification.zip
  inflating: sample_submission.csv   
  inflating: test_dataset.csv        
  inflating: train_dataset.csv       


## YOUR CODING STARTS FROM HERE

## Import required packages

nlpaug for Data Augmentation

Used Data Augmentation for oversampling minority class in the data

In [None]:
# Import required packages
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU
from keras.models import Sequential

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [None]:
# Data Loading
df_train = pd.read_csv('train_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')

In [None]:
df_train.head()

Unnamed: 0,qid,question_text,target
0,2549b81c4adff1849a7f,Is CSE at bit Meara good?,0
1,0558ed93a4630e68f7ac,Is it better to exercise before or after the b...,0
2,5d72d5233059e44f8a8e,Can character naming in writing infringe on tr...,0
3,3968636ac28841d0c901,Why does everyone making YouTube videos in Jap...,0
4,201d2b9a777bbf25443f,Is there any relation between horse power and ...,0


In [None]:
df_test.head()

Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,Is Minance any good?
1,5650c4a236fe3b555c31,Do computers have reserved key strokes?
2,b778db4f09f9326195ea,When was the last time that the US had such a ...
3,e91c299cffc74a66aaf5,Are you still living in Wasilla?
4,2e129e7a85739a73b70a,What distinguishes the acting style of Piolo P...


In [None]:
df_train.shape

(1044897, 3)

In [None]:
df_test.shape

(261221, 2)

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1044897 entries, 0 to 1044896
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   qid            1044897 non-null  object
 1   question_text  1044897 non-null  object
 2   target         1044897 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 23.9+ MB


No missing values in the train dataset

In [None]:
df_train.target.value_counts()

0    980293
1     64604
Name: target, dtype: int64

In [None]:
df_train.target.value_counts(normalize=True)

0    0.938172
1    0.061828
Name: target, dtype: float64

Data imbalance issue. Class 0 is the majority class with 94% of data while Class 1 is the minority class with 6% data.

This needs to be fixed to avoid model getting biased towards majority class.

In [None]:
df_train.duplicated().sum()

0

No duplicate records found.

In [None]:
df_train[df_train.target == 1].head()

Unnamed: 0,qid,question_text,target
16,8ea797496fc68c9d8d98,Why are black people always tormented?,1
28,72e1085eab12b6aa55e2,How do you spell aye?,1
29,8137a860b078efcadd4c,Why do Conservatives want all news to be conse...,1
55,4233e8ed3bbbf5b8a242,Are we all for calling the people born in the ...,1
67,4c4e07c6a1723d0fe649,Why did the frustrated Catholics of South Indi...,1


Analysis based on number of characters in the question text:

In [None]:
print('Maximum length of the question text', df_train.question_text.str.len().max())
print('Minimum length of the question text', df_train.question_text.str.len().min())
print('Average length of the question text', df_train.question_text.str.len().mean())

Maximum length of the question text 878
Minimum length of the question text 1
Average length of the question text 70.67046321312053


In [None]:
df_train[df_train.question_text.str.len() == 878]

Unnamed: 0,qid,question_text,target
875869,1ffca149bd0a19cd714c,What is [math]\overbrace{\sum_{\vartheta=8}^{\...,1


In [None]:
df_train.question_text.loc[875869]

'What is [math]\\overbrace{\\sum_{\\vartheta=8}^{\\infty} \\vec{\\frac{\\sum_{\\kappa=7}^{\\infty} \\overbrace{1x^0}^{\\text{Read carefully.}}-3x^{-1} \\div 1x^5+{\\sqrt[3]{2x^{-3}}}^{1x^0}+\\vec{\\vec{{3x^{-3}}^{1x^{-2}}}}}{\\sum_{\\dagger=9}^{\\infty} \\vec{\\boxed{\\boxed{3x^{-1}}+3x^1 \\times 1x^{-5}}}}} \\div \\sin(\\boxed{\\boxed{\\vec{3x^{-5}}}+\\sqrt[4]{2x^{-4}}+\\vec{2x^{-3}} \\div \\sin(\\sqrt[5]{\\int_{1x^5}^{2x^5} 2x^{-3} d\\varrho}) \\times \\vec{{\\underbrace{2x^1}_{\\text{Prove This.}}}^{3x^4} \\div \\sqrt[5]{2x^{-3}}+\\sum_{\\theta=8}^{\\infty} 1x^4}}) \\times {\\boxed{\\vec{\\sum_{\\nu=8}^{\\infty} \\sum_{4=6}^{\\infty} \\sum_{\\xi=9}^{\\infty} \\boxed{3x^1}-\\boxed{\\sqrt[3]{\\sqrt[3]{2x^{-2}}}}}}}^{1x^3}-\\cos({{\\tan(\\sum_{0=6}^{\\infty} \\tan(\\overbrace{\\frac{\\boxed{1x^1}-\\sqrt[3]{3x^{-2}}}{\\sum_{\\eta=10}^{\\infty} 1x^{-3} \\div 1x^1}}^{\\text{Molar Quantity.}}))}^{1x^3}}^{1x^{-4}})}^{\\text{Expanded.}}[/math]?'

Data cleanup required to get meaningful words from the question text

In [None]:
df_train[df_train.question_text.str.len() < 10]

Unnamed: 0,qid,question_text,target
32540,0c2a113858db20e0a4db,Quora:,1
74507,48206e5f0dcedf1f00e6,Hungary:,1
83882,45efae151057c2c0e49c,To Quora:,1
133702,7014915ed4fd6def410e,I'm an,1
208279,c309469a202434b5f1d2,W,1
307367,18b058d2aabadb23c12d,In Islam?,0
348868,83d01336b3406133723e,Bye Bye?,1
365454,7abbb52cdd2cd7bc5e48,#NAME?,1
472383,2cfd7dec2231e47afd6c,I 12?,0
483562,a7193652063b3b3b2566,#NAME?,0


Though the question text does not make much sense, need to retain them as most of the records are of Class 1 which is the minority class.

Analysis based on number of words in the question text:

In [None]:
idx_max = df_train.question_text.str.split().str.len().idxmax()
val_max = df_train.question_text.loc[idx_max]
words_max = len(val_max)
print(idx_max)
print(words_max)
print(val_max)

348157
752
In "Star Trek 2013" why did they :

*Spoilers*
*Spoilers*
*Spoilers*
*Spoilers*

1)Make warping look quite a bit like an hyperspace jump
2)what in the world were those bright particles as soon as they jumped.
3)Why in the world did they make it possible for two entities to react in warp space in separate jumps.
4)Why did Spock get emotions for this movie.
5)What was the point of hiding the "Enterprise" underwater.
6)When they were intercepted by the dark ship, how come they reached Earth when they were far away from her.(I don't seem to remember the scene where they warp to earth).
7)How did the ship enter earth's atmosphere when it wasnt even in orbit.
8)When Scotty opened the door of the black ship , how come pike and khan didn't slow down?


In [None]:
df_train[df_train.question_text.str.contains("Spoilers")]

Unnamed: 0,qid,question_text,target
348157,663c7523d48f5ee66a3e,"In ""Star Trek 2013"" why did they :\n\n*Spoiler...",0
497353,5f8adae7e14ca03c781b,Spoilers: Why prime minister did nothing after...,0
543845,9a203937cbcc8add5baf,How can I block a topic on Quora? Spoilers abo...,0
651755,21db0297c7942c7a6bc2,Spoilers: How Aarav knew that he would find th...,0
791063,caaf597913fd836c819a,[Spoilers] What is the probability of finding ...,0
818156,08c47e108dbca9d8859f,(Spoilers) Why does Thanos sound so gloomy aft...,0
1027622,f3f391f13f83afdc1260,[Spoilers] In the 2017 Ghost in the Shell movi...,0


In [None]:
print(df_train.question_text.loc[1027622])

[Spoilers] In the 2017 Ghost in the Shell movie, where did the antagonist get his body?


Above list of questions seems to be valid though they have the word "Spoilers"

Observations:

1) Imbalance in data.
2) Bad Data:
a) Mulitple questions in the question text column.

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


Undersampling majority class

In [None]:
pip install nlpaug

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [None]:
import nlpaug
import nlpaug.augmenter.word as naw

In [None]:
df_train_1 = df_train[df_train.target == 1].copy()
df_train_1.shape

(64604, 3)

In [None]:
df_train_1.head()

Unnamed: 0,qid,question_text,target
16,8ea797496fc68c9d8d98,Why are black people always tormented?,1
28,72e1085eab12b6aa55e2,How do you spell aye?,1
29,8137a860b078efcadd4c,Why do Conservatives want all news to be conse...,1
55,4233e8ed3bbbf5b8a242,Are we all for calling the people born in the ...,1
67,4c4e07c6a1723d0fe649,Why did the frustrated Catholics of South Indi...,1


In [None]:
df_train_1.loc[16].question_text

'Why are black people always tormented?'

In [None]:
# Check a sample to veiw how augmenation using synonym works
aug = naw.SynonymAug(aug_src='wordnet',aug_max=2)
print('Original:', df_train_1.loc[16].question_text)
sample = aug.augment(df_train_1.loc[16].question_text,n=3)
print(sample)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...


Original: Why are black people always tormented?


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


['Wherefore be black people always tormented?', 'Wherefore are black hoi polloi always tormented?', 'Wherefore be black people always tormented?']


Generate data samples for minority class using Synonym technique

Below code takes a while as it is generating 8 possible replicas for 1 question text

In [None]:
aug = naw.SynonymAug(aug_src='wordnet',aug_max=3)
aug_text_1 = []
for i in df_train_1.index:
    new_text = aug.augment(df_train_1.loc[i].question_text,n=8)
    for j in range(8):
        #print(new_text[j])
        aug_text_1.append(new_text[j])

In [None]:
len(aug_text_1)

516832

Create a dataframe for additional records, assign it to Class 1 and concatenate it with the original train dataset

In [None]:
df_aug_data = pd.DataFrame(aug_text_1, columns=['question_text'])
df_aug_data['qid'] = df_aug_data.index
df_aug_data['target'] = 1
df_aug_data.head()

Unnamed: 0,question_text,qid,target
0,Why make up opprobrious people constantly torm...,0,1
1,Why represent ignominious people constantly to...,1,1
2,Wherefore exist black people always excruciate?,2,1
3,Why equal black masses always torment?,3,1
4,Wherefore equal black people perpetually torme...,4,1


In [None]:
df_train_new = pd.concat([df_train,df_aug_data])

In [None]:
df_train_new = shuffle(df_train_new)

In [None]:
df_train_new.reset_index(inplace=True, drop=True)

In [None]:
df_train_new.head()

Unnamed: 0,qid,question_text,target
0,301447,"Why, in the history of the JEE, has no Brahmin...",1
1,3e1dc709644e34791347,Can I get seat in VIT Vellore with a rank 89936?,0
2,1df75f9be81b580c497c,Why neutral is provided in 3 ph transformer?,0
3,875634b65845762ce66a,How do I start a momos business in India?,0
4,fc6b1de4088654a451d9,Can you tell me any good B-school in Jaipur?,0


In [None]:
df_train_new.shape

(1561729, 3)

In [None]:
df_train_new.target.value_counts(1)

0    0.627697
1    0.372303
Name: target, dtype: float64

In [None]:
df_test.head()

Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,Is Minance any good?
1,5650c4a236fe3b555c31,Do computers have reserved key strokes?
2,b778db4f09f9326195ea,When was the last time that the US had such a ...
3,e91c299cffc74a66aaf5,Are you still living in Wasilla?
4,2e129e7a85739a73b70a,What distinguishes the acting style of Piolo P...


In [None]:
text = df_train_new.question_text.values
labels = df_train_new.target.values
test_text = df_test.question_text.values

In [None]:
df_train_new.to_csv('df_train_aug.csv', index=False)

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification
from tabulate import tabulate
from tqdm import trange
import random

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case = True
    )

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
def print_rand_sentence():
  '''Displays the tokens and respective IDs of a random text sample'''
  index = random.randint(0, len(text)-1)
  table = np.array([tokenizer.tokenize(text[index]),
                    tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text[index]))]).T
  print(tabulate(table,
                 headers = ['Tokens', 'Token IDs'],
                 tablefmt = 'fancy_grid'))

In [None]:
print_rand_sentence()

╒═════════════╤═════════════╕
│ Tokens      │   Token IDs │
╞═════════════╪═════════════╡
│ why         │        2339 │
├─────────────┼─────────────┤
│ is          │        2003 │
├─────────────┼─────────────┤
│ life        │        2166 │
├─────────────┼─────────────┤
│ so          │        2061 │
├─────────────┼─────────────┤
│ competitive │        6975 │
├─────────────┼─────────────┤
│ ?           │        1029 │
╘═════════════╧═════════════╛


In [None]:
token_id = []
attention_masks = []

In [None]:
def preprocessing(input_text, tokenizer):
    return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        max_length = 32,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )

In [None]:
for sample in text:
    encoding_dict = preprocessing(sample, tokenizer)
    token_id.append(encoding_dict['input_ids'])
    attention_masks.append(encoding_dict['attention_mask'])
token_id = torch.cat(token_id, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)
labels = torch.tensor(labels)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
token_id[6]

tensor([  101,  2073,  2024,  1996, 18792,  2545,  1997, 20116,  2243, 19819,
         2075,  2044,  1996,  2345,  7749, 10911,  1029,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0])

In [None]:
val_ratio = 0.2
# Recommended batch size: 16, 32.
batch_size = 32

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Indices of the train and validation splits stratified by labels
train_idx, val_idx = train_test_split(
    np.arange(len(labels)),
    test_size = val_ratio,
    shuffle = True,
    stratify = labels)

In [None]:
# Train and validation sets
train_set = TensorDataset(token_id[train_idx],
                          attention_masks[train_idx],
                          labels[train_idx])

val_set = TensorDataset(token_id[val_idx],
                        attention_masks[val_idx],
                        labels[val_idx])

In [None]:
# Prepare DataLoader
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            val_set,
            sampler = SequentialSampler(val_set),
            batch_size = batch_size
        )

In [None]:
def b_tp(preds, labels):
  '''Returns True Positives (TP): count of correct predictions of actual class 1'''
  return sum([preds == labels and preds == 1 for preds, labels in zip(preds, labels)])

In [None]:
def b_fp(preds, labels):
  '''Returns False Positives (FP): count of wrong predictions of actual class 1'''
  return sum([preds != labels and preds == 1 for preds, labels in zip(preds, labels)])

In [None]:
def b_tn(preds, labels):
  '''Returns True Negatives (TN): count of correct predictions of actual class 0'''
  return sum([preds == labels and preds == 0 for preds, labels in zip(preds, labels)])

In [None]:
def b_fn(preds, labels):
  '''Returns False Negatives (FN): count of wrong predictions of actual class 0'''
  return sum([preds != labels and preds == 0 for preds, labels in zip(preds, labels)])

In [None]:
def b_metrics(preds, labels):
  '''
  Returns the following metrics:
    - accuracy    = (TP + TN) / N
    - precision   = TP / (TP + FP)
    - recall      = TP / (TP + FN)
    - specificity = TN / (TN + FP)
  '''
  preds = np.argmax(preds, axis = 1).flatten()
  labels = labels.flatten()
  tp = b_tp(preds, labels)
  tn = b_tn(preds, labels)
  fp = b_fp(preds, labels)
  fn = b_fn(preds, labels)
  b_accuracy = (tp + tn) / len(labels)
  b_precision = tp / (tp + fp) if (tp + fp) > 0 else 'nan'
  b_recall = tp / (tp + fn) if (tp + fn) > 0 else 'nan'
  b_specificity = tn / (tn + fp) if (tn + fp) > 0 else 'nan'
  return b_accuracy, b_precision, b_recall, b_specificity

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [None]:
# Load the BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [None]:
# Recommended learning rates (Adam): 5e-5, 3e-5, 2e-5. See: https://arxiv.org/pdf/1810.04805.pdf
optimizer = torch.optim.AdamW(model.parameters(),
                              lr = 3e-5,
                              eps = 1e-08
                              )

# Run on GPU
model.cuda()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Recommended number of epochs: 2, 3, 4. See: https://arxiv.org/pdf/1810.04805.pdf
epochs = 2

for _ in trange(epochs, desc = 'Epoch'):

    # ========== Training ==========

    # Set model to training mode
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        # Forward pass
        train_output = model(b_input_ids,
                             token_type_ids = None,
                             attention_mask = b_input_mask,
                             labels = b_labels)
        # Backward pass
        train_output.loss.backward()
        optimizer.step()
        # Update tracking variables
        tr_loss += train_output.loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    # ========== Validation ==========

    # Set model to evaluation mode
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_precision = []
    val_recall = []
    val_specificity = []

    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
          # Forward pass
          eval_output = model(b_input_ids,
                              token_type_ids = None,
                              attention_mask = b_input_mask)
        logits = eval_output.logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Calculate validation metrics
        b_accuracy, b_precision, b_recall, b_specificity = b_metrics(logits, label_ids)
        val_accuracy.append(b_accuracy)
        # Update precision only when (tp + fp) !=0; ignore nan
        if b_precision != 'nan': val_precision.append(b_precision)
        # Update recall only when (tp + fn) !=0; ignore nan
        if b_recall != 'nan': val_recall.append(b_recall)
        # Update specificity only when (tn + fp) !=0; ignore nan
        if b_specificity != 'nan': val_specificity.append(b_specificity)

    print('\n\t - Train loss: {:.4f}'.format(tr_loss / nb_tr_steps))
    print('\t - Validation Accuracy: {:.4f}'.format(sum(val_accuracy)/len(val_accuracy)))
    print('\t - Validation Precision: {:.4f}'.format(sum(val_precision)/len(val_precision)) if len(val_precision)>0 else '\t - Validation Precision: NaN')
    print('\t - Validation Recall: {:.4f}'.format(sum(val_recall)/len(val_recall)) if len(val_recall)>0 else '\t - Validation Recall: NaN')
    print('\t - Validation Specificity: {:.4f}\n'.format(sum(val_specificity)/len(val_specificity)) if len(val_specificity)>0 else '\t - Validation Specificity: NaN')

Epoch:  50%|█████     | 1/2 [2:17:06<2:17:06, 8226.48s/it]


	 - Train loss: 0.1152
	 - Validation Accuracy: 0.9665
	 - Validation Precision: 0.9612
	 - Validation Recall: 0.9484
	 - Validation Specificity: 0.9774



In [None]:
# We need Token IDs and Attention Mask for inference on the new sentence
predict_class = []
test_ids = []
test_attention_mask = []

In [None]:
len(test_text)

261221

In [None]:
# Apply the tokenizer
i=0
for sample in test_text:
    i = i+1
    test_ids = []
    test_attention_mask = []
    encoding = preprocessing(sample, tokenizer)
    # Extract IDs and Attention Mask
    test_ids.append(encoding['input_ids'])
    test_attention_mask.append(encoding['attention_mask'])
    t_ids = torch.cat(test_ids, dim = 0)
    t_attention_mask = torch.cat(test_attention_mask, dim = 0)
    with torch.no_grad():
        output = model(t_ids.to(device), token_type_ids = None, attention_mask = t_attention_mask.to(device))
    prediction = 1 if np.argmax(output.logits.cpu().numpy()).flatten().item() == 1 else 0
    predict_class.append(prediction)
    if i in [10000,50000,100000,150000,200000]:
        print(i)

NameError: ignored

In [None]:
df_test['target'] = predict_class.values()
df_test.head()

In [None]:
df_test.target.value_count()

In [None]:
df_test.to_csv('submission_bert.csv', index = False)