# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.
 

## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json). 

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



Set Runtime Type to GPU

In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ssupadhya","key":"f3379320fde2888945b4df9022f8acb5"}'}

In [2]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

kaggle.json  [0m[01;34msample_data[0m/


### 4. Install the Kaggle API using the following command


Below code gives error when executed 1st time. Restart Runtime and execute the code from start again and the code gets executed successfully

In [3]:
!pip install -U -q kaggle==1.5.8

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.8/118.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Building wheel for slugify (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.6.1 requires urllib3>=1.25, but you have urllib3 1.24.3 which is incompatible.[0m[31m
[0m

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [4]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [5]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [6]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [7]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

Downloading toxic-questions-classification.zip to /content
 94% 57.0M/60.6M [00:03<00:00, 21.9MB/s]
100% 60.6M/60.6M [00:03<00:00, 19.5MB/s]


In [8]:
!unzip /content/toxic-questions-classification.zip

Archive:  /content/toxic-questions-classification.zip
  inflating: sample_submission.csv   
  inflating: test_dataset.csv        
  inflating: train_dataset.csv       


## YOUR CODING STARTS FROM HERE

## Import required packages

nlpaug for Data Augmentation 

Used Data Augmentation for oversampling minority class in the data

In [9]:
pip install nlpaug

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [10]:
# Import required packages
import numpy as np
import pandas as pd
import nlpaug
import nlpaug.augmenter.word as naw
from sklearn.utils import shuffle
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 
from gensim.utils import simple_preprocess
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU
from keras.models import Sequential

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [11]:
# Data Loading
df_train = pd.read_csv('train_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')

In [12]:
df_train.head()

Unnamed: 0,qid,question_text,target
0,2549b81c4adff1849a7f,Is CSE at bit Meara good?,0
1,0558ed93a4630e68f7ac,Is it better to exercise before or after the b...,0
2,5d72d5233059e44f8a8e,Can character naming in writing infringe on tr...,0
3,3968636ac28841d0c901,Why does everyone making YouTube videos in Jap...,0
4,201d2b9a777bbf25443f,Is there any relation between horse power and ...,0


In [13]:
df_test.head()

Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,Is Minance any good?
1,5650c4a236fe3b555c31,Do computers have reserved key strokes?
2,b778db4f09f9326195ea,When was the last time that the US had such a ...
3,e91c299cffc74a66aaf5,Are you still living in Wasilla?
4,2e129e7a85739a73b70a,What distinguishes the acting style of Piolo P...


In [14]:
df_train.shape

(1044897, 3)

In [15]:
df_test.shape

(261221, 2)

In [16]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1044897 entries, 0 to 1044896
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   qid            1044897 non-null  object
 1   question_text  1044897 non-null  object
 2   target         1044897 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 23.9+ MB


No missing values in the train dataset

In [17]:
df_train.target.value_counts()

0    980293
1     64604
Name: target, dtype: int64

In [18]:
df_train.target.value_counts(normalize=True)

0    0.938172
1    0.061828
Name: target, dtype: float64

Data imbalance issue. Class 0 is the majority class with 94% of data while Class 1 is the minority class with 6% data.

This needs to be fixed to avoid model getting biased towards majority class.

In [19]:
df_train.duplicated().sum()

0

No duplicate records found.

In [20]:
df_train[df_train.target == 1].head()

Unnamed: 0,qid,question_text,target
16,8ea797496fc68c9d8d98,Why are black people always tormented?,1
28,72e1085eab12b6aa55e2,How do you spell aye?,1
29,8137a860b078efcadd4c,Why do Conservatives want all news to be conse...,1
55,4233e8ed3bbbf5b8a242,Are we all for calling the people born in the ...,1
67,4c4e07c6a1723d0fe649,Why did the frustrated Catholics of South Indi...,1


Analysis based on number of characters in the question text:

In [21]:
print('Maximum length of the question text', df_train.question_text.str.len().max())
print('Minimum length of the question text', df_train.question_text.str.len().min())
print('Average length of the question text', df_train.question_text.str.len().mean())

Maximum length of the question text 878
Minimum length of the question text 1
Average length of the question text 70.67046321312053


In [22]:
df_train[df_train.question_text.str.len() == 878]

Unnamed: 0,qid,question_text,target
875869,1ffca149bd0a19cd714c,What is [math]\overbrace{\sum_{\vartheta=8}^{\...,1


In [23]:
df_train.question_text.loc[875869]

'What is [math]\\overbrace{\\sum_{\\vartheta=8}^{\\infty} \\vec{\\frac{\\sum_{\\kappa=7}^{\\infty} \\overbrace{1x^0}^{\\text{Read carefully.}}-3x^{-1} \\div 1x^5+{\\sqrt[3]{2x^{-3}}}^{1x^0}+\\vec{\\vec{{3x^{-3}}^{1x^{-2}}}}}{\\sum_{\\dagger=9}^{\\infty} \\vec{\\boxed{\\boxed{3x^{-1}}+3x^1 \\times 1x^{-5}}}}} \\div \\sin(\\boxed{\\boxed{\\vec{3x^{-5}}}+\\sqrt[4]{2x^{-4}}+\\vec{2x^{-3}} \\div \\sin(\\sqrt[5]{\\int_{1x^5}^{2x^5} 2x^{-3} d\\varrho}) \\times \\vec{{\\underbrace{2x^1}_{\\text{Prove This.}}}^{3x^4} \\div \\sqrt[5]{2x^{-3}}+\\sum_{\\theta=8}^{\\infty} 1x^4}}) \\times {\\boxed{\\vec{\\sum_{\\nu=8}^{\\infty} \\sum_{4=6}^{\\infty} \\sum_{\\xi=9}^{\\infty} \\boxed{3x^1}-\\boxed{\\sqrt[3]{\\sqrt[3]{2x^{-2}}}}}}}^{1x^3}-\\cos({{\\tan(\\sum_{0=6}^{\\infty} \\tan(\\overbrace{\\frac{\\boxed{1x^1}-\\sqrt[3]{3x^{-2}}}{\\sum_{\\eta=10}^{\\infty} 1x^{-3} \\div 1x^1}}^{\\text{Molar Quantity.}}))}^{1x^3}}^{1x^{-4}})}^{\\text{Expanded.}}[/math]?'

Data cleanup required to get meaningful words from the question text

In [24]:
df_train[df_train.question_text.str.len() < 10]

Unnamed: 0,qid,question_text,target
32540,0c2a113858db20e0a4db,Quora:,1
74507,48206e5f0dcedf1f00e6,Hungary:,1
83882,45efae151057c2c0e49c,To Quora:,1
133702,7014915ed4fd6def410e,I'm an,1
208279,c309469a202434b5f1d2,W,1
307367,18b058d2aabadb23c12d,In Islam?,0
348868,83d01336b3406133723e,Bye Bye?,1
365454,7abbb52cdd2cd7bc5e48,#NAME?,1
472383,2cfd7dec2231e47afd6c,I 12?,0
483562,a7193652063b3b3b2566,#NAME?,0


Though the question text does not make much sense, need to retain them as most of the records are of Class 1 which is the minority class.

Analysis based on number of words in the question text:

In [25]:
idx_max = df_train.question_text.str.split().str.len().idxmax()
val_max = df_train.question_text.loc[idx_max]
words_max = len(val_max)
print(idx_max)
print(words_max)
print(val_max)

348157
752
In "Star Trek 2013" why did they :

*Spoilers*
*Spoilers*
*Spoilers*
*Spoilers*

1)Make warping look quite a bit like an hyperspace jump
2)what in the world were those bright particles as soon as they jumped.
3)Why in the world did they make it possible for two entities to react in warp space in separate jumps.
4)Why did Spock get emotions for this movie.
5)What was the point of hiding the "Enterprise" underwater.
6)When they were intercepted by the dark ship, how come they reached Earth when they were far away from her.(I don't seem to remember the scene where they warp to earth).
7)How did the ship enter earth's atmosphere when it wasnt even in orbit.
8)When Scotty opened the door of the black ship , how come pike and khan didn't slow down?


In [26]:
df_train[df_train.question_text.str.contains("Spoilers")]

Unnamed: 0,qid,question_text,target
348157,663c7523d48f5ee66a3e,"In ""Star Trek 2013"" why did they :\n\n*Spoiler...",0
497353,5f8adae7e14ca03c781b,Spoilers: Why prime minister did nothing after...,0
543845,9a203937cbcc8add5baf,How can I block a topic on Quora? Spoilers abo...,0
651755,21db0297c7942c7a6bc2,Spoilers: How Aarav knew that he would find th...,0
791063,caaf597913fd836c819a,[Spoilers] What is the probability of finding ...,0
818156,08c47e108dbca9d8859f,(Spoilers) Why does Thanos sound so gloomy aft...,0
1027622,f3f391f13f83afdc1260,[Spoilers] In the 2017 Ghost in the Shell movi...,0


In [27]:
print(df_train.question_text.loc[1027622])

[Spoilers] In the 2017 Ghost in the Shell movie, where did the antagonist get his body?


Above list of questions seems to be valid though they have the word "Spoilers"

Observations:

1) Imbalance in data. 
2) Bad Data: 
a) Mulitple questions in the question text column. 

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


Generate data samples for class 1 using data augmentation (synonyms)

In [28]:
df_train_1 = df_train[df_train.target == 1].copy()
df_train_1.shape

(64604, 3)

In [29]:
df_train_1.head()

Unnamed: 0,qid,question_text,target
16,8ea797496fc68c9d8d98,Why are black people always tormented?,1
28,72e1085eab12b6aa55e2,How do you spell aye?,1
29,8137a860b078efcadd4c,Why do Conservatives want all news to be conse...,1
55,4233e8ed3bbbf5b8a242,Are we all for calling the people born in the ...,1
67,4c4e07c6a1723d0fe649,Why did the frustrated Catholics of South Indi...,1


In [30]:
df_train_1.loc[16].question_text

'Why are black people always tormented?'

In [31]:
# Check a sample to veiw how augmenation using synonym works
aug = naw.SynonymAug(aug_src='wordnet',aug_max=2)
print('Original:', df_train_1.loc[16].question_text)
sample = aug.augment(df_train_1.loc[16].question_text,n=3)
print(sample)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Original: Why are black people always tormented?


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


['Why be black multitude always tormented?', 'Why constitute black people always torture?', 'Why are black people perpetually rack?']


Generate data samples for minority class using Synonym technique

Below code takes a while as it is generating 8 possible replicas for 1 question text

In [32]:
aug = naw.SynonymAug(aug_src='wordnet',aug_max=3)
aug_text_1 = []
for i in df_train_1.index:
    new_text = aug.augment(df_train_1.loc[i].question_text,n=8)
    for j in range(8):
        #print(new_text[j])
        aug_text_1.append(new_text[j])

In [33]:
len(aug_text_1)

516832

Create a dataframe for additional records, assign it to Class 1 and concatenate it with the original train dataset

In [34]:
df_aug_data = pd.DataFrame(aug_text_1, columns=['question_text'])
df_aug_data['qid'] = df_aug_data.index
df_aug_data['target'] = 1 
df_aug_data.head()

Unnamed: 0,question_text,qid,target
0,Wherefore are contraband people always rag?,0,1
1,Wherefore are black multitude e'er tormented?,1,1
2,Wherefore be black people constantly tormented?,2,1
3,Wherefore are smuggled hoi polloi always torme...,3,1
4,Wherefore be black people forever tormented?,4,1


In [35]:
df_train_new = pd.concat([df_train,df_aug_data])

In [36]:
df_train_new = shuffle(df_train_new)

In [37]:
df_train_new.reset_index(inplace=True, drop=True)

In [38]:
df_train_new.head()

Unnamed: 0,qid,question_text,target
0,1038c9c8dd314da559cb,Why has there been relatively little progress ...,0
1,457845,What are the Singaporeans ' advice to PRC Chin...,1
2,f49b64c3df2a2fd6fcd8,Where can I find brands that are searching for...,0
3,e54ecc784e868f993da6,How do I write a research paper about harmful ...,0
4,93bb95c62ba052d6e753,How do I start preparation for UPSC without co...,0


In [39]:
df_train_new.shape

(1561729, 3)

In [40]:
df_train_new.target.value_counts(1)

0    0.627697
1    0.372303
Name: target, dtype: float64

We now have a dataset which is better (balanced) than the original data with data ration as 63:37 

Data Preprocessing

Data cleanup for both train and test dataset

In [41]:
df_train_new['question_text'] = df_train_new['question_text'].apply(lambda x:simple_preprocess(x, max_len=70))

In [42]:
df_test['question_text'] = df_test['question_text'].apply(lambda x:simple_preprocess(x, max_len=70))

In [43]:
# Remove stop words
stop_words = set(stopwords.words('english'))

df_train_new['question_text'] = df_train_new['question_text'].apply(lambda x: [w for w in x if not w in stop_words])

In [44]:
df_test['question_text'] = df_test['question_text'].apply(lambda x: [w for w in x if not w in stop_words])

In [45]:
df_train_new.head()

Unnamed: 0,qid,question_text,target
0,1038c9c8dd314da559cb,"[relatively, little, progress, consumer, batte...",0
1,457845,"[singaporeans, advice, prc, chinese, order, se...",1
2,f49b64c3df2a2fd6fcd8,"[find, brands, searching, sales, representatives]",0
3,e54ecc784e868f993da6,"[write, research, paper, harmful, effects, com...",0
4,93bb95c62ba052d6e753,"[start, preparation, upsc, without, coaching, ...",0


In [46]:
df_test.head()

Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,"[minance, good]"
1,5650c4a236fe3b555c31,"[computers, reserved, key, strokes]"
2,b778db4f09f9326195ea,"[last, time, us, scandal, driven, administration]"
3,e91c299cffc74a66aaf5,"[still, living, wasilla]"
4,2e129e7a85739a73b70a,"[distinguishes, acting, style, piolo, pascual]"


Tokennize and pad sequence

In [47]:
# Hyperparameters 
MAX_SENT_LEN = 70   # Number of words to consider from each review
MAX_VOCAB_SIZE = 50000  # Max vocabulary size

In [48]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train_new['question_text']])

print("Number of words in vocabulary:", len(tokenizer.word_index))

Number of words in vocabulary: 166123


In [49]:
# Convert the sequence of words to sequnce of indices
X = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_train_new['question_text']])
X = pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

y = df_train_new['target']

Prepare the test data using the tokens generated

In [50]:
Z = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in df_test['question_text']])
Z = pad_sequences(Z, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

In [51]:
Z.shape

(261221, 70)

Splitting data into train and test dataset

In [52]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=123, test_size=0.3)

In [53]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1093210, 70), (468519, 70), (1093210,), (468519,))

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



Load Glove from nlp.stanford.edu site

In [54]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-06-07 06:33:40--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-06-07 06:33:40--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-06-07 06:33:41--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [55]:
embeddings_index = {}
# Loading the 300-dimensional vector of the model
f = open('/content/glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [56]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_dim = 300

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i >= vocab_size:
        continue
    embedding_vector = embeddings_index.get(word)
    if (embedding_vector is not None) and len(embedding_vector) > 0:
                embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word)

Loaded 400000 word vectors.


In [57]:
len(words_not_found)

66474

In [58]:
print(len(tokenizer.word_index))

166123


##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [59]:
# Build a sequential model by stacking neural net units 
model = Sequential()
embedding_layer = Embedding(vocab_size,
                            embedding_dim, 
                            weights = [embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable=False)
model.add(embedding_layer)
model.add(Bidirectional(GRU(128, return_sequences=True, dropout=0.50, name='first_gru_layer')))
model.add(Dropout(0.5))
model.add(Bidirectional(GRU(64, name='second_gru_layer')))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

In [60]:
print('Summary of the built model...')
model.summary()

Summary of the built model...
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 70, 300)           49837200  
                                                                 
 bidirectional (Bidirectiona  (None, 70, 256)          330240    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 70, 256)           0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              123648    
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense (Dense)            

In [61]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [62]:
BATCH_SIZE = 32
N_EPOCHS = 5

In [63]:
import tensorflow as tf
tf.test.gpu_device_name()

''

In [None]:
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=N_EPOCHS,
          validation_data=(X_test, y_test))

Epoch 1/5
 5090/34163 [===>..........................] - ETA: 3:21:08 - loss: 0.3232 - accuracy: 0.8695

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








In [None]:
print('Testing...')
model.evaluate(X_test, y_test)

In [None]:
preds = model.predict(X_test)

In [None]:
preds

In [None]:
len(preds)

In [None]:
y_test.values

In [None]:
df_test_eval = pd.DataFrame(preds, columns=['pred_prob'])
df_test_eval['act_label'] = y_test.values
df_test_eval['pred_label'] = np.where(df_test_eval['pred_prob'] > 0.5, 1, 0)

In [None]:
df_test_eval.head()

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(df_test_eval.act_label, df_test_eval.pred_label))

In [None]:
preds_final = model.predict(Z)

In [None]:
len(preds_final)

In [None]:
preds_final

In [None]:
df_final = pd.DataFrame(preds_final, columns=['target_prob'])
df_final['qid'] = df_test.qid.values
df_final['question_text'] = df_test.question_text.values
df_final['target'] = np.where(df_final['target_prob'] > 0.5, 1, 0)
df_final.head()

In [None]:
df_final.target.value_counts()

In [None]:
df_final[df_final.target == 1]