# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.


## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json).

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

In [None]:
!unzip /content/toxic-questions-classification.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:
# Import required packages

In [None]:
import pandas as pd
from gensim.utils import simple_preprocess
import nltk
from nltk.stem import WordNetLemmatizer
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short
nltk.download('wordnet')
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
import random
from IPython.display import HTML,display
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import seaborn as sns
from collections import Counter
import itertools


from keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU
from keras.models import Sequential   # the model
from keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [None]:
train = pd.read_csv("/content/train_dataset.csv")
train

In [None]:
test = pd.read_csv("/content/test_dataset.csv")
test

In [None]:
combo = pd.concat([train, test], axis=0)

In [None]:
#let's see how data is looklike
random_index=random.randint(0,train.shape[0]-3)
for row in train[['question_text','target']][random_index:random_index+3].itertuples():
    _,text,label=row
    class_name=0
    if label==1:
        class_name="1"
    display(HTML(f"<h5><b style='color:red'>question_text: </b>{text}</h5>"))
    display(HTML(f"<h5><b style='color:red'>target: </b>{class_name}<br><hr></h5>"))
#data contain so much garbage needs to be cleaned

In [None]:
colors=['#AB47BC','#6495ED']
plt.pie(train['target'].value_counts(),labels=['0','1'],autopct='%.2f%%',explode=[0.01,0.01],colors=colors);
plt.title('Distribution of target')
plt.ylabel('target');

In [None]:
nltk.download('stopwords')

In [None]:
positivedata = train[train['target']== 0]
positivedata =positivedata['question_text']
negdata = train[train['target']== 1]
negdata= negdata['question_text']

def wordcloud_draw(data, color, s):
    words = ' '.join(data)
    cleaned_word = " ".join([word for word in words.split() if(word not in ['would','get','like','people','think','take'])])
    wordcloud = WordCloud(stopwords=stopwords.words('english'),background_color=color,width=2500,height=2000).generate(cleaned_word)
    plt.imshow(wordcloud)
    plt.title(s)
    plt.axis('off')

plt.figure(figsize=[20,10])
plt.subplot(1,2,1)
wordcloud_draw(positivedata,'white','Most-common Positive words')

plt.subplot(1,2,2)
wordcloud_draw(negdata, 'white','Most-common Negative words')
plt.show()

In [None]:
train['text_word_count']=train['question_text'].apply(lambda x:len(x.split()))

numerical_feature_cols=['text_word_count']

In [None]:
plt.figure(figsize=(20,3))
for i,col in enumerate(numerical_feature_cols):
    plt.subplot(1,3,i+1)
    sns.histplot(data=train,x=col,hue='target',bins=50)
    plt.title(f"Distribution of Various word counts with respect to target")
plt.tight_layout()
plt.show()

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


In [None]:
combo['question_text'] = combo['question_text'].apply(lambda x:simple_preprocess(x))

In [None]:
combo['question_text'] = combo['question_text'].apply(lambda tokens: ' '.join(tokens))

In [None]:
def custom_preprocess(s):
    # Initialize WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    # Define your own preprocessing steps
    filters = [lambda x: x.lower(),  # Convert to lowercase
               strip_tags,  # Remove HTML tags
               strip_punctuation,  # Remove punctuation
               strip_multiple_whitespaces,  # Remove multiple whitespaces
               strip_numeric,  # Remove numbers
               remove_stopwords,  # Remove stopwords
               strip_short,  # Remove short words
               lambda x: lemmatizer.lemmatize(x)]  # Lemmatization step

    return preprocess_string(s, filters)

In [None]:
combo['question_text'] = combo['question_text'].apply(lambda x:custom_preprocess(x))

In [None]:
combo['question_text'] = combo['question_text'].apply(lambda tokens: ' '.join(tokens))

In [None]:
def get_text_length(text):
    tokens = simple_preprocess(text)
    return len(tokens)

In [None]:
combo['text_length'] = combo['question_text'].apply(get_text_length)

In [None]:
fig = plt.figure(figsize=(10,6))
# Add axes to the figure. Create the first main window
ax1 = fig.add_axes([0, 0, 0.95, 0.95])
ax1.hist(np.array(combo.text_length), bins=50, label='length', alpha=0.6, color='blue');

In [None]:
combo['text_length'].quantile(0.995)

In [None]:
combo['question_text'] = combo['question_text'].apply(lambda x:simple_preprocess(x, max_len=17))

In [None]:
combo['question_text'] = combo['question_text'].apply(lambda tokens: ' '.join(tokens))

In [None]:
all_tokens = list(itertools.chain.from_iterable([i.split() for i in combo.question_text]))
token_counts = Counter(all_tokens)
sorted_token_counts = dict(sorted(token_counts.items(), key=lambda item: item[1], reverse=True))
len(sorted_token_counts)
count = 0
for key, value in sorted_token_counts.items():
    if value > 11:
        count+=1
print(count)
sorted_token_counts.get('approach', 0)

In [None]:
def vocab_limiter(x):
    x = x.split()
    y = []
    for i in x:
        if sorted_token_counts.get(i, 0) > 11:
            y.append(i)
    return ' '.join(y)

In [None]:
combo['question_text'] = combo['question_text'].apply(vocab_limiter)

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
import tensorflow as tf

In [None]:
MAX_SENT_LEN = 17   # Number of words to consider from each review
MAX_VOCAB_SIZE = 29581 # Max vocabulary size
BATCH_SIZE = 32
N_EPOCHS = 5

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(combo['question_text'])
print("Number of words in vocabulary:", len(tokenizer.word_index))
X = tokenizer.texts_to_sequences(combo['question_text'])
X = tf.keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')
y = combo['target']
testX = X[1044897:]
X = X[:1044897]
y = y[:1044897]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size = 0.1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
embeddings_index = {}

In [None]:
!wget https://nlp.stanford.edu/data/glove.42B.300d.zip

In [None]:
!unzip glove*.zip

In [None]:
f = open('/content/glove.42B.300d.txt')

In [None]:
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
embedding_dim = 300
embedding_matrix = np.zeros((vocab_size, embedding_dim))
len(tokenizer.word_index.items())

In [None]:
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if (embedding_vector is not None) and len(embedding_vector) > 0:
        embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word)

In [None]:
len(words_not_found)

In [None]:
print(len(tokenizer.word_index))

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=3)

In [None]:
weights_assigned={0:1,1:20}

In [None]:
# Build a sequential model by stacking neural net units
model = Sequential()
embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights = [embedding_matrix],
                            trainable=False)
model.add(embedding_layer)
model.add(Bidirectional(GRU(128, return_sequences=True, dropout=0.50, name='first_gru_layer')))
model.add(Dropout(0.5))
model.add(Bidirectional(GRU(64, name='second_gru_layer')))
model.add(Dropout(0.5))
model.add(Dense(32, activation='tanh'))
model.add(Dropout(0.4))
model.add(Dense(8, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

In [None]:
print('Summary of the built model...')
model.summary()

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['AUC'])

In [None]:
model.fit(X_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=2,
          validation_data=(X_test, y_test),
          callbacks=[early_stopping],
          class_weight=weights_assigned)

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








In [None]:
print('Testing...')
model.evaluate(X_test, y_test)

In [None]:
predz = model.predict(X_test)

In [None]:
fig = plt.figure(figsize=(10,6))
# Add axes to the figure. Create the first main window
ax1 = fig.add_axes([0, 0, 0.95, 0.95])
ax1.hist(np.array(predz), bins=50, label='length', alpha=0.6, color='blue');

In [None]:
y_predz1 = [1 if i >= 0.92 else 0 for i in predz]

In [None]:
print(classification_report(y_test.to_numpy(), y_predz1, target_names=['0','1']))

In [None]:
preds = model.predict(testX)


In [None]:
test['target'] = preds

In [None]:
preds1 = [1 if i >= 0.92 else 0 for i in preds]

In [None]:
test['target'] = preds1

In [None]:
test_submit = test[['qid','target']]

In [None]:
test_submit['qid'] = test['qid']

In [None]:
test_submit.to_csv('upload1.csv',index=False)

In [None]:
test_submit