# COMP-4730 Final Project
**Created by Saffa Alvi, Nour ElKott, & Nandini Patel**

## Objective
> The purpose of this research data analysis project is to apply deep learning approaches to explore Natural Language Processing (NLP), and to create a model for the Google QUEST Q&A Labeling competition on Kaggle.com. 

> The objective of this competition is to use a new dataset, compiled by the CrowdSource team at Google, to create a predictive model “for different subjective aspects of question-answering” [1]. The goal of this project is to build a performative model to accurately predict the classes of the unlabeled data and to answer the research questions, defined in this report, that are related to NLP and this competition topic. Our model accuracy will also be compared to other existing models and evaluated to see which properties/characteristics of our model affect its overall accuracy. 

> The accomplishment of this research project benefits from the help and direction from our professor - Dr. Robin Gras and a few online resources, which were of great help.	

## Imports

The following libraries, functions, etc were imported to help with constructing the NLP model.

In [84]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import string
import os
import re

In [88]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Input
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import LSTM, Embedding, Concatenate, TimeDistributed, Bidirectional,GRU, Flatten,Conv2D,Conv1D,GlobalMaxPooling1D,GlobalMaxPool1D,SimpleRNN
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Training Set, Testing Set, and Submission File

The data provided to complete this project are the following:
<br />**train.csv**
<br />contains training set of questions and answers
<br /> **test.csv**
<br /> contains testing set of questions and answers

Here, the program reads the three .csv files and saves the contents into the corresponding files.

In [71]:
trainingSet      = pd.read_csv('train.csv')
testingSet       = pd.read_csv('test.csv')
sampleSubmission = pd.read_csv('sample_submission.csv')

**Display the Target Variables in the Training Data Set**
<br/> The targets all have a value between 0 and 1, inclusive. 

In [72]:
pd.set_option('display.max_columns', None)

targets = list(sampleSubmission.columns[1:])
trainingSet[targets].describe()

Unnamed: 0,question_asker_intent_understanding,question_body_critical,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_interestingness_others,question_interestingness_self,question_multi_intent,question_not_really_a_question,question_opinion_seeking,question_type_choice,question_type_compare,question_type_consequence,question_type_definition,question_type_entity,question_type_instructions,question_type_procedure,question_type_reason_explanation,question_type_spelling,question_well_written,answer_helpful,answer_level_of_information,answer_plausible,answer_relevance,answer_satisfaction,answer_type_instructions,answer_type_procedure,answer_type_reason_explanation,answer_well_written
count,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0,6079.0
mean,0.892663,0.595301,0.057301,0.698525,0.772633,0.793689,0.587478,0.507275,0.238745,0.004469,0.429978,0.284915,0.038137,0.010035,0.030762,0.065225,0.497587,0.166063,0.386385,0.000823,0.799931,0.925408,0.654823,0.960054,0.968626,0.85468,0.479547,0.130641,0.502468,0.908254
std,0.132047,0.21947,0.182196,0.350938,0.303023,0.336622,0.1359,0.185987,0.335057,0.045782,0.365952,0.368826,0.153635,0.07424,0.138065,0.197582,0.423138,0.257301,0.383384,0.020489,0.17842,0.114836,0.107666,0.086926,0.074631,0.130743,0.422921,0.225718,0.407097,0.100708
min,0.333333,0.333333,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.333333,0.333333,0.333333,0.2,0.0,0.0,0.0,0.333333
25%,0.777778,0.444444,0.0,0.5,0.666667,0.666667,0.444444,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.888889,0.666667,1.0,1.0,0.8,0.0,0.0,0.0,0.888889
50%,0.888889,0.555556,0.0,0.666667,1.0,1.0,0.555556,0.444444,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.333333,0.0,0.833333,1.0,0.666667,1.0,1.0,0.866667,0.5,0.0,0.5,0.888889
75%,1.0,0.777778,0.0,1.0,1.0,1.0,0.666667,0.666667,0.333333,0.0,0.666667,0.666667,0.0,0.0,0.0,0.0,1.0,0.333333,0.666667,0.0,1.0,1.0,0.666667,1.0,1.0,0.933333,1.0,0.333333,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.666667,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Display the Training Set Size**
<br />**Training Set** is composed of:
<br />41 columns for Q&A classifications (ex. question_title, answer_helpful, etc).
<br />6079 rows for each entry.

In [73]:
print("Training Set Size (rows, cols): ", trainingSet.shape)

trainingSetCols = trainingSet.columns
n = 1

print("\nLIST OF COLUMN NAMES IN TRAINING SET:")
print("----------------------------------------")
for i in trainingSetCols:
    print(n, ". ", i)
    n = n + 1

Training Set Size (rows, cols):  (6079, 41)

LIST OF COLUMN NAMES IN TRAINING SET:
----------------------------------------
1 .  qa_id
2 .  question_title
3 .  question_body
4 .  question_user_name
5 .  question_user_page
6 .  answer
7 .  answer_user_name
8 .  answer_user_page
9 .  url
10 .  category
11 .  host
12 .  question_asker_intent_understanding
13 .  question_body_critical
14 .  question_conversational
15 .  question_expect_short_answer
16 .  question_fact_seeking
17 .  question_has_commonly_accepted_answer
18 .  question_interestingness_others
19 .  question_interestingness_self
20 .  question_multi_intent
21 .  question_not_really_a_question
22 .  question_opinion_seeking
23 .  question_type_choice
24 .  question_type_compare
25 .  question_type_consequence
26 .  question_type_definition
27 .  question_type_entity
28 .  question_type_instructions
29 .  question_type_procedure
30 .  question_type_reason_explanation
31 .  question_type_spelling
32 .  question_well_written
33 . 

**Display the Testing Set Size**
<br />**Testing Set** is composed of:
<br />11 columns for Q&A classifications (ex. question_title, answer_user_name, etc).
<br />476 rows for each entry.

In [74]:
print("Testing Set Size (rows, cols): ", testingSet.shape)

testingSetCols = testingSet.columns
n = 1

print("\nLIST OF COLUMN NAMES IN TESTING SET:")
print("----------------------------------------")
for i in testingSetCols:
    print(n, ". ", i)
    n = n + 1

Testing Set Size (rows, cols):  (476, 11)

LIST OF COLUMN NAMES IN TESTING SET:
----------------------------------------
1 .  qa_id
2 .  question_title
3 .  question_body
4 .  question_user_name
5 .  question_user_page
6 .  answer
7 .  answer_user_name
8 .  answer_user_page
9 .  url
10 .  category
11 .  host


**Display the contents of the Training Set**

In [75]:
xTrain.head()

Unnamed: 0,question_title,question_body,question_user_name,question_user_page,answer,answer_user_name,answer_user_page,url,category,host
0,What am I losing when using extension tubes in...,After playing around with macro photography on...,ysap,https://photo.stackexchange.com/users/1024,"I just got extension tubes, so here's the skin...",rfusca,https://photo.stackexchange.com/users/1917,http://photo.stackexchange.com/questions/9169/...,LIFE_ARTS,photo.stackexchange.com
1,What is the distinction between a city and a s...,I am trying to understand what kinds of places...,russellpierce,https://rpg.stackexchange.com/users/8774,It might be helpful to look into the definitio...,Erik Schmidt,https://rpg.stackexchange.com/users/1871,http://rpg.stackexchange.com/questions/47820/w...,CULTURE,rpg.stackexchange.com
2,Maximum protusion length for through-hole comp...,I'm working on a PCB that has through-hole com...,Joe Baker,https://electronics.stackexchange.com/users/10157,Do you even need grooves? We make several pro...,Dwayne Reid,https://electronics.stackexchange.com/users/64754,http://electronics.stackexchange.com/questions...,SCIENCE,electronics.stackexchange.com
3,Can an affidavit be used in Beit Din?,"An affidavit, from what i understand, is basic...",Scimonster,https://judaism.stackexchange.com/users/5151,"Sending an ""affidavit"" it is a dispute between...",Y e z,https://judaism.stackexchange.com/users/4794,http://judaism.stackexchange.com/questions/551...,CULTURE,judaism.stackexchange.com
4,How do you make a binary image in Photoshop?,I am trying to make a binary image. I want mor...,leigero,https://graphicdesign.stackexchange.com/users/...,Check out Image Trace in Adobe Illustrator. \n...,q2ra,https://graphicdesign.stackexchange.com/users/...,http://graphicdesign.stackexchange.com/questio...,LIFE_ARTS,graphicdesign.stackexchange.com


## Building a Sentiment Classifier
We will build a sentiment classifier and test its performance. The Sentiment Classifier will be generated by using the 'category' and 'question_body' columns of the training data. This will train the model to:
1. Mine the data: ignore useless words, characters, etc in order to focus on the important content in the data
2. Create a Recurrent Neural Network (RNN), to form an undirected graph of sequences of inputs. In this case, the inputs are the data in the **training set**.


## Begin Mining Data
We will begin to remove any contents found in each question and answer in the dataset.
The following will take place in order to keep useful information:
1. remove URLS
2. convert uppercase letters to lowercase letters
3. remove tags
4. remove words containing possible errors
5. remove special characters
6. remove 'stop words'
7. stemming and lemmatization

In [76]:
import nltk

# will be used to remove the stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# will be used for stemming and lemmatization
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /home/nour/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/nour/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nour/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Print a random Q&A in the training set before being processed.**

In [77]:
print("RANDOM QUESTION BODY W/OUT MINING")
print("---------------------------------------------------------------")
print(trainingSet['question_body'].values[344])

print("RANDOM ANSWER W/OUT MINING")
print("---------------------------------------------------------------")
print(trainingSet['answer'].values[344])

RANDOM QUESTION BODY W/OUT MINING
---------------------------------------------------------------
I have a site that I am migrating to WordPress, and I have a need to add properties that each of the users can edit (e.g., Address, City, State, Business Name, etc), along with some properties that Administrators can edit (IsActive, CanEmail) that wouldn't be displayed to the user.  In addition, I need to be able to display the properties in a table (similar to how the plugin, "Members List", displays, but with the custom fields displaying as well.

Given these requirements, I had attempted to use a combination of "Cimy User Extra Fields" and "Members List", but the members list grid did not have an option to display the extra fields created by the other plugin.

How would you recommend I approach this?

EDIT: 

So I guess the crux of my question is, what is the preferred method to add properties to the User?

RANDOM ANSWER W/OUT MINING
-----------------------------------------------------

**Mining Function**
Remove any of the following texts that may be found in the questions and answers:
<br />1. Web Links
<br />2. Tags
<br />3. Upper-case letters, convert to lower-case
<br />4. Typos
<br />5. Special characters
<br />6. Lemmatize the words

In [86]:
def mineText(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'<.*?>', ' ', text)
    text = text.lower()
    text = re.sub(r"\S*\d\S*", "", text)
    text = re.sub('[^A-Za-z0-9]+', ' ', text)
    text = ' '.join([word for word in text.split(' ') if word not in stopwords.words('english')])
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split(' ')])
    return text

trainingSet['question_body'] = trainingSet['question_body'].apply(mineText)
trainingSet['answer'] = trainingSet['answer'].apply(mineText)

In [80]:
print("RANDOM QUESTION BODY AFTER MINING")
print("---------------------------------------------------------------")
print(trainingSet['question_body'].values[344])

print("\nRANDOM ANSWER AFTER MINING")
print("---------------------------------------------------------------")
print(trainingSet['answer'].values[344])

RANDOM QUESTION BODY AFTER MINING
---------------------------------------------------------------
site migrating wordpress need add property user edit e g address city state business name etc along property administrator edit isactive canemail displayed user addition need able display property table similar plugin member list display custom field displaying well given requirement attempted use combination cimy user extra field member list member list grid option display extra field created plugin would recommend approach edit guess crux question preferred method add property user 

RANDOM ANSWER AFTER MINING
---------------------------------------------------------------
answer first part question put class ttt user profile addon github class offer simple interface add field profile page added example checkbox subclass code initialize per function php work plugin course build placeholder add separate filter markup input value make extending class easier set custom capability showing sa

## Creating a Recurrent Neural Network (RNN)

In this project, we will generate a simple model composed of the following layers:
<br /> 1. Input: 
<br /> 2. Embedding: We will do this in order to have space for more semantic nuances in sentences.
<br /> 3. Bidirectional RNN:
<br /> 4. Global Max Pooling: 
<br /> 5. Dense Layer:
<br /> 6. Dense Layer: 

In [30]:
# parameters used for typical embeddings
maxLength = 1000
maxFeatures = 5000 
embeddingSize = 768

inp = Input(shape=(maxLength,)) # returns a shape tuple of ints, size of the maxLength of 

z = Embedding(maxFeatures,embeddingSize,input_length = maxLength)(inp)
z = Bidirectional(SimpleRNN(60,return_sequences='True'))(z)
z = GlobalMaxPool1D()(z)
z = Dense(16,activation='relu')(z)
z = Dense(5,activation='softmax')(z)

model = Model(inputs=inp,outputs=z)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()


Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1000)]            0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 768)         3840000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1000, 120)         99480     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 120)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                1936      
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 85        
Total params: 3,941,501
Trainable params: 3,941,501
Non-trainable params: 0
_________________________________________________

In [31]:
yLabel = LabelEncoder() # label encodeer
labels = yLabel.fit_transform(trainingSet['category']) # used for the labels of each category

yTrain = labels # set x training vars as the labels (categories)
# setup the training (x,y) and testing (X,Y) 'question_body' datasets.  test_size default 0.25, random_state for shuffling data
xTrain,xTest,yTrain, yTest = train_test_split(trainingSet['question_body'], yTrain, test_size=0.25,random_state=30)

tokenizer = Tokenizer(num_words = maxFeatures) # setup the tokenizer, num words in question_body to vectorize
tokenizer.fit_on_texts(list(xTrain)) # 

xVal = xTest # set xVal to xTest
xVal = tokenizer.texts_to_sequences(xVal) # transform xVal into sequence of integers

xTrain = tokenizer.texts_to_sequences(xTrain) # transform xTrain into sequence of integers

xTrain = pad_sequences(xTrain, maxlen = maxLength) # pad sequence so that the vectors can have the same lengths
xVal   = pad_sequences(xVal, maxlen = maxLength) # pad sequence so that the vectors can have the same lengths

yVal = yTest

print("PADDED AND TOKENIZED SEQUENCES, ALL VECTORS HAVE THE SAME LENGTH")
print("----------------------------------------------------------------")

print("xTrain Sequence, Padded and Tokenized: ", xTrain.shape)
print("yTrain                               : ", yTrain.shape)

print("xVal Sequence, Padded and Tokenized  : ", xVal.shape)
print("yVal                                 : ", yVal.shape)


model.fit(xTrain, yTrain, batch_size=128, epochs=10, verbose=2,validation_data = (xVal,yVal))

PADDED AND TOKENIZED SEQUENCES, ALL VECTORS HAVE THE SAME LENGTH
----------------------------------------------------------------
xTrain Sequence, Padded and Tokenized:  (4559, 1000)
yTrain                               :  (4559,)
xVal Sequence, Padded and Tokenized  :  (1520, 1000)
yVal                                 :  (1520,)
Epoch 1/10
36/36 - 54s - loss: 1.3087 - accuracy: 0.4350 - val_loss: 1.0614 - val_accuracy: 0.5197
Epoch 2/10
36/36 - 62s - loss: 0.7407 - accuracy: 0.7951 - val_loss: 0.6744 - val_accuracy: 0.8263
Epoch 3/10
36/36 - 64s - loss: 0.3239 - accuracy: 0.9566 - val_loss: 0.4638 - val_accuracy: 0.8500
Epoch 4/10
36/36 - 65s - loss: 0.1169 - accuracy: 0.9868 - val_loss: 0.4138 - val_accuracy: 0.8546
Epoch 5/10
36/36 - 64s - loss: 0.0519 - accuracy: 0.9945 - val_loss: 0.4162 - val_accuracy: 0.8605
Epoch 6/10
36/36 - 65s - loss: 0.0234 - accuracy: 0.9987 - val_loss: 0.4310 - val_accuracy: 0.8599
Epoch 7/10
36/36 - 64s - loss: 0.0153 - accuracy: 0.9982 - val_loss: 0.491

<tensorflow.python.keras.callbacks.History at 0x7f2bc1b00040>

**Embedding Matrix**

In [129]:
model = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-large/5?tf-hub-format=compressed")
trainingSet['question_title'] = trainingSet['question_title'].apply(mineText)

In [125]:
questionEmbedding = [model([trainingSet.iloc[i].question_title])[0] for i in range(trainingSet.shape[0])]

**Testing the Sentiment Similarity Process**

In [126]:
def sentimentSimilarity(query):
    queryEmbedding = model([query])
    similarity = cosine_similarity(questionEmbedding, queryEmbedding)
    similarityVal = [similarity[i][0] for i in range(similarity.shape[0])         
    return np.argmax(similarityVal)
    

SyntaxError: invalid syntax (2594475462.py, line 5)

In [130]:
question = 'I am having an issue with accessing different OpenCV libraries. What are some functions that may help solve my issue?'

In [131]:
print('SAMPLE QUESTION')
print("-----------------------------")
print(question)

print('\nPREPROCESSED QUESTION')
print("-----------------------------")
question = mineText(question)
print(question)

predictedQuery = sentimentSimilarity(question)

print('\nSEMANTIC SIMILARITY')
print("-----------------------------")
print(trainingSet.iloc[predictedQuery].question_title)

SAMPLE QUESTION
-----------------------------
I am having an issue with accessing different OpenCV libraries. What are some functions that may help solve my issue?

PREPROCESSED QUESTION
-----------------------------
issue accessing different opencv library function may help solve issue 
(6079, 1)

SEMANTIC SIMILARITY
-----------------------------
change default checkerboard blocksize opencv
