# Colab for Feature Engineering.  #

---


## To Do: ##
1. Separate the Duplicates and Non-Duplicates
2. Copy the DataFrames into new variables
3. Process the data into the feature to be extracted
4. Plot the Histogram to see to observe the distribution


---


##Features##

1. Number of unique words which occur in q1 and q2 
2. Ratio of common words / total words (q1+q2)
2. Common Word Ratio min ( words common/ min(len(q1), len(q2)))
2. Common Word Ratio mmax ( words common/ max(len(q1), len(q2)))
2. Common Stop Words min ( common stopwords/ min(len(q1), len(q2)))
2. Common Stop Words max  ( common stopwords/ max(len(q1), len(q2)))
2. Common Tokens min ( common Tokens / min(len(q1), len(q2)))
2. Common Tokens max  ( common Tokens / max(len(q1), len(q2)))
2. Common Adjectives min ( common adjectives /min(len(q1), len(q2)))
2. Common Adjectives max ( common adjectives /max(len(q1), len(q2)))
2. Common Noun min ( common nouns / min(len(q1), len(q2)))
2. Common Noun max ( common nouns / max(len(q1), len(q2)))
2. Fuzz ratio
2. Fuzz partial ratio 
2. Fuzz Token Sort Ratio 
2. Fuzz Token Set Ratio
2. Mean Length of 2 questions
2. Ratio of Length of Questions ( len(q1) / len(q2) )
2. Absolute Length Difference (| len(q1) - len(q2) |
2. Longest Matching Substring min ( longest substring/min(len(q1), len(q2)))
2. Longest Matching Substring max ( longest substring/max(len(q1), len(q2)))



Download your required libraries here

In [None]:
!pip install bs4
!pip install fuzzywuzzy
!pip install TextBlob
!pip install pickle5
!python -m spacy download en_core_web_lg
!pip install keras==2.6.0

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[K     |████████████████████████████████| 256 kB 12.6 MB/s 
[?25hInstalling collected packages: pickle5
Successfully installed pickle5-0.0.12
Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=9d50f1c826803fde486e554edc26b13f7b47017066850fc05b27e8336cc8f61c
  Stored in directory: /tmp/pip-ephem-wheel-cache-adgdwep

Import your required libraries here

In [None]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
import nltk
from fuzzywuzzy import fuzz
from difflib import SequenceMatcher #For finding longest substring
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
import en_core_web_lg
import pickle5
from keras.models import load_model
from keras import backend as K
from keras.preprocessing.sequence import pad_sequences
nlp = spacy.load('en_core_web_lg')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger') # for pos tagging
from tqdm import tqdm_notebook
from scipy.spatial.distance import cosine
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, confusion_matrix



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Mounting the dataset onto this google colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive/CS3244 45 Project/RNN Models"

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1ixE_YVbTLblbUJgpPcDWfO-zlQeoNGq4/CS3244 45 Project/RNN Models


In [None]:
#Loading the tokenizer
with open('../tokenizer.pickle', 'rb') as saved_tokenizer:
    tokenizer = pickle5.load(saved_tokenizer)

def exponent_neg_manhattan_distance(left, right):
    return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))

model = load_model('ys.h5')



In [None]:
model_initial = load_model('baseline_3')

##**Preprocess the questions**

In [None]:
# This function accepts a question and preprocesses it. Returns cleaned question.
# This section of code was referenced from Sourab Vadlamani in his work "Quora Question Pairs Similairty, Tackling a real life NLP problem"
# https://towardsdatascience.com/quora-question-pairs-similarity-tackling-a-real-life-nlp-problem-ab55c5da2e84

def preprocess(q):
  # Firstly, we convert to lowercase and remove trailing and leading spaces
  q = str(q).lower().strip()

  # Replace certain special characters with their string equivalents
  q = q.replace('%', ' percent')
  q = q.replace('$', ' dollar ')
  q = q.replace('₹', ' rupee ')
  q = q.replace('€', ' euro ')
  q = q.replace('@', ' at ')

  # The pattern '[math]' appears around 900 times in the whole dataset.
  q = q.replace('[math]', '')

  # Replacing some numbers with string equivalents (not perfect, can be done better to account for more cases)
  q = q.replace(',000,000,000 ', 'b ')
  q = q.replace(',000,000 ', 'm ')
  q = q.replace(',000 ', 'k ')
  q = re.sub(r'([0-9]+)000000000', r'\1b', q)
  q = re.sub(r'([0-9]+)000000', r'\1m', q)
  q = re.sub(r'([0-9]+)000', r'\1k', q)

  # Decontracting words
  # https://en.wikipedia.org/wiki/Wikipedia%3aList_of_English_contractions
  # https://stackoverflow.com/a/19794953
  contractions = { 
    "ain't": "am not",
    "aren't": "are not",
    "can't": "can not",
    "can't've": "can not have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
  }

  q_decontracted = []

  for word in q.split():
    if word in contractions:
      word = contractions[word]
  
    q_decontracted.append(word)

  q = ' '.join(q_decontracted)
  q = q.replace("'ve", " have")
  q = q.replace("n't", " not")
  q = q.replace("'re", " are")
  q = q.replace("'ll", " will")

  # Removing HTML tags
  q = BeautifulSoup(q)
  q = q.get_text()

  # Remove punctuations
  pattern = re.compile('\W')
  q = re.sub(pattern, ' ', q).strip()

  return q

##**Jun An**

1. Ratio of Common Words (Common words / total words) (Done)
2. Ratio of Common Tokens (Common tokens/ max(q1, q2)) (Done)
3. Fuzz partial ratio (Done)
4. Longest Matching Substring Min (Done)

In [None]:
def num_common_words_ratio(row):
  set1 = set(row['question1'].lower().split())
  set2 = set(row['question2'].lower().split())
  total = len(set1) + len(set2)
  return len(set1.intersection(set2))/total

In [None]:
def common_tokens_ratio_max(row):
  q1 = set(word_tokenize(row['question1'].lower()))
  q2 = set(word_tokenize(row['question2'].lower()))
  stop_words = set(stopwords.words('english'))
  token1 = [word for word in q1 if word not in stop_words]
  token2 = [word for word in q2 if word not in stop_words]
  ratio = len(set(token1).intersection(set(token2))) / max(len(row['question1']), len(row['question2']))

  return ratio


In [None]:
def fuzz_partial_ratio(row):
  q1 = row['question1']
  q2 = row['question2']
  fuzz_partial = fuzz.partial_ratio(q1,q2)
  return fuzz_partial

In [None]:
def min_longest_substring(row):
  q1 = row['question1']
  q2 = row['question2']
  match = SequenceMatcher(None, q1, q2).find_longest_match(0, len(q1), 0, len(q2))
  return match.size/min(len(q1), len(q2))

##**Penn Han**

1. Number of unique words that occur in q1 and q2
2. Ratio of Common Tokens to min(len(q1), len(q2))
3. Fuzz Ratio
4. Absolute Length Difference between q1 and q2
5. Mean TF-IDF value
6. Mean IDF-weighted vector

In [None]:
def unique_words_count(row):
  set1 = set(row['question1'].lower().split())
  set2 = set(row['question2'].lower().split())
  return len(set1.intersection(set2))

In [None]:
def common_token_ratio_min(row):
  q1 = set(word_tokenize(row['question1'].lower()))
  q2 = set(word_tokenize(row['question2'].lower()))
  stop_words = set(stopwords.words('english'))
  token1 = [word for word in q1 if word not in stop_words]
  token2 = [word for word in q2 if word not in stop_words]
  ratio = len(set(token1).intersection(set(token2))) / min(len(row['question1']), len(row['question2']))
  return ratio

In [None]:
def fuzz_ratio(row):
  q1 = row['question1']
  q2 = row['question2']
  fuzz_ratio = fuzz.ratio(q1,q2)
  return fuzz_ratio

In [None]:
def abs_len_difference(row):
  q1 = row['question1']
  q2 = row['question2']
  abs_len_diff = abs(len(q1) - len(q2))
  return abs_len_diff

In [None]:
#Stop words not removed PLEASE ONLY USE EITHER THIS OR THE BELOW, NOT BOTH

#tf_idf_vectoriser = TfidfVectorizer(lowercase=True)
#q1_train_list = list(train_set['question1'])
#q2_train_list = list(train_set['question2'])
#question_corpus = list(q1_train_list + q2_train_list)
#tf_idf_vectoriser.fit(question_corpus)
#idf = dict(zip(tf_idf_vectoriser.get_feature_names(), tf_idf_vectoriser.idf_))  #For Weighted W2V
#nlp = en_core_web_lg.load()

In [None]:
def mean_tfidf_value_q1(row):
  q1 = word_tokenize(row['question1'].lower())
  stop_words = set(stopwords.words('english'))
  token1 = [word for word in q1 if word not in stop_words]
  if len(token1) > 0:
    q1_vector_matrix = tf_idf_vectoriser.transform(token1)  #Transform must take in a iterable so [str]
    return q1_vector_matrix  #Returns a sparse matrix
  else:
    return 0

In [None]:
def mean_tfidf_value_q2(row):
  q2 = set(word_tokenize(row['question2'].lower()))
  stop_words = set(stopwords.words('english'))
  token2 = [word for word in q1 if word not in stop_words]
  if len(token1) > 0:
    q2_vector_matrix = tf_idf_vectoriser.transform(token2)  #Transform must take in a iterable so [str]
    return q2_vector_matrix  #Returns a sparse matrix
  else:
    return 0

In [None]:
def calculate_weighted_vector(question):
    weighted_vectors = []
    doc = nlp(question)
    mean_vec = np.zeros((len(doc[0].vector)))
    for word in doc:
        vector = word.vector
        if str(word) in idf:
            idf_weight = idf[str(word)]
        else:
            idf_weight = 0
        mean_vec += vector * idf_weight
    mean_vec /= len(doc)
    return mean_vec

In [None]:
def mean_idfweighted_vector_q1(row):
  idfweighted_vector_q1 = calculate_weighted_vector(row['question1'])
  return idfweighted_vector_q1

In [None]:
def mean_idfweighted_vector_q2(row):
  idfweighted_vector_q2 = calculate_weighted_vector(row['question2'])
  return idfweighted_vector_q2

In [None]:
#train_set["tfidf_matrix_q1"] = train_set.apply(mean_tfidf_value_q1, axis=1)
#train_set["tfidf_matrix_q2"] = train_set.apply(mean_tfidf_value_q2, axis=1)
#train_set["mean_idfweighted_vector_q1"] = train_set.apply(mean_idfweighted_vector_q1, axis=1)
#train_set["mean_idfweighted_vector_q2"] = train_set.apply(mean_idfweighted_vector_q2, axis=1)

## Jeremy
1. common stop words min
2. common noun min
3. mean length of 2 questions

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
def get_min_len_qn(row):
    return min(len(row['question1'].split()), len(row['question2'].split()))

In [None]:
def calc_common_stop_words_min(row):
    q1 = word_tokenize(row['question1'])
    q2 = word_tokenize(row['question2'])
    stop_words_q1 = set([x for x in q1 if x in stop_words])
    stop_words_q2 = set([x for x in q2 if x in stop_words])
    num_intersect = len(stop_words_q1.intersection(stop_words_q2))
    return num_intersect / get_min_len_qn(row)

In [None]:
def calc_common_nouns_min(row):
    q1_tokens = word_tokenize(row["question1"].lower())
    q2_tokens = word_tokenize(row["question2"].lower())
    pos_tagged_q1 = nltk.pos_tag(q1_tokens)
    pos_tagged_q2 = nltk.pos_tag(q2_tokens)
    # x[0] is the word, x[1] is the tag
    q1_nouns = set([x[0] for x in pos_tagged_q1 if x[1] == "NN"]) 
    q2_nouns = set([x[0] for x in pos_tagged_q2 if x[1] == "NN"])
    return len(q1_nouns.intersection(q2_nouns)) / get_min_len_qn(row)

In [None]:
def mean_len_qns(row):
    return (len(word_tokenize(row["question1"].lower())) + len(word_tokenize(row["question2"].lower()))) / 2

##Kay Chi
1. Common stop words max
2. Common noun max
3. Ratio of length of questions

In [None]:
def get_max_len_qn(row):
    return max(len(row['question1'].split()), len(row['question2'].split()))

In [None]:
def calc_common_stop_words_max(row):
    q1 = word_tokenize(row['question1'])
    q2 = word_tokenize(row['question2'])
    stop_words_q1 = set([x for x in q1 if x in stop_words])
    stop_words_q2 = set([x for x in q2 if x in stop_words])
    num_intersect = len(stop_words_q1.intersection(stop_words_q2))
    return num_intersect / get_max_len_qn(row)

In [None]:
def calc_common_nouns_max(row):
    q1_tokens = word_tokenize(row["question1"].lower())
    q2_tokens = word_tokenize(row["question2"].lower())
    pos_tagged_q1 = nltk.pos_tag(q1_tokens)
    pos_tagged_q2 = nltk.pos_tag(q2_tokens)
    # x[0] is the word, x[1] is the tag
    q1_nouns = set([x[0] for x in pos_tagged_q1 if x[1] == "NN"]) 
    q2_nouns = set([x[0] for x in pos_tagged_q2 if x[1] == "NN"])
    return len(q1_nouns.intersection(q2_nouns)) / get_max_len_qn(row)

In [None]:
def ratio_len_qn(row):
  q1 = row['question1']
  q2 = row['question2']
  return len(q1) / len(q2)

## YS
1. Common Word Ratio max ( words common/ max(len(q1), len(q2))) 
2. Common Adjectives max ( common adjectives /max(len(q1), len(q2)))
3. Fuzz Token Set Ratio 

In [None]:
def common_word_ratio_max(row):
  q1 = row['question1']
  q2 = row['question2']
  return len(set(q1).intersection(set(q2))) / max(len(q1), len(q2))

In [None]:
# This has been tested to be correct, but result seems off.
def get_adjectives(text):
  blob = TextBlob(text)
  return set(word for (word,tag) in blob.tags if tag.startswith("JJ"))
  
def common_adjectives_max(row):
  q1 = row['question1']
  q2 = row['question2']
  return len(get_adjectives(q1).intersection(get_adjectives(q2))) / max(len(q1), len(q2))

In [None]:
def calc_fuzz_token_set_ratio(row):
  q1 = row['question1']
  q2 = row['question2']
  return fuzz.token_set_ratio(q1, q2)

##**Neaton**

In [None]:
def common_words_ratio_min(row):
  set1 = set(row['question1'].lower().split())
  set2 = set(row['question2'].lower().split())
  common_words = len(set1.intersection(set2))
  return common_words/min(len(set1), len(set2))


In [None]:
# This has been tested to be correct, but result seems off.
def get_adjectives(text):
  blob = TextBlob(text)
  return set(word for (word,tag) in blob.tags if tag.startswith("JJ"))
  
def common_adjectives_min(row):
  q1 = row['question1']
  q2 = row['question2']
  return len(get_adjectives(q1).intersection(get_adjectives(q2))) / min(len(q1), len(q2))

In [None]:
def fuzz_token_sort_ratio(row):
  q1 = row['question1']
  q2 = row['question2']
  fuzz_token = fuzz.token_sort_ratio(q1,q2)
  return fuzz_token

In [None]:
def max_longest_substring(row):
  q1 = row['question1']
  q2 = row['question2']
  match = SequenceMatcher(None, q1, q2).find_longest_match(0, len(q1), 0, len(q2))
  return match.size/max(len(q1), len(q2))

##Loading GloVe Embedding

In [None]:
%cd '../'

/content/drive/.shortcut-targets-by-id/1ixE_YVbTLblbUJgpPcDWfO-zlQeoNGq4/CS3244 45 Project


In [None]:
embeddings_index = {}
with open('glove.840B.300d.txt', encoding='utf-8') as f:
  for line in f:
    values = line.split(' ')
    word = values[0]
    embedding = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = embedding

In [None]:
def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            M.append(np.zeros((1, 300)))
    M = np.array(M)
    v = M.sum(axis=0)
    return v / np.sqrt((v ** 2).sum())

In [None]:
def wrapper(train_set):

  #Preprocessing questions
  train_set['question1'] = train_set['question1'].apply(preprocess)
  train_set['question2'] = train_set['question2'].apply(preprocess)

  #JA
  train_set['common_words_ratio'] = train_set.apply(num_common_words_ratio, axis=1)
  train_set['common_tokens_ratio'] = train_set.apply(common_tokens_ratio_max, axis=1)
  train_set['fuzz_partial_ratio'] = train_set.apply(fuzz_partial_ratio, axis=1)
  train_set['min_longest_substring'] = train_set.apply(min_longest_substring, axis=1)

  #Penn Han
  train_set["unique_words_count"] = train_set.apply(unique_words_count, axis=1)
  train_set["common_token_ratio_min"] = train_set.apply(common_token_ratio_min, axis=1)
  train_set["fuzz_ratio"] = train_set.apply(fuzz_ratio, axis=1)
  train_set["abs_len_difference"] = train_set.apply(abs_len_difference, axis=1)

  #train_set["tfidf_matrix_q1"] = train_set.apply(mean_tfidf_value_q1, axis=1)
  #train_set["tfidf_matrix_q2"] = train_set.apply(mean_tfidf_value_q2, axis=1)
  #train_set["mean_idfweighted_vector_q1"] = train_set.apply(mean_idfweighted_vector_q1, axis=1)
  #train_set["mean_idfweighted_vector_q2"] = train_set.apply(mean_idfweighted_vector_q2, axis=1)

  #Jeremy
  train_set['common_stop_words_min'] = train_set.apply(calc_common_stop_words_min, axis=1)
  train_set['common_nouns_min'] = train_set.apply(calc_common_nouns_min, axis=1)
  train_set['mean_len'] = train_set.apply(mean_len_qns, axis=1)

  #KC
  train_set['common_stop_words_max'] = train_set.apply(calc_common_stop_words_max, axis=1)
  train_set['common_nouns_max'] = train_set.apply(calc_common_nouns_max, axis=1)
  train_set['ratio_len_qn'] = train_set.apply(ratio_len_qn, axis=1)

  #YS
  train_set['common_word_ratio_max'] = train_set.apply(common_word_ratio_max, axis=1)
  train_set['common_adjectives_max'] = train_set.apply(common_adjectives_max, axis=1)
  train_set['fuzz_token_set_ratio'] = train_set.apply(calc_fuzz_token_set_ratio, axis=1)

  #Neaton
  train_set['common_words_ratio_min'] = train_set.apply(common_words_ratio_min, axis=1)
  train_set['common_adjectives_min'] = train_set.apply(common_adjectives_min, axis=1)
  train_set['fuzz_token_sort_ratio'] = train_set.apply(fuzz_token_sort_ratio, axis=1)
  train_set['max_longest_substring'] = train_set.apply(max_longest_substring, axis=1)

  question1_vectors = np.zeros((train_set.shape[0], 300))
  for i, q in enumerate(tqdm_notebook(train_set.question1.values)):
    question1_vectors[i, :] = sent2vec(q)
    
  question2_vectors  = np.zeros((train_set.shape[0], 300))
  for i, q in enumerate(tqdm_notebook(train_set.question2.values)):
      question2_vectors[i, :] = sent2vec(q)
    
  train_set['embed_cos_dist'] = [cosine(x, y) for (x, y) in zip(np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]


  return train_set


In [None]:
train_set = pd.read_csv('features_with_word_embedding.csv', index_col=[0])

In [None]:
#features = wrapper(train_set)

In [None]:
SEED = 42
TRAIN_TEST = 0.1
MAX_WORDS = 20000
MAX_SEQUENCE = 25

Y_labels = train_set["is_duplicate"]
X_features = train_set.drop("is_duplicate", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_features, Y_labels, test_size=TRAIN_TEST, random_state=SEED)

In [None]:
q1_test = X_test['question1']
q2_test = X_test['question2']
q1_train = X_train['question1'].astype(str)
q2_train = X_train['question2'].astype(str)

In [None]:
X_test.drop(['question1', 'question2', 'qid1', 'qid2'], axis=1, inplace=True)

In [None]:
#dic = {'question1':["How to overcome fear"], "question2": ["How not to be scared"] }

#train_set = pd.DataFrame(dic)

In [None]:
#final_features = features.drop(['question1', 'question2'], axis=1)

In [None]:
id = X_test.reset_index()['id']

In [None]:
MAX_SEQUENCE = 25
questions = q1_train.tolist() + q2_train.tolist()
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(questions)

question1_token = tokenizer.texts_to_sequences(q1_test.tolist())
question2_token = tokenizer.texts_to_sequences(q2_test.tolist())

q1_prepared = pad_sequences(question1_token, maxlen=MAX_SEQUENCE)
q2_prepared = pad_sequences(question2_token, maxlen=MAX_SEQUENCE)

In [None]:
q1_prepared

array([[    0,     0,     0, ...,    44,     6, 13783],
       [    0,     0,     0, ...,  4489,     7,   935],
       [    0,     0,     0, ...,    12,   851,   118],
       ...,
       [    0,     0,     0, ...,  5396,    87,   342],
       [  602,   113,   180, ...,     4,  1453,    59],
       [    0,     0,     0, ...,   235,   299,   184]], dtype=int32)

## Final RNN Model

In [None]:
incorrects = model.predict([q1_prepared, q2_prepared, X_test], verbose=1)



In [None]:
incorrects[incorrects > 0.5] = 1
incorrects[incorrects <= 0.5] = 0
flattened_incorrect = incorrects.flatten()

In [None]:
y_pred = flattened_incorrect

In [None]:
final_df = pd.DataFrame(id)
final_df['Actual'] = y_test.values
final_df['Pred'] = y_pred.astype(int)

In [None]:
wrong_class = final_df[final_df['Actual'] != final_df['Pred']]

## Deep dive into the False Positives For Tuned Model

In [None]:
wrong_class

Unnamed: 0,id,Actual,Pred
0,162455,0,1
2,158538,0,1
10,257581,1,0
11,145274,0,1
13,333020,0,1
...,...,...,...
40391,195572,0,1
40393,366331,0,1
40395,96642,1,0
40408,401326,1,0


In [None]:
false_positives = wrong_class[wrong_class['Actual'] == 0]

In [None]:
false_positives

Unnamed: 0,id,Actual,Pred
0,162455,0,1
2,158538,0,1
11,145274,0,1
13,333020,0,1
42,145652,0,1
...,...,...,...
40368,77275,0,1
40371,81291,0,1
40391,195572,0,1
40393,366331,0,1


In [None]:
false_positives
false_positives_id = false_positives['id'].values

In [None]:
fp_df = q1_test.reset_index()
fp_df['question2'] = q2_test.reset_index()['question2']
fp_df['Actual'] = y_test.values
fp_df['Pred'] = y_pred.astype(int)

In [None]:
fp_df = fp_df[fp_df.id.isin(false_positives_id)]

In [None]:
fp_df.reset_index(inplace=True)
fp_df = fp_df.drop(['index'], axis=1)

In [None]:
fp_df

Unnamed: 0,id,question1,question2,Actual,Pred
0,162455,how good a phil barone saxophones,what are phil barone saxophones,0,1
1,158538,how do i learn and master things,how can i learn mastering music,0,1
2,145274,who won the second presidential debate trump ...,in your opinion who won or performed better ...,0,1
3,333020,why is the first 20 minutes usually red colour...,why is the first 20 minutes usually red colour...,0,1
4,145652,how do i get rid of my addiction to facebook,what is the best way to get rid of addictions ...,0,1
...,...,...,...,...,...
3947,77275,in a world where everyone goes around naked h...,what if everyone in the world yelled at the sa...,0,1
3948,81291,what are examples of long term goals,what are some examples of long term and short ...,0,1
3949,195572,how do i turn off 2 step verification on my gm...,how do i recover my gmail account password wit...,0,1
3950,366331,when will avicii release his new album,will avicii release his new album in 2016,0,1


In [None]:
fp_df.loc[0:3]

Unnamed: 0,id,question1,question2,Actual,Pred
0,162455,how good a phil barone saxophones,what are phil barone saxophones,0,1
1,158538,how do i learn and master things,how can i learn mastering music,0,1
2,145274,who won the second presidential debate trump ...,in your opinion who won or performed better ...,0,1
3,333020,why is the first 20 minutes usually red colour...,why is the first 20 minutes usually red colour...,0,1
