# Feature engineering for semantic analysis
Natalie Ho | April 2017

This notebook explores a variety of approaches to common natural language processing (NLP) problems. These techniques will be explained and applied in context of feature engineering a Quora dataset.

I hope that by the end of this notebook, you'll gain familiarity with standard practices as well as recent methods used for NLP tasks. The ever-evolving field has a range of applications from [information retrieval](https://cloud.google.com/natural-language/) to [AI](https://www.ibm.com/developerworks/library/os-ind-watson/), and is well worth a [deeper](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/index.html) [dive](https://www.ted.com/talks/deb_roy_the_birth_of_a_word).


## Table of Contents

1. [Introduction](#bullet-1)<br/>
    1.1 [Data preview & pre-processing](#bullet-2)<br/>
    1.2 [What is feature engineering?](#bullet-3)<br/>
    1.3 [What is NLP?](#bullet-4)<br/>
<br/>    
2. [Syntax](#bullet-5)<br/>
    2.1 [Basic string cleaning](#bullet-6)<br/>
    2.2 [Simplify question pairs](#bullet-7)<br/>
    2.3 [Measuring similarity](#bullet-8)<br/>
<br/>   
3. [Semantics](#bullet-9)<br/>
    3.1 [Single word analysis](#bullet-10)<br/>
    3.2 [Sentence analysis](#bullet-11)<br/>
    3.3 [Weighted analysis](#bullet-12)<br/>
    3.4 [Feature creation](#bullet-13)
    
    

## 1.0 Introduction<a class="anchor" id="bullet-1"></a>

Quora is a knowledge sharing platform that functions simply on questions and answers. Their mission, plainly stated: "We want the Quora answer to be the definitive answer for everybody forever." In order to ensure the quality of these answers, Quora must protect the integrity of the questions. They accomplish this by adhering to a principle that each logically distinct question should reside on its own page. Unfortunately, the English language is a fickle thing, and intention can vary significantly with subtle shifts in syntactic structure.

Our goal is to create features for syntactically similar, but semantically distinct pairs of strings. We'll be working with Quora's first [public dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs).

### Load data

In [1]:
import pandas as pd
df = pd.read_csv('questions.csv')

### 1.1 Data preview  & pre-processing<a class="anchor" id="bullet-2"></a>
The Quora dataset is simple, containing columns for question strings, unique IDs, and a binary variable indicating whether the pair is logically distinct. 

In [2]:
from IPython.display import display
pd.set_option('display.max_colwidth', -1)

# checking for missing values
df.isnull().any()

# drop rows with missing values
df=df.dropna()

print df.shape
display(df[14:19])

(404349, 6)


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
14,14,29,30,"What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?",What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?,0
15,15,31,32,What would a Trump presidency mean for current international master’s students on an F1 visa?,How will a Trump presidency affect the students presently in US or planning to study in US?,1
16,16,33,34,What does manipulation mean?,What does manipulation means?,1
17,17,35,36,Why do girls want to be friends with the guy they reject?,How do guys feel after rejecting a girl?,0
18,18,37,38,Why are so many Quora users posting questions that are readily answered on Google?,Why do people ask Quora questions which can be answered easily by Google?,1


### 1.2 What is feature engineering?<a class="anchor" id="bullet-3"></a>

**Feature engineering** is the practice of generating data attributes that are useful for prediction. Although the task is loosely defined and depends heavily on the domain in question, it is a key process for optimizing model building. The goal is to find information which best describes the target to be predicted. 

In our case, the target is logical distinction - will one answer suffice for each pair of questions? This target is described by the binary is_duplicate label in the dataset. We will need to process the Quora data to create features that capture the structure and semantics of each question. This will be accomplished by using natural language processing (NLP) methods on the strings. 

### 1.3 What is natural language processing?<a class="anchor" id="bullet-4"></a>
NLP is the field concerned with computational handling of natural language. Grammar is full of seemingly arbitrary exceptions, vocabulary is constantly transforming, and meaning hinges precariously on culture and context. It is no small feat for a machine to find patterns in this dynamic mess (which is somehow easily grasped by the human brain).

We will start with the simpler task of describing syntax. Skip to [section 3](link) for semantic processing techniques.

## 2.0 Syntax<a class="anchor" id="bullet-5"></a>

A **corpus** is the body of text that we are working with - in this case, the dataset of Quora questions. Our first task is to break the strings down into more manageable units. We will be using the Natural Language Processing Toolkit (NLTK) library to apply the following methods.


### 2.1 Basic string cleaning<a class="anchor" id="bullet-6"></a>

**Tokenization**<br/>
Converting each string into a series of useful units (usually words). We can use NLTK's word_tokenize function to convert a question string into a list of word tokens.

In [3]:
import nltk
from nltk.tokenize import word_tokenize

teststring = df['question1'][12]
tokens = word_tokenize(df['question1'][12])

print teststring
print tokens

What can make Physics easy to learn?
['What', 'can', 'make', 'Physics', 'easy', 'to', 'learn', '?']


**Stopwords**<br/>
Common words to the corpus that do not significantly alter meaning. The NLTK library includes a set of English language stopwords (e.g. I, you, this, that), which we'll remove from the list of word tokens.

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words += ['?'] # adding ? character to stop words, since we are working with a corpus of questions

filtered_tokens = [t for t in tokens if not t in stop_words]
print filtered_tokens

['What', 'make', 'Physics', 'easy', 'learn']


**Stemming**<br/>
Removes prefixes and suffixes to extract the **stem** of a word, which may be derived from a **root**. For example, the word "destabilized", has the stem "destablize", but the root "stabil-". The Porter stemming algorithm is often used in practice to handle this task.

In [5]:
#!pip install stemming
from stemming.porter2 import stem

stem_tokens = [stem(t) for t in filtered_tokens]
print stem_tokens

['What', 'make', 'Physic', 'easi', 'learn']


### 2.4 Simplify question pairs <a class="anchor" id="bullet-7"></a>
We will combine the string cleaning methods into a function, and apply that across both question columns in the dataset. To prepare for basic comparison, the function will also convert the words to lowercase and sort them alphabetically.

In [6]:
import string

def simplify(s):
    s = str(s).lower().decode('utf-8')
    tokens = word_tokenize(s)
    stop_words = stopwords.words('english')
    stop_words += string.punctuation
    filtered_tokens = [t for t in tokens if not t in stop_words]
    stem_tokens = [stem(t) for t in filtered_tokens]
    sort_tokens = sorted(stem_tokens)
    if sort_tokens is not []:
        tokenstr = " ".join(sort_tokens)
    else:
        tokenstr = ""
    return tokenstr.encode('utf-8')

df['q1_tokens'] = df['question1'].map(simplify)
df['q2_tokens'] = df['question2'].map(simplify)

simplifydf=df[['question1','q1_tokens','question2','q2_tokens','is_duplicate']]
display(simplifydf[12:13])

Unnamed: 0,question1,q1_tokens,question2,q2_tokens,is_duplicate
12,What can make Physics easy to learn?,easi learn make physic,How can you make physics easy to learn?,easi learn make physic,1


### 2.5 Measuring similarity<a class="anchor" id="bullet-8"></a>

The simplest way to compare the difference between two strings is by **edit distance**. 

**Levenshtein distance**: calculates edit distance by counting the number of operations (add, replace, or delete) that are required to transform one string into another.

**Token sort ratio**: A method from the [FuzzyWuzzy library](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) that uses Levenshtein distance to get the proportion of common tokens between two strings. The score is normalized from 0-100 for easier interpretation.

We'll create our first two features with these methods.

In [7]:
#!pip install python-Levenshtein
from Levenshtein import distance
#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz

df['edit_distance'] = df.apply(lambda x: distance(x['q1_tokens'], x['q2_tokens']), axis=1)
df['in_common'] = df.apply(lambda x: fuzz.token_sort_ratio(x['q1_tokens'], x['q2_tokens']), axis=1)

syntaxdf=df[['question1','q1_tokens','question2','q2_tokens','edit_distance','in_common','is_duplicate']]
display(syntaxdf[508:510]) # example

Unnamed: 0,question1,q1_tokens,question2,q2_tokens,edit_distance,in_common,is_duplicate
508,What is the best way to learn algebra by yourself?,algebra best learn way,How do you learn algebra 1 fast?,1 algebra fast learn,8,76,1
509,How does it feel to retake a class in college?,class colleg feel retak,Does retaking subjects in college affect future job prospects?,affect colleg futur job prospect retak subject,30,49,0


Clearly the edit distance or proportion of common tokens is not sufficient to predict duplicate intention. For example, question pair 508 is duplicate, but has a larger edit distance and smaller proportion of common tokens than pair 509.

Let's try to improve on our features by working with semantic methods.


## 3.0 Semantics<a class="anchor" id="bullet-9"></a>

To a machine, words look like characters stored next to one another. Syntax methods allow us to compare words by manipulating them mathematically - counting the number of characters, measuring the amount of work needed to turn one set of characters into another. 

Semantic analysis strives to represent how each sequence of characters is related to any other sequence of characters. These relationships can be derived from large bodies of language as a separate machine learning task. A **document** is the group of words in question. In our case, each question from the Quora corpus is one document.

To start, we'll create lists of word tokens (filtered for stopwords, but not stemmed), to support the methods we'll use in this section.

In [8]:
def word_set(s,t,q):
    s = str(s).lower().decode('utf-8')
    t = str(t).lower().decode('utf-8')
    
    s_tokens, t_tokens = word_tokenize(s), word_tokenize(t)
    
    stop_words = stopwords.words('english')
    stop_words += string.punctuation
    
    s_tokens = [x for x in s_tokens if not x in stop_words]
    t_tokens = [x for x in t_tokens if not x in stop_words]
    
    s_temp = set(s_tokens)
    t_temp = set(t_tokens)
    
    s_distinct = [x for x in s_tokens if x not in t_temp]
    t_distinct = [x for x in t_tokens if x not in s_temp]

    if q == "q1_words":
        return s_tokens
    elif q == "q2_words":
        return t_tokens
    elif q == "q1_distinct":
        return s_distinct
    elif q == "q2_distinct":
        return t_distinct

df['q1_words'] = df.apply(lambda x: word_set(x['question1'], x['question2'],"q1_words"), axis=1)
df['q2_words'] = df.apply(lambda x: word_set(x['question1'], x['question2'],"q2_words"), axis=1)

wordsdf=df[['question1','q1_words','question2','q2_words','is_duplicate']]
display(wordsdf[508:510])

Unnamed: 0,question1,q1_words,question2,q2_words,is_duplicate
508,What is the best way to learn algebra by yourself?,"[best, way, learn, algebra]",How do you learn algebra 1 fast?,"[learn, algebra, 1, fast]",1
509,How does it feel to retake a class in college?,"[feel, retake, class, college]",Does retaking subjects in college affect future job prospects?,"[retaking, subjects, college, affect, future, job, prospects]",0


### 3.1 Single word analysis<a class="anchor" id="bullet-10"></a>

**Word embeddings**<br/>
This method represents individual words as vectors, and semantic relationships as the distance between vectors. The more related words are, the closer they should exist in vector space. Word embeddings come from the field of [distributional semantics](https://en.wikipedia.org/wiki/Distributional_semantics), which suggests that words are semantically related if they are frequently used in similar contexts (i.e. they are often surrounded by the same words).

For example, 'Canada' and 'Toronto' should exist closer together in the vector space than 'Canada' and 'Camara' (which would be closer in edit distance).

**Word2Vec**<br/>
The mapping of words to vectors is in itself the result of a machine learning algorithm. Developed by Google in 2013, the Word2Vec algorithm is a neural network that takes a large corpus as training data, and produces vector co-ordinates for each word by the word embedding concept. 

We will be using an pre-trained model from Google that was created from over 100 billion words from Google News. The model needs to be [downloaded](https://code.google.com/archive/p/word2vec/) and handled using the [gensim library](https://radimrehurek.com/gensim/) for word vectors. The model is a dictionary that contains every word and its corresponding vector representation, which look like 300 dimensional co-ordinates stored in an array.

**Comparing word vectors**<br/>
To compare word vectors, we can use cosine similarity. As the name suggests, this metric measures similarity by taking the cosine of the angle between vectors. The cosine function scales the similarity between 0 and 1, representing words from least to most semantically related.

In [9]:
#!pip install gensim
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('googlenews-vectors.bin', binary=True)

# using gensim built-in similarity function for examples
print "Cosine of angle between Canada, Toronto:" + "\n",
print model.similarity('Canada','Toronto')
print "\n" + "Cosine of angle between Canada, Camara:" + "\n",
print model.similarity('Canada','Camara')



Cosine of angle between Canada, Toronto:
0.564658820406

Cosine of angle between Canada, Camara:
0.0305788253005


### 3.2 Sentence analysis<a class="anchor" id="bullet-11"></a>

Since the model works like a dictionary, it can only give us vector representations for single words. There are two ways to get a vector representation of a sentence: 

1. Train a model on ordered words (e.g. sentences or phrases). Since word order is included during training, the resulting vectors will preserve the relationships between words. I won't be training a new model in this notebook, as it is computationally heavy, but here are some [resources](https://rare-technologies.com/doc2vec-tutorial) for the curious.
<br/>
<br/>
2. Convert a sentence to a set of words, and get the corresponding set of vectors. Averaging the vector set (summing and dividing by total vector length) will give us a single vector that represents that particular set of words. This method can only give a 'bag of words' representation - i.e. word order is not captured.

*Comment: I think that getting new embeddings specific to a corpus is the best-performing method in practice. For the purpose of illustrating NLP problem-solving, I will do my best with bag-of-words methods.*

The following function implements the second method to get the average embedded vector from a set of words.

In [10]:
import numpy as np
from __future__ import division

def vectorize(words):
    V = np.zeros(300)
    
    for w in words:
        try: 
            V = np.add(V,model[w]) 
        except:
            continue
    else:
        avg_vector = V / np.sqrt((V ** 2).sum())
        return avg_vector

Let's see how the average vectors compare for question pair 508:

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

sent1_q508 = wordsdf['q1_words'][508]
sent2_q508 = wordsdf['q2_words'][508]

vec1_q508 = vectorize(sent1_q508).reshape(1,-1)
vec2_q508 = vectorize(sent2_q508).reshape(1,-1)

display(wordsdf[508:509])

print "\n" + "Cosine similarity of [best, way, learn, algebra] and [learn, algebra, 1, fast]:" + "\n",
print cosine_similarity(vec1_q508, vec2_q508)[0][0]

Unnamed: 0,question1,q1_words,question2,q2_words,is_duplicate
508,What is the best way to learn algebra by yourself?,"[best, way, learn, algebra]",How do you learn algebra 1 fast?,"[learn, algebra, 1, fast]",1



Cosine similarity of [best, way, learn, algebra] and [learn, algebra, 1, fast]:
0.788908032578


How do the averaged vectors represent the cosine similarities of its components? 

Intuitively, if our question pair differs by a closely related word (best vs. ideal) we would get a larger cosine similarity. And if our question pair differs by a very distinct word (algebra vs. juggling), the cosine similarity is smaller.

In [12]:
# bag of words, so same set of words in a different order does not matter
print "\n" + "Distance between [best, way, learn, algebra] and [learn, algebra, best, way]:" + "\n",
print model.n_similarity(['best','way','learn','algebra'],['learn','algebra','best','way'])

# difference is a semantically similar word
print "\n" + "Distance between [best, way, learn, algebra] and [ideal, way, learn, algebra]:" + "\n",
print model.n_similarity(['best','way','learn','algebra'],['ideal','way','learn','algebra'])

# difference is not semantically similar
print "\n" + "Distance between [best, way, learn, algebra] and [best, way, learn, juggling]:" + "\n",
print model.n_similarity(['best','way','learn','algebra'],['best','way','learn','juggling'])


Distance between [best, way, learn, algebra] and [learn, algebra, best, way]:
1.0

Distance between [best, way, learn, algebra] and [ideal, way, learn, algebra]:
0.905135532973

Distance between [best, way, learn, algebra] and [best, way, learn, juggling]:
0.732704534446


**Word mover's distance** <br/>
An implementation of Earth mover's distance for natural language processing problems by Kusner et al. <a href="#footnote-1"><sup>[1]</sup></a>

WM distance is an approach that combines the ideas of edit distance with vector representation. It measures the work required to transform one set of vectors into another. Instead of counting edit operations, we use distance between word vectors - how far one vector would have to move to occupy the same spot as the second.

How Word Mover's Distance is calculated:
</a><br/><img src="https://raw.githubusercontent.com/nllho/quora-nlp/master/images/wmd.PNG" width="400" height="400"/>
1. All the words in each set are paired off with each other
2. Calculate the distance between each pair (instead of cosine similarity, Euclidean distance is used here)
3. Sum the distances between pairs with minimum distances

If the two sets do not have the same number of words, the problem becomes an optimization of another measurement called **flow**.
</a><br/><img src="https://raw.githubusercontent.com/nllho/quora-nlp/master/images/flow.PNG" width="320" height="320"/>

1. The flow is equal to 1/(number of words in the set), so words from the smaller set have a larger flow<br/>
(words on the bottom have a flow of 0.33, while words on the top have a flow of 0.25)
2. Extra flow gets attributed to the next most similar words<br/>
(see the arrows drawn from the bottom words to more than one word in the top row)
3. The optimization problem identifies the pairs with minimum distances by solving for minimum flow.

We can use the WM distance method directly from gensim.

In [13]:
print "\n" + "WM distance between [best, way, learn, algebra] and [learn, algebra, 1, fast]:" + "\n",
print model.wmdistance(sent1_q508, sent2_q508)


WM distance between [best, way, learn, algebra] and [learn, algebra, 1, fast]:
1.43031281792


### 3.3 Weighted analysis<a class="anchor" id="bullet-12"></a>

In the example below, we can see that the words are the same except for the name of the country in question (Canada vs. Japan). However, the country name makes all the semantic difference, which we fail to capture using only cosine similarity or WM distance.

In [14]:
display(wordsdf[14:15])

sent1_q14 = wordsdf['q1_words'][14]
sent2_q14 = wordsdf['q2_words'][14]

print "\n" + "Cosine angle:" + "\n",
print model.n_similarity(sent1_q14, sent2_q14)

print "\n" + "WM distance:" + "\n",
print model.wmdistance(sent1_q14, sent2_q14)

Unnamed: 0,question1,q1_words,question2,q2_words,is_duplicate
14,"What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?","[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]",What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?,"[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]",0



Cosine angle:
0.970663125514

WM distance:
0.332179243272


**Weighing uncommon words**<br/>
Let's assume that 'rare' words are more likely to be semantically significant. We can represent this at the word vector level by multiplying those words by a numerical weight.  

**Term frequency-inverse document frequency** (tf-idf) is a method that assigns weights to word vectors depending on how common they are to a document. The frequency of a word is measured in two ways:

* How many documents contain the word (N)
* How many times a word appears in one document (f)

The weight is calculated from the frequency as log(N/f), so the less frequently a word appears in some documents, the higher its weight.

This method can be implemented via sci-kit learn's built in [Tf-idf Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which generates weights given a corpus. To save memory and computing time, I decided to simplify the premise of tf-idf for use on pairs of similar questions.

(1) Assume that distinct words  are the most important in telling the difference between question pairs.

In [15]:
# get list of distinct words for each question
df['q1_distinct'] = df.apply(lambda x: word_set(x['question1'], x['question2'],"q1_distinct"), axis=1)
df['q2_distinct'] = df.apply(lambda x: word_set(x['question1'], x['question2'],"q2_distinct"), axis=1)

distinctdf=df[['question1','q1_words','q1_distinct','question2','q2_words','q2_distinct','is_duplicate']]
display(distinctdf[14:15])

Unnamed: 0,question1,q1_words,q1_distinct,question2,q2_words,q2_distinct,is_duplicate
14,"What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?","[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]",[canada],What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?,"[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]",[japan],0


It might be useful to get features for the cosine similarity and WM distance for distinct words.

In [16]:
distinct1 = distinctdf['q1_distinct'][14]
distinct2 = distinctdf['q2_distinct'][14]

distinct_vec1 = vectorize(distinct1).reshape(1,-1)
distinct_vec2 = vectorize(distinct2).reshape(1,-1)

print "Cosine similarity between distinct vectors ({0}, {1}):".format(distinct1[0], distinct2[0]) + "\n",
print cosine_similarity(distinct_vec1, distinct_vec2)[0][0]

print "\n" + "WM distance between distinct vectors ({0}, {1}):".format(distinct1[0], distinct2[0]) + "\n",
print model.wmdistance(distinct1, distinct2)

Cosine similarity between distinct vectors (canada, japan):
0.482060432649

WM distance between distinct vectors (canada, japan):
3.98616600037


(2) Distinct words only appear in one of the two questions, so we can take N = 1. We assumed that distinct words are important, so we assign the distinct words a small frequency of 1/(number of words in the question) for a larger weight.

In [17]:
# modify vectorize function to add weights
def get_weight(words):
    n = len(words)
    weight = 1
    
    if n != 0:
        weight = np.log(1/(1/n))
        
    return weight

(3) Generate an array containing the weights for every question in the dataset.

In [18]:
# empty arrays
q1_weights = np.zeros((df.shape[0],300))
q2_weights = np.zeros((df.shape[0],300))

# fill arrays with weights for each question
for i, q in enumerate(df.q1_words.values):
    q1_weights[i, :] = get_weight(q)
    
for i, q in enumerate(df.q2_words.values):
    q2_weights[i, :] = get_weight(q)

(4) Calculate the average weighted vectors. We can see how weighing distinct words translates to reduced cosine similarity.

In [19]:
avg_vec1 = vectorize(sent1_q14).reshape(1,-1)
avg_vec2 = vectorize(sent2_q14).reshape(1,-1)

print "\n" + "Cosine similarity between averaged question vectors:" + "\n",
print cosine_similarity(avg_vec1, avg_vec2)[0][0]

w_distinct_vec1 = distinct_vec1*q1_weights[14]
w_distinct_vec2 = distinct_vec2*q2_weights[14]

avg_weight_distinct_vec1 = np.add(avg_vec1, -(distinct_vec1), w_distinct_vec1) 
avg_weight_distinct_vec2 = np.add(avg_vec2, -(distinct_vec2), w_distinct_vec2)

print "\n" + "Cosine similiarity between weighted question vectors:" + "\n",
print cosine_similarity(avg_weight_distinct_vec1, avg_weight_distinct_vec2)[0][0]


Cosine similarity between averaged question vectors:
0.970663125278

Cosine similiarity between weighted question vectors:
0.752143561408


### 3.4 Feature creation<a class="anchor" id="bullet-13"></a> <a href="#footnote-1"><sup>[2]</sup></a>
We can apply these methods to our dataset to create the following features:

* Word mover's distance between sentence sets
* Word mover's distance between distinct word sets
* Angle between averaged sentence vectors
* Angle between averaged distinct word vectors
* Angle between weighted sentence vectors

In [20]:
# word mover's distance between sentence sets
df['wm_dist_words'] = df.apply(lambda x: model.wmdistance(x['q1_words'], x['q2_words']), axis=1)

# word mover's distance between distinct sets
df['wm_dist_distinct'] = df.apply(lambda x: model.wmdistance(x['q1_distinct'], x['q2_distinct']), axis=1)

# angle between averaged sentence vectors
q1_avg_vectors = np.zeros((df.shape[0], 300))
q2_avg_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(df.q1_words.values):
    q1_avg_vectors[i, :] = vectorize(q)

for i, q in enumerate(df.q2_words.values):
    q2_avg_vectors[i, :] = vectorize(q)
    
    
df['cos_angle_words'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]
                        for (x, y) in zip(np.nan_to_num(q1_avg_vectors),
                                          np.nan_to_num(q2_avg_vectors))]

# angle between averaged distinct sentence vectors
q1_dist_vectors = np.zeros((df.shape[0], 300))
q2_dist_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(df.q1_distinct.values):
    q1_dist_vectors[i, :] = vectorize(q)

for i, q in enumerate(df.q2_distinct.values):
    q2_dist_vectors[i, :] = vectorize(q)
    
    
df['cos_angle_distinct'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]
                           for (x, y) in zip(np.nan_to_num(q1_dist_vectors),
                                             np.nan_to_num(q2_dist_vectors))]

# get array of weighted distinct vectors
q1_weight_distinct_vec = np.multiply(q1_dist_vectors,q1_weights)
q2_weight_distinct_vec = np.multiply(q2_dist_vectors,q2_weights)

# get sentence vectors with weights
q1_avg_weight_vectors = np.add(q1_avg_vectors, -(q1_dist_vectors), + q1_weight_distinct_vec)
q2_avg_weight_vectors = np.add(q2_avg_vectors, -(q2_dist_vectors), + q2_weight_distinct_vec)

df['cos_angle_weighted'] = [cosine_similarity(x.reshape(1,-1), y.reshape(1,-1))[0][0]
                           for (x, y) in zip(np.nan_to_num(q1_avg_weight_vectors),
                                             np.nan_to_num(q2_avg_weight_vectors))]

df[14:15]



Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_tokens,q2_tokens,edit_distance,in_common,q1_words,q2_words,q1_distinct,q2_distinct,wm_dist_words,wm_dist_distinct,cos_angle_words,cos_angle_distinct,cos_angle_weighted
14,14,29,30,"What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?",What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?,0,canada card chang compar green immigr law law status student us visa,card chang compar green immigr japan law law status student us visa,13,90,"[laws, change, status, student, visa, green, card, us, compare, immigration, laws, canada]","[laws, change, status, student, visa, green, card, us, compare, immigration, laws, japan]",[canada],[japan],0.332179,3.986166,0.970663,0.48206,0.752144


You can now export the feature engineered dataset for use with your preferred model!

In [21]:
featuredf = df.drop(['q1_tokens','q2_tokens','q1_words','q2_words','q1_distinct','q2_distinct'], axis=1)
featuredf.to_csv('/nlp/quora_features.csv', index=False)

## Further reading

* Follow [this tutorial](http://nbviewer.jupyter.org/gist/nllho/4496a06e2bec93f06858851b5d822298) to build an XGBoost classifier, and make predictions using our new features
* Try [Doc2Vec](https://rare-technologies.com/doc2vec-tutorial) to train a model for sentences or phrases
* Try [Tf-idf Vectorizer](http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/) to generate specific weights based on word frequency in a corpus


## References

<p id="footnote-1"><sup>[1]</sup> Kusner, M. J. and Sun, Y. and Kolkin, N. I. and Weinberger, K. Q. (2015) [From Word Embeddings to Document Distances](http://proceedings.mlr.press/v37/kusnerb15.pdf)

<p id="footnote-1"><sup>[2]</sup> Thakur, A. (April 2017) [Is that a duplicate Quora Question?](https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur)