<a href="https://colab.research.google.com/github/kzafeiroudi/QuestRecommend/blob/master/Preprocessing_Quora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing the Quora Question Pairs dataset

## The dataset

For the purpose of this project, we first look into the Quora Question Pairs dataset that consists of:

*  537933 unique questions
*  404290 pairs of questions
* 149263 pairs of questions marked as duplicates
* 255027 pairs of questions marked as non duplicates

Each instance of the dataset has the following attributes:
* id : A unique identifier for each training set question pair
* qid1: A unique identifier for the first question in the pair
* qid2: A unique identifier for the second question in the pair
* question1: The full text of the first question 
* question2: The full text of the second question
* is_duplicate: The target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise

## Upload the dataset

Upload the file *quora.csv* that will be used in this Python 3 notebook.

In [2]:
from google.colab import files

# Choose from your own machine the file to upload - name should be "quora.csv"
uploaded = files.upload()

Saving quora.csv to quora.csv


## Download a pre-trained English language model

We will be using the large pre-trained statistical model for English, available by the **spaCy** free open-source library for NLP in Python. Find more [here](https://spacy.io/models/en#en_core_web_lg).

In [4]:
!python -m spacy download en_core_web_lg

# Load the model
import spacy
nlp = spacy.load('en_core_web_lg')


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')



## Importing Python libraries

In [0]:
import csv
import numpy as np
from prettytable import PrettyTable

## Load the dataset

In [6]:
# Loading the Quora Question Pairs dataset
data_file = 'quora.csv'
new_data = []
with open(data_file) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        new_data.append(row)

# Extracting the unique questions from the dataset,
# all pairs, duplicate and non-duplicate pairs
uniqueQ = {}
pairs = {}
pairs_count = 0
duplicate = {}
dup = 0
non_duplicate = {}
ndup = 0
for dd in new_data:
    uniqueQ[dd['qid1']] = dd['question1']
    uniqueQ[dd['qid2']] = dd['question2']
    pairs_count += 1
    pairs[pairs_count] = (dd['question1'], dd['question2'])
    if (dd['is_duplicate'] == '0'):
      ndup += 1
      non_duplicate[ndup] = (dd['question1'], dd['question2'])
    else:
      dup += 1
      duplicate[dup] = (dd['question1'], dd['question2'])

# Print stats for the dataset
t = PrettyTable(['Pairs type', '# of Pairs'])
t.add_row(['Unique Pairs', len(pairs)])
t.add_row(['Duplicate Questions', len(duplicate)])
t.add_row(['non-Duplicate Questions', len(non_duplicate)])
print(t)


+-------------------------+------------+
|        Pairs type       | # of Pairs |
+-------------------------+------------+
|       Unique Pairs      |   404290   |
|   Duplicate Questions   |   149263   |
| non-Duplicate Questions |   255027   |
+-------------------------+------------+


## Examples of duplicate Vs non-duplicate questions

In [9]:
print('Duplicate questions:')
print('\t-' + duplicate[17][0])
print('\t-' + duplicate[17][1])
print()
print('non-Duplicate questions:')
print('\t-' + non_duplicate[3][0])
print('\t-' + non_duplicate[3][1])

Duplicate questions:
	-Will a Blu Ray play on a regular DVD player? If so, how?
	-How can you play a Blu Ray DVD on a regular DVD player?

non-Duplicate questions:
	-How can I increase the speed of my internet connection while using a VPN?
	-How can Internet speed be increased by hacking through DNS?


## Implementing a cosine similarity function

`cosine(vA, vB)` calculates the cosine similarity between two vectors `vA` and `vB`.

In [0]:
def cosine(vA, vB):
  cos = np.dot(vA, vB) / (np.sqrt(np.dot(vA, vA)) * np.sqrt(np.dot(vB, vB)))
  return cos.astype('float64')

## Calculating the cosine similarity

Here we are showing the differences in calculating the cosine similarity for a pair of duplicate/non-duplicate questions by taking into account the full text of the question when generating the vector representation, and when only the noun chunks extracted from a question are kept.

For questions that are non-duplicate, we realize that using only the noun-chunks results in a lower similarity metric, which is preferable. We will test our hypothesis, that using only the main verb arguments of a question to calculate the question embeddings results in a more accurate representation that makes it easier to identify semantically closely related questions over non-related questions.

In [18]:
(a, b) = duplicate[17]
foo1 = nlp(a)
foo2 = nlp(b)
# Calculating the cosine similarity of the two questions (full text)
cos = foo1.similarity(foo2)

print('Duplicate questions:')
print('Question1: ', foo1)
print('Question2: ', foo2)
print('Cosine similarity (full-text):', cos)

# Calculating the cosine similarity (only noun chunks)
calc1 = []
for nn in foo1.noun_chunks:
    calc1.append(nn.vector)
calc2 = []
for nn in foo2.noun_chunks:
    calc2.append(nn.vector)
# vA: question embedding for question1
# vB: question embedding for question2
vA = np.sum(calc1, axis=0)
vB = np.sum(calc2, axis=0)
cos = cosine(vA, vB)
print('Cosine similarity (noun chunks):', cos)

print()

(a, b) = non_duplicate[3]
foo1 = nlp(a)
foo2 = nlp(b)
# Calculating the cosine similarity of the two questions (full text)
cos = foo1.similarity(foo2)

print('non-Duplicate questions:')
print('Question1: ', foo1)
print('Question2: ', foo2)
print('Cosine similarity (full-text):', cos)

# Calculating the cosine similarity (only noun chunks)
calc1 = []
for nn in foo1.noun_chunks:
    calc1.append(nn.vector)
calc2 = []
for nn in foo2.noun_chunks:
    calc2.append(nn.vector)
# vA: question embedding for question1
# vB: question embedding for question2
vA = np.sum(calc1, axis=0)
vB = np.sum(calc2, axis=0)
cos = cosine(vA, vB)
print('Cosine similarity (noun chunks):', cos)

Duplicate questions:
Question1:  Will a Blu Ray play on a regular DVD player? If so, how?
Question2:  How can you play a Blu Ray DVD on a regular DVD player?
Cosine similarity (full-text): 0.9717130630470544
Cosine similarity (noun chunks): 0.914749264717102

non-Duplicate questions:
Question1:  How can I increase the speed of my internet connection while using a VPN?
Question2:  How can Internet speed be increased by hacking through DNS?
Cosine similarity (full-text): 0.9290680453645342
Cosine similarity (noun chunks): 0.670642077922821


## Duplicate questions

Here we are calculating the cosine similarity for about 5000 pairs of duplicate questions, where we:
* Calculated the question vectors of each full text question.
* Calculated the question vectors for the main noun chunks of each question.


In [24]:
# sim_duplicate: stores the cosine similarity calculated on the full-text questions
# sim_duplicate2: stores the cosine similarity calculated only on the noun chunks of the questions
sim_duplicate = []
sim_duplicate2 = []
for i in range(1, 5000):
  (a, b) = duplicate[i]
  foo1 = nlp(a)
  foo2 = nlp(b)
  
  calc1 = [np.zeros(300, dtype='float64')]
  for nn in foo1.noun_chunks:
      calc1.append(nn.vector)
  calc2 = [np.zeros(300, dtype='float64')]
  for nn in foo2.noun_chunks:
      calc2.append(nn.vector)
  vA = np.sum(calc1, axis=0).astype('float64')
  vB = np.sum(calc2, axis=0).astype('float64')
  cos = cosine(vA, vB)
  try:
    if not (np.isnan(cos)):
      sim_duplicate.append(foo1.similarity(foo2))
      sim_duplicate2.append(cos)
  except:
    continue

# Print stats for each vector representation
print('For pairs of questions that are duplicates, the cosine similarity stats:')
t = PrettyTable(['', 'Full text question', 'Noun chunks'])
t.add_row(['Min', min(sim_duplicate), min(sim_duplicate2)])
t.add_row(['Max', max(sim_duplicate), max(sim_duplicate2)])
t.add_row(['Avg', np.average(sim_duplicate), np.average(sim_duplicate2)])
print(t)

  


For pairs of questions that are duplicates, the cosine similarity stats:
+-----+--------------------+-----------------------+
|     | Full text question |      Noun chunks      |
+-----+--------------------+-----------------------+
| Min | 0.6468996546040794 | -0.008805530308907383 |
| Max | 1.0000000858950187 |   1.0000000000000004  |
| Avg | 0.9465413936763243 |   0.8805787954723814  |
+-----+--------------------+-----------------------+


## non-Duplicate questions

Here we are calculating the cosine similarity for about 5000 pairs of non-duplicate questions, where we:
* Calculated the question vectors of each full text question.
* Calculated the question vectors for the main noun chunks of each question.


In [25]:
# sim_nonduplicate: stores the cosine similarity calculated on the full-text questions
# sim_nonduplicate2: stores the cosine similarity calculated only on the noun chunks of the questions
sim_nonduplicate = []
sim_nonduplicate2 = []
for i in range(1, 5000):
  (a, b) = non_duplicate[i]
  foo1 = nlp(a)
  foo2 = nlp(b)
  
  calc1 = [np.zeros(300, dtype='float64')]
  for nn in foo1.noun_chunks:
      calc1.append(nn.vector)
  calc2 = [np.zeros(300, dtype='float64')]
  for nn in foo2.noun_chunks:
      calc2.append(nn.vector)
  vA = np.sum(calc1, axis=0).astype('float64')
  vB = np.sum(calc2, axis=0).astype('float64')
  cos = cosine(vA, vB)
  try:
    if not (np.isnan(cos)):
      sim_nonduplicate.append(foo1.similarity(foo2))
      sim_nonduplicate2.append(cos)
  except:
    continue

# Print stats for each vector representation
print('For pairs of questions that are non duplicates, the cosine similarity stats:')
t = PrettyTable(['', 'Full text question', 'Noun chunks'])
t.add_row(['Min', min(sim_nonduplicate), min(sim_nonduplicate2)])
t.add_row(['Max', max(sim_nonduplicate), max(sim_nonduplicate2)])
t.add_row(['Avg', np.average(sim_nonduplicate), np.average(sim_nonduplicate2)])
print(t)

  


For pairs of questions that are non duplicates, the cosine similarity stats:
+-----+---------------------+---------------------+
|     |  Full text question |     Noun chunks     |
+-----+---------------------+---------------------+
| Min | 0.29264479268934257 | -0.1132305785162487 |
| Max |  1.0000001169458588 |  1.0000000000000002 |
| Avg |  0.8947737652759219 |  0.7746351169036362 |
+-----+---------------------+---------------------+


## Save the data

Finally, we are creating two files:
* `full_question.csv`
* `noun_chunks.csv`

Each file has a subset of roughly 10,000 pairs of questions, with 5,000 duplicate and 5,000 non-duplicate questions, so we can then perform classification and decide which is a better model to use to generate question embeddings for the purpose of finding semantically related questions.

In [0]:
# toCSV1 stores the cosine similarity calculated using the vector representation of the full text of the question
# toCSV2 stores the cosine similarity calculated using the vector representation of only the noun chunks of the question
toCSV1 = []
toCSV2 = []
for i in sim_duplicate:
  toCSV1.append({'Class':'Duplicate', 'Similarity_Metric':i})
for i in sim_duplicate2:
  toCSV2.append({'Class':'Duplicate', 'Similarity_Metric':i})
for i in sim_nonduplicate:
  toCSV1.append({'Class':'NonDuplicate', 'Similarity_Metric':i})
for i in sim_nonduplicate2:
  toCSV2.append({'Class':'NonDuplicate', 'Similarity_Metric':i})
  
with open('full_questions.csv', 'w') as csvfile:
  fieldnames = ['Class', 'Similarity_Metric']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i in range(len(toCSV1)):
    writer.writerow(toCSV1[i])

with open('noun_chunks.csv', 'w') as csvfile:
  fieldnames = ['Class', 'Similarity_Metric']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i in range(len(toCSV2)):
    writer.writerow(toCSV2[i])