<a href="https://colab.research.google.com/github/kzafeiroudi/PRML/blob/master/Preprocessing_Quora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing the Quora Question Pairs dataset

## The dataset

For the purpose of this project, we first look into the Quora Question Pairs dataset that consists of:

*  537933 unique questions
*  404290 pairs of questions
* 149263 pairs of questions marked as duplicates
* 255027 pairs of questions marked as non duplicates

Each instance of the dataset has the following attributes:
* id : A unique identifier for each training set question pair
* qid1: A unique identifier for the first question in the pair
* qid2: A unique identifier for the second question in the pair
* question1: The full text of the first question 
* question2: The full text of the second question
* is_duplicate: The target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise

## Upload the dataset

Upload the file *quora.csv* that will be used in this Python 3 notebook.

In [2]:
from google.colab import files

# Choose from your own machine the file to upload - name should be "quora.csv"
uploaded = files.upload()
!ls

Saving quora.csv to quora.csv
gdrive	quora.csv  sample_data


In [0]:
!python -m spacy download en_core_web_lg

In [0]:
import csv, sys
import spacy
# reload(sys)
# sys.setdefaultencoding('utf8')

# Loading the Quora Question Pairs dataset
data_file = 'quora.csv'
new_data = []
with open(data_file) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        new_data.append(row)

# Extracting the unique questions from the dataset
uniqueQ = {}
classes = {}
class_count = 0
pairs = {}
pairs_count = 0
duplicate = {}
dup = 0
non_duplicate = {}
ndup = 0
for dd in new_data:
    uniqueQ[dd['qid1']] = dd['question1']
    uniqueQ[dd['qid2']] = dd['question2']
    if dd['qid1'] in classes:
      classes[dd['qid2']] = classes[dd['qid1']]
    elif dd['qid2'] in classes:
      classes[dd['qid1']] = classes[dd['qid2']]
    else:
      class_count += 1
      classes[dd['qid1']] = "Class_" + str(class_count)
      classes[dd['qid2']] = "Class_" + str(class_count)
    pairs_count += 1
    pairs[pairs_count] = (dd['question1'], dd['question2'])
    if (dd['is_duplicate'] == '0'):
      ndup += 1
      non_duplicate[ndup] = (dd['question1'], dd['question2'])
    else:
      dup += 1
      duplicate[dup] = (dd['question1'], dd['question2'])


# nlp = spacy.load('en_core_web_lg')



In [0]:
print(duplicate[17])
print(non_duplicate[3])

('Will a Blu Ray play on a regular DVD player? If so, how?', 'How can you play a Blu Ray DVD on a regular DVD player?')
('How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?')


In [0]:
# print(len(uniqueQ))
# for i in range(1,30):
#   print(pairs[i])
# print(pairs_count)
import numpy as np
def cosine(vA, vB):
  cos = numpy.dot(vA, vB) / (numpy.sqrt(numpy.dot(vA,vA)) * numpy.sqrt(numpy.dot(vB,vB)))
  return cos.astype('float64')

(a, b) = duplicate[17]
foo1 = nlp(a)
print(foo1)
foo2 = nlp(b)
print(foo2)
print(foo1.similarity(foo2))
calc1 = []
for nn in foo1.noun_chunks:
    print(nn)
    calc1.append(nn.vector)
calc2 = []
for nn in foo2.noun_chunks:
    print(nn)
    calc2.append(nn.vector)
vA = np.sum(calc1, axis=0)
vB = np.sum(calc2, axis=0)
print(cosine(vA, vB))

print('****')

(a, b) = non_duplicate[3]
foo1 = nlp(a)
print(foo1)
foo2 = nlp(b)
print(foo2)
print(foo1.similarity(foo2))
calc1 = []
for nn in foo1.noun_chunks:
    print(nn)
    calc1.append(nn.vector)
calc2 = []
for nn in foo2.noun_chunks:
    print(nn)
    calc2.append(nn.vector)
vA = np.sum(calc1, axis=0)
vB = np.sum(calc2, axis=0)
print(cosine(vA, vB))

Will a Blu Ray play on a regular DVD player? If so, how?
How can you play a Blu Ray DVD on a regular DVD player?
0.9717130630470544
a Blu Ray
a regular DVD player
you
a Blu Ray DVD
a regular DVD player
0.914749264717102
****
How can I increase the speed of my internet connection while using a VPN?
How can Internet speed be increased by hacking through DNS?
0.9290680453645342
I
the speed
my internet connection
a VPN
Internet speed
DNS
0.670642077922821


In [0]:
sim_duplicate = []
sim_duplicate2 = []
for i in range(1, 5000):
  (a, b) = duplicate[i]
  foo1 = nlp(a)
  foo2 = nlp(b)
  

  calc1 = [np.zeros(300, dtype='float64')]
  for nn in foo1.noun_chunks:
#       print(nn)
      calc1.append(nn.vector)
  calc2 = [np.zeros(300, dtype='float64')]
  for nn in foo2.noun_chunks:
#       print(nn)
      calc2.append(nn.vector)
  vA = np.sum(calc1, axis=0).astype('float64')
  vB = np.sum(calc2, axis=0).astype('float64')
  cos = cosine(vA, vB)
  try:
    if not (np.isnan(cos)):
      sim_duplicate.append(foo1.similarity(foo2))
      sim_duplicate2.append(cos)
  except:
    print('Exception!')

print(min(sim_duplicate))
print(max(sim_duplicate))
print(np.average(sim_duplicate))
print('***')
print(min(sim_duplicate2))
print(max(sim_duplicate2))
print(np.average(sim_duplicate2))

  This is separate from the ipykernel package so we can avoid doing imports until


0.6468996546040794
1.0000000858950187
0.9465413936763243
***
-0.008805530308907383
1.0000000000000004
0.8805787954723814


In [0]:
duplicate[1]

('Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
 "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?")

In [0]:
sim_nonduplicate = []
sim_nonduplicate2 = []
for i in range(1, 5000):
  (a, b) = non_duplicate[i]
  foo1 = nlp(a)
  foo2 = nlp(b)
  
  
  calc1 = [np.zeros(300, dtype='float64')]
  for nn in foo1.noun_chunks:
#       print(nn)
#       print(nn.vector)
      calc1.append(nn.vector)
  calc2 = [np.zeros(300, dtype='float64')]
  for nn in foo2.noun_chunks:
#       print(nn)
#       print(nn.vector)
      calc2.append(nn.vector)
  vA = np.sum(calc1, axis=0).astype('float64')
  vB = np.sum(calc2, axis=0).astype('float64')
  cos = cosine(vA, vB)
  if not (np.isnan(cos)):
    sim_nonduplicate2.append(cos)
    sim_nonduplicate.append(foo1.similarity(foo2))
#   print('Cosine: ', np.isnan(cosine(vA, vB)))
#   print('****')
#   print(i)
#   print(sim_nonduplicate2)

print(min(sim_nonduplicate))
print(max(sim_nonduplicate))
print(np.average(sim_nonduplicate))
print('****')
print(min(sim_nonduplicate2))
print(max(sim_nonduplicate2))
print(np.average(sim_nonduplicate2))

  This is separate from the ipykernel package so we can avoid doing imports until


0.29264479268934257
1.0000001169458588
0.8947737652759219
****
-0.1132305785162487
1.0000000000000002
0.7746351169036362


In [0]:
toCSV1 = []
toCSV2 = []
for i in sim_duplicate:
  toCSV1.append({'Class':'Duplicate', 'Similarity_Metric':i})
for i in sim_duplicate2:
  toCSV2.append({'Class':'Duplicate', 'Similarity_Metric':i})
for i in sim_nonduplicate:
  toCSV1.append({'Class':'NonDuplicate', 'Similarity_Metric':i})
for i in sim_nonduplicate2:
  toCSV2.append({'Class':'NonDuplicate', 'Similarity_Metric':i})

In [0]:
with open('full_questions.csv', 'w') as csvfile:
  fieldnames = ['Class', 'Similarity_Metric']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i in range(len(toCSV1)):
    writer.writerow(toCSV1[i])

In [0]:
with open('noun_chunks.csv', 'w') as csvfile:
  fieldnames = ['Class', 'Similarity_Metric']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i in range(len(toCSV2)):
    writer.writerow(toCSV2[i])

In [0]:
# for start in range(400000, len(uniqueQ), 5000):
for start in range(0, 1000, 1000):
    end = min(len(uniqueQ), start+1000)
    toCSV = []
    for idQ in list(uniqueQ.keys())[start:end]:
        foo = nlp(uniqueQ[idQ])
        boo = {}
        boo['QID'] = idQ
        boo['Question'] = uniqueQ[idQ].replace("\"", "").replace("\'", "")
        boo['Class'] = classes[idQ]
        for i in range(300):
            boo['vector_' + str(i+1)] = foo.vector[i]
        toCSV.append(boo)

    print('Writing file: processed_quora_'+str(start)+'.csv')
    with open('processed_quora_'+str(start)+'.csv', 'w') as csvfile:
        fieldnames = ['QID', 'Question', 'Class']
        for i in range(300):
            fieldnames.append('vector_' + str(i+1))
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for i in range(len(toCSV)):
            writer.writerow(toCSV[i])

Writing file: processed_quora_0.csv
