<a href="https://colab.research.google.com/github/kzafeiroudi/QuestRecommend/blob/master/Preprocessing_SQuAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing the SQuAD dataset

## The dataset

After looking into the Quora Question Pairs dataset, it is now time to apply the deveoped process on the SQuAD dataset. The part of the SQuAD dataset that we are going to use here consists of:

*  3111 unique questions
*  30 unique article topics from Wikipedia

Each instance of this part of the dataset has the following attributes:
* sentID : A unique identifier for each question
* Question: The full text of the question
* Class: The article title of the Wikipedia page the question was derived from, used here as a nominal class

## Upload the dataset

Upload the file `squad.csv` that will be used in this Python 3 notebook.

In [1]:
from google.colab import files

# Choose from your own machine the file to upload - name should be "squad.csv"
uploaded = files.upload()

Saving squad.csv to squad.csv


## Download a pre-trained English language model

We will be using the large pre-trained statistical model for English, available by the **spaCy** free open-source library for NLP in Python. Find more [here](https://spacy.io/models/en#en_core_web_lg).

In [2]:
!python -m spacy download en_core_web_lg

# Load the model
import spacy
nlp = spacy.load('en_core_web_lg')

Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)
[K     |████████████████████████████████| 852.3MB 1.2MB/s 
[?25hBuilding wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Stored in directory: /tmp/pip-ephem-wheel-cache-ppjihu0k/wheels/0d/bc/67/e6a9108ab86cd076703af19ad4e0f02f57381ac6583df16249
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')



## Importing Python libraries

In [0]:
import csv
import numpy as np
import random
from prettytable import PrettyTable

## Load the dataset

In [15]:
# Loading the SQuAD dataset
data_file = 'squad.csv'
new_data = []
with open(data_file) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        new_data.append(row)

# Extracting the unique questions from the dataset, and all unique topics
uniqueQ = []
topics = {}
for dd in new_data:
    uniqueQ.append(dd['Question'])
    if (dd['Class'] in topics):
      topics[dd['Class']].append(dd['Question'])
    else:
      topics[dd['Class']] = [dd['Question']]

# Print stats for the dataset
print('Unique Questions: ', len(uniqueQ))
print('Unique Topics: ', len(topics.keys()))
print()

# Print stats for each topic
t = PrettyTable(['Topic', '# of Questions'])
for tt in topics:
  t.add_row([tt, len(topics[tt])])
print(t)

Unique Questions:  3111
Unique Topics:  30

+---------------------------------+----------------+
|              Topic              | # of Questions |
+---------------------------------+----------------+
|          Super_Bowl_50          |      106       |
|              Warsaw             |      104       |
|             Normans             |      101       |
|           Nikola_Tesla          |      108       |
| Computational_complexity_theory |      103       |
|             Teacher             |      103       |
|          Martin_Luther          |      102       |
|       Southern_California       |      101       |
|       Sky_(United_Kingdom)      |      103       |
|       Victoria_(Australia)      |      104       |
|             Huguenot            |      104       |
|           Steam_engine          |      105       |
|              Oxygen             |      105       |
|         1973_oil_crisis         |      102       |
|          Apollo_program         |      103       |
| 

## Implementing a cosine similarity function

`cosine(vA, vB)` calculates the cosine similarity between two vectors `vA` and `vB`.

In [0]:
def cosine(vA, vB):
  cos = np.dot(vA, vB) / (np.sqrt(np.dot(vA, vA)) * np.sqrt(np.dot(vB, vB)))
  return cos.astype('float64')

## Calculating the question vector representation

We will use only the main verb arguments of the question to calculate the question embeddings, since the process we followed for the Quora Question Pairs dataset showed that question are less likely to be considered duplicates when they are not.

The final result is saved in the file `squad_vectors.csv`, which will then be used to perform classification and clustering.

In [0]:
squad = []
for i in range(len(new_data)):
  foo = nlp(new_data[i]['Question'])
  calc = [np.zeros(300, dtype='float64')]
  for nn in foo.noun_chunks:
      calc.append(nn.vector)
  vA = np.sum(calc, axis=0).astype('float64')
  squad.append(vA)

toCSV = []
for i in range(len(new_data)):
  boo = new_data[i]
  boo['Question'] = boo['Question'].replace('\"', '').replace('\'','')
  for j in range(300):
    boo['vector_' + str(j+1)] = squad[i][j]
  toCSV.append(boo)

with open('squad_vectors.csv', 'w') as csvfile:
  fieldnames = list(boo.keys())
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i in range(len(toCSV)):
    writer.writerow(toCSV[i])

## Testing whether the dataset is large enough for the task

In this section, we calculate the mean vector representation across all questions that belong in the same class, showcasing that these vectors do not coincide, and thus we have a heterogenous dataset good enough for the task.

As we can see, by running this code, there is no occurence of *Same means* in the standard output.

In [0]:
# cl: all different classes
classes = list(topics.keys())

# calc: to calculate the mean vector representation per class
calc = {}
for cc in classes:
  calc[cc] = []

for i in range(len(new_data)):
  calc[new_data[i]['Class']].append(squad[i])

# means: to store the mean vector representation of each class
means = {}

for cc in classes:
  means[cc] = np.mean(calc[cc], axis=0)

for i in range(len(classes)):
  for j in range(i+1, len(classes)):
    if (any(means[keys[i]] == means[keys[j]])):
      print('Same means')

## Evaluating the results if we were to rely completely on the cosine similarity

For this task, we will pick randomly 100 questions, we will calculate the vector representation on the full body of the question, and by calculating the cosine similarity between split1 (34 questions) and split2 (66 questions), we will show the questions that is suggestive to group together due to the process we have developed so far.

In [0]:
# Shuffling the sequence of questions
new = list(range(len(new_data)))
random.shuffle(new)

# Picking randomly 340 for split1
# and 660 for split2
split1 = {}
split2 = {}
for i in new[:340]:
  q = new_data[i]['Question']
  split1[q] = {'vec': nlp(q), 'class' : new_data[i]['Class']}
for i in new[340:1000]:
  q = new_data[i]['Question']
  split2[q] = {'vec': nlp(q), 'class' : new_data[i]['Class']}

# Calculate the question vector on the full body of the question
# and save the questions from split2 that are closely related to 
# questions from split1 (cosine similarity > 0.93 as per our first
# classification task)
rank = {}
for tt in split1:
  rank[tt] = []
  vA = split1[tt]['vec'].vector
  for tr in split2:
    vB = split2[tr]['vec'].vector
    cos = cosine(vA, vB)
    if (cos > 0.93):
      rank[tt].append(tr)

In [75]:
for q in rank:
  if(rank[q] != []):
    print('Input Question:')
    print('\t-',q)
    print('Related Questions:')
    for rr in rank[q]:
      print('\t-', rr)
    print()

Input Question:
	- What did Tesla dress in while in Tominaj?
Related Questions:
	- What did Tesla struggle with while in school? 

Input Question:
	- Other than Point Conception, what landmark is used in the other definition of southern California?
Related Questions:
	- Point Conception is an example of a landmark among what boundary of southern California?

Input Question:
	- What is one part of the innate immune system that doesnt attack microbes directly?
Related Questions:
	- What does the immune system protect against?

Input Question:
	- What is one issue that adds to the complexity of a pharmacists job?
Related Questions:
	- What is one example of an instance that the quantitative answer to the traveling salesman problem fails to answer?
	- What is one example of what a clinical pharmacists duties entail?

Input Question:
	- In a 4-cylinder compound engine, what degree were the individual pistons balanced at?
Related Questions:
	- At what angle were the groups of pistons set in 