<a href="https://colab.research.google.com/github/kzafeiroudi/QuestRecommend/blob/master/TrainingOnQuora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deciding the right vector representation

In this section, we are going to train two simple tree classifiers, in order to identify the similarity metric that best divides question pairs that are duplicates from those that are not.

## Upload the two datasets

Upload the files `full_questions.csv` and `noun_chunks.csv` that will be used in this Python 3 notebook.

In [2]:
from google.colab import files

# Choose from your own machine the files to upload - names should be 
# "full_questions.csv" and "noun_chunks.csv"
uploaded = files.upload()
uploaded = files.upload()

Saving full_questions.csv to full_questions (1).csv


Saving noun_chunks.csv to noun_chunks.csv


## Download all important packages

We will be using the **Weka** machine learning software through a Python wrapper. The following commands are necessary to install the packages before we use the software.

In [0]:
# pygraphviz
!wget https://anaconda.org/anaconda/pygraphviz/1.3/download/linux-64/pygraphviz-1.3-py36h14c3975_1.tar.bz2
!tar xvjf pygraphviz-1.3-py36h14c3975_1.tar.bz2
!cp -r lib/python3.6/site-packages/* /usr/local/lib/python3.6/dist-packages/

import pygraphviz

In [0]:
%%bash
# Install deps from 
# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux
apt-get update
apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev \
nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev \
libopenal-dev timidity libwildmidi-dev unzip

# Boost libraries
apt-get install libboost-all-dev

# Lua binding dependencies
apt-get install liblua5.1-dev

In [7]:
import os       #importing os to set environment variable
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
install_java()

openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment (build 11.0.2+9-Ubuntu-3ubuntu118.04.3)
OpenJDK 64-Bit Server VM (build 11.0.2+9-Ubuntu-3ubuntu118.04.3, mixed mode, sharing)


In [0]:
!apt-get install libproj-dev proj-data proj-bin
!apt-get install libgeos-dev
!pip install cython
!pip install python-weka-wrapper3

## Importing Python libraries

In [9]:
import weka.core.jvm as jvm
jvm.start()

from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.6/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.6/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.6/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.6/dist-packages/weka/lib/weka.jar', '/usr/local/lib/python3.6/dist-packages/weka/lib/python-weka-wrapper.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


## Load the dataset `full_questions.csv`

We will first train a tree classifier (J48) on the similarity metrics resulting after creating the question vector representations on the full text of the question.

In [33]:
loader = Loader(classname="weka.core.converters.CSVLoader")
data_file = 'full_questions.csv'
data = loader.load_file(data_file)

print('Sample size: ', data.num_instances)

Sample size:  9888


## Train the classifier

We choose to classify on the nominal atrribute **Class**. We first split our dataset to train and test, with a 66% to the train split.

For the J48 classifier, which generated a pruned C4.5 decision tree, we have chosen a confidence factor used for pruning of 0.25.

The resulting decision tree can be seen in the standard output.

In [34]:
print('Classifying on: ', data.attribute(0))
data.class_index = 0
train, test = data.train_test_split(66.0, Random(1))


cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)

print(cls)

Classifying on:  @attribute Class {Duplicate,NonDuplicate}
J48 pruned tree
------------------

Similarity_Metric <= 0.927621: NonDuplicate (2799.0/814.0)
Similarity_Metric > 0.927621: Duplicate (3727.0/1267.0)

Number of Leaves  : 	2

Size of the tree : 	3



## Evaluating the classifier

We evaluate the trained model against the test split.

In [35]:
evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())


Correctly Classified Instances        2229               66.2998 %
Incorrectly Classified Instances      1133               33.7002 %
Kappa statistic                          0.3264
Mean absolute error                      0.4397
Root mean squared error                  0.4718
Relative absolute error                 87.9477 %
Root relative squared error             94.3515 %
Total Number of Instances             3362     



## Actual predictions on the test split

We are printing the first 10 predictions on the test split.

In [36]:
print("# - actual - predicted - error - distribution")
for index, inst in enumerate(itertools.islice(test, 10)):
  pred = cls.classify_instance(inst)
  dist = cls.distribution_for_instance(inst)
  print(
            "%d - %s - %s - %s  - %s" %
            (index+1,
             inst.get_string_value(inst.class_index),
             inst.class_attribute.value(int(pred)),
             "yes" if pred != inst.get_value(inst.class_index) else "no",
             str(dist.tolist())))

# - actual - predicted - error - distribution
1 - NonDuplicate - NonDuplicate - no  - [0.29081814933904965, 0.7091818506609503]
2 - Duplicate - Duplicate - no  - [0.6600482962167964, 0.33995170378320366]
3 - NonDuplicate - NonDuplicate - no  - [0.29081814933904965, 0.7091818506609503]
4 - Duplicate - NonDuplicate - yes  - [0.29081814933904965, 0.7091818506609503]
5 - NonDuplicate - NonDuplicate - no  - [0.29081814933904965, 0.7091818506609503]
6 - NonDuplicate - NonDuplicate - no  - [0.29081814933904965, 0.7091818506609503]
7 - Duplicate - NonDuplicate - yes  - [0.29081814933904965, 0.7091818506609503]
8 - NonDuplicate - Duplicate - yes  - [0.6600482962167964, 0.33995170378320366]
9 - Duplicate - Duplicate - no  - [0.6600482962167964, 0.33995170378320366]
10 - NonDuplicate - NonDuplicate - no  - [0.29081814933904965, 0.7091818506609503]


## Load the dataset `noun_chunks.csv`

We will now train a tree classifier (J48) on the similarity metrics resulting after creating the question vector representations on the main verb arguments extracted from the question.

In [37]:
data_file = 'noun_chunks.csv'
data = loader.load_file(data_file)

print('Sample size: ', data.num_instances)

Sample size:  9888


## Train the classifier

We choose to classify on the nominal atrribute **Class**. We first split our dataset to train and test, with a 66% to the train split.

For the J48 classifier, which generated a pruned C4.5 decision tree, we have chosen a confidence factor used for pruning of 0.25.

The resulting decision tree can be seen in the standard output.

In [38]:
print('Classifying on: ', data.attribute(0))
data.class_index = 0
train, test = data.train_test_split(66.0, Random(1))

cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)

print(cls)

Classifying on:  @attribute Class {Duplicate,NonDuplicate}
J48 pruned tree
------------------

Similarity_Metric <= 0.883042: NonDuplicate (3483.0/1259.0)
Similarity_Metric > 0.883042: Duplicate (3043.0/1028.0)

Number of Leaves  : 	2

Size of the tree : 	3



## Evaluating the classifier

We evaluate the trained model against the test split.

In [39]:
evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())


Correctly Classified Instances        2125               63.2064 %
Incorrectly Classified Instances      1237               36.7936 %
Kappa statistic                          0.264 
Mean absolute error                      0.4603
Root mean squared error                  0.4825
Relative absolute error                 92.0636 %
Root relative squared error             96.5002 %
Total Number of Instances             3362     



## Actual predictions on the test split

We are printing the first 10 predictions on the test split.

In [40]:
print("# - actual - predicted - error - distribution")
for index, inst in enumerate(itertools.islice(test, 10)):
  pred = cls.classify_instance(inst)
  dist = cls.distribution_for_instance(inst)
  print(
            "%d - %s - %s - %s  - %s" %
            (index+1,
             inst.get_string_value(inst.class_index),
             inst.class_attribute.value(int(pred)),
             "yes" if pred != inst.get_value(inst.class_index) else "no",
             str(dist.tolist())))

# - actual - predicted - error - distribution
1 - NonDuplicate - NonDuplicate - no  - [0.36146999712891187, 0.6385300028710882]
2 - Duplicate - NonDuplicate - yes  - [0.36146999712891187, 0.6385300028710882]
3 - NonDuplicate - NonDuplicate - no  - [0.36146999712891187, 0.6385300028710882]
4 - Duplicate - NonDuplicate - yes  - [0.36146999712891187, 0.6385300028710882]
5 - NonDuplicate - NonDuplicate - no  - [0.36146999712891187, 0.6385300028710882]
6 - NonDuplicate - NonDuplicate - no  - [0.36146999712891187, 0.6385300028710882]
7 - Duplicate - NonDuplicate - yes  - [0.36146999712891187, 0.6385300028710882]
8 - NonDuplicate - Duplicate - yes  - [0.6621754847190273, 0.33782451528097274]
9 - Duplicate - Duplicate - no  - [0.6621754847190273, 0.33782451528097274]
10 - NonDuplicate - Duplicate - yes  - [0.6621754847190273, 0.33782451528097274]


## Discussion

Both classifiers show an accuracy of 63-66%, which is not very high. It was expected that since both duplicate questions and non-duplicate questions show high correlation in the vocabulary used, it would be hard to separate them. The main purpose of traning a decision tree model was to find the value of the similarity metric that best separates the two classes, duplicates from non-duplicates.

When for our question vector representation we used the full body of the text, we saw that the best separation arises for a cosine similarity of **0.93**, while when we only use the main verb arguments of the question to create our vector representation, we find that the cosine similarity best separating duplicate from non-duplicate questions is **0.88**. 

The main purpose of this exercise was to show that we can indeed achieve a decrease in the cosine similarity between questions that are not marked as duplicates.