<a href="https://colab.research.google.com/github/kzafeiroudi/QuestRecommend/blob/master/TrainingOnSQuAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying the questions

In this section, we are going to train a Random Forest classifier, in order to check whether the vector representation of the questions that we picked is good enough to train a model to identify the topic a question belongs to.

## Upload the dataset

Upload the file `squad_vectors.csv` that will be used in this Python 3 notebook.

In [62]:
from google.colab import files

# Choose from your own machine the files to upload - name should be 
# "squad_vector.csv"
uploaded = files.upload()

Saving squad_vectors.csv to squad_vectors.csv


## Download all important packages

We will be using the **Weka** machine learning software through a Python wrapper. The following commands are necessary to install the packages before we use the software.

In [0]:
# pygraphviz
!wget https://anaconda.org/anaconda/pygraphviz/1.3/download/linux-64/pygraphviz-1.3-py36h14c3975_1.tar.bz2
!tar xvjf pygraphviz-1.3-py36h14c3975_1.tar.bz2
!cp -r lib/python3.6/site-packages/* /usr/local/lib/python3.6/dist-packages/

import pygraphviz

In [0]:
%%bash
# Install deps from 
# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux
apt-get update
apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev \
nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev \
libopenal-dev timidity libwildmidi-dev unzip

# Boost libraries
apt-get install libboost-all-dev

# Lua binding dependencies
apt-get install liblua5.1-dev

In [4]:
import os       #importing os to set environment variable
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
install_java()

openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment (build 11.0.2+9-Ubuntu-3ubuntu118.04.3)
OpenJDK 64-Bit Server VM (build 11.0.2+9-Ubuntu-3ubuntu118.04.3, mixed mode, sharing)


In [0]:
!apt-get install libproj-dev proj-data proj-bin
!apt-get install libgeos-dev
!pip install cython
!pip install python-weka-wrapper3

## Importing Python libraries

In [1]:
import weka.core.jvm as jvm
jvm.start()

from weka.core.converters import Loader, Saver
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation
from weka.core.dataset import Instances, Instance, Attribute
from weka.filters import Filter
from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
from weka.clusterers import Clusterer, ClusterEvaluation

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.6/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.6/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.6/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.6/dist-packages/weka/lib/weka.jar', '/usr/local/lib/python3.6/dist-packages/weka/lib/python-weka-wrapper.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


## Load the dataset `squad_vector.csv`

We will first train a tree classifier (RandomForest) on the question vector representation that we have calculated.

In [63]:
loader = Loader(classname="weka.core.converters.CSVLoader")
data_file = 'squad_vectors.csv'
data = loader.load_file(data_file)

print('Sample size: ', data.num_instances)

Sample size:  3111


## Train the classifier

We choose to classify on the nominal atrribute **Class**. We first split our dataset to train and test, with a 66% to the train split.

For the J48 classifier, which generated a pruned C4.5 decision tree, we have chosen a confidence factor used for pruning of 0.25.

The resulting decision tree can be seen in the standard output.

In [12]:
print('Classifying on: ', data.attribute(0))
data.class_index = 0
train, test = data.train_test_split(66.0, Random(1))


cls = Classifier(classname="weka.classifiers.trees.RandomForest")
cls.build_classifier(train)

print(cls)

Classifying on:  @attribute Class {Super_Bowl_50,Warsaw,Normans,Nikola_Tesla,Computational_complexity_theory,Teacher,Martin_Luther,Southern_California,Sky_(United_Kingdom),Victoria_(Australia),Huguenot,Steam_engine,Oxygen,1973_oil_crisis,Apollo_program,European_Union_law,Amazon_rainforest,Ctenophora,'Fresno,_California',Packet_switching,Black_Death,Geology,Newcastle_upon_Tyne,Victoria_and_Albert_Museum,American_Broadcasting_Company,Genghis_Khan,Pharmacy,Immune_system,Civil_disobedience,Construction}
RandomForest

Bagging with 100 iterations and base learner

weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities


## Evaluating the classifier

We evaluate the trained model against the test split.

In [13]:
evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary("=== Percentage split 66% ==="))


Correctly Classified Instances         762               72.0227 %
Incorrectly Classified Instances       296               27.9773 %
Kappa statistic                          0.7105
Mean absolute error                      0.0495
Root mean squared error                  0.1453
Relative absolute error                 76.8582 %
Root relative squared error             80.931  %
Total Number of Instances             1058     



## Train another classifier

In this case, we are training the same tree classifier (Random Forest) using 10-fold cross-validation

In [18]:
# Choose Random Forest as the classifier
classifier = Classifier(classname="weka.classifiers.trees.RandomForest")

# Randomize the data
folds = 10
rnd = Random(1)
rand_data = Instances.copy_instances(data)
rand_data.randomize(rnd)
rand_data.stratify(folds)

# Perform cross-validation and add predictions
predicted_data = None
evaluation = Evaluation(rand_data)
for i in range(folds):
  train = rand_data.train_cv(folds, i)
  test = rand_data.test_cv(folds, i)

  # Build and evaluate the classifier
  cls = Classifier.make_copy(cls)
  cls.build_classifier(train)
  evaluation.test_model(cls, test)

  # Add predictions
  addcls = Filter(
    classname="weka.filters.supervised.attribute.AddClassification",
    options=["-classification", "-distribution", "-error"])
  addcls.set_property("classifier", Classifier.make_copy(cls))
  addcls.inputformat(train)
  # Train the classifier
  addcls.filter(train)
  pred = addcls.filter(test)
  if predicted_data is None:
    predicted_data = Instances.template_instances(pred, 0)
  for n in range(pred.num_instances):
    predicted_data.add_instance(pred.get_instance(n))


print(evaluation.summary("=== " + str(folds) + " -fold Cross-Validation ==="))

=== 10 -fold Cross-Validation ===
Correctly Classified Instances        2296               73.8026 %
Incorrectly Classified Instances       815               26.1974 %
Kappa statistic                          0.729 
Mean absolute error                      0.0476
Root mean squared error                  0.1413
Relative absolute error                 73.9209 %
Root relative squared error             78.7353 %
Total Number of Instances             3111     



## Attribute selection

In this section, we are going to pick 100 attributes that show the highest InfoGain in accordance to the Class feature.

In [64]:
full = loader.load_file(data_file)
full.class_index = 0
search = ASSearch(classname="weka.attributeSelection.Ranker", options=["-N", "100"])
evaluation = ASEvaluation("weka.attributeSelection.InfoGainAttributeEval")
attsel = AttributeSelection()
attsel.ranking(True)
attsel.folds(2)
attsel.crossvalidation(True)
attsel.seed(42)
attsel.search(search)
attsel.evaluator(evaluation)
attsel.select_attributes(full)
print("# of selected sttributes: " + str(attsel.number_attributes_selected))
print(attsel.results_string)

# of selected sttributes: 100


=== Attribute Selection on all input data ===

Search Method:
	Attribute ranking.

Attribute Evaluator (supervised, Class (nominal): 1 Class):
	Information Gain Ranking Filter

Ranked attributes:
 0.351    270 vector_269
 0.319      6 vector_5
 0.312    287 vector_286
 0.263    285 vector_284
 0.263     91 vector_90
 0.261     89 vector_88
 0.259    267 vector_266
 0.253    215 vector_214
 0.247    217 vector_216
 0.245     30 vector_29
 0.243    111 vector_110
 0.242     59 vector_58
 0.24      19 vector_18
 0.24     272 vector_271
 0.236     87 vector_86
 0.233     66 vector_65
 0.232    290 vector_289
 0.231    174 vector_173
 0.23     219 vector_218
 0.229    156 vector_155
 0.228    260 vector_259
 0.223    278 vector_277
 0.22     299 vector_298
 0.219    218 vector_217
 0.218    221 vector_220
 0.218    289 vector_288
 0.216    202 vector_201
 0.212    125 vector_124
 0.212    101 vector_100
 0.211    181 vector_180
 0.211    295 vector_294
 0.209

We then remove the lowest ranked 200 attributes, significantly decreasing the size of our features. To make sure that performing Principal Component Analysis by using the InfoGain of the features doesn't decrease the performance of a model trained on this dataset, we evaluate again a tree classifier (Random Forest) trained on only these 100 features, and we see that we achieve good accuracy of 67%.

In [0]:
attr = attsel.selected_attributes
attr2 = list(range(1, 301))
# We want to keep only the 100 top ranked attributes
for i in attr[:100]:
  attr2.remove(i)
attr3 = []
for i in attr2:
  attr3.append(str(i+1))
goo = (",".join(attr3))

r = Filter("weka.filters.unsupervised.attribute.Remove", options = ["-R", goo])
r.inputformat(full)
filtered = r.filter(full)

In [66]:
print('Classifying on: ', filtered.attribute(0))
filtered.class_index = 0
train, test = filtered.train_test_split(66.0, Random(1))


cls = Classifier(classname="weka.classifiers.trees.RandomForest")
cls.build_classifier(train)

evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary("=== Percentage split 66% ==="))

Classifying on:  @attribute Class {Super_Bowl_50,Warsaw,Normans,Nikola_Tesla,Computational_complexity_theory,Teacher,Martin_Luther,Southern_California,Sky_(United_Kingdom),Victoria_(Australia),Huguenot,Steam_engine,Oxygen,1973_oil_crisis,Apollo_program,European_Union_law,Amazon_rainforest,Ctenophora,'Fresno,_California',Packet_switching,Black_Death,Geology,Newcastle_upon_Tyne,Victoria_and_Albert_Museum,American_Broadcasting_Company,Genghis_Khan,Pharmacy,Immune_system,Civil_disobedience,Construction}
=== Percentage split 66% ===
Correctly Classified Instances         713               67.3913 %
Incorrectly Classified Instances       345               32.6087 %
Kappa statistic                          0.6627
Mean absolute error                      0.0491
Root mean squared error                  0.1453
Relative absolute error                 76.1026 %
Root relative squared error             80.9465 %
Total Number of Instances             1058     



## Clustering

In this section and in order to successfully perform clustering and evaluate the results against our classes, Google Colab could only handle smaller dataset, so we had to reduce the number of samples and desired clusters (number of classes in the initial dataset). We will again perform attribute selection and then apply SimpleKMeans clustering.

In [67]:
# Choose from your own machine the files to upload - name should be 
# "squad_vectors_reduced.csv"
uploaded = files.upload()

Saving squad_vectors_reduced.csv to squad_vectors_reduced.csv


In [0]:
data_file = 'squad_vectors_reduced.csv'
full = loader.load_file(data_file)
full.class_index = 0
search = ASSearch(classname="weka.attributeSelection.Ranker", options=["-N", "100"])
evaluation = ASEvaluation("weka.attributeSelection.InfoGainAttributeEval")
attsel = AttributeSelection()
attsel.ranking(True)
attsel.folds(2)
attsel.crossvalidation(True)
attsel.seed(42)
attsel.search(search)
attsel.evaluator(evaluation)
attsel.select_attributes(full)

attr = attsel.selected_attributes
attr2 = list(range(1, 301))
# We want to keep only the 100 top ranked attributes
for i in attr[:100]:
  attr2.remove(i)
attr3 = []
for i in attr2:
  attr3.append(str(i+1))
goo = (",".join(attr3))

r = Filter("weka.filters.unsupervised.attribute.Remove", options = ["-R", goo])
r.inputformat(full)
filtered = r.filter(full)

In [70]:
full = Instances.copy_instances(filtered)
data = Instances.copy_instances(filtered)
data.no_class()
data.delete_attribute(0)


clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "6"])
clusterer.build_clusterer(data)

# classes to clusters
evl = ClusterEvaluation()
evl.set_model(clusterer)
evl.test_model(full)
print("Cluster results")
print(evl.cluster_results)
print("Classes to clusters")
print(evl.classes_to_clusters)

Cluster results

kMeans

Number of iterations: 31
Within cluster sum of squared errors: 966.297534874898

Initial starting points (random):

Cluster 0: 0.147448,0.197533,-0.044818,1.066038,-0.524173,0.390865,0.094538,0.357294,-0.180145,-0.014014,-0.156085,-0.1754,0.403217,-0.137324,-0.00115,0.064803,-0.16955,0.240462,-0.080003,-0.009862,0.182377,-0.012749,-0.173015,0.05764,0.070989,-0.083744,-0.190712,0.171715,-0.194349,0.112126,0.032233,0.124618,-0.293527,-0.091996,-0.943734,-0.063402,-0.048336,-0.163679,0.098866,0.020664,-0.144495,-0.091228,0.232769,0.123182,-0.061444,0.058062,0.01416,0.194798,-0.000054,-0.171339,-0.034409,0.242292,0.345444,0.081944,0.193546,0.113629,0.085784,-0.146271,-0.073855,0.039765,-0.083515,0.138332,0.057803,0.026438,0.115902,0.07499,0.0182,-0.046944,-0.018517,0.074831,0.025846,-0.136665,0.143654,-0.010292,0.174061,-0.20188,-0.152503,0.068272,-0.023235,-0.098519,0.056706,0.165668,-0.088316,-0.161943,-0.210242,-0.064989,-0.2291,-0.049827,-0.371325,0.251394,-0.0