## 2.7 Machine Learning for Ranking

First we need to load our training data into pandas dataframe.  The data is in tab separated formatand the easiest way to load this is using the ```pandas``` ```read_table()``` method. 

In [34]:
import csv
import pandas as pd

df = pd.read_table("data/fullDataset.tsv",header=0)

Let's check out the shape and column names:

In [36]:
print(df.shape)
print(df.columns)

(79575, 19)
Index([u'key', u'query', u'Title', u'LeafCats', u'ItemID', u'X_unit_id',
       u'SCORE', u'label_relevanceGrade', u'label_relevanceBinary',
       u'feature_1', u'feature_2', u'feature_3', u'feature_4', u'feature_5',
       u'feature_6', u'feature_7', u'feature_8', u'feature_9', u'feature_10'],
      dtype='object')


Here is a sample of the data:

In [35]:
df.sample(5)

Unnamed: 0,key,query,Title,LeafCats,ItemID,X_unit_id,SCORE,label_relevanceGrade,label_relevanceBinary,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
25297,40418,argand,"BRASS ""FLAT TOP"" DEFLECTOR/CHIMNEY RING FOR AR...",1407,20244,740191598,3:-1:3,5,0,3093.916748,1,-7.62221,-4.739789,46.223152,0.388945,38,-1000000,219,-100.0
8724,15736,age of empires 3,Age of Empires III: Gold Edition WIN XP ...,139973,8060,724905631,3:3:3,6,1,2952.142822,1,-8.121331,-5.078777,53.495548,0.674927,93,333333,143,1.607907
1562,78514,alton ellis,ALTON ELLIS Sunday Coming/Another Night STUDIO...,176985,39005,740197728,3:3:3,6,1,2873.857178,1,-7.882324,-4.961157,25.345171,0.489932,151,-1000000,172,-100.0
72072,19626,woman shoes wedges,Womens Fashion Sandals Cute Wedge Heel Sandal ...,55793|62107,9937,724908777,3:3:3,6,1,3560.105225,1,-9.081261,-4.514035,69.695389,0.508286,440205,-1000000,225,-100.0
57503,23660,amc cars,AMC PACER AUTO ICONS STRIP OF 10 MINT CAR STAM...,14024|1313,11897,753410348,-1:-1:-1,2,0,760.0,1,-7.390667,-4.736623,18.733725,0.0,1335,-1000000,139,-100.0


The columns are:

Column name             | Description
------------------------|-----------------------------------------------------------------------
key                     |  Used to join back to the original dataset and add any additional fields as needed
query                   |  Un-normalized query keywords (without user constraints)
Title                   |  Un-normalized title
LeafCats                |  Item leaf category IDs
ItemID                  |  Anonymized itemID. This is not the actual item ID
X_unit_id               |  The query ID used for grouping query-item pairs by their search, primarily for per-query metrics. Essentially it's a "search ID". We can also group by query or normalized query
SCORE                   |  The scores given by up to three judges characterizing the relevance/relevance problem of the query-item pair
label_relevanceGrade    |  The SCORE averaged and rounded and converted to a relevance graded judgment 0-6, 6 being the best. Note this is very approximate
label_relevanceBinary   |  The SCORE converted to a binary relevant(==1) or not relevant (==0) judgment.  This is a more accurate label than the Grade, I recommend it as a training target

Features (In brief): 

* query features: feature_2, feature_7, feature_8
* item features: feature_3, feature_4, feature_9
* query-item features: feature_1, feature_5, feature_6, feature_10

## Getting ready for Machine Learning

We start by simply exploring how we might classify queries as relevent / not relevant.  We will explore a series of different models to do this. The first is logistic regression, we will also use SVM and finally classification tree's.  Along the way we will look at over fitting / generalization and how to evaluate models.  

We can do all this using Spark MLLIB - first we have to findspark and get our spark context:


In [38]:
import findspark
import os
findspark.init(os.getenv('HOME') + '/spark-1.6.0-bin-hadoop2.6')
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

In [39]:
import pyspark
try: 
    print(sc)
except NameError:
    sc = pyspark.SparkContext()
    print(sc)

<pyspark.context.SparkContext object at 0x7f17779a0a50>


It is easy to load the tsv data into a Spark DataFrame:

In [52]:
from pyspark.sql import SQLContext
import os

sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options() \
        .options(header='true', inferSchema='true', delimiter='\t') \
        .load(os.getcwd() + '/data/fullDataset.tsv') 
        
df.schema


StructType(List(StructField(key,IntegerType,true),StructField(query,StringType,true),StructField(Title,StringType,true),StructField(LeafCats,StringType,true),StructField(ItemID,IntegerType,true),StructField(X_unit_id,IntegerType,true),StructField(SCORE,StringType,true),StructField(label_relevanceGrade,IntegerType,true),StructField(label_relevanceBinary,IntegerType,true),StructField(feature_1,DoubleType,true),StructField(feature_2,IntegerType,true),StructField(feature_3,DoubleType,true),StructField(feature_4,DoubleType,true),StructField(feature_5,DoubleType,true),StructField(feature_6,DoubleType,true),StructField(feature_7,IntegerType,true),StructField(feature_8,DoubleType,true),StructField(feature_9,IntegerType,true),StructField(feature_10,DoubleType,true)))

Now we can extract the features and the target for the machine learning algorithms:

In [91]:
sqlContext.registerDataFrameAsTable(df,'dataset')
sqlContext.tableNames()

data_full = sqlContext.sql("select label_relevanceBinary, feature_1, feature_2, feature_3, feature_4 \
                       feature_5, feature_6, feature_7, feature_8, feature_9, feature_10 \
               from dataset").rdd

We also split the data into test and validation data sets - splitting 75%:25% between the training and test sets:

In [96]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint

# Load and parse the data
def parseRecord(line):
    return LabeledPoint(line[0], line[1:])

data_train, data_test = data_full.randomSplit([0.75,0.25])

In [93]:
print('Training data records = ' + str(data_train.count()))
print('Training data records = ' + str(data_test.count()))

Training data records = 59892
Training data records = 19559


## Fitting an SVM - a simple classifier

In [102]:
model = SVMWithSGD.train(data_train.map(parseRecord), iterations=100)

In [99]:
model

(weights=[1194.26941609,0.461463715096,-2.21656945582,-0.993141240628,0.2438952531,24884.355726,63515.4566756,51.6824038304,-11.1044683869], intercept=0.0)

In [103]:
# Evaluating the model on test data
preds = data_test.map(parseRecord).map(lambda p: (p.label, model.predict(p.features)))
err = preds.filter(lambda (v, p): v != p).count() / float(data_test.count())
print("Training Error = " + str(err))

Training Error = 0.489422642078


## Fitting Logistic Regression

In [104]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel

model = LogisticRegressionWithLBFGS.train(data_train.map(parseRecord))

In [105]:
model

(weights=[0.00042945137487,0.639302892683,0.0172727048381,0.235558307927,1.13591233639,3.55644013574e-07,-1.32957685793e-07,1.88841414344e-05,0.000132492242944], intercept=0.0)

In [107]:
# Evaluating the model on training data
preds = data_test.map(parseRecord).map(lambda p: (p.label, model.predict(p.features)))
err = preds.filter(lambda (v, p): v != p).count() / float(data_test.count())
print("Training Error = " + str(err))

Training Error = 0.290688910105
