# Jonathan Halverson
# Tuesday, December 27, 2016
# Wine classification in Spark 2

Here we work a standard machine learning binary classification problem with the twist that we split the three class records between the 0 and 1 class so that the classifier isn't very good and we can examine its performance. For the EDA see the appropriate notebook in the machine_learning directory.

Here is a nice notebook by Ben Sadeghi on a related topic:
http://nbviewer.jupyter.org/github/bensadeghi/pyspark-churn-prediction/blob/master/churn-prediction.ipynb

In [1]:
from __future__ import print_function
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[4]").appName("Wine classification").getOrCreate()

In [2]:
df = spark.read.csv('../../machine_learning/wine.csv', header=False, inferSchema=True)
df.sample(False, 0.1).show()

+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9|_c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|  1|13.77| 1.9|2.68|17.1|115| 3.0|2.79|0.39|1.68| 6.3|1.13|2.93|1375|
|  1|13.72|1.43| 2.5|16.7|108| 3.4|3.67|0.19|2.04| 6.8|0.89|2.87|1285|
|  2|12.33| 1.1|2.28|16.0|101|2.05|1.09|0.63|0.41|3.27|1.25|1.67| 680|
|  2|11.96|1.09| 2.3|21.0|101|3.38|2.14|0.13|1.65|3.21|0.99|3.13| 886|
|  2|11.84|0.89|2.58|18.0| 94| 2.2|2.21|0.22|2.35|3.05|0.79|3.08| 520|
|  2|12.08|1.33| 2.3|23.6| 70| 2.2|1.59|0.42|1.38|1.74|1.07|3.21| 625|
|  2| 12.0|3.43| 2.0|19.0| 87| 2.0|1.64|0.37|1.87|1.28|0.93|3.05| 564|
|  2|12.42|4.43|2.73|26.5|102| 2.2|2.13|0.43|1.71|2.08|0.92|3.12| 365|
|  2|11.79|2.13|2.78|28.5| 92|2.13|2.24|0.58|1.76| 3.0|0.97|2.44| 466|
|  3|12.25|4.72|2.54|21.0| 89|1.38|0.47|0.53| 0.8|3.85|0.75|1.27| 720|
|  3|13.52|3.17|2.72|23.5| 97|1.55|0.52| 0.5|0.55|4.35|0.89|2.06| 520|
|  3|1

In [3]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: integer (nullable = true)



Class labels must begin with 0 and count up in Spark. Here we will only consider a binary classification problem so we randomly assign class 3 to the other classes -- this will lead to mistakes by the classifier which will allow for a interesting validation:

In [4]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from random import random as rng
from random import seed

seed(123456)
def randomFlip3rd(x):
    if (x == 3):
        if (rng() > 0.5):
            return 0
        else:
            return 1
    else:
        return x - 1

In [5]:
trans3rd = udf(randomFlip3rd, IntegerType())
df = df.withColumn('_c0', trans3rd(df._c0))

In [6]:
df.sample(False, 0.1).show()

+---+-----+----+----+----+---+----+----+----+----+-----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9| _c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+-----+----+----+----+
|  0|13.63|1.81| 2.7|17.2|112|2.85|2.91| 0.3|1.46|  7.3|1.28|2.88|1310|
|  0| 13.5|1.81|2.61|20.0| 96|2.53|2.61|0.28|1.66| 3.52|1.12|3.82| 845|
|  0|13.05|2.05|3.22|25.0|124|2.63|2.68|0.47|1.92| 3.58|1.13| 3.2| 830|
|  0|13.87| 1.9| 2.8|19.4|107|2.95|2.97|0.37|1.76|  4.5|1.25| 3.4| 915|
|  0|14.22| 1.7| 2.3|16.3|118| 3.2| 3.0|0.26|2.03| 6.38|0.94|3.31| 970|
|  1|12.37|0.94|1.36|10.6| 88|1.98|0.57|0.28|0.42| 1.95|1.05|1.82| 520|
|  1|12.99|1.67| 2.6|30.0|139| 3.3|2.89|0.21|1.96| 3.35|1.31| 3.5| 985|
|  1|11.66|1.88|1.92|16.0| 97|1.61|1.57|0.34|1.15|  3.8|1.23|2.14| 428|
|  1|12.08|1.13|2.51|24.0| 78| 2.0|1.58| 0.4| 1.4|  2.2|1.31|2.72| 630|
|  1|12.16|1.61|2.31|22.8| 90|1.78|1.69|0.43|1.56| 2.45|1.33|2.26| 495|
|  1|12.29|1.41|1.98|16.0| 85|2.55| 2.5|0.29|1.77|  2.9|1.23|2.7

In [7]:
df.groupby('_c0').count().toPandas()

Unnamed: 0,_c0,count
0,1,101
1,0,77


We see that the two classes appear with equal proportions so stratified sampling is not required.

Note that in local mode even with the [4] only one partition is being used:

In [8]:
df.rdd.getNumPartitions()

1

In [9]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: integer (nullable = true)



Let's change the data type of _c5 and _c13 to double:

In [10]:
df = df.withColumn('_c5', df['_c5'].cast('double'))
df = df.withColumn('_c13', df['_c13'].cast('double'))
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)



Let's give the columns more meaningful names:

In [11]:
columns = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', \
           'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', \
           'OD280/OD315 of diluted wines', 'Proline']

In [12]:
for u, v in zip(df.schema.names, columns):
    df = df.withColumnRenamed(u, v)

In [13]:
df.printSchema()

root
 |-- Class: integer (nullable = true)
 |-- Alcohol: double (nullable = true)
 |-- Malic acid: double (nullable = true)
 |-- Ash: double (nullable = true)
 |-- Alcalinity of ash: double (nullable = true)
 |-- Magnesium: double (nullable = true)
 |-- Total phenols: double (nullable = true)
 |-- Flavanoids: double (nullable = true)
 |-- Nonflavanoid phenols: double (nullable = true)
 |-- Proanthocyanins: double (nullable = true)
 |-- Color intensity: double (nullable = true)
 |-- Hue: double (nullable = true)
 |-- OD280/OD315 of diluted wines: double (nullable = true)
 |-- Proline: double (nullable = true)



Here is an alternative version of assigning the column names:

In [14]:
wineRaw = reduce(lambda data, i: data.withColumnRenamed(df.schema.names[i], columns[i]), xrange(len(columns)), df)
wineRaw.sample(False, 0.05).toPandas().applymap(lambda x: round(x, 1))

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,0.0,14.1,2.0,2.4,18.8,103.0,2.8,2.9,0.3,2.4,6.2,1.1,2.8,1060.0
1,0.0,13.9,1.7,2.3,17.4,108.0,2.9,3.5,0.3,2.1,8.9,1.1,3.1,1260.0
2,1.0,12.8,3.4,2.0,16.0,80.0,1.6,1.3,0.4,0.8,3.4,0.7,2.1,372.0
3,0.0,12.8,2.7,2.5,22.0,112.0,1.5,1.4,0.2,1.3,10.8,0.5,1.5,480.0
4,0.0,13.7,4.4,2.3,22.5,88.0,1.3,0.5,0.5,1.1,6.6,0.8,1.8,520.0


Here are the descriptive statistics -- of course, no standardization has been performed yet:

In [15]:
wineRaw.select(wineRaw.schema.names[1:]).toPandas().describe().applymap(lambda x: round(x, 1))

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.0,2.3,2.4,19.5,99.7,2.3,2.0,0.4,1.6,5.1,1.0,2.6,746.9
std,0.8,1.1,0.3,3.3,14.3,0.6,1.0,0.1,0.6,2.3,0.2,0.7,314.9
min,11.0,0.7,1.4,10.6,70.0,1.0,0.3,0.1,0.4,1.3,0.5,1.3,278.0
25%,12.4,1.6,2.2,17.2,88.0,1.7,1.2,0.3,1.3,3.2,0.8,1.9,500.5
50%,13.1,1.9,2.4,19.5,98.0,2.4,2.1,0.3,1.6,4.7,1.0,2.8,673.5
75%,13.7,3.1,2.6,21.5,107.0,2.8,2.9,0.4,1.9,6.2,1.1,3.2,985.0
max,14.8,5.8,3.2,30.0,162.0,3.9,5.1,0.7,3.6,13.0,1.7,4.0,1680.0


Reformat the data into a new dataframe with features as a vector:

In [16]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

In [17]:
wineRaw = wineRaw.select('Class', 'Ash', 'Hue', 'Alcohol', 'Flavanoids').rdd.map(lambda row: Row(label=row.Class, features=Vectors.dense(row[1:]))).toDF()
wineRaw.sample(False, 0.1).show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[2.87,1.04,13.24,...|    0|
|[2.48,1.23,14.19,...|    0|
|[2.56,0.96,13.64,...|    0|
|[2.36,1.11,13.71,...|    0|
|[2.8,1.25,13.87,2...|    0|
+--------------------+-----+
only showing top 5 rows



Now that we have the correct format, a train-test split can be performed before we standardize:

In [18]:
trainUnSTD, testUnSTD = wineRaw.randomSplit([0.7, 0.3])

Let's standardize the data by making the mean and variance 0 and 1, respectively, for each column:

In [19]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)
scalerModel = scaler.fit(trainUnSTD)
train = scalerModel.transform(trainUnSTD).cache()

In [20]:
train.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: long (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [21]:
train.show(5)

+--------------------+-----+--------------------+
|            features|label|      scaledFeatures|
+--------------------+-----+--------------------+
|[1.36,1.05,12.37,...|    1|[-3.4963574389052...|
|[1.7,1.12,13.11,3...|    1|[-2.3063143247853...|
|[1.71,1.19,13.03,...|    1|[-2.2713130567230...|
|[1.75,1.28,12.21,...|    1|[-2.1313079844736...|
|[1.82,0.75,11.46,...|    1|[-1.8862991080371...|
+--------------------+-----+--------------------+
only showing top 5 rows



Let's check that the standardized features have a mean of 0 and a variance of 1:

In [22]:
train.rdd.map(lambda row: row.scaledFeatures.values.tolist()).toDF().toPandas().describe().applymap(lambda x: round(x, 1))

Unnamed: 0,_1,_2,_3,_4
count,130.0,130.0,130.0,130.0
mean,0.0,-0.0,-0.0,-0.0
std,1.0,1.0,1.0,1.0
min,-3.5,-2.0,-2.4,-1.7
25%,-0.6,-0.8,-0.8,-0.9
50%,0.0,0.0,0.0,0.1
75%,0.7,0.7,0.8,0.8
max,3.0,3.3,1.9,3.0


Now that the wine dataFrame is properly formatted, we create a ML model with cross-validation and hyperparameter optimization:

In [23]:
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol='scaledFeatures', labelCol='label', maxIter=10, threshold=0.5)
pipeline = Pipeline(stages=[lr])

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [1.0, 0.1, 0.01]).addGrid(lr.elasticNetParam, [1.0, 0.1, 0.01]).build()
bce = BinaryClassificationEvaluator(metricName="areaUnderROC")
crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=bce, numFolds=5)
cvModel = crossval.fit(train)

Here are some details about the cross-validation procedure and the final coefficients:

In [24]:
cvModel.avgMetrics

[0.5,
 0.8788335275835276,
 0.8845648795648796,
 0.8771814296814296,
 0.8836169386169386,
 0.8892676767676768,
 0.8826981351981351,
 0.8907128982128982,
 0.8882420357420358]

In [25]:
cvModel.bestModel.stages[0].coefficients

DenseVector([-0.2983, 0.2142, -0.9822, -0.3259])

In [26]:
cvModel.bestModel.stages[0].intercept

0.282737661939952

Let's evaluate the model using the test data. We begin by standardizing the test data using the previous standardizer object:

In [27]:
test = scalerModel.transform(testUnSTD).cache()
prediction = cvModel.transform(test)
prediction.sample(False, 0.2).show()

+--------------------+-----+--------------------+--------------------+--------------------+----------+
|            features|label|      scaledFeatures|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+--------------------+----------+
|[2.17,0.86,12.07,...|    1|[-0.6612547258548...|[-1.2811598552068...|[0.21735285524044...|       1.0|
|[2.17,1.08,14.83,...|    0|[-0.6612547258548...|[2.08970240664770...|[0.88989827102082...|       0.0|
|[2.21,1.04,14.02,...|    0|[-0.5212496536054...|[0.94134911665876...|[0.71937209212416...|       0.0|
|[2.24,0.98,13.49,...|    1|[-0.4162458494184...|[0.20425613810141...|[0.55088723704585...|       0.0|
|[2.27,1.01,13.86,...|    0|[-0.3112420452313...|[1.09483461308741...|[0.74929101709491...|       0.0|
|[2.3,0.72,12.82,0...|    0|[-0.2062382410443...|[-0.7155495933593...|[0.32837374459018...|       1.0|
|[2.35,0.7,13.36,0.5]|    0|[-0.0312319007325...|[-0.0181335961903...|[0.

Here are the probabilites for class 0 and class 1 for each record:

In [28]:
prediction.select('probability').rdd.map(lambda row: row.probability.values.tolist()).toDF().toPandas().applymap(lambda x: round(x, 7))[:5]

Unnamed: 0,_1,_2
0,0.091828,0.908172
1,0.091364,0.908636
2,0.209529,0.790471
3,0.577755,0.422245
4,0.408536,0.591464


The raw predition is the inner product of the feature vector and the coefficients:

In [29]:
rawPred = prediction.select('rawPrediction').rdd.map(lambda row: row.rawPrediction.values.tolist()).toDF().toPandas().applymap(lambda x: round(x, 4))[:5]
rawPred

Unnamed: 0,_1,_2
0,-2.2915,2.2915
1,-2.2971,2.2971
2,-1.3278,1.3278
3,0.3136,-0.3136
4,-0.37,0.37


In [30]:
import numpy as np
dp = rawPred.iloc[0, 0]
np.exp(dp) / (1 + np.exp(dp)), np.exp(-dp) / (1 + np.exp(-dp))

(0.091829378202088724, 0.9081706217979113)

Let's compute two quantities for evaluation purposes:

In [31]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
evaluator.evaluate(prediction, {evaluator.metricName: "areaUnderROC"})

0.8973913043478261

In [32]:
evaluator.evaluate(prediction, {evaluator.metricName: "areaUnderPR"})

0.9158088615450615

In [33]:
predictions_and_labels = prediction.select('prediction', 'label')
predictions_and_labels = predictions_and_labels.withColumn('prediction', predictions_and_labels['prediction'].cast('integer'))
predictions_and_labels.show(48)

+----------+-----+
|prediction|label|
+----------+-----+
|         1|    1|
|         1|    1|
|         1|    1|
|         0|    0|
|         1|    1|
|         1|    1|
|         1|    1|
|         0|    0|
|         0|    0|
|         1|    1|
|         0|    1|
|         0|    0|
|         0|    0|
|         1|    1|
|         0|    0|
|         1|    1|
|         1|    1|
|         1|    1|
|         1|    0|
|         1|    1|
|         0|    0|
|         1|    1|
|         1|    0|
|         0|    0|
|         0|    0|
|         1|    1|
|         0|    0|
|         0|    0|
|         1|    1|
|         0|    0|
|         1|    1|
|         1|    1|
|         0|    1|
|         0|    0|
|         0|    1|
|         0|    0|
|         0|    0|
|         0|    0|
|         0|    0|
|         1|    1|
|         1|    0|
|         0|    0|
|         0|    0|
|         1|    1|
|         0|    1|
|         0|    0|
|         0|    0|
|         0|    0|
+----------+-----+



In [34]:
tp = predictions_and_labels.filter('prediction == 1 and label == 1').count()
tp

19

In [35]:
fp = predictions_and_labels.filter('prediction == 1 and label == 0').count()
fp

3

In [36]:
tn = predictions_and_labels.filter('prediction == 0 and label == 0').count()
tn

22

In [37]:
fn = predictions_and_labels.filter('prediction == 0 and label == 1').count()
fn

4

In [38]:
accuracy = float(tp + tn) / (tp + tn + fp + fn)
accuracy

0.8541666666666666

In [39]:
# when the answer is 1, how often were you right
# or the proportion of positive cases that were correctly identified
precision = float(tp) / (tp + fn)
precision

0.8260869565217391

In [40]:
# classification error
ce = float(fn + fp) / (tp + tn + fp + fn)
ce

0.14583333333333334

In [41]:
# sensitivity (or recall or true positive rate): when the actual value is positive, how often is the prediction correct
# the proportion of actual positive cases which are correctly identified
recall = float(tp) / (tp + fn)
recall

0.8260869565217391

In [42]:
# the proportion of actual negative cases which are correctly identified
specificity = float(tn) / (fp + tn)
specificity

0.88

In [43]:
# (false positive rate) when the actual value is negative, how often is the prediction incorrect
fpr = float(fp) / (tn + fp)
fpr

0.12