# Jonathan Halverson
# Tuesday, December 27, 2016
# Wine classification in Spark 2

Here we work a standard machine learning binary classification problem. For the EDA see the appropriate notebook in the machine_learning directory.

Here is a nice notebook by Ben Sadeghi on a related topic:
http://nbviewer.jupyter.org/github/bensadeghi/pyspark-churn-prediction/blob/master/churn-prediction.ipynb

In [1]:
from __future__ import print_function
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[4]").appName("Wine classification").getOrCreate()

In [2]:
df = spark.read.csv('../../machine_learning/wine.csv', header=False, inferSchema=True)
df.sample(False, 0.1).show()

+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9|_c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|  1| 13.2|1.78|2.14|11.2|100|2.65|2.76|0.26|1.28|4.38|1.05| 3.4|1050|
|  1|14.38|1.87|2.38|12.0|102| 3.3|3.64|0.29|2.96| 7.5| 1.2| 3.0|1547|
|  1|13.82|1.75|2.42|14.0|111|3.88|3.74|0.32|1.87|7.05|1.01|3.26|1190|
|  2|12.17|1.45|2.53|19.0|104|1.89|1.75|0.45|1.03|2.95|1.45|2.23| 355|
|  2|12.21|1.19|1.75|16.8|151|1.85|1.28|0.14| 2.5|2.85|1.28|3.07| 718|
|  2|13.49|1.66|2.24|24.0| 87|1.88|1.84|0.27|1.03|3.74|0.98|2.78| 472|
|  2|12.33|0.99|1.95|14.8|136| 1.9|1.85|0.35|2.76| 3.4|1.06|2.31| 750|
|  2|11.56|2.05|3.23|28.5|119|3.18|5.08|0.47|1.87| 6.0|0.93|3.69| 465|
|  3|12.25|3.88| 2.2|18.5|112|1.38|0.78|0.29|1.14|8.21|0.65| 2.0| 855|
|  3|12.45|3.03|2.64|27.0| 97| 1.9|0.58|0.63|1.14| 7.5|0.67|1.73| 880|
|  3|13.69|3.26|2.54|20.0|107|1.83|0.56| 0.5| 0.8|5.88|0.96|1.82| 680|
|  3| 

Class labels must begin with 0 and count up in Spark. Here we will only consider a binary classification problem so we will ignore class 3:

In [3]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from random import random as rng

def randomFlip3rd(x):
    if (x == 3):
        if (rng() > 0.5):
            return 1
        else:
            return 2
    else:
        return x

In [4]:
toRandomFlip3rd = udf(randomFlip3rd, IntegerType())
slen = udf(lambda x: x * x, IntegerType())
df = df.withColumn('_c0', toRandomFlip3rd(df._c0))
df = df.withColumn('_c0', df['_c0'] - 1)
#df = df.withColumn('_c0', toRandomFlip3rd(df._c0))

In [5]:
#df = df.filter(df._c0 < 3).withColumn('_c0', df['_c0'] - 1)
df.sample(False, 0.1).show()

+---+-----+----+----+----+---+----+----+----+----+-----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9| _c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+-----+----+----+----+
|  0|14.12|1.48|2.32|16.8| 95| 2.2|2.43|0.26|1.57|  5.0|1.17|2.82|1280|
|  0|12.93| 3.8|2.65|18.6|102|2.41|2.41|0.25|1.98|  4.5|1.03|3.52| 770|
|  0|13.41|3.84|2.12|18.8| 90|2.45|2.68|0.27|1.48| 4.28|0.91| 3.0|1035|
|  0|14.38|3.59|2.28|16.0|102|3.25|3.17|0.27|2.19|  4.9|1.04|3.44|1065|
|  1|12.33| 1.1|2.28|16.0|101|2.05|1.09|0.63|0.41| 3.27|1.25|1.67| 680|
|  1|11.03|1.51| 2.2|21.5| 85|2.46|2.17|0.52|2.01|  1.9|1.71|2.87| 407|
|  1|12.77|3.43|1.98|16.0| 80|1.63|1.25|0.43|0.83|  3.4| 0.7|2.12| 372|
|  1|11.56|2.05|3.23|28.5|119|3.18|5.08|0.47|1.87|  6.0|0.93|3.69| 465|
|  1|12.37|1.63| 2.3|24.5| 88|2.22|2.45| 0.4| 1.9| 2.12|0.89|2.78| 342|
|  0| 12.7|3.55|2.36|21.5|106| 1.7| 1.2|0.17|0.84|  5.0|0.78|1.29| 600|
|  0| 12.6|2.46| 2.2|18.5| 94|1.62|0.66|0.63|0.94|  7.1|0.73|1.5

In [6]:
df.groupby('_c0').count().toPandas()

Unnamed: 0,_c0,count
0,1,94
1,0,84


We see that the two classes appear with approximately equal proportions so stratified sampling is not required.

Note that in local mode even with the [4] only one partition is being used:

In [7]:
df.rdd.getNumPartitions()

1

In [8]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: integer (nullable = true)



Let's change the data type of _c5 and _c13 to double:

In [9]:
df = df.withColumn('_c5', df['_c5'].cast('double'))
df = df.withColumn('_c13', df['_c13'].cast('double'))
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)



Let's give the columns more meaningful names:

In [10]:
columns = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', \
           'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', \
           'OD280/OD315 of diluted wines', 'Proline']

In [11]:
for u, v in zip(df.schema.names, columns):
    df = df.withColumnRenamed(u, v)

In [12]:
df.printSchema()

root
 |-- Class: integer (nullable = true)
 |-- Alcohol: double (nullable = true)
 |-- Malic acid: double (nullable = true)
 |-- Ash: double (nullable = true)
 |-- Alcalinity of ash: double (nullable = true)
 |-- Magnesium: double (nullable = true)
 |-- Total phenols: double (nullable = true)
 |-- Flavanoids: double (nullable = true)
 |-- Nonflavanoid phenols: double (nullable = true)
 |-- Proanthocyanins: double (nullable = true)
 |-- Color intensity: double (nullable = true)
 |-- Hue: double (nullable = true)
 |-- OD280/OD315 of diluted wines: double (nullable = true)
 |-- Proline: double (nullable = true)



Here is an alternative version of assigning the column names:

In [13]:
wineRaw = reduce(lambda data, i: data.withColumnRenamed(df.schema.names[i], columns[i]), xrange(len(columns)), df)
wineRaw.sample(False, 0.05).toPandas().applymap(lambda x: round(x, 1))

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,0.0,13.8,1.6,2.6,20.0,115.0,3.0,3.4,0.4,1.7,6.6,1.1,2.6,1130.0
1,0.0,12.9,3.8,2.6,18.6,102.0,2.4,2.4,0.3,2.0,4.5,1.0,3.5,770.0
2,0.0,13.8,1.9,2.7,17.1,115.0,3.0,2.8,0.4,1.7,6.3,1.1,2.9,1375.0
3,1.0,12.0,0.9,2.0,19.0,86.0,2.4,2.3,0.3,1.4,2.5,1.4,3.1,278.0
4,1.0,12.2,1.3,1.9,19.0,92.0,2.4,2.0,0.4,2.1,2.7,0.9,3.0,312.0
5,1.0,11.8,2.1,2.8,28.5,92.0,2.1,2.2,0.6,1.8,3.0,1.0,2.4,466.0
6,1.0,13.5,3.2,2.7,23.5,97.0,1.6,0.5,0.5,0.6,4.3,0.9,2.1,520.0
7,0.0,13.7,3.3,2.5,20.0,107.0,1.8,0.6,0.5,0.8,5.9,1.0,1.8,680.0


Here are the descriptive statistics -- of course, no standardization has been performed yet:

In [14]:
wineRaw.select(wineRaw.schema.names[1:]).toPandas().describe().applymap(lambda x: round(x, 1))

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.0,2.3,2.4,19.5,99.7,2.3,2.0,0.4,1.6,5.1,1.0,2.6,746.9
std,0.8,1.1,0.3,3.3,14.3,0.6,1.0,0.1,0.6,2.3,0.2,0.7,314.9
min,11.0,0.7,1.4,10.6,70.0,1.0,0.3,0.1,0.4,1.3,0.5,1.3,278.0
25%,12.4,1.6,2.2,17.2,88.0,1.7,1.2,0.3,1.3,3.2,0.8,1.9,500.5
50%,13.1,1.9,2.4,19.5,98.0,2.4,2.1,0.3,1.6,4.7,1.0,2.8,673.5
75%,13.7,3.1,2.6,21.5,107.0,2.8,2.9,0.4,1.9,6.2,1.1,3.2,985.0
max,14.8,5.8,3.2,30.0,162.0,3.9,5.1,0.7,3.6,13.0,1.7,4.0,1680.0


Reformat the data into a new dataframe with features as a vector:

In [15]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

In [16]:
wineRaw = wineRaw.rdd.map(lambda row: Row(label=row.Class, features=Vectors.dense(row[1:]))).toDF()
wineRaw.sample(False, 0.1).show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[14.23,1.71,2.43,...|    0|
|[13.2,1.78,2.14,1...|    0|
|[14.2,1.76,2.45,1...|    0|
|[14.06,2.15,2.61,...|    0|
|[13.75,1.73,2.41,...|    0|
+--------------------+-----+
only showing top 5 rows



Now that we have the correct format, a train-test split can be performed before we standardize:

In [17]:
trainUnSTD, testUnSTD = wineRaw.randomSplit([0.7, 0.3])

Let's standardize the data by making the mean and variance 0 and 1, respectively, for each column:

In [18]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)
scalerModel = scaler.fit(trainUnSTD)
train = scalerModel.transform(trainUnSTD).cache()

In [19]:
train.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: long (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [20]:
train.show(5)

+--------------------+-----+--------------------+
|            features|label|      scaledFeatures|
+--------------------+-----+--------------------+
|[11.03,1.51,2.2,2...|    1|[-2.3837202600692...|
|[11.41,0.74,2.5,2...|    1|[-1.9199436462701...|
|[11.45,2.4,2.42,2...|    1|[-1.8711250553439...|
|[11.46,3.74,1.82,...|    1|[-1.8589204076124...|
|[11.61,1.35,2.7,2...|    1|[-1.6758506916390...|
+--------------------+-----+--------------------+
only showing top 5 rows



Let's check that the standardized features have a mean of 0 and a variance of 1:

In [21]:
train.rdd.map(lambda row: row.scaledFeatures.values.tolist()).toDF().toPandas().describe().applymap(lambda x: round(x, 1))

Unnamed: 0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_11,_12,_13
count,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0
mean,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.4,-1.4,-2.9,-2.6,-2.1,-2.0,-1.6,-1.9,-2.1,-1.7,-2.0,-1.8,-1.5
25%,-0.8,-0.7,-0.5,-0.7,-0.8,-0.9,-1.0,-0.8,-0.5,-0.8,-0.8,-1.0,-0.8
50%,0.1,-0.4,-0.0,0.1,-0.1,-0.0,0.0,-0.0,-0.2,-0.1,0.1,0.3,-0.2
75%,0.8,0.7,0.7,0.5,0.5,0.8,0.8,0.8,0.6,0.5,0.6,0.8,0.8
max,2.2,2.8,2.3,3.2,3.7,2.6,1.9,2.2,3.8,2.9,3.3,2.0,2.6


Now that the wine dataFrame is properly formatted, we create a ML model:

In [22]:
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol='scaledFeatures', labelCol='label', maxIter=10, threshold=0.5)
pipeline = Pipeline(stages=[lr])

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [1.0, 0.1, 0.01]).addGrid(lr.elasticNetParam, [1.0, 0.1, 0.01]).build()
bce = BinaryClassificationEvaluator(metricName="areaUnderROC")
crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=bce, numFolds=5)
cvModel = crossval.fit(train)

Here are some details about the cross-validation procedure and the final coefficients:

In [23]:
cvModel.avgMetrics

[0.5,
 0.9135933837404426,
 0.913283399606929,
 0.9068871487989134,
 0.9164018824312942,
 0.9152993085346026,
 0.9074318165494637,
 0.9057457248633719,
 0.9065693456869929]

In [24]:
cvModel.bestModel.stages[0].coefficients

DenseVector([-0.6265, -0.1964, -0.3267, 0.2522, 0.0, -0.1282, -0.0524, 0.0938, 0.0242, -0.307, 0.0162, -0.0856, -0.66])

In [25]:
cvModel.bestModel.stages[0].intercept

0.11877831164402701

Let's evaluate the model using the test data. We begin by standardizing the test data using the previous standardizer object:

In [26]:
test = scalerModel.transform(testUnSTD).cache()
prediction = cvModel.transform(test)
prediction.sample(False, 0.2).show()

+--------------------+-----+--------------------+--------------------+--------------------+----------+
|            features|label|      scaledFeatures|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+--------------------+----------+
|[12.04,4.3,2.38,2...|    1|[-1.1510508391822...|[-1.4508164913835...|[0.18987593900319...|       1.0|
|[12.42,1.61,2.19,...|    1|[-0.6872742253831...|[-2.4165564141809...|[0.08191886866055...|       1.0|
|[12.43,1.53,2.29,...|    1|[-0.6750695776516...|[-1.7633809686264...|[0.14636740283458...|       1.0|
|[12.52,2.43,2.17,...|    1|[-0.5652277480676...|[-1.9772097572436...|[0.12161659380022...|       1.0|
|[13.05,2.05,3.22,...|    0|[0.08161858170477...|[0.70312705273351...|[0.66888071500915...|       0.0|
|[13.73,1.5,2.7,22...|    0|[0.91153462745048...|[2.01436632854915...|[0.88229721777144...|       0.0|
|[13.78,2.76,2.3,2...|    1|[0.97255786610825...|[0.22189059710223...|[0.

Here are the probabilites for class 0 and class 1:

In [27]:
prediction.select('probability').rdd.map(lambda row: row.probability.values.tolist()).toDF().toPandas().applymap(lambda x: round(x, 7))[:5]

Unnamed: 0,_1,_2
0,0.301341,0.698659
1,0.068303,0.931697
2,0.052442,0.947558
3,0.190876,0.809124
4,0.189358,0.810642


Raw predition is feature vector dotted with the coefficients:

In [28]:
prediction.select('rawPrediction').rdd.map(lambda row: row.rawPrediction.values.tolist()).toDF().toPandas().applymap(lambda x: round(x, 2))[:5]

Unnamed: 0,_1,_2
0,-0.84,0.84
1,-2.61,2.61
2,-2.89,2.89
3,-1.44,1.44
4,-1.45,1.45


In [29]:
2.7182**-8.62 / (1+2.7182**-8.62)

0.00018047451181084533

Let's compute two quantities for evaluation purposes:

In [30]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
evaluator.evaluate(prediction, {evaluator.metricName: "areaUnderROC"})

0.9756493506493507

In [31]:
evaluator.evaluate(prediction, {evaluator.metricName: "areaUnderPR"})

0.9845378719840736

In [32]:
predictions_and_labels = prediction.select('prediction', 'label')
predictions_and_labels = predictions_and_labels.withColumn('prediction', predictions_and_labels['label'].cast('integer'))
predictions_and_labels.show(5)

+----------+-----+
|prediction|label|
+----------+-----+
|         1|    1|
|         1|    1|
|         1|    1|
|         1|    1|
|         1|    1|
+----------+-----+
only showing top 5 rows



In [33]:
predictions_and_labels.count()

50

In [34]:
predictions_and_labels.filter('prediction == label').count()

50

In [35]:
predictions_and_labels.filter('prediction == 0 and label == 0').count()

22

In [36]:
predictions_and_labels.filter('prediction == 1 and label == 1').count()

28