<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 
# Spark MLlib Lab
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Create the spark context

In [2]:
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning
from pyspark.sql import SQLContext

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StandardScaler

In [3]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


## Label encoding categorical features

Often we have categorical features with values given as strings which we would like to transform to numerical values. The analogue of sklearn's `LabelEncoder` is the `StringIndexer`.

In [4]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

In [5]:
# stringindexser等于label encoder, 用法和下例相同

In [6]:
# onehotendoderestimator 就是 dummify

In [7]:
ex_1 = sqlContext.createDataFrame([
    (4, "high"),
    (5, "low"),
    (6, "high"),
    (7, "high"),
    (8,'medium')
], ["id", "label"])

In [8]:
string_indexer = StringIndexer(
        inputCol='label',
        outputCol='label' + "_index"
    )

In [9]:
ex_2 = string_indexer.fit(ex_1).transform(ex_1)
ex_2.show()

+---+------+-----------+
| id| label|label_index|
+---+------+-----------+
|  4|  high|        0.0|
|  5|   low|        1.0|
|  6|  high|        0.0|
|  7|  high|        0.0|
|  8|medium|        2.0|
+---+------+-----------+



In [10]:
onehot = OneHotEncoderEstimator(
        dropLast=True,
        inputCols=['label_index'],
        outputCols=['label' + "_index_1"]
    )

In [11]:
onehot.fit(ex_2).transform(ex_2).show()

+---+------+-----------+-------------+
| id| label|label_index|label_index_1|
+---+------+-----------+-------------+
|  4|  high|        0.0|(2,[0],[1.0])|
|  5|   low|        1.0|(2,[1],[1.0])|
|  6|  high|        0.0|(2,[0],[1.0])|
|  7|  high|        0.0|(2,[0],[1.0])|
|  8|medium|        2.0|    (2,[],[])|
+---+------+-----------+-------------+



The one-hot-encoded values are given as a sparse vector for each observation. The first number indicates the length of the sparse vector, the second number in brackets indicates the position that is filled with the last value. As you can see from the last shown entry, dropping a redundant label (`drop_last`) is default here.

## Read in the car evaluation dataset 

```python
df = pd.read_csv('../../../../resource-datasets/car_evaluation/car.csv')
```

Use `acceptability` as target.

In [17]:
df = pd.read_csv('/Users/paxton615/GA/DSI9-lessons/week11/day5_spark_machine_learning/spark-ml-lab/car_evaluation/car.csv')
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [27]:
df.acceptability.unique()

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)

In [18]:
spark_df = sqlContext.createDataFrame(df)
spark_df.first()

Row(buying='vhigh', maint='vhigh', doors='2', persons='2', lug_boot='small', safety='low', acceptability='unacc')

In [19]:
spark_df.dtypes

[('buying', 'string'),
 ('maint', 'string'),
 ('doors', 'string'),
 ('persons', 'string'),
 ('lug_boot', 'string'),
 ('safety', 'string'),
 ('acceptability', 'string')]

In [20]:
spark_df.select('buying').dtypes

[('buying', 'string')]

In [26]:
spark_df.select('acceptability').dtypes

[('acceptability', 'string')]

In [21]:
[spark_df.dtypes[i][0] for i in range(len(spark_df.dtypes)) if spark_df.dtypes[i][1]=='string']

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability']

In [22]:
spark_df

DataFrame[buying: string, maint: string, doors: string, persons: string, lug_boot: string, safety: string, acceptability: string]

## Dummify the categorical variables.

Use first the `StringIndexer`, then the `OneHotEncoderEstimator` to create the dummified variables. Be careful not to use one-hot encoding on the target variable (`acceptability`).

In [43]:
target_1 = sqlContext.createDataFrame([(4, "unacc"),
                                       
    (5, "acc"),
    (6, "vgood"),
    (7, "good"),
], ["id", "quality"])

In [44]:
string_indexer = StringIndexer( inputCol='quality',
                               outputCol='quality'+'_index')

In [38]:
target_2 = string_indexer.fit(target_1).transform(target_1)

In [39]:
target_2.show()

+---+-------+-------------+
| id|quality|quality_index|
+---+-------+-------------+
|  4|  unacc|          0.0|
|  5|    acc|          1.0|
|  6|  vgood|          3.0|
|  7|   good|          2.0|
+---+-------+-------------+



In [45]:
spark_df.columns

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability']

In [47]:
onehot = OneHotEncoderEstimator(dropLast = True, 
                                inputCols=['quality_index'],
                                outputCols=['quality_index_oh'])

In [51]:
onehot.fit(target_2).transform(target_2).show(4)

+---+-------+-------------+----------------+
| id|quality|quality_index|quality_index_oh|
+---+-------+-------------+----------------+
|  4|  unacc|          0.0|   (3,[0],[1.0])|
|  5|    acc|          1.0|   (3,[1],[1.0])|
|  6|  vgood|          3.0|       (3,[],[])|
|  7|   good|          2.0|   (3,[2],[1.0])|
+---+-------+-------------+----------------+



In [55]:
df.head(2)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc


In [65]:
for i in spark_df.columns:
    string_indexer = StringIndexer(
    inputCol= i,
    outputCol= i + "_index"
    )
    ex_2 = string_indexer.fit(spark_df).transform(spark_df)
    

In [67]:
ex_2.show()

+------+-----+-----+-------+--------+------+-------------+-------------------+
|buying|maint|doors|persons|lug_boot|safety|acceptability|acceptability_index|
+------+-----+-----+-------+--------+------+-------------+-------------------+
| vhigh|vhigh|    2|      2|   small|   low|        unacc|                0.0|
| vhigh|vhigh|    2|      2|   small|   med|        unacc|                0.0|
| vhigh|vhigh|    2|      2|   small|  high|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     med|   low|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     med|   med|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     med|  high|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     big|   low|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     big|   med|        unacc|                0.0|
| vhigh|vhigh|    2|      2|     big|  high|        unacc|                0.0|
| vhigh|vhigh|    2|      4|   small|   low|        

## Prepare your feature columns with `VectorAssembler`

In [16]:
from pyspark.ml.feature import VectorAssembler

## Fit and evaluate a spark decision tree model and tune with grid search

Once done, try also other models.

In [17]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator