### Install & Run MongoDB 

- Installation
```bash
brew install mongo
mkdir -p /data/db
sudo chwon -R `id -u` /data/db
```
- Run Mongo Daemon
```bash
mongodb
```
- Import data into mongo (in terminal)
```bash
mongoimport --db msan697 --collection business --file /Users/ThyKhueLy/msan697/inclass/data/business.json
```


- Increase memory if running out: 
```bash
pyspark --driver-memory=2g
```

## Machine Learning with Spark - Spark ML

### MLib
### ML
- New, more flexible
- MLib DataFrame-based API
- Support piplines of estimators, transformer and evaluators

### Example 1
Develop a Simple Linear Regression model(without stochastic gradient descent) to predict “petal_width” using “sepal_width” using “iris.csv”.

In [1]:
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
import math

In [2]:
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

In [3]:
# delete test table, if exsists.
sqlContext.sql("DROP TABLE IF EXISTS test")

DataFrame[]

In [4]:
# create schema
irisSchema = StructType([StructField("sepal_length", DoubleType(), True), 
                         StructField("sepal_width", DoubleType(), True),
                         StructField("petal_length", DoubleType(), True), 
                         StructField("petal_width", DoubleType(), True),
                         StructField("class", StringType(), True)])

In [5]:
iris = sqlContext.read \
                .format('com.databricks.spark.csv') \
                .options(header='false') \
                .load('./data/iris.csv', schema = irisSchema)

In [6]:
iris = iris.select('sepal_length', 'petal_length')

In [7]:
iris.show(n=5)

+------------+------------+
|sepal_length|petal_length|
+------------+------------+
|         5.1|         1.4|
|         4.9|         1.4|
|         4.7|         1.3|
|         4.6|         1.5|
|         5.0|         1.4|
+------------+------------+
only showing top 5 rows



In [17]:
# split training and test set
train, test = iris.randomSplit([0.9, 0.1])
train.cache() # faster to read from the memory - improve performance
test.cache()
test.write.saveAsTable('test')

In [9]:
# compute covariance 
covariance = train.cov('sepal_length', 'petal_length')

In [10]:
# compute variance
variance = train.select(variance('sepal_length'))

In [11]:
variance.show()

+----------------------+
|var_samp(sepal_length)|
+----------------------+
|     0.684387303149607|
+----------------------+



In [12]:
variance = variance.first()[0]
variance

0.684387303149607

In [13]:
# coeff_0 = covariance(x,y)/variance(x)
coeff_0 = covariance/variance

In [14]:
#coeff_1 = mean(y) – coeff_0 * mean(x)
coeff_1 =  train.select(mean("petal_length")).first()[0] - coeff_0 * train.select(mean("sepal_length")).first()[0]

In [15]:
coeff_0, coeff_1

(1.8430654178726151, -6.992009096302155)

In [18]:
test_pred = sqlContext.sql("SELECT sepal_length, petal_length, sepal_length * {0} + {1} AS prediction FROM test"\
                        .format(coeff_0, coeff_1))

In [19]:
test_pred.show(10)

+------------+------------+------------------+
|sepal_length|petal_length|        prediction|
+------------+------------+------------------+
|         4.4|         1.4|1.1174787423280002|
|         4.8|         1.9|1.8547049094759984|
|         4.9|         1.5|2.0390114512630007|
|         4.9|         4.5|2.0390114512630007|
|         5.3|         1.5| 2.776237618410999|
|         5.5|         4.0|    3.144850701985|
|         5.6|         3.6|3.3291572437719985|
|         5.6|         3.9|3.3291572437719985|
|         5.9|         4.2|    3.882076869133|
|         6.3|         5.6| 4.619303036280998|
+------------+------------+------------------+
only showing top 10 rows



In [20]:
mse = test_pred.rdd.map(lambda x : (x["prediction"] - x["petal_length"])**2)\
                   .reduce(lambda x,y : x+y)/test_pred.count()

In [21]:
rmse = math.sqrt(mse)

In [22]:
mse, rmse

(0.7688573516151774, 0.8768451126710906)

## Main Components

- Use DataFrame and DataSet (a strongly typed collection of objects -- including DataFrame)

- Main Components: 
#### 1. Transformers
    + Convert a dataset to another
    + Types: 
        + **Feature Transformer**: Take a dataframe output a dataframe with new columns like featured vectors
        + **Learning Model**: Take a dataframe and output a dataframe with predicted labels
    + `transform()`: Takes dataframe and optional parameters
    + For example, we require **Feature Vector** and **Vector Assemble** transform text the dataframe's string into numerical values (now becomes a new dataframe) in order to feed it into a model 
        + e.g. ```"female, separated, MS" ==> [1,2,3]```
#### 2. Estimators
    + algorithms that produce transformers by fitting on a dataset
    + `fit()` takes a dataframe and parameters
    + ex. linear regression produces a linear regression model with fitted weights and an intercepts , which is a transformer
#### 3. Evaluators
    + evaluate the performance of a model
    + `evaluator()`
#### 4. ML Parameters: 
    + Specifiy parameters for estimators and transformers
    + Also can use **`ParamGridBuilder()`** for choosing the model produced by the best-performing set of parameters in `CrossValidator()`
#### 5. ML Pipeline
    + `PipelineModel()`
    + In ML, same steps are often repeated with slightly different parameters -> find the best result
    + A `Pipeline` chains multiple `Transformers` and `Estimators` together to specify an ML workflow and runs in order

## Algorithms

Algorithms
- Feature Extractors, Transformers, and Selectors.
- Feature Extractors : TF-IDF, Word2Vec, CountVectorizer
- Feature Transformer : Tokenizer StopWordsRemover n-gram Binarizer PCA PolynomialExpansion Discrete Cosine Transform (DCT) StringIndexer IndexToString OneHotEncoder VectorIndexer Normalizer StandardScaler MinMaxScaler MaxAbsScaler Bucketizer ElementwiseProduct SQLTransformer VectorAssembler QuantileDiscretizer
- Feature Selectors : VectorSlicer, Rformula, ChiSqSelector

### Example 2 - Logistic Regression 

Use adult.dat to generate a regression model.
1. Create an RDD.
2. Convert the RDD to DataFrame.
3. Clean the data.
    1. Missing data imputation.
    2. Convert strings to categorical values.
4. Train the model.
5. Interpret the model parameters.
6. Evaluate the model.

In [26]:
# Step 1: Create RDD 
def toDoubleSafe(v):
    try: return float(v)
    except ValueError: return v

# Load and convert data
cencus_raw = sc.textFile('./data/adult.raw', 4)\
               .map(lambda x: x.split(", "))
cencus_raw = cencus_raw.map(lambda r: [toDoubleSafe(x) for x in r])

In [27]:
cencus_raw.take(1)

[[39.0,
  u'State-gov',
  77516.0,
  u'Bachelors',
  u'Never-married',
  u'Adm-clerical',
  u'Not-in-family',
  u'White',
  u'Male',
  2174.0,
  0.0,
  40.0,
  u'United-States',
  u'<=50K']]

In [28]:
# Step 2: Convert RDD to DataFrame
# Create schema
from pyspark.sql.types import *
adultSchema = StructType([
    StructField("age",DoubleType(),True),
    StructField("capital_gain",DoubleType(),True),
    StructField("capital_loss",DoubleType(),True),
    StructField("education",StringType(),True),
    StructField("fnlwgt",DoubleType(),True),
    StructField("hours_per_week",DoubleType(),True),
    StructField("income",StringType(),True),
    StructField("marital_status",StringType(),True),
    StructField("native_country",StringType(),True),
    StructField("occupation",StringType(),True),
    StructField("race",StringType(),True),
    StructField("relationship",StringType(),True),
    StructField("sex",StringType(),True),
    StructField("workclass",StringType(),True)   
])

In [29]:
from pyspark.sql import Row
columns = ['age', 'workclass', 'fnlwgt', 'education', 'marital_status',
          'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
          'hours_per_week', 'native_country', 'income']

cencusRdd = cencus_raw.map(lambda row:
                           Row(**{x[0]: x[1] for x in zip(columns, row)}))

In [30]:
cencusRdd.take(1)

[Row(age=39.0, capital_gain=2174.0, capital_loss=0.0, education=u'Bachelors', fnlwgt=77516.0, hours_per_week=40.0, income=u'<=50K', marital_status=u'Never-married', native_country=u'United-States', occupation=u'Adm-clerical', race=u'White', relationship=u'Not-in-family', sex=u'Male', workclass=u'State-gov')]

In [31]:
dfraw = sqlContext.createDataFrame(cencusRdd, adultSchema)
dfraw.show(n=5)

+----+------------+------------+---------+--------+--------------+------+------------------+--------------+-----------------+-----+-------------+------+----------------+
| age|capital_gain|capital_loss|education|  fnlwgt|hours_per_week|income|    marital_status|native_country|       occupation| race| relationship|   sex|       workclass|
+----+------------+------------+---------+--------+--------------+------+------------------+--------------+-----------------+-----+-------------+------+----------------+
|39.0|      2174.0|         0.0|Bachelors| 77516.0|          40.0| <=50K|     Never-married| United-States|     Adm-clerical|White|Not-in-family|  Male|       State-gov|
|50.0|         0.0|         0.0|Bachelors| 83311.0|          13.0| <=50K|Married-civ-spouse| United-States|  Exec-managerial|White|      Husband|  Male|Self-emp-not-inc|
|38.0|         0.0|         0.0|  HS-grad|215646.0|          40.0| <=50K|          Divorced| United-States|Handlers-cleaners|White|Not-in-family|  Mal

In [35]:
# Step 3: Clean the data
# Check the most commonly used vals
print dfraw.groupBy("workclass").count().orderBy("count", ascending=False).show()
print dfraw.groupBy("occupation").count().orderBy("count", ascending=False).show()
print dfraw.groupBy("native_country").count().orderBy("count", ascending=False).show()

+----------------+-----+
|       workclass|count|
+----------------+-----+
|         Private|33906|
|Self-emp-not-inc| 3862|
|       Local-gov| 3136|
|               ?| 2799|
|       State-gov| 1981|
|    Self-emp-inc| 1695|
|     Federal-gov| 1432|
|     Without-pay|   21|
|    Never-worked|   10|
+----------------+-----+

None
+-----------------+-----+
|       occupation|count|
+-----------------+-----+
|   Prof-specialty| 6172|
|     Craft-repair| 6112|
|  Exec-managerial| 6086|
|     Adm-clerical| 5611|
|            Sales| 5504|
|    Other-service| 4923|
|Machine-op-inspct| 3022|
|                ?| 2809|
| Transport-moving| 2355|
|Handlers-cleaners| 2072|
|  Farming-fishing| 1490|
|     Tech-support| 1446|
|  Protective-serv|  983|
|  Priv-house-serv|  242|
|     Armed-Forces|   15|
+-----------------+-----+

None
+------------------+-----+
|    native_country|count|
+------------------+-----+
|     United-States|43832|
|            Mexico|  951|
|                 ?|  857|
|      

**Missing Data Imputation**: 
+ **`.na`**: returns a DataFrameNA function for handling 
+ **`.replace(old_value, new_value, subset=None)`**: subset = optional list of column names

In [37]:
# Missing Data Imputation
dfraw_nona = dfraw.na.replace(["?"], ["Private"], ["workclass"])
dfraw_nona = dfraw_nona.na.replace(["?"], ["Prof-specialty"], ["occupation"])
dfraw_nona = dfraw_nona.na.replace(["?"], ["Prof-specialty"], ["occupation"])

#### Numeric Encoding / StringIndexer
- Convert String Categorical Values into Integer Indexes
- Take a DataFrame, fit a **`StringIndexModel()`**, use it for transformation

#### One-hot Encoding
- Expand a column to as many columns as there are distinct values in it
- For each value, only 1 column is 0, the rest is zero
- Create a new column as a one-hot-encoded sparse vector. (Replace a column with a vector.)

#### Vector Assembly
- A transformer that combines a given list of columns into a single vector column
- Merge all the new vectors and the original columns into a single vector.
- Useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees
- ML algorithms work with two columns called **features** and **label** by default.
![img](vectorAssembler.png)

In [36]:
# Convert String to Categorical Values 
# Numeric Encoding / StringIndexer
from pyspark.ml.feature import StringIndexer

def indexStringColumns(df, cols):
    #variable newdf will be updated several times
    newdf = df
    
    for c in cols:
        #For each given colum, fits StringIndexerModel.
        si = StringIndexer(inputCol=c, outputCol=c+"-num")
        sm = si.fit(newdf)
        # Creates a DataFame by putting the transformed values in the new colum with suffix "-num" 
        # and then drops the original columns.
        # and drop the "-num" suffix. 
        newdf = sm.transform(newdf).drop(c)
        newdf = newdf.withColumnRenamed(c+"-num", c)
    return newdf

In [38]:
dfnumeric = indexStringColumns(dfraw_nona, 
                            ["workclass", "education", "marital_status",
                            "occupation", "relationship", "race", "sex",
                             "native_country", "income"])

In [39]:
dfnumeric.show(2)

+----+------------+------------+-------+--------------+---------+---------+--------------+----------+------------+----+---+--------------+------+
| age|capital_gain|capital_loss| fnlwgt|hours_per_week|workclass|education|marital_status|occupation|relationship|race|sex|native_country|income|
+----+------------+------------+-------+--------------+---------+---------+--------------+----------+------------+----+---+--------------+------+
|39.0|      2174.0|         0.0|77516.0|          40.0|      3.0|      2.0|           1.0|       3.0|         1.0| 0.0|0.0|           0.0|   0.0|
|50.0|         0.0|         0.0|83311.0|          13.0|      1.0|      2.0|           0.0|       2.0|         0.0| 0.0|0.0|           0.0|   0.0|
+----+------------+------------+-------+--------------+---------+---------+--------------+----------+------------+----+---+--------------+------+
only showing top 2 rows



In [40]:
# One-hot Encoding
from pyspark.ml.feature import OneHotEncoder
def oneHotEncodeColumns(df, cols):
    newdf = df
    for c in cols:
        # For each given colum, create OneHotEncoder. 
        # dropLast : Whether to drop the last category in the encoded vector (default: true)
        onehotenc = OneHotEncoder(inputCol=c, outputCol=c+"-onehot",
                                  dropLast=False)
        
        # Creates a DataFame by putting the transformed values in the new colum with suffix "-onehot" 
        # and then drops the original columns.
        # and drop the "-onehot" suffix. 
        newdf = onehotenc.transform(newdf).drop(c)
        newdf = newdf.withColumnRenamed(c+"-onehot", c)
    return newdf

In [41]:
dfhot = oneHotEncodeColumns(dfnumeric, 
                            ["workclass", "education", "marital_status",
                             "occupation", "relationship", "race",
                             "native_country"])

In [42]:
dfhot.show(2)

+----+------------+------------+-------+--------------+---+------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+
| age|capital_gain|capital_loss| fnlwgt|hours_per_week|sex|income|    workclass|     education|marital_status|    occupation| relationship|         race|native_country|
+----+------------+------------+-------+--------------+---+------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+
|39.0|      2174.0|         0.0|77516.0|          40.0|0.0|   0.0|(8,[3],[1.0])|(16,[2],[1.0])| (7,[1],[1.0])|(14,[3],[1.0])|(6,[1],[1.0])|(5,[0],[1.0])|(42,[0],[1.0])|
|50.0|         0.0|         0.0|83311.0|          13.0|0.0|   0.0|(8,[1],[1.0])|(16,[2],[1.0])| (7,[0],[1.0])|(14,[2],[1.0])|(6,[0],[1.0])|(5,[0],[1.0])|(42,[0],[1.0])|
+----+------------+------------+-------+--------------+---+------+-------------+--------------+--------------+--------------+-------------+-------------+--

In [44]:
# Vector Assembler
# Merging the data with Vector Assembler.
from pyspark.ml.feature import VectorAssembler
input_cols=["age","capital_gain","capital_loss","fnlwgt",
            "hours_per_week","sex","workclass","education",
            "marital_status","occupation","relationship",
            "native_country","race"]

va = VectorAssembler(outputCol="features", inputCols=input_cols)
lpoints = va.transform(dfhot)\
            .select("features", "income")\
            .withColumnRenamed('income', 'label')

In [46]:
lpoints.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(104,[0,1,3,4,9,1...|  0.0|
|(104,[0,3,4,7,16,...|  0.0|
|(104,[0,3,4,6,14,...|  0.0|
|(104,[0,3,4,6,19,...|  0.0|
|(104,[0,3,4,5,6,1...|  0.0|
+--------------------+-----+
only showing top 5 rows



In [47]:
# Step 4: Train the Model
splits = lpoints.randomSplit([0.8, 0.2], seed=1)
adult_train = splits[0].cache()
adult_valid = splits[1].cache()

In [48]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(regParam=0.01, maxIter=1000, fitIntercept=True)
lrmodel = lr.fit(adult_train)

In [49]:
# Step 5: Interpret Model Parameters
lrmodel.coefficientMatrix

DenseMatrix(1, 104, [0.0203, 0.0001, 0.0006, 0.0, 0.0271, -0.5083, 0.0227, -0.3644, ..., 0.1557, 0.6976, -1.1457, 0.085, -0.1768, 0.2418, -0.2531, -0.0301], 1)

In [51]:
lrmodel.intercept

-4.312013382056819

In [52]:
# Step 6: Evaluate Classification Model
# transform the test data
# rawPrediction : log-odds that a sample doesn’t/does belong to the category.
# probability : the probability that the sample is not in the category.
valid_pred = lrmodel.transform(adult_valid)
valid_pred.show(5)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(104,[0,1,3,4,5,6...|  1.0|[-0.5087780542625...|[0.37548002179441...|       1.0|
|(104,[0,1,3,4,5,6...|  0.0|[0.72928404272965...|[0.67464814102871...|       0.0|
|(104,[0,1,3,4,5,6...|  0.0|[3.40717737102968...|[0.96792809443462...|       0.0|
|(104,[0,1,3,4,5,6...|  0.0|[4.04395070246335...|[0.98277385419711...|       0.0|
|(104,[0,1,3,4,5,6...|  0.0|[5.04840035706332...|[0.99362135402343...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [53]:
# Evaluate model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [54]:
bc_eval = BinaryClassificationEvaluator()
print (bc_eval.getMetricName() + ": " +
      str(bc_eval.evaluate(valid_pred)))

areaUnderROC: 0.906050949299


In [55]:
# Step 7: Cross Validation & ParamGridBuilder()
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder

In [57]:
cv = CrossValidator().setEstimator(lr).setEvaluator(bc_eval).setNumFolds(5)

In [59]:
#ParamGridBuilder() – combinations of parameters and their values.
paramGrid = ParamGridBuilder()\
                .addGrid(lr.maxIter, [1000])\
                .addGrid(lr.regParam, [0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5])\
                .build()

#setEstimatorParamMaps() takes ParamGridBuilder().
cv.setEstimatorParamMaps(paramGrid)
cvmodel = cv.fit(adult_train)

In [60]:
print cvmodel.bestModel.coefficients

[0.0224927021591,0.00031651782064,0.000647586459743,8.47531789584e-07,0.0306393184324,-0.701387996131,-0.45098299074,-0.912575087859,-0.451257069228,-0.649649495815,-0.234993248993,0.139186669814,-1.78995080113,-4.74000741816,-0.609515412487,-0.224071953022,0.552330041065,0.969271322403,-0.0977332604368,-1.34494908268,-0.0331316208049,-1.52606503781,-1.90185612026,1.50688701438,-1.60622067874,-0.865815243833,1.57198074027,-1.83368979664,-2.03569219159,-6.13340968996,1.24336513018,-1.46244075993,-1.02653839072,-1.06629316021,-0.937571407924,-0.893528033646,1.30588782025,-0.032480333163,-0.128320425344,0.430993329517,-0.219387686659,0.019702705536,-1.03778194865,-0.583501012425,-0.332063306887,-0.907586307763,-1.24860168828,0.24119614506,0.147861570695,-2.08772980355,0.329667516064,-0.402429982504,0.111681099415,-1.05807847097,-0.0307793365924,0.723277461932,-0.82071809166,-0.873603300068,-1.713918162,-1.129334384,-0.790947001714,-0.837361839514,-1.26720680309,-0.327245378511,-1.24323267

In [61]:
print cvmodel.bestModel.intercept

-1.40997985804


In [62]:
print cvmodel.bestModel._java_obj.getMaxIter()

1000


In [63]:
print cvmodel.bestModel._java_obj.getRegParam()

0.0001


In [66]:
# Best model selection
bestModel_pred = cvmodel.bestModel.transform(adult_valid)
BinaryClassificationEvaluator().evaluate(bestModel_pred)

0.9087492937870021

In [67]:
BinaryClassificationEvaluator().setMetricName("areaUnderPR")\
                               .evaluate(bestModel_pred)

0.7791236223234642

### Case Study - Preparing Data

In [75]:
# Load data & create RDD 
pen_raw = sc.textFile("./data/penbased.dat", 4)\
            .map(lambda x: x.split(", "))\
            .map(lambda r: [float(x) for x in r])

In [76]:
from pyspark.sql.types import *
from pyspark.sql import Row

In [77]:
# Create schema
penschema = StructType([
    StructField("pix1",DoubleType(),True),
    StructField("pix2",DoubleType(),True),
    StructField("pix3",DoubleType(),True),
    StructField("pix4",DoubleType(),True),
    StructField("pix5",DoubleType(),True),
    StructField("pix6",DoubleType(),True),
    StructField("pix7",DoubleType(),True),
    StructField("pix8",DoubleType(),True),
    StructField("pix9",DoubleType(),True),
    StructField("pix10",DoubleType(),True),
    StructField("pix11",DoubleType(),True),
    StructField("pix12",DoubleType(),True),
    StructField("pix13",DoubleType(),True),
    StructField("pix14",DoubleType(),True),
    StructField("pix15",DoubleType(),True),
    StructField("pix16",DoubleType(),True),
    StructField("label",DoubleType(),True)
])

In [92]:
dfpen = sqlContext.createDataFrame(pen_raw.map(lambda x : Row(x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7],x[8],x[9],x[10],x[11],x[12],x[13],x[14],x[15],x[16])), 
                                   penschema)

In [93]:
dfpen.show(1)

+----+-----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+
|pix1| pix2|pix3|pix4|pix5|pix6|pix7|pix8|pix9|pix10|pix11|pix12|pix13|pix14|pix15|pix16|label|
+----+-----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+
|47.0|100.0|27.0|81.0|57.0|37.0|26.0| 0.0| 0.0| 23.0| 56.0| 53.0|100.0| 90.0| 40.0| 98.0|  8.0|
+----+-----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+
only showing top 1 row



In [94]:
# Merging the data with Vector Assembler.
from pyspark.ml.feature import VectorAssembler

In [95]:
va = VectorAssembler(outputCol="features",
                     inputCols=dfpen.columns[0:-1]) #except the last col.

penlpoints = va.transform(dfpen).select("features", "label")

In [97]:
penlpoints.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[47.0,100.0,27.0,...|  8.0|
|[0.0,89.0,27.0,10...|  2.0|
|[0.0,57.0,31.0,68...|  1.0|
|[0.0,100.0,7.0,92...|  4.0|
|[0.0,67.0,49.0,83...|  1.0|
+--------------------+-----+
only showing top 5 rows



In [112]:
# Create Training and Test data.
pendtsets = penlpoints.randomSplit([0.8, 0.2])
pendt_train = pendtsets[0].cache()
pendt_valid = pendtsets[1].cache()

### Example 3 - Decision Tree

### Decision Tree
```
DecisionTreeClassifier()
```
- Input
    1. **features** - Feature vector.
    2. **label** - Label to predict.
- Output
    1. **prediction** – Predicted label.
    2. **rawPrediction** - Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction.
    3. **probability** - Vector of length # classes equal to rawPrediction normalized to a multinomial distribution

In [113]:
# Train the data.
# Paramenters
# maxDepth : maximum tree depth (default : 5).
# maxBins : maximum number of bins when binning continuous features (default : 32).
# minInstancesPerNode : minimum number of dataset samples each branch needs to have after a split (default : 1).
# minInfoGain : minimum information gain for a split (default : 0).

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(maxDepth=20, maxBins=32, minInstancesPerNode=1, minInfoGain = 0)
dtmodel = dt.fit(pendt_train)

In [114]:
# The tree itself!
print dtmodel._call_java('toDebugString')

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_47d280b5e945c60c7fe4) of depth 19 with 621 nodes
  If (feature 15 <= 51.0)
   If (feature 4 <= 41.0)
    If (feature 9 <= 20.0)
     If (feature 14 <= 64.0)
      If (feature 5 <= 32.0)
       If (feature 0 <= 28.0)
        Predict: 4.0
       Else (feature 0 > 28.0)
        Predict: 0.0
      Else (feature 5 > 32.0)
       If (feature 1 <= 74.0)
        Predict: 1.0
       Else (feature 1 > 74.0)
        If (feature 0 <= 22.0)
         If (feature 0 <= 19.0)
          Predict: 6.0
         Else (feature 0 > 19.0)
          Predict: 4.0
        Else (feature 0 > 22.0)
         Predict: 6.0
     Else (feature 14 > 64.0)
      If (feature 10 <= 44.0)
       If (feature 10 <= 11.0)
        If (feature 0 <= 38.0)
         Predict: 1.0
        Else (feature 0 > 38.0)
         Predict: 8.0
       Else (feature 10 > 11.0)
        If (feature 15 <= 21.0)
         Predict: 2.0
        Else (feature 15 > 21.0)
         If (feature 0 <= 

In [115]:
# Evaluate the model
dtpredicts = dtmodel.transform(pendt_valid)
dtpredicts.show(5)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[0.0,0.0,41.0,16....|  9.0|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|       9.0|
|[0.0,4.0,74.0,29....|  1.0|[0.0,404.0,0.0,0....|[0.0,1.0,0.0,0.0,...|       1.0|
|[0.0,24.0,43.0,47...|  1.0|[0.0,404.0,0.0,0....|[0.0,1.0,0.0,0.0,...|       1.0|
|[0.0,33.0,31.0,60...|  1.0|[0.0,404.0,0.0,0....|[0.0,1.0,0.0,0.0,...|       1.0|
|[0.0,36.0,35.0,56...|  1.0|[0.0,404.0,0.0,0....|[0.0,1.0,0.0,0.0,...|       1.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [116]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(dtpredicts)

In [117]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.0384216


In [118]:
# Confusion Matrix
dtpredicts.select('label','prediction').rdd.map(lambda x : (x,1)).countByKey() 

defaultdict(int,
            {Row(label=0.0, prediction=0.0): 200,
             Row(label=0.0, prediction=1.0): 1,
             Row(label=0.0, prediction=8.0): 1,
             Row(label=0.0, prediction=9.0): 2,
             Row(label=1.0, prediction=1.0): 189,
             Row(label=1.0, prediction=2.0): 4,
             Row(label=1.0, prediction=3.0): 4,
             Row(label=1.0, prediction=5.0): 1,
             Row(label=1.0, prediction=7.0): 3,
             Row(label=1.0, prediction=9.0): 1,
             Row(label=2.0, prediction=1.0): 4,
             Row(label=2.0, prediction=2.0): 173,
             Row(label=2.0, prediction=4.0): 1,
             Row(label=2.0, prediction=7.0): 1,
             Row(label=3.0, prediction=1.0): 1,
             Row(label=3.0, prediction=2.0): 1,
             Row(label=3.0, prediction=3.0): 181,
             Row(label=3.0, prediction=5.0): 1,
             Row(label=3.0, prediction=9.0): 2,
             Row(label=4.0, prediction=3.0): 1,
             Ro

In [119]:
# Depreciated in Spark 2.0 -- Use accuracy
from pyspark.mllib.evaluation import MulticlassMetrics

dtresrdd = dtpredicts.select("prediction", "label").rdd # convert DataFrame to RDD.
dtmm = MulticlassMetrics(dtresrdd) 

In [120]:
print dtmm.precision() 
print(dtmm.confusionMatrix())

0.961578400831
DenseMatrix([[ 200.,    1.,    0.,    0.,    0.,    0.,    0.,    0.,    1.,
                 2.],
             [   0.,  189.,    4.,    4.,    0.,    1.,    0.,    3.,    0.,
                 1.],
             [   0.,    4.,  173.,    0.,    1.,    0.,    0.,    1.,    0.,
                 0.],
             [   0.,    1.,    1.,  181.,    0.,    1.,    0.,    0.,    0.,
                 2.],
             [   0.,    0.,    0.,    1.,  200.,    0.,    0.,    0.,    0.,
                 0.],
             [   0.,    1.,    0.,    0.,    1.,  168.,    0.,    1.,    5.,
                 5.],
             [   1.,    2.,    0.,    1.,    1.,    1.,  194.,    1.,    0.,
                 0.],
             [   0.,    2.,    2.,    2.,    0.,    0.,    2.,  191.,    0.,
                 0.],
             [   2.,    0.,    0.,    0.,    0.,    6.,    1.,    0.,  174.,
                 0.],
             [   0.,    1.,    0.,    2.,    0.,    3.,    1.,    0.,    1.,
               18

In [121]:
# n-fold validation and the results.
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder

In [122]:
cv = CrossValidator().setEstimator(dt).setEvaluator(evaluator).setNumFolds(5)

#ParamGridBuilder() – combinations of parameters and their values.
paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [5,10,15,20,25,30]).build()

#setEstimatorParamMaps() takes ParamGridBuilder().
cv.setEstimatorParamMaps(paramGrid)
cvmodel = cv.fit(pendt_train)

In [123]:
print cvmodel.bestModel._java_obj.getMaxDepth()
print "Accuracy : " +  str(MulticlassClassificationEvaluator()\
                           .evaluate(cvmodel.bestModel.transform(pendt_valid)))

20
Accuracy : 0.961567856114


### Example 4 - Random Forest Classifier

In [124]:
# Fit
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(maxDepth=20)
rfmodel = rf.fit(pendt_train)

In [125]:
print rfmodel._call_java('toDebugString')

RandomForestClassificationModel (uid=RandomForestClassifier_4b7ab1a1841d550335ef) with 20 trees
  Tree 0 (weight 1.0):
    If (feature 15 <= 51.0)
     If (feature 1 <= 99.0)
      If (feature 14 <= 40.0)
       If (feature 0 <= 64.0)
        If (feature 9 <= 54.0)
         If (feature 3 <= 80.0)
          If (feature 8 <= 67.0)
           If (feature 7 <= 99.0)
            If (feature 3 <= 76.0)
             Predict: 8.0
            Else (feature 3 > 76.0)
             Predict: 9.0
           Else (feature 7 > 99.0)
            Predict: 9.0
          Else (feature 8 > 67.0)
           If (feature 11 <= 20.0)
            If (feature 0 <= 19.0)
             Predict: 1.0
            Else (feature 0 > 19.0)
             Predict: 9.0
           Else (feature 11 > 20.0)
            Predict: 0.0
         Else (feature 3 > 80.0)
          If (feature 9 <= 16.0)
           If (feature 15 <= 26.0)
            If (feature 4 <= 41.0)
             If (feature 9 <= 4.0)
              Predict: 6.0
 

In [126]:
# Evaluate
rfpredicts = rfmodel.transform(pendt_valid)

In [129]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol='label',
                                             predictionCol='prediction',
                                             metricName='accuracy')

In [130]:
accuracy = evaluator.evaluate(rfpredicts)
print "Test Error = %g" % (1.0 - accuracy)

Test Error = 0.0103842


In [131]:
# Confusion Matrix
rfpredicts.select('label', 'prediction').rdd.map(lambda x: (x,1)).countByKey()

defaultdict(int,
            {Row(label=0.0, prediction=0.0): 202,
             Row(label=0.0, prediction=4.0): 1,
             Row(label=0.0, prediction=6.0): 1,
             Row(label=1.0, prediction=1.0): 193,
             Row(label=1.0, prediction=2.0): 5,
             Row(label=1.0, prediction=3.0): 3,
             Row(label=1.0, prediction=5.0): 1,
             Row(label=2.0, prediction=1.0): 1,
             Row(label=2.0, prediction=2.0): 178,
             Row(label=3.0, prediction=3.0): 186,
             Row(label=4.0, prediction=4.0): 200,
             Row(label=4.0, prediction=9.0): 1,
             Row(label=5.0, prediction=5.0): 180,
             Row(label=5.0, prediction=9.0): 1,
             Row(label=6.0, prediction=2.0): 1,
             Row(label=6.0, prediction=5.0): 2,
             Row(label=6.0, prediction=6.0): 198,
             Row(label=7.0, prediction=7.0): 198,
             Row(label=7.0, prediction=8.0): 1,
             Row(label=8.0, prediction=5.0): 1,
       

### Example 5 - k-Means Clustering
- Unsupervised learning
- Data should be standardized
- Partition into groups, anomaly detection, text/topic categorization
- Parameters: 
    + k : Number of clusters to find (default – 2).
    + maxIter : Maximum number of iterations (default – 20). 
    + tol : Convergence tolerance (default – 0.0001).
    + seed : Random seed value for cluster initialization.

- Input: Features (feature vector)
- Output: Prediction (predicted cluster center)

In [132]:
# Merging the data with Vector Assembler.
from pyspark.ml.feature import VectorAssembler

In [133]:
va = VectorAssembler(outputCol="features",
                     inputCols=dfpen.columns[0:-1]) #except the last col
penlpoints = va.transform(dfpen)
penlpoints.show(5)

+----+-----+----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+--------------------+
|pix1| pix2|pix3| pix4| pix5| pix6| pix7| pix8|pix9|pix10|pix11|pix12|pix13|pix14|pix15|pix16|label|            features|
+----+-----+----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+--------------------+
|47.0|100.0|27.0| 81.0| 57.0| 37.0| 26.0|  0.0| 0.0| 23.0| 56.0| 53.0|100.0| 90.0| 40.0| 98.0|  8.0|[47.0,100.0,27.0,...|
| 0.0| 89.0|27.0|100.0| 42.0| 75.0| 29.0| 45.0|15.0| 15.0| 37.0|  0.0| 69.0|  2.0|100.0|  6.0|  2.0|[0.0,89.0,27.0,10...|
| 0.0| 57.0|31.0| 68.0| 72.0| 90.0|100.0|100.0|76.0| 75.0| 50.0| 51.0| 28.0| 25.0| 16.0|  0.0|  1.0|[0.0,57.0,31.0,68...|
| 0.0|100.0| 7.0| 92.0|  5.0| 68.0| 19.0| 45.0|86.0| 34.0|100.0| 45.0| 74.0| 23.0| 67.0|  0.0|  4.0|[0.0,100.0,7.0,92...|
| 0.0| 67.0|49.0| 83.0|100.0|100.0| 81.0| 80.0|60.0| 60.0| 40.0| 40.0| 33.0| 20.0| 47.0|  0.0|  1.0|[0.0,67.0,49.0,83...|
+----+-----+----+-----+-

In [134]:
penlpoints = penlpoints.select("features") # do not select label

In [142]:
# Fit the model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=10)
kmeans_model = kmeans.fit(penlpoints)

In [138]:
# Evaluate
# Compute Within Set Sum of Sqr Errors
wssse = model.computeCost(penlpoints)
print("Within Set Sum of Sqr Errors: " + str(wssse))

Within Set Sum of Sqr Errors: 46160858.0701


In [139]:
# Average distance from the center
import math
print("Avg distance from the center: " + str(
                            math.sqrt(wssse/pen_raw.count())))

Avg distance from the center: 68.2427139368


In [141]:
# Show the result
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers[:3]: 
    print center

Cluster Centers: 
[ 88.00580833  97.78993224  52.64762827  87.28848015  21.28944821
  59.95062924   7.01548887  28.31945789  32.33881897   4.47918683
  79.51016457  11.47821878  62.06582769  30.81219748  13.39303001
  24.91674734]
[ 27.44980443  83.71968709  63.03259452  94.55997392  85.55280313
  87.22946545  55.13233377  65.58148631  69.64602347  45.38787484
  87.32920469  22.85397653  52.22946545   7.26597132   4.30247718
   9.58083442]
[ 44.53996448  98.30195382  13.6660746   77.04795737   5.36234458
  49.47424512  69.21669627  47.98401421  96.60923623  65.72824156
  77.97513321  67.89698046  62.92717584  34.38543517  50.60035524
   0.34280639]


In [145]:
kmeans_model.transform(penlpoints)

DataFrame[features: vector, prediction: int]

In [149]:
kmeans_model.transform(va.transform(dfpen)).select('label', 'prediction').rdd.map(lambda x: (x,1)).countByKey()

defaultdict(int,
            {Row(label=0.0, prediction=0): 31,
             Row(label=0.0, prediction=2): 12,
             Row(label=0.0, prediction=3): 1,
             Row(label=0.0, prediction=4): 2,
             Row(label=0.0, prediction=5): 352,
             Row(label=0.0, prediction=6): 630,
             Row(label=0.0, prediction=8): 6,
             Row(label=0.0, prediction=9): 3,
             Row(label=1.0, prediction=0): 7,
             Row(label=1.0, prediction=1): 70,
             Row(label=1.0, prediction=3): 66,
             Row(label=1.0, prediction=4): 573,
             Row(label=1.0, prediction=8): 304,
             Row(label=1.0, prediction=9): 2,
             Row(label=2.0, prediction=4): 16,
             Row(label=2.0, prediction=8): 1006,
             Row(label=3.0, prediction=1): 919,
             Row(label=3.0, prediction=3): 2,
             Row(label=3.0, prediction=4): 19,
             Row(label=3.0, prediction=8): 1,
             Row(label=3.0, prediction=9): 1