# General pipeline for binary classification tasks in spark ml

$\color{blue}{\text{Covering major components in real life scenarios for binary classification tasks using structured datasets.}}$

Modelling and data transformation part will be mostly operated in pure spark, but exploration and plotting metrics may involve pandas components.

The toolkits are in the lib/ folder, covering following topics:

0. summary on transformer, estimators, pipelines
1. spark and pandas dataframe conversion, tips in converting datatypes and assign correct schema
2. typical udf to transform columns
3. explorative analysis on spark df
4. categorical variables encoding methods, some advanced types of encoding implemented
5. feature selection methods in spark ml, selection based on model, lasso...
6. handling skewed datasets and highly imbalanced labels (up/down sampling) SMOTE in spark
7. modelling toolkits, contains common classifiers and their tuning guidance, use of xgboost in spark
8. metrics plotting tools, to plot common metrics after training

### Understanding of spark ml structure:
Key components:
1. Transformer
2. Estimator
3. Pipeline

$\textbf{Transformer}$ can transform one df into another df by appending new columns onto original df. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. It has .transform() method, normally taking df as input. Transformers can be trained models, trained encoders.

$\textbf{Estimator}$ is an algorithm to be fit on a df to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a df and produces a model; if we specify a One-hot-encoder, it is an estimarot object, we need to .fit() it onto a column and obtain a transformer. Output of fitted/trained estimator is transformer.

$\textbf{Pipeline}$ chains multiple Transformers and Estimators together to specify an ML workflow. When executing the pipeline, spark will automatically sort out the steps to execute, depending on whether you called a .fit() or .transform() method. A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline). Pipeline is an estimator, after calling pipeline.fit() method, the output will be PipelineModel, a transformer ready to apply .transform() at test time.

In [3]:
import os
import random
import pandas as pd
pd.options.display.max_columns=None
pd.options.display.max_rows=None

#import toolkits
from lib import util
from lib import logger

def initialize_spark(app_name='spark_pipeline'):
    import findspark
    #spark path using default value
    findspark.init()

    import pyspark
    import pyarrow
    from pyspark.sql import SQLContext

    conf = pyspark.SparkConf()\
        .setAppName(app_name)\
        .setMaster('local')\
        .set('spark.driver.memory', '8g')\
        .set('spark.executor.memory', '8g')\
        .set('spark.executor.instances', 4)\
        .set('spark.executor.cores', 4)\
        .set('spark.driver.maxResultSize', '8g')\
        .set('spark.sql.shuffle.partitions', 100)\
        .set('spark.default.parallelism', 200)\
        .set('spark.sql.broadcastTimeout', 36000)\ #broadcastTimeout is purposedly set to be large due to development on single machine
        .set('spark.kryoserializer.buffer.max', '1024m')\
        .set('spark.sql.execution.arrow.enabled', 'false')\
        .set('spark.dynamicAllocation.enabled', "False")\
        .set('spark.port.maxRetries',30) 

    sc = pyspark.SparkContext.getOrCreate(conf)
    spark = pyspark.sql.SparkSession(sc)
    sqlContext = SQLContext.getOrCreate(sc)    
    return sc,spark,sqlContext

from pyspark.ml import Pipeline
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import when, lit
from distutils.version import LooseVersion
from importlib import reload
import pyspark.sql.functions as func
import pyspark.sql.types as typ

In [2]:
sc,spark,sqlContext = initialize_spark()

## Loading data into spark dataframe

In [114]:
df = pd.read_csv('datasets/adult.csv')
# if directly using spark.read.csv('datasets/adult.csv',header=True), unless we specify schema manually,
# all columns will be interpreted as string type, troublesome for later process
dataset = util.pandas_to_spark(sqlContext,df)
dataset = dataset.withColumn('income', when(dataset.income=='<=50K', lit(0)).otherwise(1))
cols = dataset.columns

In [133]:
from lib import imbalance_handler as imbalance_handle
reload(imbalance_handle)

<module 'lib.imbalance_handler' from '/Users/hwang/Desktop/spark_pipelines/lib/imbalance_handler.py'>

In [139]:
down_sampled_df = imbalance_handle.spark_df_down_sampling(dataset, 1, 'income', major_class_val = 0)

After downsampling "income": label distribution is [Row(income=0, count=11687), Row(income=1, count=11687)]


In [74]:
num_cols, cat_cols = util.get_num_cat_feat(dataset)

All columns are been covered.


In [107]:
min_cat = 2
max_cat = 20

In [152]:
#function to automate generate string columns and numerical columns from spark df
cat_coverage_df,no_info_col,cols_high_cardinality = util.coverage_test_spark(dataset,cat_cols,min_cat,max_cat)

Start the count computation for categorical features...
The no. of categorical features: 8


In [211]:
from lib import feature_selection as f_selector
reload(f_selector)

f_selector.num_cols_correlation_test(dataset,num_cols,0.1)

['educational-num', 'capital-gain']

# Smote, encoding, preprocessing, normalization should happen before smote
## Example of spark LSH projection to find nearnest neighbours, as a step in smote upsampling
BucketedRandomProjectionLSH: LSH class for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

Input spark df divided into 3 parts:
1. LongType discrete attributes
2. StringType discrete attributes
3. Continuous numerical type attributes

In [None]:
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.feature import VectorAssembler

#for categorical columns, must take its stringIndexed form (smote should be after string indexing, default by frequency)
#using smote on stringIndexed categorical cols, just take randomly of sample value or it's neighbour's value
def __vectorAssembler(df,cols_2_vec,target_col):
    if(df.select(target_col).distinct().count() != 2):
        raise ValueError("Target field must have only 2 distinct classes")
        
    if target_col in cols_2_vec:
        cols_2_vec.remove(target_col)

    #only vectorize desired columns
    df = df.select(*(cols_2_vec+[target_col]))
    
    #only assembled numeric columns
    assembler=VectorAssembler(inputCols = cols_2_vec, outputCol = 'features')
    
    pos_vectorized = assembler.transform(df)
    
    vectorized = pos_vectorized.select('features',target_col).withColumn('label',pos_vectorized[target_col]).drop(target_col)
    
    return vectorized

In [83]:
vectorized = __vectorAssembler(dataset,num_cols,'income')

In [88]:
sample = vectorized.rdd.map(lambda x: x[0])

In [None]:
#in spark same syntex as pandas to slice df
dataInput_min = vectorized[vectorized['label'] == 1]
dataInput_maj = vectorized[vectorized['label'] == 0]

In [66]:
feature = dataInput_min.select('features')

In [96]:
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
#fit the model of bucketedrandomprojection
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=7)
model = brp.fit(dataInput_min)
# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dataInput_min).show()

The hashed dataset where hashed values are stored in the column 'hashes':
+--------------------+-----+-----------+
|            features|label|     hashes|
+--------------------+-----+-----------+
|[25.0,226802.0,7....|    1|[[10174.0]]|
|[38.0,89814.0,9.0...|    1| [[4027.0]]|
|[18.0,103497.0,10...|    1| [[4642.0]]|
|[34.0,198693.0,6....|    1| [[8913.0]]|
|[29.0,227026.0,9....|    1|[[10184.0]]|
|[24.0,369667.0,10...|    1|[[16584.0]]|
|[55.0,104996.0,4....|    1| [[4710.0]]|
|[36.0,212465.0,13...|    1| [[9531.0]]|
|[26.0,82091.0,9.0...|    1| [[3681.0]]|
|[58.0,299831.0,9....|    1|[[13451.0]]|
|[20.0,444554.0,10...|    1|[[19945.0]]|
|[43.0,128354.0,9....|    1| [[5757.0]]|
|[37.0,60548.0,9.0...|    1| [[2715.0]]|
|[34.0,238588.0,10...|    1|[[10703.0]]|
|[72.0,132015.0,4....|    1| [[5922.0]]|
|[25.0,220931.0,13...|    1| [[9911.0]]|
|[25.0,205947.0,13...|    1| [[9239.0]]|
|[22.0,236427.0,9....|    1|[[10607.0]]|
|[23.0,134446.0,9....|    1| [[6029.0]]|
|[54.0,99516.0,9.0...|  

In [82]:
# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
print("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")
model.approxSimilarityJoin(vectorized, dataInput_min, 1.5, distCol="EuclideanDistance").show()

Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:
+--------------------+--------------------+-----------------+
|            datasetA|            datasetB|EuclideanDistance|
+--------------------+--------------------+-----------------+
|[[39.0,182828.0,1...|[[39.0,182828.0,1...|              0.0|
|[[30.0,182833.0,1...|[[30.0,182833.0,1...|              0.0|
|[[48.0,246367.0,1...|[[48.0,246367.0,1...|              0.0|
|[[47.0,46857.0,10...|[[47.0,46857.0,10...|              0.0|
|[[36.0,73023.0,9....|[[36.0,73023.0,9....|              0.0|
|[[42.0,125461.0,1...|[[42.0,125461.0,1...|              0.0|
|[[22.0,126613.0,1...|[[21.0,126613.0,1...|              1.0|
|[[42.0,335846.0,1...|[[42.0,335846.0,1...|              0.0|
|[[26.0,40255.0,11...|[[26.0,40255.0,11...|              0.0|
|[[38.0,194809.0,1...|[[38.0,194809.0,1...|              0.0|
|[[64.0,201700.0,4...|[[64.0,201700.0,4...|              0.0|
|[[30.0,340899.0,1...|[[30.0,340899.0,1...|              0

In [None]:
key = Vectors.dense([1.0, 0.0])

# Compute the locality sensitive hashes for the input rows, then perform approximate nearest
# neighbor search.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxNearestNeighbors(transformedA, key, 2)`

print("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()

#method 2: just use udf to wrap in the model and distributedly compute each minority sample and find their k nearnest neighbours from the original dataframe

In [205]:
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

data_A = [(0, Vectors.dense([-1.0, -1.0 ]),),
        (1, Vectors.dense([-1.0, 1.0 ]),),
        (2, Vectors.dense([1.0, -1.0 ]),),
        (3, Vectors.dense([1.0, 1.0]),)]
dfA = spark.createDataFrame(data_A, ["id", "features"])

data_B = [(4, Vectors.dense([2.0, 2.0 ]),),
        (5, Vectors.dense([2.0, 3.0 ]),),
        (6, Vectors.dense([3.0, 2.0 ]),),
        (7, Vectors.dense([3.0, 3.0]),)]
dfB = spark.createDataFrame(data_B, ["id", "features"])

In [206]:
#fit the model of bucketedrandomprojection
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=7)
model = brp.fit(dfA)
# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()

# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
print("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")
model.approxSimilarityJoin(dfA, dfB, 1.5, distCol="EuclideanDistance")\
    .select(col("datasetA.id").alias("idA"),
            col("datasetB.id").alias("idB"),
            col("EuclideanDistance")).show()

key = Vectors.dense([1.0, 0.0])

# Compute the locality sensitive hashes for the input rows, then perform approximate nearest
# neighbor search.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxNearestNeighbors(transformedA, key, 2)`

print("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()

#BAD METHOD: method 1: using approxSimilarityJoin to find all similar point pairs within the given threshold euclidean distance
#the euclidean distance will be obtained by monte-carlo method on a few points using approxNearestNeighbors and given desired nearest neighbours

#method 2: just use udf to wrap in the model and distributedly compute each minority sample and find their k nearnest neighbours from the original dataframe



In [276]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row

In [273]:
'''
Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})),
vectors.sparse: first param shows length of this vector, then comes with a dict storing actual index-value pair
vectors.dense: input param is a list containing all the information at all positions (thus dense)
'''
#VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector 
#with a sub-array of the original features. It is useful for extracting features from a vector column.

df = spark.createDataFrame([Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})),
    Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0])),
    Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})),
    Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3}))])

#first deal with all numerical features and then keep its length, to be used in vectorslicer 

slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[0,1,2])

output = slicer.transform(df)

output.select("userFeatures", "features").show()

+--------------------+--------------------+
|        userFeatures|            features|
+--------------------+--------------------+
|(3,[0,1],[-2.0,2.3])|(3,[0,1],[-2.0,2.3])|
|      [-2.0,2.3,0.0]|      [-2.0,2.3,0.0]|
|(3,[0,1],[-2.0,2.3])|(3,[0,1],[-2.0,2.3])|
|(3,[0,1],[-2.0,2.3])|(3,[0,1],[-2.0,2.3])|
+--------------------+--------------------+



In [112]:
from pyspark.sql.functions import udf

@udf("string")
def sum_dense_vector_udf(s):
    return str(sum(s))

@udf("string")
def __find_knn(rowin):
    print(rowin)
    return str(knn_df.count())
    

In [None]:
class smote_pyspark():
    discreteAttributes
    continuousAttributes
    bucketLength #length of each bucket
    numHashTables #number of hash tables
    sizeMultiplier #synthetics per example
    numNearestNeighbours
    seed
    spark
    
    
    

### ChiSquare feature selector on categorical features
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:

In [97]:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
'''
numTopFeatures, percentile
akin to yielding the features with the most predictive power.
'''

# example code 1
# doing chi-square test on categorical cols
import pyspark.mllib.linalg as ln

for cat in categorical_cols[1:]:
    agg = dataset \
    .groupby('race') \
    .pivot(cat) \
    .count()
    
    agg_rdd = agg \
    .rdd \
    .map(lambda row: (row[1:])) \
    .flatMap(lambda row:[0 if e == None else e for e in row]).collect()
    
    row_length = len(agg.collect()[0]) - 1
    
    agg = ln.Matrices.dense(row_length, 2, agg_rdd)
    
    test = st.Statistics.chiSqTest
    (agg)
    
    print(cat, round(test.pValue, 4))
    
    

# example code 2
df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])

selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="clicked")

result = selector.fit(df).transform(df)

#will choose the most useful column from features (assembled) 
print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures())
result.show()

# so for categorical features with not too high cardinality, we can one hot encode it and then do chi-squared test of independence to find which top columns
# from the OHE sub-vectors to be kept (but how to retain this information at test/scoring time?) Only transform at test time
# When there are three or more levels for the predictor, the degree of association 
# between predictor and outcome can be measured with statistics such as X2 (chi-squared) tests …

ChiSqSelector output with top 1 features selected
+---+------------------+-------+----------------+
| id|          features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|
+---+------------------+-------+----------------+



In [48]:
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from distutils.version import LooseVersion
from pyspark.ml.feature import StandardScaler,MinMaxScaler


#indexes each categorical column using the StringIndexer, 
#and then converts the indexed categories into one-hot encoded variables. 
#The resulting output has the binary vectors appended to the end of each row.
    
def one_hot_encode_cat_cols(sdf,cat_cols,label_col=None):
    '''
    perform one hot encoding for cat_cols 
    input:
    * sdf: spark df
    * cat_cols: categorical columns
    * stages: as a list
    output:
    * stages
    '''
    stages = [] # stages in our Pipeline

    for categoricalCol in cat_cols:
        # Category Indexing with StringIndexer, will encode to numerical according to frequency, highest frequency will be encoded to 0
        # when applying this stringIndexer onto another dataset and encounter missing encoded value, we can throw exception or setHandleInvalid(“skip”)
        # like indexer.fit(df1).setHandleInvalid("skip").transform(df2), will remove all rows unable to encode    
        stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")

        # Use OneHotEncoder to convert categorical variables into binary SparseVectors，
        # binary sparse vectors like (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.
        # spark OHE will automatically drop the last category, you can force it not to drop by dropLast=False
        # it omits the final category to break the correlation between features

        if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
            from pyspark.ml.feature import OneHotEncoderEstimator
            encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        else:
            from pyspark.ml.feature import OneHotEncoder
            encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        # Add stages.  These are not run here, but will run all at once later on.
        stages += [stringIndexer, encoder]

    #check if input sdf has label col
    if label_col == None:
        return stages
    else:
        # Convert label into label indices using the StringIndexer
        label_stringIdx = StringIndexer(inputCol = label_col, outputCol = "label")
        # now stages contains a lot of stringIndexer and oneHotencoder and a label stringindexer
        stages += [label_stringIdx]

    return stages
    
def assemble_into_features(sdf,num_cols,cat_cols,stages):
    '''
    assemble all features into vector
    input:
    * 
    '''
    # to combine all the feature columns into a single vector column. 
    # This includes both the numeric columns and the one-hot encoded binary vector columns in our dataset.
    # Transform all features into a vector using VectorAssembler
    
    # keep track of num cols indices for any smote purposes, in dealing with smote on cat cols
    num_cols_indices = list(range(len(num_cols)))
    
    assemblerInputs = num_cols + [c + "classVec" for c in cat_cols]

    #assemblerInputs stores all necessary (transformed) columns after all the stages
    #VectorAssembler only applied to numerical or transformed categorical columns
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler] 

    # then we apply scaling on the vectorized features, 2 additional params are:
    # withStd: True by default. Scales the data to unit standard deviation.
    # withMean: False by default. Centers the data with mean before scaling.
    scaler = StandardScaler(inputCol="features", outputCol="scaled_features",withMean=True)
    #scaler = MinMaxScaler(min=0, max=1, inputCol='features', outputCol='features_minmax')

    stages += [scaler] 
    return stages


In [45]:
stages = one_hot_encode_cat_cols(dataset,cat_cols)

In [49]:
stages_2 = assemble_into_features(dataset,num_cols,cat_cols,stages)

In [None]:
from pyspark.ml.classification import LogisticRegression

#having compiled the stages into a list, at execution, it will automatically sort out the sequence to perform steps in stages
#like when .fit() is called, what should be executed...
partialPipeline = Pipeline().setStages(stages_2) #type is pipeline, independent of dataframe, only using stages 

pipelineModel = partialPipeline.fit(dataset) #type is pipelinemodel, use the prepared staged pipelines to fit dataframe

preppedDataDF = pipelineModel.transform(dataset) #type is stage transformed dataframe, it contains all original columns, and indexed/encoded/vector_encoded columns

In [358]:
# Keep relevant columns
cols = dataset.columns
selectedcols = ["label", "features"] + cols

dataset = preppedDataDF.select(selectedcols)

In [359]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)

print(trainingData.count())

print(testData.count())

34294
14548


In [361]:
# now to train on the train set
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)

In [362]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = rfModel.transform(testData)

In [363]:
# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

DataFrame[label: double, prediction: double, probability: vector, age: bigint, occupation: string]

In [364]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

In [366]:
evaluator.getMetricName()

'areaUnderROC'

In [368]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())
# paramGrid contains 3*2*2 = 12 models
# cv is 5 folds, so total 60 models are searched

# Create 5-fold CrossValidator, input is an estimator (rf classifier e.g.)
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [369]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [370]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.8999505914844388

In [202]:
# vector assembler can have inputs as: numeric,bool,vector
# output will be a flattened vector (even if input could have vector)

In [371]:
bestModel = cvModel.bestModel

In [372]:
# Generate predictions for entire dataset
finalPredictions = bestModel.transform(dataset)
# Evaluate best model
evaluator.evaluate(finalPredictions)