# General pipeline for binary classification tasks in spark ml

$\color{blue}{\text{Covering major components in real life scenarios for binary classification tasks using structured datasets.}}$

Modelling and data transformation part will be mostly operated in pure spark, but exploration and plotting metrics may involve pandas components.

The toolkits are in the lib/ folder, covering following topics:

0. summary on transformer, estimators, pipelines
1. spark and pandas dataframe conversion, tips in converting datatypes and assign correct schema
2. categorical variables encoding methods: label encoding (string indexer), one hot encoding.
3. feature selection methods in spark ml, selection based on model, lasso.
4. handling skewed datasets and highly imbalanced labels (up/down sampling) SMOTE in spark
5. modelling toolkits, contains common classifiers and their tuning guidance, use of xgboost in spark
6. metrics plotting tools, to plot common metrics after training

### Spark ml structure:
Key components:
1. Transformer
2. Estimator
3. Pipeline

$\textbf{Transformer}$ can transform one df into another df by appending new columns onto original df. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. It has .transform() method, normally taking df as input. Transformers can be trained models, trained encoders.

$\textbf{Estimator}$ is an algorithm to be fit on a df to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a df and produces a model; if we specify a One-hot-encoder, it is an estimarot object, we need to .fit() it onto a column and obtain a transformer. Output of fitted/trained estimator is transformer.

$\textbf{Pipeline}$ chains multiple Transformers and Estimators together to specify an ML workflow. When executing the pipeline, spark will automatically sort out the steps to execute, depending on whether you called a .fit() or .transform() method. A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline). Pipeline is an estimator, after calling pipeline.fit() method, the output will be PipelineModel, a transformer ready to apply .transform() at test time.

In [146]:
import os
import random
import pandas as pd
pd.options.display.max_columns=None
pd.options.display.max_rows=None

#import toolkits
from lib import util
from lib import logger

def initialize_spark(app_name='spark_pipeline'):
    import findspark
    #spark path using default value
    findspark.init()

    import pyspark
    import pyarrow
    from pyspark.sql import SQLContext
    
    #broadcastTimeout is purposedly set to be large due to development on single machine
    conf = pyspark.SparkConf()\
        .setAppName(app_name)\
        .setMaster('local')\
        .set('spark.driver.memory', '8g')\
        .set('spark.executor.memory', '8g')\
        .set('spark.executor.instances', 4)\
        .set('spark.executor.cores', 4)\
        .set('spark.driver.maxResultSize', '8g')\
        .set('spark.sql.shuffle.partitions', 100)\
        .set('spark.default.parallelism', 200)\
        .set('spark.sql.broadcastTimeout', 36000)\
        .set('spark.kryoserializer.buffer.max', '1024m')\
        .set('spark.sql.execution.arrow.enabled', 'false')\
        .set('spark.dynamicAllocation.enabled', "False")\
        .set('spark.port.maxRetries',30) 

    sc = pyspark.SparkContext.getOrCreate(conf)
    spark = pyspark.sql.SparkSession(sc)
    sqlContext = SQLContext.getOrCreate(sc)    
    return sc,spark,sqlContext

from pyspark.ml import Pipeline
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.types import IntegerType,DecimalType
from pyspark.sql.functions import when, lit
from distutils.version import LooseVersion
from importlib import reload
import pyspark.sql.functions as func
import pyspark.sql.types as typ

In [2]:
sc,spark,sqlContext = initialize_spark()

### Step 1: Load data into spark df, stringIndex all cat_cols

Stringindex all -> train test split -> smote train -> restore smoted train to original columns -> train models -> transform testset

In [240]:
df = pd.read_csv('datasets/adult.csv')
# if directly using spark.read.csv('datasets/adult.csv',header=True), unless we specify schema manually,
# all columns will be interpreted as string type, troublesome for later process
dataset = util.pandas_to_spark(sqlContext,df)
dataset = dataset.withColumn('income', when(dataset.income=='<=50K', lit(0)).otherwise(1))
cols = dataset.columns

In [241]:
trainingData, testData = dataset.randomSplit([0.7, 0.3], seed=100)

print(trainingData.count())
print(testData.count())

34294
14548


### (Optional) Random downsampling

In [333]:
from config.conf_template import Struct as Section
from pyspark.ml.feature import OneHotEncoderEstimator
import lib.imbalance_handler as imbalance_handle
import lib.feature_selection as fs
import lib.categorical_handler as ctgy
from pyspark.ml import Pipeline
reload(ctgy)
reload(fs)
reload(imbalance_handle)

<module 'lib.imbalance_handler' from '/Users/hwang/Desktop/spark_pipelines/lib/imbalance_handler.py'>

In [243]:
down_sampled_df = imbalance_handle.spark_df_down_sampling(trainingData, 2, 'income', major_class_val = 0)

After downsampling "income": label distribution is [Row(income=0, count=16501), Row(income=1, count=8251)]


### Step 2: Smote

In [245]:
# get num_cols and cat_cols from spark df
num_cols, cat_cols = util.get_num_cat_feat(dataset)

All columns are been covered.


In [246]:
min_cat = 2
max_cat = 20

In [247]:
cat_coverage_df,no_info_col,cols_high_cardinality = fs.cat_col_cardinality_test(dataset,cat_cols,min_cat,max_cat)

Start the count computation for categorical features...
The no. of categorical features: 8


In [248]:
# find highly correlated columns
fs.num_cols_correlation_test(dataset,num_cols,0.1)

['income', 'hours-per-week', 'educational-num', 'capital-gain', 'age']

In [250]:
conf = Section("smote_config")
conf.seed = 48
conf.bucketLength = 100
conf.k = 4
conf.multiplier = 3

In [296]:
# get num_cols and cat_cols from spark df
num_cols, cat_cols = util.get_num_cat_feat(dataset)
vectorized,stages1 = imbalance_handle.pre_smote_df_process(trainingData,num_cols,cat_cols,'income',False)
res = imbalance_handle.smote(vectorized,conf)

All columns are been covered.
return num cols vectorized df and stages for testset transformation
generating batch 0 of synthetic instances
generating batch 1 of synthetic instances
generating batch 2 of synthetic instances


In [299]:
res_restored = restore_smoted_df(num_cols,res,'features')

### Step 3: Encode categorical cols

In [337]:
stages_rf = ctgy.assemble_into_features_RF(res_restored,num_cols,cat_cols,'_index')

In [330]:
# with encoding on cat cols
allstages = ctgy.assemble_into_features(res_restored,num_cols,cat_cols,'_index','_ohe')

In [338]:
#having compiled the stages into a list, at execution, it will automatically sort out the sequence to perform steps in stages
#like when .fit() is called, what should be executed...
partialPipeline = Pipeline().setStages(stages_rf) #type is pipeline, independent of dataframe, only using stages 
pipelineModel = partialPipeline.fit(res_restored) #type is pipelinemodel, use the prepared staged pipelines to transform test dataframe or train
preppedDataDF = pipelineModel.transform(res_restored) #type is stage transformed dataframe, it contains all original columns, and indexed/encoded/vector_encoded columns

In [339]:
preppedDataDF_test = pipelineModel.transform(testData)
preppedDataDF_test = preppedDataDF_test.withColumnRenamed("income","label")

In [408]:
# now to train on the train set
from pyspark.ml.classification import RandomForestClassifier
import numpy as np

rf_settings = {'maxBins':100,
              'labelCol':'label',
              'featuresCol':'features'}

In [409]:
RF_feature_selector(rf_settings,preppedDataDF,10)

building random forest feature selector using maxBins:100


['marital-status_index',
 'relationship_index',
 'educational-num',
 'capital-gain',
 'occupation_index',
 'education_index',
 'gender_index',
 'hours-per-week',
 'age',
 'capital-loss']

### Model construction

In [344]:
# now to train on the train set
from pyspark.ml.classification import RandomForestClassifier
# Create an initial RandomForest model.
rf = RandomForestClassifier(maxBins=100, labelCol="label", featuresCol="features")
# Train model with Training Data
rfModel = rf.fit(preppedDataDF)

In [345]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = rfModel.transform(preppedDataDF_test)

In [346]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
areaUnderROC = evaluator.setMetricName("areaUnderROC").evaluate(predictions)
areaUnderPR = evaluator.setMetricName("areaUnderPR").evaluate(predictions)

In [317]:
preds = predictions.select("prediction","label").toPandas()

In [324]:
from sklearn.metrics import confusion_matrix

In [323]:
confusion_matrix(y_pred=preds['prediction'],y_true=preds['label'])

array([[7433, 3679],
       [ 372, 3064]])

In [398]:
#for ohe transformed features 
pandasDF = pd.DataFrame(preppedDataDF.schema["features"].metadata["ml_attr"]["attrs"]["nominal"]+preppedDataDF.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
feature_dict_broad = sc.broadcast(feature_dict)

In [399]:
import numpy as np

importances = list(np.array(rfModel.featureImportances))

col_importance_val = []
for i,importance in enumerate(importances):
    col_importance_val.append([i,importance])

final_sorted_importance = sorted(col_importance_val, key=lambda x: x[1], reverse =True)

In [400]:
top_feature_index = [a[0] for a in final_sorted_importance[:15]]

res = []
for i,importance in enumerate(importances):
    feature_nm = feature_dict[i]
    res.append([feature_nm,importance])
    
sorted_important_fs = sorted(res, key=lambda x: x[1], reverse =True)

In [396]:
sorted_important_fs

[['marital-status_index', 0.4082046991270808],
 ['relationship_index', 0.1665893438731733],
 ['educational-num', 0.16535002474366226],
 ['capital-gain', 0.10703501011498343],
 ['occupation_index', 0.06260924091269396],
 ['education_index', 0.04168504655448925],
 ['gender_index', 0.02210641676730955],
 ['hours-per-week', 0.01547938921110717],
 ['capital-loss', 0.005645579417788404],
 ['age', 0.004702962533155233],
 ['native-country_index', 0.0003211945125361774],
 ['workclass_index', 0.0001543171015540678],
 ['race_index', 7.339355307677681e-05],
 ['fnlwgt', 4.338157738968021e-05]]

In [368]:
# Doing cross validation and params tuning

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())
# paramGrid contains 3*2*2 = 12 models
# cv is 5 folds, so total 60 models are searched

# Create 5-fold CrossValidator, input is an estimator (rf classifier e.g.)
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [369]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [370]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.8999505914844388

In [202]:
# vector assembler can have inputs as: numeric,bool,vector
# output will be a flattened vector (even if input could have vector)

In [371]:
bestModel = cvModel.bestModel

In [372]:
# Generate predictions for entire dataset
finalPredictions = bestModel.transform(dataset)
# Evaluate best model
evaluator.evaluate(finalPredictions)

In [None]:
'''
from pyspark.ml.feature import StringIndexer, IndexToString
labelReverse = IndexToString().setInputCol("race_index").setOutputCol("recover")
labelReverse.transform(vectorized).select("race_index","recover").show()
'''