# General pipeline for binary classification tasks in spark ml

$\color{blue}{\text{Covering major components in real life scenarios for binary classification tasks using structured datasets.}}$

Modelling and data transformation part will be mostly operated in pure spark, but exploration and plotting metrics may involve pandas components.

The toolkits are in the lib/ folder, covering following topics:

0. summary on transformer, estimators, pipelines
1. spark and pandas dataframe conversion, tips in converting datatypes and assign correct schema
2. categorical variables encoding methods: label encoding (string indexer), one hot encoding.
3. feature selection methods in spark ml, selection based on random forest.
4. handling skewed datasets and highly imbalanced labels (up/down sampling) SMOTE in spark
5. modelling toolkits, contains common classifiers and their tuning guidance, use of xgboost in spark
6. metrics plotting tools, to plot common metrics after training

### Spark ml structure:
Key components:
1. Transformer
2. Estimator
3. Pipeline

$\textbf{Transformer}$ can transform one df into another df by appending new columns onto original df. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. It has .transform() method, normally taking df as input. Transformers can be trained models, trained encoders.

$\textbf{Estimator}$ is an algorithm to be fit on a df to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a df and produces a model; if we specify a One-hot-encoder, it is an estimarot object, we need to .fit() it onto a column and obtain a transformer. Output of fitted/trained estimator is transformer.

$\textbf{Pipeline}$ chains multiple Transformers and Estimators together to specify an ML workflow. When executing the pipeline, spark will automatically sort out the steps to execute, depending on whether you called a .fit() or .transform() method. A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline). Pipeline is an estimator, after calling pipeline.fit() method, the output will be PipelineModel, a transformer ready to apply .transform() at test time.

In [3]:
import os
import random
import pandas as pd
pd.options.display.max_columns=None
pd.options.display.max_rows=None

#import toolkits
from lib import util
from lib import logger

def initialize_spark(app_name='spark_pipeline'):
    import findspark
    #spark path using default value
    findspark.init()

    import pyspark
    import pyarrow
    from pyspark.sql import SQLContext
    
    #broadcastTimeout is purposedly set to be large due to development on single machine
    conf = pyspark.SparkConf()\
        .setAppName(app_name)\
        .setMaster('local')\
        .set('spark.driver.memory', '8g')\
        .set('spark.executor.memory', '8g')\
        .set('spark.executor.instances', 4)\
        .set('spark.executor.cores', 4)\
        .set('spark.driver.maxResultSize', '8g')\
        .set('spark.sql.shuffle.partitions', 100)\
        .set('spark.default.parallelism', 200)\
        .set('spark.sql.broadcastTimeout', 36000)\
        .set('spark.kryoserializer.buffer.max', '1024m')\
        .set('spark.sql.execution.arrow.enabled', 'false')\
        .set('spark.dynamicAllocation.enabled', "False")\
        .set('spark.port.maxRetries',30) 

    sc = pyspark.SparkContext.getOrCreate(conf)
    spark = pyspark.sql.SparkSession(sc)
    sqlContext = SQLContext.getOrCreate(sc)    
    return sc,spark,sqlContext

from pyspark.ml import Pipeline
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.types import IntegerType,DecimalType
from pyspark.sql.functions import when, lit
from distutils.version import LooseVersion
from importlib import reload
import pyspark.sql.functions as func
import pyspark.sql.types as typ

from config.conf_template import Struct as Section
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.classification import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

import lib.imbalance_handler as imbalance_handle
import lib.feature_selection as fs
import lib.categorical_handler as ctgy
import lib.modelling as model

from pyspark.ml import Pipeline
reload(ctgy)
reload(fs)
reload(imbalance_handle)
reload(model)

In [2]:
sc,spark,sqlContext = initialize_spark()

### Step 1: Load data into spark df, stringIndex all cat_cols

Stringindex all -> train test split -> smote train -> restore smoted train to original columns -> train models -> transform testset

In [39]:
df = pd.read_csv('datasets/adult.csv')
# if directly using spark.read.csv('datasets/adult.csv',header=True), unless we specify schema manually,
# all columns will be interpreted as string type, troublesome for later process
dataset = util.pandas_to_spark(sqlContext,df)
dataset = dataset.withColumn('income', when(dataset.income=='<=50K', lit(0)).otherwise(1))

In [40]:
# get num_cols and cat_cols from spark df
num_cols, cat_cols = util.get_num_cat_feat(dataset)

All columns are been covered.


In [41]:
trainingData,testData = dataset.randomSplit([0.7, 0.3], seed=100)
# manual rename, target col to 'label' on test set, train set will auto convert
testData = testData.withColumnRenamed("income","label")

### (Optional) Random downsampling, for highly imbalanced dataset
Example usage of downsampling spark df:
down_sampled_df = imbalance_handle.spark_df_down_sampling(trainingData, 2, 'income', major_class_val = 0)

### Step 2: Smote

In [63]:
conf = Section("smote_config")
conf.seed = 48
conf.bucketLength = 100
conf.k = 4
conf.multiplier = 3

In [64]:
vectorized,smote_stages = imbalance_handle.pre_smote_df_process(trainingData,num_cols,cat_cols,'income',False)

return num cols vectorized df and stages for testset transformation


In [65]:
smoted_train_df = imbalance_handle.smote(vectorized,conf)

generating batch 0 of synthetic instances
generating batch 1 of synthetic instances
generating batch 2 of synthetic instances


In [84]:
res_restored = imbalance_handle.restore_smoted_df(num_cols,smoted_train_df,'features')

### Step 3: Feature selection using RF, for high number of features use cases

In [22]:
rf_settings = {'maxBins':100,
              'labelCol':'label',
              'featuresCol':'features'}

In [23]:
# consider dropping high category cat cols before rf, the max cardinality should be less than maxBins
features_selected = fs.RF_feature_selector(rf_settings,res_restored,num_cols,cat_cols,10)

preparing stages to prepare for rf input vectorized df
transforming df (label encoding and vectorizing)
fitting random forest model using maxBins:100
return top 10 features from rf feature importance


### Step 4: Encode categorical cols (optional for tree based algo)

    with 1hot encoding on cat cols, output string columns will be with suffix _index_ohe
    allstages = ctgy.assemble_into_features(res_restored,num_cols,cat_cols,'_index','_ohe')

### Step 5: Model construction, tuned RF and GBT

    GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.
    On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.
    Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)
    Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).

In [162]:
params = {'RF':{'maxDepth':[4,6],'maxBins':[80],'numTrees':[20]},
             'GBT':{'maxDepth':[6],'maxBins':[80]}}

In [148]:
%%time
Best_RF = model.train_rf_param_tuning(params,res_restored,testData,num_cols,cat_cols,cv_Folds=2)

received rf param grid: {'maxDepth': [4, 6], 'maxBins': [80], 'numTrees': [20]}
transforming train df
transforming test df
fitting cv rf models on transformed train df
best model performance on test set:
AUC: 0.8920428449542431, AUPR: 0.7162176681645379
return best rf model
CPU times: user 3.25 s, sys: 1.13 s, total: 4.38 s
Wall time: 14min 40s
