# General pipeline framework for binary classification tasks in spark ml

$\color{blue}{\text{Covering major components in real life scenarios.}}$

The toolkits are in the lib/ folder and including following topics:

0. summary on transformer, estimators, pipelines
1. spark and pandas dataframe conversion, tips in converting datatypes and assign correct schema
2. typical udf to transform columns
3. explorative analysis on spark df
4. categorical variables encoding methods, some advanced types of encoding implemented
5. feature selection methods in spark ml, selection based on model, lasso...
6. handling skewed datasets and highly imbalanced labels (up/down sampling) SMOTE in spark
7. modelling toolkits, contains common classifiers and their tuning guidance, use of xgboost in spark
8. metrics plotting tools, to plot common metrics after training

### Understanding of spark ml structure:
Key components:
1. Transformer
2. Estimator
3. Pipeline

$\textbf{Transformer}$ can transform one df into another df by appending new columns onto original df. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. It has .transform() method, normally taking df as input. Transformers can be trained models, trained encoders.

$\textbf{Estimator}$ is an algorithm to be fit on a df to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a df and produces a model; if we specify a One-hot-encoder, it is an estimarot object, we need to .fit() it onto a column and obtain a transformer. Output of fitted/trained estimator is transformer.

$\textbf{Pipeline}$ chains multiple Transformers and Estimators together to specify an ML workflow. When executing the pipeline, spark will automatically sort out the steps to execute, depending on whether you called a .fit() or .transform() method. A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline). Pipeline is an estimator, after calling pipeline.fit() method, the output will be PipelineModel, a transformer ready to apply .transform() at test time.

In [3]:
def initialize_spark(app_name='spark_pipeline'):
    import findspark
    #spark path using default value
    findspark.init()

    import pyspark
    import pyarrow
    from pyspark.sql import SQLContext

    conf = pyspark.SparkConf()\
        .setAppName(app_name)\
        .setMaster('local')\
        .set('spark.driver.memory', '8g')\
        .set('spark.executor.memory', '8g')\
        .set('spark.executor.instances', 4)\
        .set('spark.executor.cores', 4)\
        .set('spark.driver.maxResultSize', '8g')\
        .set('spark.sql.shuffle.partitions', 100)\
        .set('spark.default.parallelism', 200)\
        .set('spark.sql.broadcastTimeout', 36000)\
        .set('spark.kryoserializer.buffer.max', '1024m')\
        .set('spark.sql.execution.arrow.enabled', 'false')\
        .set('spark.dynamicAllocation.enabled', "False")\
        .set('spark.port.maxRetries',30) 

    sc = pyspark.SparkContext.getOrCreate(conf)
    spark = pyspark.sql.SparkSession(sc)
    sqlContext = SQLContext.getOrCreate(sc)    
    return sc,spark,sqlContext

In [443]:
import os
import random
import pandas as pd
pd.options.display.max_columns=None
pd.options.display.max_rows=None

import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import when, lit
from distutils.version import LooseVersion
from importlib import reload

#import toolkits
from lib import util
from lib import logger

In [135]:
sc,spark,sqlContext = initialize_spark()

## Loading data into spark dataframe

In [381]:
df = pd.read_csv('datasets/adult.csv')
# if directly using spark.read.csv('datasets/adult.csv',header=True), unless we specify schema manually,
# all columns will be interpreted as string type, troublesome for later process
dataset = util.pandas_to_spark(sqlContext,df)
dataset = dataset.withColumn('income', when(dataset.income=='<=50K', lit(1)).otherwise(0))
cols = dataset.columns

In [265]:
import pyspark.sql.functions as func
import pyspark.sql.types as typ

In [449]:
#cols = [(col.name, col.dataType) for col in dataset_transformed.schema]

In [467]:
#function to automate generate string columns and numerical columns from spark df
x,y,z=util.coverage_test_spark(dataset,b)

Start the count computation for categorical features...
The no. of categorical features: 8


In [382]:
import pyspark.mllib.stat as st
import numpy as np

#basic stats for numerical cols
numericCols = ["age", "fnlwgt", "educational-num", "capital-gain", "capital-loss", "hours-per-week"]

numeric_rdd = dataset.select(numericCols).rdd.map(lambda row: [e for e in row])

mllib_stats = st.Statistics.colStats(numeric_rdd)

for col, m, v in zip(numericCols,
    mllib_stats.mean(),
    mllib_stats.variance()):
    print('{0}: \t{1:.2f} \t {2:.2f}'.format(col, m, np.sqrt(v)))
    
#basic stats for categorical cols
categorical_cols = [e for e in dataset.columns if e not in numericCols]

categorical_rdd = dataset\
  .select(categorical_cols)\
  .rdd \
  .map(lambda row: [e for e in row])

for i, col in enumerate(categorical_cols):
    agg = categorical_rdd \
        .groupBy(lambda row: row[i]) \
        .map(lambda row: (row[0], len(row[1])))
    
    print(col, sorted(agg.collect(),key=lambda el: el[1],reverse=True))

age: 	38.64 	 13.71
fnlwgt: 	189664.13 	 105604.03
educational-num: 	10.08 	 2.57
capital-gain: 	1079.07 	 7452.02
capital-loss: 	87.50 	 403.00
hours-per-week: 	40.42 	 12.39
workclass [('Private', 33906), ('Self-emp-not-inc', 3862), ('Local-gov', 3136), ('?', 2799), ('State-gov', 1981), ('Self-emp-inc', 1695), ('Federal-gov', 1432), ('Without-pay', 21), ('Never-worked', 10)]
education [('HS-grad', 15784), ('Some-college', 10878), ('Bachelors', 8025), ('Masters', 2657), ('Assoc-voc', 2061), ('11th', 1812), ('Assoc-acdm', 1601), ('10th', 1389), ('7th-8th', 955), ('Prof-school', 834), ('9th', 756), ('12th', 657), ('Doctorate', 594), ('5th-6th', 509), ('1st-4th', 247), ('Preschool', 83)]
marital-status [('Married-civ-spouse', 22379), ('Never-married', 16117), ('Divorced', 6633), ('Separated', 1530), ('Widowed', 1518), ('Married-spouse-absent', 628), ('Married-AF-spouse', 37)]
occupation [('Prof-specialty', 6172), ('Craft-repair', 6112), ('Exec-managerial', 6086), ('Adm-clerical', 5611), 

In [None]:
import pyspark.mllib.feature as ft
import pyspark.mllib.regression as reg

hashing = ft.HashingTF(7)

births_hashed = births_transformed \
  .rdd \
  .map(lambda row: [
      list(hashing.transform(row[1]).toArray())
          if col == 'BIRTH_PLACE'
          else row[i]
      for i, col
      in enumerate(features_to_keep)]) \
  .map(lambda row: [[e] if type(e) == int else e
          for e in row]) \
  .map(lambda row: [item for sublist in row
          for item in sublist]) \
  .map(lambda row: reg.LabeledPoint(
      row[0],
      ln.Vectors.dense(row[1:]))
      )

In [385]:
#find multi-colinearlity
multicolinearity_thres = 0

corrs = st.Statistics.corr(numeric_rdd)

for i, el in enumerate(corrs > multicolinearity_thres):
    correlated = [(numericCols[j], corrs[i][j]) for j, e in enumerate(el) if e == 1.0 and j != i]
    
    if len(correlated) > 0:
        for e in correlated:
            print('{0}-to-{1}: {2:.2f}'.format(numericCols[i], e[0], e[1]))

age-to-educational-num: 0.03
age-to-capital-gain: 0.08
age-to-capital-loss: 0.06
age-to-hours-per-week: 0.07
educational-num-to-age: 0.03
educational-num-to-capital-gain: 0.13
educational-num-to-capital-loss: 0.08
educational-num-to-hours-per-week: 0.14
capital-gain-to-age: 0.08
capital-gain-to-educational-num: 0.13
capital-gain-to-hours-per-week: 0.08
capital-loss-to-age: 0.06
capital-loss-to-educational-num: 0.08
capital-loss-to-hours-per-week: 0.05
hours-per-week-to-age: 0.07
hours-per-week-to-educational-num: 0.14
hours-per-week-to-capital-gain: 0.08
hours-per-week-to-capital-loss: 0.05


In [436]:
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.feature import VectorAssembler

def vectorizerFunction(dataInput, TargetFieldName):
    if(dataInput.select(TargetFieldName).distinct().count() != 2):
        raise ValueError("Target field must have only 2 distinct classes")
        
    columnNames = list(dataInput.columns)
    columnNames.remove(TargetFieldName)
    
    dataInput = dataInput.select(*(columnNames+[TargetFieldName]))
    
    #only assembled numeric columns
    assembler=VectorAssembler(inputCols = columnNames, outputCol = 'features')
    
    pos_vectorized = assembler.transform(dataInput)
    
    vectorized = pos_vectorized.select('features',TargetFieldName).withColumn('label',pos_vectorized[TargetFieldName]).drop(TargetFieldName)
    
    return vectorized

def SmoteSampling(vectorized, k = 5, minorityClass = 1, majorityClass = 0, percentageOver = 200, percentageUnder = 100):
    if(percentageUnder > 100|percentageUnder < 10):
        raise ValueError("Percentage Under must be in range 10 - 100");
    if(percentageOver < 100):
        raise ValueError("Percentage Over must be in at least 100");
        
    #in spark same syntex as pandas to slice df
    dataInput_min = vectorized[vectorized['label'] == minorityClass]
    dataInput_maj = vectorized[vectorized['label'] == majorityClass]
    
    feature = dataInput_min.select('features')
    feature = feature.rdd
    feature = feature.map(lambda x: x[0])
    
    #still collected as list not spark smote
    feature = feature.collect()

    #modified to pure spark
    knn = NearestNeighbors(n_neighbors=3, radius=2.0, 
                       algorithm='brute', metric='euclidean')
    knn.fit(samples)

    feature = np.asarray(feature)
    
    #using the dense vectors to fit and find neighbors
    nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm='auto').fit(feature)
    
    neighbours =  nbrs.kneighbors(feature)
    
    gap = neighbours[0]
    neighbours = neighbours[1]
    
    #minority rdd
    min_rdd = dataInput_min.drop('label').rdd
    
    pos_rddArray = min_rdd.map(lambda x : list(x))
    pos_ListArray = pos_rddArray.collect()
    
    min_Array = list(pos_ListArray)
    
    newRows = []
    
    nt = len(min_Array)
    
    nexs = int(percentageOver/100)
    
    for i in range(nt):
        for j in range(nexs):
            neigh = random.randint(1,k)
            difs = min_Array[neigh][0] - min_Array[i][0]
            newRec = (min_Array[i][0]+random.random()*difs)
            newRows.insert(0,(newRec))
            
    newData_rdd = sc.parallelize(newRows)
    newData_rdd_new = newData_rdd.map(lambda x: Row(features = x, label = 1))
    new_data = newData_rdd_new.toDF()
    new_data_minor = dataInput_min.unionAll(new_data)
    new_data_major = dataInput_maj.sample(False, (float(percentageUnder)/float(100)))
    return new_data_major.unionAll(new_data_minor)


In [437]:
vall = SmoteSampling(vectorizerFunction(keep_ds_p_sp, 'income'), k = 5, minorityClass = 0, majorityClass = 1, percentageOver = 300)

11687
[28.0,336951.0,12.0,0.0,0.0,40.0]


In [412]:
vall.select("label").toPandas()['label'].value_counts()

1    25286
0    11687
Name: label, dtype: int64

In [353]:
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from distutils.version import LooseVersion

categoricalColumns = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "gender", "native-country"]
stages = [] # stages in our Pipeline, as a list

#indexes each categorical column using the StringIndexer, 
#and then converts the indexed categories into one-hot encoded variables. 
#The resulting output has the binary vectors appended to the end of each row.
    
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer, will encode to numerical according to frequency, highest frequency will be encoded to 0
    # when applying this stringIndexer onto another dataset and encounter missing encoded value, we can throw exception or setHandleInvalid(“skip”)
    # like indexer.fit(df1).setHandleInvalid("skip").transform(df2), will remove all rows unable to encode    
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors，
    # binary sparse vectors like (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.
    # spark OHE will automatically drop the last category, you can force it not to drop by dropLast=False
    # it omits the final category to break the correlation between features
    
    if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
        from pyspark.ml.feature import OneHotEncoderEstimator
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    else:
        from pyspark.ml.feature import OneHotEncoder
        encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
#now stages contains a lot of stringIndexer and oneHotencoder and a label stringindexer
stages += [label_stringIdx]

# to combine all the feature columns into a single vector column. 
# This includes both the numeric columns and the one-hot encoded binary vector columns in our dataset.
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "educational-num", "capital-gain", "capital-loss", "hours-per-week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols

#assemblerInputs stores all necessary (transformed) columns after all the stages
#VectorAssembler only applied to numerical or transformed categorical columns
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

stages += [assembler] 

# then we apply scaling on the vectorized features, 2 additional params are:
# withStd: True by default. Scales the data to unit standard deviation.
# withMean: False by default. Centers the data with mean before scaling.
from pyspark.ml.feature import StandardScaler,MinMaxScaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features",withMean=True)
#scaler = MinMaxScaler(min=0, max=1, inputCol='features', outputCol='features_minmax')

stages += [scaler] 

In [354]:
from pyspark.ml.classification import LogisticRegression

#having compiled the stages into a list, at execution, it will automatically sort out the sequence to perform steps in stages
#like when .fit() is called, what should be executed...
partialPipeline = Pipeline().setStages(stages) #type is pipeline, independent of dataframe, only using stages 

pipelineModel = partialPipeline.fit(dataset) #type is pipelinemodel, use the prepared staged pipelines to fit dataframe

preppedDataDF = pipelineModel.transform(dataset) #type is stage transformed dataframe, it contains all original columns, and indexed/encoded/vector_encoded columns

In [358]:
# Keep relevant columns
cols = dataset.columns
selectedcols = ["label", "features"] + cols

dataset = preppedDataDF.select(selectedcols)

In [359]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)

print(trainingData.count())

print(testData.count())

34294
14548


In [361]:
# now to train on the train set
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)

In [362]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = rfModel.transform(testData)

In [363]:
# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

DataFrame[label: double, prediction: double, probability: vector, age: bigint, occupation: string]

In [364]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

In [366]:
evaluator.getMetricName()

'areaUnderROC'

In [368]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())
# paramGrid contains 3*2*2 = 12 models
# cv is 5 folds, so total 60 models are searched

# Create 5-fold CrossValidator, input is an estimator (rf classifier e.g.)
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [369]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [370]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.8999505914844388

In [202]:
# vector assembler can have inputs as: numeric,bool,vector
# output will be a flattened vector (even if input could have vector)

In [371]:
bestModel = cvModel.bestModel

In [372]:
# Generate predictions for entire dataset
finalPredictions = bestModel.transform(dataset)
# Evaluate best model
evaluator.evaluate(finalPredictions)

In [None]:
# doing chi-square test on categorical cols
import pyspark.mllib.linalg as ln

for cat in categorical_cols[1:]:
    agg = dataset \
    .groupby('race') \
    .pivot(cat) \
    .count()
    
    agg_rdd = agg \
    .rdd \
    .map(lambda row: (row[1:])) \
    .flatMap(lambda row:[0 if e == None else e for e in row]).collect()
    
    row_length = len(agg.collect()[0]) - 1
    
    agg = ln.Matrices.dense(row_length, 2, agg_rdd)
    
    test = st.Statistics.chiSqTest(agg)
    
    print(cat, round(test.pValue, 4))