# PySpark Processing Pipeline Template

This notebook implements a typical pipeline for tabular data:
- Split data into training and testing
- Index all string columns (categorical columns) then perform one hot encoder. Missing is dealed with by handleInvalid='keep'
- Impute all numeric columns, then standardize them
- Assemble all processed columns in a Vector features

User parameters: 
- `hdfs_path`: path to HDFS folder
- `data_file`: data file name
- `split_ratio`: a list of two ratio, the training proportion and the testing proportion
- `integer_cols`: a list of all integer columns' names. These will be casted to `double`
- `drop_cols`: a list of all columns to drop from modeling data. These are usually ID columns or name columns
- `string_cols`: a list of all string (categorical) columns. These will undergo one hot encoder
- `numeric_cols`: a list of all numeric columns. These will undergo imputation and standardization
- `target`: the single target column

In [1]:
%spark2.pyspark

#path to data
hdfs_path = '/tmp/data/'
data_file = 'heart_disease.csv'
split_ratio = [0.7, 0.3]
drop_cols = ['PatientID']
integer_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'HeartDisease']
string_cols = ['ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
numeric_cols = ['Age','RestingBP','Cholesterol','FastingBS','MaxHR','Oldpeak']
target = 'HeartDisease'


from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType

#read data
data = spark.read.options(header='True',inferSchema='True',delimiter=',').csv("/tmp/data/heart_disease.csv")

#drop columns
data = data.drop(*drop_cols)

#cast integer columns to double
for c in integer_cols:
    data = data.withColumn(c, col(c).cast(DoubleType()))
    
#train-test split
data_train, data_test = data.randomSplit(split_ratio)

from pyspark.ml.feature import StringIndexer, OneHotEncoder, Imputer, StandardScaler, VectorAssembler
from pyspark.ml import Pipeline

###one hot encode the categorical columns
encoders = []
for c in string_cols:
    encoders.append(StringIndexer(inputCol=c, outputCol=c+'Index', handleInvalid='keep'))
    encoders.append(OneHotEncoder(inputCol=c+'Index', outputCol=c+'Codes'))

###impute the numeric columns
imputer = Imputer(inputCols = numeric_cols, outputCols = [c+'Imp' for c in numeric_cols], strategy = 'median')

###standardization
num_assembler = VectorAssembler(inputCols=[c+'Imp' for c in numeric_cols], outputCol='imputed')
scaler = StandardScaler(inputCol = 'imputed', outputCol = 'scaled')

###combine results
assembler = VectorAssembler(inputCols=[c+'Codes' for c in string_cols]+['scaled'], outputCol='features')



###build pipeline
pipeline = Pipeline(stages = encoders + [imputer, num_assembler, scaler, assembler])

###train pipeline
pipeline_trained = pipeline.fit(data_train)

# Testing the pipeline

Perform transformation on the training and testing set. This is a classification problem, so fit a Logistic model to demonstrate.


In [3]:
%spark2.pyspark

#transform the training and testing data
train_prc = pipeline_trained.transform(data_train).select(target,'features')
test_prc = pipeline_trained.transform(data_test).select(target,'features')

#create and fit a logistic regression model
from pyspark.ml.classification import LogisticRegression
logistic_model = LogisticRegression(featuresCol='features', labelCol=target)
logistic_trained = logistic_model.fit(train_prc)

In [4]:
%spark2.pyspark
train_predicted = logistic_trained.transform(train_prc)
train_predicted.crosstab(target, 'prediction').show()

+-----------------------+---+---+
|HeartDisease_prediction|0.0|1.0|
+-----------------------+---+---+
|                    1.0| 37|329|
|                    0.0|246| 50|
+-----------------------+---+---+



In [5]:
%spark2.pyspark
test_predicted = logistic_trained.transform(test_prc)
test_predicted.crosstab(target, 'prediction').show()

+-----------------------+---+---+
|HeartDisease_prediction|0.0|1.0|
+-----------------------+---+---+
|                    1.0| 15|127|
|                    0.0| 91| 23|
+-----------------------+---+---+

