# Binary Tabular Data Classification with PySpark

This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark.

**Classification Algorithms**
In order to predict the class of certain samples, there are several classification algorithms that can be used. In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance. \

A non-exhaustive list of some of the most used algorithms are:

- Logistic Regression
- Decision Trees
- Random Forests
- Support Vector Machines
- K-Nearest Neighbors (KNN)

**ROC**
the metric that we will use in our project is the Reciever Operation Characteristic or ROC.
The ROC curve tells us about how good the model can distinguish between two classes. It can get values from 0 to 1. The better the model is, the closer to 1 value it will be.

We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. \
 
We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data.
Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000. \

As from our previous research we have found out that the individuals who are most likely to donate money to a charity are the ones that make more than $50,000. \

Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0).

In [1]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')

In [2]:
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import time

import pyspark # only run this after findspark.init()
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import * 
from pyspark.sql.types import * 

### 1. Load Data

The census dataset consists of approximately 45222 data points, with each datapoint having 13 features.

The dataset for this project can be found from the [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).

In [3]:
# Initiate the Spark Session
spark = SparkSession.builder.appName('imbalanced_binary_classification').getOrCreate()

In [4]:
spark

In [5]:
# File location and type
file_location = "./data/census.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "False"
delimiter = ","

# make sure to add column name as the CSV does not contain column name as default


# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location) \
  .toDF("age", "workClass", "fnlwgt", "education", "education-num","marital-status", "occupation", "relationship",
        "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income")

display(df)

DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, income: string]

In [6]:
df.show()

+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+
|age|       workClass|fnlwgt|   education|education-num|      marital-status|       occupation| relationship|              race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|   Bachelors|           13|       Never-married|     Adm-clerical|Not-in-family|             White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|   Bachelors|           13|  Married-civ-spouse|  Exec-managerial|      Husband|             White|  Male|           0|           0|            13| United-States| <=50K|
| 38|

### 2. Data Preprocessing

In [7]:
# Import pyspark functions
from pyspark.sql import functions as F
# Create add new column to the dataset
df = df.withColumn('>50K', F.when(df.income == '<=50K', 0).otherwise(1))
# Drop the Income label
df = df.drop('income')
# Show dataset's columns
df.columns

['age',
 'workClass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 '>50K']

#### Vectorizing Numerical Features and One-Hot Encodin Categorical Features

In [8]:
# Selecting categorical features
categorical_columns = [
 'workClass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'hours-per-week',
 'native-country',
 ]

In [9]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# The index of string values multiple columns
indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in categorical_columns]
# The encode of indexed values multiple columns
encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
            outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers]

The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.

#### Join the categorical encoded features with the numerical ones and make a vector with both of them

In [10]:
# Vectorizing encoded values
categorical_encoded = [encoder.getOutputCol() for encoder in encoders]
numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']
inputcols = categorical_encoded + numerical_columns
assembler = VectorAssembler(inputCols=inputcols, outputCol="features")

#### Set up a pipeline to automatize this stages

In [11]:
pipeline = Pipeline(stages=indexers + encoders+[assembler])
model = pipeline.fit(df)
# Transform data
transformed = model.transform(df)
display(transformed)

DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, >50K: int, workClass_indexed: double, education_indexed: double, marital-status_indexed: double, occupation_indexed: double, relationship_indexed: double, race_indexed: double, sex_indexed: double, hours-per-week_indexed: double, native-country_indexed: double, workClass_indexed_encoded: vector, education_indexed_encoded: vector, marital-status_indexed_encoded: vector, occupation_indexed_encoded: vector, relationship_indexed_encoded: vector, race_indexed_encoded: vector, sex_indexed_encoded: vector, hours-per-week_indexed_encoded: vector, native-country_indexed_encoded: vector, features: vector]

#### Finally, we will select a dataset only with the relevant features.

In [12]:
# Transform data
final_data = transformed.select('features', '>50K')

### 3. Build a Model

In [13]:
# Initialize the classification models
# Decision Trees
# Random Forests
# Gradient Boosted Trees

dtc = DecisionTreeClassifier(labelCol='>50K', featuresCol='features')

rfc = RandomForestClassifier(numTrees=150, labelCol='>50K', featuresCol='features')

gbt = GBTClassifier(labelCol='>50K', featuresCol='features', maxIter=10)

In [14]:
# Split data
# We will perform a classic 80/20 split between training and testing data.
train_data, test_data = final_data.randomSplit([0.8,0.2], seed=623)
print(train_data.count())
print(test_data.count())

39010
9832


### 4. Start Training

In [15]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

### 5. Evaludate with Test-set

In [16]:
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

### 6. Evaluating Model’s Performance

In [17]:
# our evaluator will be the ROC
my_eval = BinaryClassificationEvaluator(labelCol='>50K')

In [18]:
# Display Decision Tree evaluation metric
print('DTC')
print(my_eval.evaluate(dtc_preds))

DTC
0.5849312593442992


In [19]:
# Display Random Forest evaluation metric
print('RFC')
print(my_eval.evaluate(rfc_preds))

RFC
0.8914577709920453


In [20]:
# Display Gradien Boosting Tree evaluation metric
print('GBT')
print(my_eval.evaluate(gbt_preds))

GBT
0.9044179860557597


### 7. Improving Models Performance (Model Tuning)

We will try to do this by performing the grid search cross validation technique. With it, we will evaluate the performance of the model with different combinations of previously sets of hyperparameter’s values.

The hyperparameters that we will tune are:

- Max Depth
- Max Bins
- Max Iterations

In [21]:
# Import libraries
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Set the Parameters grid
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

# Iinitializing the cross validator class
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=my_eval, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees
cvModel = cv.fit(train_data)
gbt_predictions_2 = cvModel.transform(test_data)
my_eval.evaluate(gbt_predictions_2)

0.9143539096589867

#### We can also access the model's feature weights and intercepts easily

In [None]:
print('Model Intercept: ', cvModel.bestModel.intercept)

In [None]:
weights = cvModel.bestModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = sqlContext.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)

In [None]:
# View best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

In [None]:
# End Spark Session
spark.stop()