# Multi-class classification using Decision Tree Problem with PySpark 

Each year, San Francisco Airport (SFO) conducts a customer satisfaction survey to find out what they are doing well and where they can improve. The survey gauges satisfaction with SFO facilities, services, and amenities. SFO compares results to previous surveys to discover elements of the guest experience that are not satisfactory.

The 2013 SFO Survey Results consists of customer responses to survey questions and an overall satisfaction rating with the airport. We investigated whether we could use machine learning to predict a customer's overall response given their responses to the individual questions. That in and of itself is not very useful because the customer has already provided an overall rating as well as individual ratings for various aspects of the airport such as parking, food quality and restroom cleanliness. However, we didn't stop at prediction instead we asked the question:

What factors drove the customer to give the overall rating?

Here is an outline of our data flow:

- Load data: Load the data as a DataFrame
- Understand the data: Compute statistics and create visualizations to get a better understanding of the data to see if we can use basic statistics to answer the question above.
- Create Model On the training dataset:
- Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.
- Feature Importance: Determine the importance of each of the individual ratings in determining the overall rating by the customer

In [1]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')

In [2]:
import os
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import time

import pyspark # only run this after findspark.init()
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import * 
from pyspark.sql.types import * 

## 1. Load Data 
This dataset is available as a public dataset from https://catalog.data.gov/dataset/2013-sfo-customer-survey-d3541.

In [3]:
# Initiate the Spark Session
spark = SparkSession.builder.appName('Decision-Tree').getOrCreate()

In [27]:
spark

In [5]:
survey = spark.read.csv("./data/2013_SFO_Customer_Survey.csv", header="true", inferSchema="true")

In [6]:
display(survey)

DataFrame[RESPNUM: int, CCGID: int, RUN: int, INTDATE: int, GATE: int, STRATA: int, PEAK: int, METHOD: int, AIRLINE: int, FLIGHT: int, DEST: int, DESTGEO: int, DESTMARK: int, ARRTIME: string, DEPTIME: string, Q2PURP1: int, Q2PURP2: int, Q2PURP3: int, Q2PURP4: int, Q2PURP5: int, Q2PURP6: string, Q3GETTO1: int, Q3GETTO2: int, Q3GETTO3: int, Q3GETTO4: int, Q3GETTO5: string, Q3GETTO6: string, Q3PARK: int, Q4BAGS: int, Q4BUY: int, Q4FOOD: int, Q4WIFI: int, Q5FLYPERYR: int, Q6TENURE: double, SAQ: int, Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, Q8COM1: int, Q8COM2: int, Q8COM3: int, Q9A_CLNBOARD: int, Q9B_CLNAIRTRAIN: int, Q9C_CLNRENT: int, Q9D_CLNFOOD: int, Q9E_CLNBATH: int, Q9F_CLNWHOLE: int, Q9COM1: int, Q9COM2: int, Q9COM3: int, Q10SAFE: int, Q10COM1: int, Q10COM2: int, Q10COM3: int, Q11A_US

In [7]:
survey.printSchema()

root
 |-- RESPNUM: integer (nullable = true)
 |-- CCGID: integer (nullable = true)
 |-- RUN: integer (nullable = true)
 |-- INTDATE: integer (nullable = true)
 |-- GATE: integer (nullable = true)
 |-- STRATA: integer (nullable = true)
 |-- PEAK: integer (nullable = true)
 |-- METHOD: integer (nullable = true)
 |-- AIRLINE: integer (nullable = true)
 |-- FLIGHT: integer (nullable = true)
 |-- DEST: integer (nullable = true)
 |-- DESTGEO: integer (nullable = true)
 |-- DESTMARK: integer (nullable = true)
 |-- ARRTIME: string (nullable = true)
 |-- DEPTIME: string (nullable = true)
 |-- Q2PURP1: integer (nullable = true)
 |-- Q2PURP2: integer (nullable = true)
 |-- Q2PURP3: integer (nullable = true)
 |-- Q2PURP4: integer (nullable = true)
 |-- Q2PURP5: integer (nullable = true)
 |-- Q2PURP6: string (nullable = true)
 |-- Q3GETTO1: integer (nullable = true)
 |-- Q3GETTO2: integer (nullable = true)
 |-- Q3GETTO3: integer (nullable = true)
 |-- Q3GETTO4: integer (nullable = true)
 |-- Q3GETT

As you can see above there are many questions in the survey including what airline the customer flew on, where do they live, etc. For the purposes of answering the above, focus on the Q7A, Q7B, Q7C .. Q7O questions since they directly related to customer satisfaction, which is what you want to measure. If you drill down on those variables you get the following:

|Column Name|Data Type|Description|
| --- | --- | --- |
|Q7B_FOOD|INTEGER|Restaurants|
|Q7C_SHOPS|INTEGER|Retail shops and concessions|
|Q7D_SIGNS|INTEGER|Signs and Directions inside SFO|
|Q7E_WALK|INTEGER|Escalators / elevators / moving walkways|
|Q7F_SCREENS|INTEGER|Information on screens and monitors|
|Q7G_INFOARR|INTEGER|Information booth near arrivals area|
|Q7H_INFODEP|INTEGER|Information booth near departure areas|
|Q7I_WIFI|INTEGER|Airport WiFi|
|Q7J_ROAD|INTEGER|Signs and directions on SFO airport roadways|
|Q7K_PARK|INTEGER|Airport parking facilities|
|Q7L_AIRTRAIN|INTEGER|AirTrain|
|Q7M_LTPARK|INTEGER|Long term parking lot shuttle|
|Q7N_RENTAL|INTEGER|Airport rental car center|
|Q7O_WHOLE|INTEGER|SFO Airport as a whole|

Q7O_WHOLE is the target variable 

The possible values for the above are:

**0 = no answer, 1 = Unacceptable, 2 = Below Average, 3 = Average, 4 = Good, 5 = Outstanding, 6 = Not visited or not applicable**

Select only the fields we are interested in.

In [8]:
dataset = survey.select("Q7A_ART", "Q7B_FOOD", "Q7C_SHOPS", "Q7D_SIGNS", "Q7E_WALK", "Q7F_SCREENS", "Q7G_INFOARR", "Q7H_INFODEP", "Q7I_WIFI", "Q7J_ROAD", "Q7K_PARK", "Q7L_AIRTRAIN", "Q7M_LTPARK", "Q7N_RENTAL", "Q7O_WHOLE")

Let's get some basic statistics such as looking at the **average of each column**.

In [9]:
a = map(lambda s: "'missingValues(" + s +") " + s + "'",["Q7A_ART", "Q7B_FOOD", "Q7C_SHOPS", "Q7D_SIGNS", "Q7E_WALK", "Q7F_SCREENS", "Q7G_INFOARR", "Q7H_INFODEP", "Q7I_WIFI", "Q7J_ROAD", "Q7K_PARK", "Q7L_AIRTRAIN", "Q7M_LTPARK", "Q7N_RENTAL", "Q7O_WHOLE"])
", ".join(a)

"'missingValues(Q7A_ART) Q7A_ART', 'missingValues(Q7B_FOOD) Q7B_FOOD', 'missingValues(Q7C_SHOPS) Q7C_SHOPS', 'missingValues(Q7D_SIGNS) Q7D_SIGNS', 'missingValues(Q7E_WALK) Q7E_WALK', 'missingValues(Q7F_SCREENS) Q7F_SCREENS', 'missingValues(Q7G_INFOARR) Q7G_INFOARR', 'missingValues(Q7H_INFODEP) Q7H_INFODEP', 'missingValues(Q7I_WIFI) Q7I_WIFI', 'missingValues(Q7J_ROAD) Q7J_ROAD', 'missingValues(Q7K_PARK) Q7K_PARK', 'missingValues(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'missingValues(Q7M_LTPARK) Q7M_LTPARK', 'missingValues(Q7N_RENTAL) Q7N_RENTAL', 'missingValues(Q7O_WHOLE) Q7O_WHOLE'"

Let's start with the overall rating.

In [10]:
from pyspark.sql.functions import *
dataset.selectExpr('avg(Q7O_WHOLE) Q7O_WHOLE').take(1)

[Row(Q7O_WHOLE=3.8743988684582744)]

The overall rating is only 3.87, so slightly above average. Let's get the averages of the constituent ratings:

In [11]:
avgs = dataset.selectExpr('avg(Q7A_ART) Q7A_ART', 'avg(Q7B_FOOD) Q7B_FOOD', 'avg(Q7C_SHOPS) Q7C_SHOPS', 'avg(Q7D_SIGNS) Q7D_SIGNS', 'avg(Q7E_WALK) Q7E_WALK', 'avg(Q7F_SCREENS) Q7F_SCREENS', 'avg(Q7G_INFOARR) Q7G_INFOARR', 'avg(Q7H_INFODEP) Q7H_INFODEP', 'avg(Q7I_WIFI) Q7I_WIFI', 'avg(Q7J_ROAD) Q7J_ROAD', 'avg(Q7K_PARK) Q7K_PARK', 'avg(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'avg(Q7M_LTPARK) Q7M_LTPARK', 'avg(Q7N_RENTAL) Q7N_RENTAL')
display(avgs)

DataFrame[Q7A_ART: double, Q7B_FOOD: double, Q7C_SHOPS: double, Q7D_SIGNS: double, Q7E_WALK: double, Q7F_SCREENS: double, Q7G_INFOARR: double, Q7H_INFODEP: double, Q7I_WIFI: double, Q7J_ROAD: double, Q7K_PARK: double, Q7L_AIRTRAIN: double, Q7M_LTPARK: double, Q7N_RENTAL: double]

In [12]:
display(dataset)

DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int]

So basic statistics can't seem to answer the question: **What factors drove the customer to give the overall rating?**

So let's try to use a predictive algorithm to see if these individual ratings can be used to predict an overall rating.

## 2. Create a Model

First need to treat responses of 0 = No Answer and 6 = Not Visited or Not Applicable as missing values. One of the ways you can do this is a technique called mean impute which is when we use the mean of the column as a replacement for the missing value. You can use a replace function to set all values of 0 or 6 to the average rating of 3. You also need a label column of type double so do that as well.

In [13]:
training = dataset.withColumn("label", dataset['Q7O_WHOLE']*1.0).na.replace(0,3).replace(6,3)

In [14]:
display(training)

DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double]

##### Create 'Model Pipeline'

In [15]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

inputCols = ['Q7A_ART', 'Q7B_FOOD', 'Q7C_SHOPS', 'Q7D_SIGNS', 'Q7E_WALK', 'Q7F_SCREENS', 'Q7G_INFOARR', 'Q7H_INFODEP', 'Q7I_WIFI', 'Q7J_ROAD', 'Q7K_PARK', 'Q7L_AIRTRAIN', 'Q7M_LTPARK', 'Q7N_RENTAL']
va = VectorAssembler(inputCols=inputCols,outputCol="features")
dt = DecisionTreeRegressor(labelCol="label", featuresCol="features", maxDepth=4)
evaluator = RegressionEvaluator(metricName = "rmse", labelCol="label")
grid = ParamGridBuilder().addGrid(dt.maxDepth, [3, 5, 7, 10]).build()
cv = CrossValidator(estimator=dt, estimatorParamMaps=grid, evaluator=evaluator, numFolds = 10)
pipeline = Pipeline(stages=[va, dt])

## 3. Train a Model

In [16]:
model = pipeline.fit(training)

In [17]:
display(model.stages[-1])

DecisionTreeRegressionModel (uid=DecisionTreeRegressor_73754465424e) of depth 4 with 31 nodes

In [18]:
predictions = model.transform(training)
display(predictions)

DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double, features: vector, prediction: double]

## 4. Evaluate the model

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator()

evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})

0.555808023551782

## 5. Save the model

In [None]:
import uuid
model_save_path = f"/tmp/sfo_survey_model/{str(uuid.uuid4())}"
model.write().overwrite().save(model_save_path)

## 6. Feature Importance
Feature importance is a measure of information gain. It is scaled from 0.0 to 1.0. As an example, feature 1 in the example above is rated as 0.0826 or 8.26% of the total importance for all the features.

In [21]:
model.stages[1].featureImportances

SparseVector(14, {0: 0.0653, 1: 0.1173, 2: 0.0099, 3: 0.5219, 4: 0.0052, 5: 0.2403, 8: 0.0028, 10: 0.0059, 13: 0.0314})

In [22]:
featureImportance = model.stages[1].featureImportances.toArray()
featureNames = map(lambda s: s.name, dataset.schema.fields)
featureImportanceMap = zip(featureImportance, featureNames)

In [23]:
featureImportanceMap

<zip at 0x168a65ac308>

In [None]:
importancesDf = spark.createDataFrame(spark.parallelize(featureImportanceMap).map(lambda r: [r[1], float(r[0])]))

importancesDf = importancesDf.withColumnRenamed("_1", "Feature").withColumnRenamed("_2", "Importance")

Let's convert this to a DataFrame so you can view it and save it so other users can rely on this information.

In [None]:
display(importancesDf.orderBy(desc("Importance")))

As you can see below, the 3 most important features are:

- Signs
- Screens
- Food

This is useful information for the airport management. It means that people want to first know where they are going. Second, they check the airport screens and monitors so they can find their gate and be on time for their flight. Third, they like to have good quality food.

This is especially interesting considering that taking the average of these feature variables told us nothing about the importance of the variables in determining the overall rating by the survey responder.

These 3 features combine to make up **65**% of the overall rating.

In [None]:
importancesDf.orderBy(desc("Importance")).limit(3).agg(sum("Importance")).take(1)

In [None]:
# See it in Piechart
display(importancesDf.orderBy(desc("Importance")))

In [None]:
display(importancesDf.orderBy(desc("Importance")).limit(5))

## 7. Conclusion
So if you run SFO, artwork and shopping are nice-to-haves but signs, monitors, and food are what keep airport customers happy!

In [None]:
# delete saved model
dbutils.fs.rm(model_save_path, True)