# Project Big Data - Pipelines
## Hélène Lechêne, Marie Philippe, Claire Serraz & Romane Soler
## M2 D3S

The aim of this notebook is to do build a **pipeline using spark**.
A pipeline is specified as a sequence of stages, where each stage either belongs to transformer method either to estimator method:

* **Transformers**: is an algorithm which can transform one DataFrame into another DataFrame (ex: *Tokenizer*, *StringIndexer*...)
* **Estimators**: is an algorithm which can be fit on a DataFrame to produce a Transformer (ex: *LogisiticRegression*, *DecisionTree*...)

The data used concerns an airline company's customers satisfaction. The customers will be classified as: **satisfied or dissatisfied/neutral**

# Part 0: Preliminary part

In this preliminary part, the aim is just to import the packages needed and to load the data. In this part, we choose to work separately on the train and test datasets of origin, and not to concatenate them, for simplicity's sake when buidling the pipeline.

## 0.1. Libraries

In [0]:
# Pyspark libraries
import hashlib
import sys
def hash(x):
    return hashlib.sha1(str(x).encode('utf-8')).hexdigest()

assert sys.version_info.major == 3
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql import functions as fn
from pyspark.sql.functions import col, lit
import pyspark.sql.functions as F

# Other
import os
from functools import reduce
from operator import add

# Pipline and ML
from pyspark.ml import Pipeline 
from pyspark.ml.feature import * 
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.evaluation import MulticlassMetrics


## 0.2. Load the data

In [0]:
# Load train set

# Get data
train = sqlContext.read.format("csv").option("header", True).option(
    'sep', ',').option('inferSchema', True).load('/FileStore/tables/train.csv')
# Delete id column
train = train.drop("id")
# Rename column
train = train.withColumnRenamed(train.columns[0], 'id')
# Display DF
train.display()
# Print the number of lines
print(train.count())
# Display columns
train.printSchema()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied
5,Female,Loyal Customer,26,Personal Travel,Eco,1180,3,4,2,1,1,2,1,1,3,4,4,4,4,1,0,0.0,neutral or dissatisfied
6,Male,Loyal Customer,47,Personal Travel,Eco,1276,2,4,2,3,2,2,2,2,3,3,4,3,5,2,9,23.0,neutral or dissatisfied
7,Female,Loyal Customer,52,Business travel,Business,2035,4,3,4,4,5,5,5,5,5,5,5,4,5,4,4,0.0,satisfied
8,Female,Loyal Customer,41,Business travel,Business,853,1,2,2,2,4,3,3,1,1,2,1,4,1,2,0,0.0,neutral or dissatisfied
9,Male,disloyal Customer,20,Business travel,Eco,1061,3,3,3,4,2,3,3,2,2,3,4,4,3,2,0,0.0,neutral or dissatisfied


103904
root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable = true)

In [0]:
# Load test set

# Get data
test = sqlContext.read.format("csv").option("header", True).option(
    'sep', ',').option('inferSchema', True).load('/FileStore/tables/test.csv')
# Delete id column
test = test.drop("id")
# Rename column
test = test.withColumnRenamed(test.columns[0], 'id')
# Display DF
test.display()
# Print the number of lines
print(test.count())
# Display columns
test.printSchema()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Female,Loyal Customer,52,Business travel,Eco,160,5,4,3,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied
1,Female,Loyal Customer,36,Business travel,Business,2863,1,1,3,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied
2,Male,disloyal Customer,20,Business travel,Eco,192,2,0,2,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,Male,Loyal Customer,44,Business travel,Business,3377,0,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied
4,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,4,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied
5,Male,Loyal Customer,16,Business travel,Eco,311,3,3,3,3,5,5,3,5,4,3,1,1,2,5,0,0.0,satisfied
6,Female,Loyal Customer,77,Business travel,Business,3987,5,5,5,5,3,5,5,5,5,5,5,4,5,3,0,0.0,satisfied
7,Female,Loyal Customer,43,Business travel,Business,2556,2,2,2,2,4,4,5,4,4,4,4,5,4,3,77,65.0,satisfied
8,Male,Loyal Customer,47,Business travel,Eco,556,5,2,2,2,5,5,5,5,2,2,5,3,3,5,1,0.0,satisfied
9,Female,Loyal Customer,46,Business travel,Business,1744,2,2,2,2,3,4,4,4,4,4,4,5,4,4,28,14.0,satisfied


25976
root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable = true)


#Part 1: Clean the data

The preliminary explorations using pandas and spark dataframes enabled one to notice that some values need to be deleted, especially the missing values and non applicable values (a kind of missing value).

## 1.1. Delete missing values

In [0]:
train.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c)
             for c in train.columns]).display()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,310,0


In [0]:
test.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c)
            for c in train.columns]).display()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,83,0


We observe that there are some NA in the column *Arrival delay in minutes* in both train and test datasets. Thus we drop them:

In [0]:
train = train.na.drop()

# Check number of lines
print(train.count())


103594


In [0]:
test = test.na.drop()

# Check number of lines
print(test.count())


25893


## 1.2. Delete non applicable values

14 variables are satisfaction variables and take values between 1 (worst) and 5 (best). In what follows, we call them **score variables**. However, sometimes these variables are equal to 0, which isn't applicable. Thus, rows with these values are deleted.

In [0]:
# Keep only the values different than 0
train = train.where(
    fn.col('Inflight wifi service') != 0).where(
    fn.col('Departure/Arrival time convenient') != 0).where(
    fn.col('Ease of Online booking') != 0).where(
    fn.col('Gate location') != 0).where(
    fn.col('Food and drink') != 0).where(
    fn.col('Online boarding') != 0).where(
    fn.col('Seat comfort') != 0).where(
    fn.col('Inflight entertainment') != 0).where(
    fn.col('On-board service') != 0).where(
    fn.col('Leg room service') != 0).where(
    fn.col('Baggage handling') != 0).where(
    fn.col('Checkin service') != 0).where(
    fn.col('Inflight service') != 0).where(
    fn.col('Cleanliness') != 0)

print(train.count())


95415


In [0]:
test = test.where(
    fn.col('Inflight wifi service') != 0).where(
    fn.col('Departure/Arrival time convenient') != 0).where(
    fn.col('Ease of Online booking') != 0).where(
    fn.col('Gate location') != 0).where(
    fn.col('Food and drink') != 0).where(
    fn.col('Online boarding') != 0).where(
    fn.col('Seat comfort') != 0).where(
    fn.col('Inflight entertainment') != 0).where(
    fn.col('On-board service') != 0).where(
    fn.col('Leg room service') != 0).where(
    fn.col('Baggage handling') != 0).where(
    fn.col('Checkin service') != 0).where(
    fn.col('Inflight service') != 0).where(
    fn.col('Cleanliness') != 0)

print(test.count())


23789


## 1.3. Converting to double

When trying some transformers of pyspark, we noticed that Binarizer() did not work when the score variables were of type integer, thus we convert all of them to double.

In [0]:
cols = ['Inflight wifi service',
        'Departure/Arrival time convenient',
        'Ease of Online booking',
        'Gate location',
        'Food and drink',
        'Online boarding',
        'Seat comfort',
        'Inflight entertainment',
        'On-board service',
        'Leg room service',
        'Baggage handling',
        'Checkin service',
        'Inflight service',
        'Cleanliness']

# Converting in the train dataset
for col_name in cols:
    train = train.withColumn(col_name, col(col_name).cast('double'))

# Checking the type
train.printSchema()

# Converting in the test dataset
for col_name in cols:
    test = test.withColumn(col_name, col(col_name).cast('double'))

# Checking the type
test.printSchema()


root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: double (nullable = true)
 |-- Departure/Arrival time convenient: double (nullable = true)
 |-- Ease of Online booking: double (nullable = true)
 |-- Gate location: double (nullable = true)
 |-- Food and drink: double (nullable = true)
 |-- Online boarding: double (nullable = true)
 |-- Seat comfort: double (nullable = true)
 |-- Inflight entertainment: double (nullable = true)
 |-- On-board service: double (nullable = true)
 |-- Leg room service: double (nullable = true)
 |-- Baggage handling: double (nullable = true)
 |-- Checkin service: double (nullable = true)
 |-- Inflight service: double (nullable = true)
 |-- Cleanliness: double (nullable = true)
 |-- Departure Delay

# Part 2: Pipelines

First, we define our transformers and estimators.

## 2.1. Transformers

For most part, in this step, we will create almost the same dummy variables as in the other notebooks:
- **Gender**: a dummy variable (*female*) is created, 1 if the passenger is a female and 0 otherwise. 
- **Customer Type**: a dummy variable (*loyal customers*) is created, 1 if the customer is loyal and 0 otherwise. 
- **Type of Travel**: a dummy variable is created (*business travel*): 1 if the travel is for business and 0 for personal. 
- **Class**: one dummy equal to 1 (*business class*) if the passenger is in business class and 0 if the passenger in eco class or eco plus class. 
- **Satisfaction**: it is the target variable (*satisfied*), it is equal to 1 if the customer is satisfied and 0 if he is dissatisfied or neutral.

And the VectorAssembler steps will build the vectors of features for the different models.

In [0]:
# We will use string indexer to convert the categorical variables 
# to dummy variables. We use two different ones because we choose 
#our reference using the alphabetical order descendingly for 
#most variables, but it does not work for "Satisfaction", 
#for which we needed to use the alphabetical order ascendingly.

si = StringIndexer(inputCols=["Gender",
                              "Customer Type",
                              "Type of Travel",
                              "Class"],
                   outputCols=["female",
                               "Loyal customers",
                               "Business Travel",
                               "Class index"],
                   stringOrderType="alphabetDesc")

si_rest = StringIndexer(inputCol="satisfaction", outputCol="satisfied",
                        stringOrderType="alphabetAsc")

# Because the variable Class index has 3 values, 0 for Eco, 1 for Eco Plus 
#and 2 for Business because ranked alphabetically descendingly, 
#we decided to binarize it: the variable 'Business class' will be 
#equal to 1 when Class index > 1

bi_class = Binarizer(threshold=1.0,
                     inputCol="Class index", outputCol="Business class")

# We also decide to binarize all scores, by create "good" 
#variables when the initial score > 3 
bi_scores = Binarizer(inputCols=['Inflight wifi service',
                                 'Departure/Arrival time convenient',
                                 'Ease of Online booking',
                                 'Gate location',
                                 'Food and drink',
                                 'Online boarding',
                                 'Seat comfort',
                                 'Inflight entertainment',
                                 'On-board service',
                                 'Leg room service',
                                 'Baggage handling',
                                 'Checkin service',
                                 'Inflight service',
                                 'Cleanliness'],
                      outputCols=['Inflight wifi service good',
                                  'Departure/Arrival time convenient good',
                                  'Ease of Online booking good',
                                  'Gate location good',
                                  'Food and drink good',
                                  'Online boarding good',
                                  'Seat comfort good',
                                  'Inflight entertainment good',
                                  'On-board service good',
                                  'Leg room service good',
                                  'Baggage handling good',
                                  'Checkin service good',
                                  'Inflight service good',
                                  'Cleanliness good'], threshold=3.)

# We build two VectorAssembler because we will build two different models
va_1 = VectorAssembler(inputCols=['Business class',
                                  'female',
                                  'Loyal customers',
                                  'Business Travel',
                                  'Age',
                                  'Flight Distance',
                                  'Inflight wifi service',
                                  'Departure/Arrival time convenient',
                                  'Ease of Online booking',
                                  'Gate location',
                                  'Food and drink',
                                  'Online boarding',
                                  'Seat comfort',
                                  'Inflight entertainment',
                                  'On-board service',
                                  'Leg room service',
                                  'Baggage handling',
                                  'Checkin service',
                                  'Inflight service',
                                  'Cleanliness',
                                  'Departure Delay in Minutes',
                                  'Arrival Delay in Minutes'],
                       outputCol="features")
va_2 = VectorAssembler(inputCols=['Business class',
                                  'female',
                                  'Loyal customers',
                                  'Business Travel',
                                  'Age',
                                  'Flight Distance',
                                  'Inflight wifi service good',
                                  'Departure/Arrival time convenient good',
                                  'Ease of Online booking good',
                                  'Gate location good',
                                  'Food and drink good',
                                  'Online boarding good',
                                  'Seat comfort good',
                                  'Inflight entertainment good',
                                  'On-board service good',
                                  'Leg room service good',
                                  'Baggage handling good',
                                  'Checkin service good',
                                  'Inflight service good',
                                  'Cleanliness good',
                                  'Departure Delay in Minutes',
                                  'Arrival Delay in Minutes'],
                       outputCol="features")

## 2.2. Estimators

We decided to select the decision tree for our estimation step because it was the model that worked the best during the MLIB part.

In [0]:
dt = DecisionTreeClassifier(labelCol="satisfied", featuresCol="features",
                            impurity='gini', maxDepth=5, maxBins=32)


## 2.1. First pipeline: 1 to 5 scores

For the first pipeline, we decide to build a **decision tree** keeping all scores as categorical variables that go **from 1 to 5** as initially.

In [0]:
# We define our first pipeline
first_pipeline = Pipeline(stages=[si, si_rest, bi_class, va_1, dt])


In [0]:
# We fit on the train and transform on the test 
first_model = first_pipeline.fit(train).transform(test)


In [0]:
# In the following dataframe we get the result of the pipeline, with, 
# for the most interesting part, the predicted values of 
# the variable "satisfied" 
first_model.display()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,female,Loyal customers,Business Travel,Class index,satisfied,Business class,features,rawPrediction,probability,prediction
0,Female,Loyal Customer,52,Business travel,Eco,160,5.0,4.0,3.0,4.0,3.0,4.0,3.0,5.0,5.0,5.0,5.0,2.0,5.0,5.0,50,44.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 1.0, 1.0, 1.0, 52.0, 160.0, 5.0, 4.0, 3.0, 4.0, 3.0, 4.0, 3.0, 5.0, 5.0, 5.0, 5.0, 2.0, 5.0, 5.0, 50.0, 44.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
1,Female,Loyal Customer,36,Business travel,Business,2863,1.0,1.0,3.0,1.0,5.0,4.0,5.0,4.0,4.0,4.0,4.0,3.0,4.0,5.0,0,0.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 36.0, 2863.0, 1.0, 1.0, 3.0, 1.0, 5.0, 4.0, 5.0, 4.0, 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 0.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
4,Female,Loyal Customer,49,Business travel,Eco,1182,2.0,3.0,4.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,4.0,0,20.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 1.0, 1.0, 1.0, 49.0, 1182.0, 2.0, 3.0, 4.0, 3.0, 4.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 2.0, 4.0, 0.0, 20.0))","Map(vectorType -> dense, length -> 2, values -> List(2146.0, 172.0))","Map(vectorType -> dense, length -> 2, values -> List(0.9257981018119068, 0.07420189818809318))",0.0
5,Male,Loyal Customer,16,Business travel,Eco,311,3.0,3.0,3.0,3.0,5.0,5.0,3.0,5.0,4.0,3.0,1.0,1.0,2.0,5.0,0,0.0,satisfied,0.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 0.0, 1.0, 1.0, 16.0, 311.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 3.0, 5.0, 4.0, 3.0, 1.0, 1.0, 2.0, 5.0, 0.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
6,Female,Loyal Customer,77,Business travel,Business,3987,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,3.0,0,0.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 77.0, 3987.0, 5.0, 5.0, 5.0, 5.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 5.0, 3.0, 0.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
7,Female,Loyal Customer,43,Business travel,Business,2556,2.0,2.0,2.0,2.0,4.0,4.0,5.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,77,65.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 43.0, 2556.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 5.0, 4.0, 4.0, 4.0, 4.0, 5.0, 4.0, 3.0, 77.0, 65.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
8,Male,Loyal Customer,47,Business travel,Eco,556,5.0,2.0,2.0,2.0,5.0,5.0,5.0,5.0,2.0,2.0,5.0,3.0,3.0,5.0,1,0.0,satisfied,0.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 0.0, 1.0, 1.0, 47.0, 556.0, 5.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 5.0, 3.0, 3.0, 5.0, 1.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
9,Female,Loyal Customer,46,Business travel,Business,1744,2.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,28,14.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 46.0, 1744.0, 2.0, 2.0, 2.0, 2.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 4.0, 4.0, 28.0, 14.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
10,Female,Loyal Customer,47,Business travel,Eco,1235,4.0,1.0,1.0,1.0,5.0,1.0,5.0,3.0,3.0,4.0,3.0,1.0,3.0,4.0,29,19.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 1.0, 1.0, 1.0, 47.0, 1235.0, 4.0, 1.0, 1.0, 1.0, 5.0, 1.0, 5.0, 3.0, 3.0, 4.0, 3.0, 1.0, 3.0, 4.0, 29.0, 19.0))","Map(vectorType -> dense, length -> 2, values -> List(953.0, 451.0))","Map(vectorType -> dense, length -> 2, values -> List(0.6787749287749287, 0.3212250712250712))",0.0
11,Female,Loyal Customer,33,Business travel,Business,325,2.0,5.0,5.0,5.0,1.0,3.0,4.0,2.0,2.0,2.0,2.0,3.0,2.0,4.0,18,7.0,neutral or dissatisfied,1.0,1.0,1.0,2.0,0.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 33.0, 325.0, 2.0, 5.0, 5.0, 5.0, 1.0, 3.0, 4.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 4.0, 18.0, 7.0))","Map(vectorType -> dense, length -> 2, values -> List(8062.0, 609.0))","Map(vectorType -> dense, length -> 2, values -> List(0.9297658862876255, 0.07023411371237458))",0.0


In [0]:
# Select prediction
predictionAndTarget = first_model.select("satisfied", "prediction")

# Create two evaluators to have all the metrics
metrics_binary = BinaryClassificationMetrics(
                 predictionAndTarget.rdd.map(tuple))
metrics_multi = MulticlassMetrics(predictionAndTarget.rdd.map(tuple))

accuracy = metrics_multi.accuracy
precision1 = metrics_multi.precision(1.0)
recall1 = metrics_multi.recall(1.0)
precision0 = metrics_multi.precision(0.0)
recall0 = metrics_multi.recall(0.0)
auc = metrics_binary.areaUnderROC

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)
print("Area under the curve = %s" % auc)

Accuracy = 0.9161377107066291
Error = 0.08386228929337092
Precision satisfied = 0.9162758890191481
Recall satisfied = 0.891794237900542
Precision neutral or dissatisfied = 0.9160333505496938
Recall neutral or dissatisfied = 0.9354279686558168
Area under the curve = 0.9136111032781793


Using this model, we get a **high accuracy of 92% with only around 8% of errors**, which is similar as what we got in the MLIB part.

## 2.2. Second pipeline: Binarized scores

In this second pipeline, we decide to keep the decision tree model, but to replace categorical scores by **binarized scores**: with a dummy variable that corresponds to  a "good" score when the score is > 3.

In [0]:
second_pipeline = Pipeline(stages=[si, si_rest, bi_class, bi_scores, va_2, dt])


In [0]:
second_model = second_pipeline.fit(train).transform(test)


In [0]:
second_model.display()


id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,female,Loyal customers,Business Travel,Class index,satisfied,Business class,Inflight wifi service good,Departure/Arrival time convenient good,Ease of Online booking good,Gate location good,Food and drink good,Online boarding good,Seat comfort good,Inflight entertainment good,On-board service good,Leg room service good,Baggage handling good,Checkin service good,Inflight service good,Cleanliness good,features,rawPrediction,probability,prediction
0,Female,Loyal Customer,52,Business travel,Eco,160,5.0,4.0,3.0,4.0,3.0,4.0,3.0,5.0,5.0,5.0,5.0,2.0,5.0,5.0,50,44.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(0.0, 1.0, 1.0, 1.0, 52.0, 160.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 50.0, 44.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
1,Female,Loyal Customer,36,Business travel,Business,2863,1.0,1.0,3.0,1.0,5.0,4.0,5.0,4.0,4.0,4.0,4.0,3.0,4.0,5.0,0,0.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 36.0, 2863.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
4,Female,Loyal Customer,49,Business travel,Eco,1182,2.0,3.0,4.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,4.0,0,20.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,"Map(vectorType -> sparse, length -> 22, indices -> List(1, 2, 3, 4, 5, 8, 10, 17, 19, 21), values -> List(1.0, 1.0, 1.0, 49.0, 1182.0, 1.0, 1.0, 1.0, 1.0, 20.0))","Map(vectorType -> dense, length -> 2, values -> List(29719.0, 546.0))","Map(vectorType -> dense, length -> 2, values -> List(0.9819593589955394, 0.018040641004460598))",0.0
5,Male,Loyal Customer,16,Business travel,Eco,311,3.0,3.0,3.0,3.0,5.0,5.0,3.0,5.0,4.0,3.0,1.0,1.0,2.0,5.0,0,0.0,satisfied,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 22, indices -> List(2, 3, 4, 5, 10, 11, 13, 14, 19), values -> List(1.0, 1.0, 16.0, 311.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
6,Female,Loyal Customer,77,Business travel,Business,3987,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,3.0,0,0.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 77.0, 3987.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
7,Female,Loyal Customer,43,Business travel,Business,2556,2.0,2.0,2.0,2.0,4.0,4.0,5.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,77,65.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 43.0, 2556.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 77.0, 65.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
8,Male,Loyal Customer,47,Business travel,Eco,556,5.0,2.0,2.0,2.0,5.0,5.0,5.0,5.0,2.0,2.0,5.0,3.0,3.0,5.0,1,0.0,satisfied,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 22, indices -> List(2, 3, 4, 5, 6, 10, 11, 12, 13, 16, 19, 20), values -> List(1.0, 1.0, 47.0, 556.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
9,Female,Loyal Customer,46,Business travel,Business,1744,2.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,28,14.0,satisfied,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"Map(vectorType -> dense, length -> 22, values -> List(1.0, 1.0, 1.0, 1.0, 46.0, 1744.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 28.0, 14.0))","Map(vectorType -> dense, length -> 2, values -> List(1349.0, 26613.0))","Map(vectorType -> dense, length -> 2, values -> List(0.04824404549030827, 0.9517559545096917))",1.0
10,Female,Loyal Customer,47,Business travel,Eco,1235,4.0,1.0,1.0,1.0,5.0,1.0,5.0,3.0,3.0,4.0,3.0,1.0,3.0,4.0,29,19.0,satisfied,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 22, indices -> List(1, 2, 3, 4, 5, 6, 10, 12, 15, 19, 20, 21), values -> List(1.0, 1.0, 1.0, 47.0, 1235.0, 1.0, 1.0, 1.0, 1.0, 1.0, 29.0, 19.0))","Map(vectorType -> dense, length -> 2, values -> List(192.0, 519.0))","Map(vectorType -> dense, length -> 2, values -> List(0.270042194092827, 0.729957805907173))",1.0
11,Female,Loyal Customer,33,Business travel,Business,325,2.0,5.0,5.0,5.0,1.0,3.0,4.0,2.0,2.0,2.0,2.0,3.0,2.0,4.0,18,7.0,neutral or dissatisfied,1.0,1.0,1.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 22, indices -> List(0, 1, 2, 3, 4, 5, 7, 8, 9, 12, 19, 20, 21), values -> List(1.0, 1.0, 1.0, 1.0, 33.0, 325.0, 1.0, 1.0, 1.0, 1.0, 1.0, 18.0, 7.0))","Map(vectorType -> dense, length -> 2, values -> List(8087.0, 659.0))","Map(vectorType -> dense, length -> 2, values -> List(0.9246512691516122, 0.07534873084838783))",0.0


In [0]:
predictionAndTarget = second_model.select("satisfied", "prediction")

metrics_binary = BinaryClassificationMetrics(
                 predictionAndTarget.rdd.map(tuple))
metrics_multi = MulticlassMetrics(predictionAndTarget.rdd.map(tuple))

accuracy = metrics_multi.accuracy
precision1 = metrics_multi.precision(1.0)
recall1 = metrics_multi.recall(1.0)
precision0 = metrics_multi.precision(0.0)
recall0 = metrics_multi.recall(0.0)
auc = metrics_binary.areaUnderROC

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)
print("Area under the curve = %s" % auc)


Accuracy = 0.9015511370801631
Error = 0.0984488629198369
Precision satisfied = 0.8816920672137554
Recall satisfied = 0.8886372587632926
Precision neutral or dissatisfied = 0.9165498413635358
Recall neutral or dissatisfied = 0.9111714222841635
Area under the curve = 0.899904340523728


When using this second pipeline, with binarized scores, we get a **slightly lower accuracy of 90% and, thus, a higher percentage of errors with 10%, the first model should be preferred.**