# Project Big Data
## Hélène Lechêne, Marie Philippe, Claire Serraz & Romane Soler
## M2 D3S

The aim of this notebook is to do a classification using the MLlib's package. The data used concerns an airline company's customers satisfaction. The customers will be classified as: satisfied or dissatisfied/neutral. 
First, the data is going to be cleaned with what one has learned thanks to the preliminary exploration. A few new features will be added and then the customers will be classified.

# Part 0: Preliminary part

In this preliminary part, the aim is just to import the packages needed and to load the data and concatenate the 2 files.

## 0.1. Libraries

In [0]:
# Pyspark libraries
import hashlib
import sys
def hash(x):
    return hashlib.sha1(str(x).encode('utf-8')).hexdigest()

assert sys.version_info.major == 3
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql import functions as fn
from pyspark.sql.functions import col, lit
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer

# Other
import os
from functools import reduce
from operator import add

# MLlib
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel

# ML
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


## 0.2. Load the data

In [0]:
# Path of the test and train data
path_train = '/FileStore/tables/train.csv'
path_test = '/FileStore/tables/test.csv'


In [0]:
# Load train set

# Get data
train = sqlContext.read.format("csv").option("header", True).option(
    'sep', ',').option('inferSchema', True).load(path_train)
# Delete id column
train = train.drop("id")
# Rename column
train = train.withColumnRenamed(train.columns[0], 'id')
# Print the number of lines
print(train.count())
# Display columns
train.printSchema()


103904
root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable = true)

In [0]:
# Load test set

# Get data
test = sqlContext.read.format("csv").option("header", True).option(
    'sep', ',').option('inferSchema', True).load(path_test)
# Delete id column
test = test.drop("id")
# Rename column
test = test.withColumnRenamed(test.columns[0], 'id')
# Print the number of lines
print(test.count())
# Display columns
test.printSchema()


25976
root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable = true)


One can notice both dataframes (without surprise) have the same columns. 

Now, let us concatenate them.

In [0]:
# Concatenate train and test

# Union of the above dataframes
df = test.union(train)

# Print the number of lines
print(df.count())
# Display columns
df.printSchema()


129880
root
 |-- id: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: string (nullable = true)
 |-- Class: string (nullable = true)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable = true)

In [0]:
# See 5 first rows (vertically)
df.show(5, vertical=True)


-RECORD 0-------------------------------------------------
 id                                | 0                    
 Gender                            | Female               
 Customer Type                     | Loyal Customer       
 Age                               | 52                   
 Type of Travel                    | Business travel      
 Class                             | Eco                  
 Flight Distance                   | 160                  
 Inflight wifi service             | 5                    
 Departure/Arrival time convenient | 4                    
 Ease of Online booking            | 3                    
 Gate location                     | 4                    
 Food and drink                    | 3                    
 Online boarding                   | 4                    
 Seat comfort                      | 3                    
 Inflight entertainment            | 5                    
 On-board service                  | 5                  

# Part 1: Clean the data

The preliminary explorations using pandas and spark dataframes enabled one to notice that some values need to be deleted, especially the missing values and non applicable values (a kind of missing value).

## 1.1. Delete missing values

One knows from the preliminary exploration that the variable "Arrival Delay in Minutes" has missing values.

In [0]:
# Create dataframe copy
df_clean = df


In [0]:
# Check missing values
df_clean.select([count(when(isnan(c) | col(c).isNull(),
                            c)).alias(c) for c in df_clean.columns]).show(
                                vertical=True)


-RECORD 0--------------------------------
 id                                | 0   
 Gender                            | 0   
 Customer Type                     | 0   
 Age                               | 0   
 Type of Travel                    | 0   
 Class                             | 0   
 Flight Distance                   | 0   
 Inflight wifi service             | 0   
 Departure/Arrival time convenient | 0   
 Ease of Online booking            | 0   
 Gate location                     | 0   
 Food and drink                    | 0   
 Online boarding                   | 0   
 Seat comfort                      | 0   
 Inflight entertainment            | 0   
 On-board service                  | 0   
 Leg room service                  | 0   
 Baggage handling                  | 0   
 Checkin service                   | 0   
 Inflight service                  | 0   
 Cleanliness                       | 0   
 Departure Delay in Minutes        | 0   
 Arrival Delay in Minutes         

The missing values are deleted.

In [0]:
# Deleting missing values
df_clean = df_clean.na.drop()

# Check number of lines
print(df_clean.count())

# Check missing values
df_clean.select([count(when(isnan(c) | col(c).isNull(),
                            c)).alias(c) for c in df_clean.columns]).show(
                                vertical=True)


129487
-RECORD 0--------------------------------
 id                                | 0   
 Gender                            | 0   
 Customer Type                     | 0   
 Age                               | 0   
 Type of Travel                    | 0   
 Class                             | 0   
 Flight Distance                   | 0   
 Inflight wifi service             | 0   
 Departure/Arrival time convenient | 0   
 Ease of Online booking            | 0   
 Gate location                     | 0   
 Food and drink                    | 0   
 Online boarding                   | 0   
 Seat comfort                      | 0   
 Inflight entertainment            | 0   
 On-board service                  | 0   
 Leg room service                  | 0   
 Baggage handling                  | 0   
 Checkin service                   | 0   
 Inflight service                  | 0   
 Cleanliness                       | 0   
 Departure Delay in Minutes        | 0   
 Arrival Delay in Minutes  

The missing values have been deleted correctly since the number of rows is 129487, which is 129880 - 393.

## 1.2. Delete non applicable values

14 variables are satisfaction variables and take values between 1 (worst) and 5 (best). However, sometimes these variables are equal to 0, which isn't applicable. Thus, rows with these values are deleted.

In [0]:
# Keep only the values different than 0
df_clean = df_clean = df_clean.where(
    fn.col('Inflight wifi service') != 0).where(
    fn.col('Departure/Arrival time convenient') != 0).where(
    fn.col('Ease of Online booking') != 0).where(
    fn.col('Gate location') != 0).where(
    fn.col('Food and drink') != 0).where(
    fn.col('Online boarding') != 0).where(
    fn.col('Seat comfort') != 0).where(
    fn.col('Inflight entertainment') != 0).where(
    fn.col('On-board service') != 0).where(
    fn.col('Leg room service') != 0).where(
    fn.col('Baggage handling') != 0).where(
    fn.col('Checkin service') != 0).where(
    fn.col('Inflight service') != 0).where(
    fn.col('Cleanliness') != 0)

print(df_clean.count())


119204


119204 is the number of rows left. It correspond to what was found when using pandas.

# Part 2: Converting the data to a MLlib Matrix

As explained during the preliminary exploration using pandas, some dummies must be created from qualitative variables. This is done to adapt the data to the MLlib matrix format. 

- **Gender**: a dummy variable is created, 1 if the passenger is a female and 0 otherwise. 
- **Customer Type**: a dummy variable is created, 1 if the customer is loyal and 0 otherwise. 
- **Type of Travel**: a dummy variable is created: 1 if the travel is for business and 0 for personal. 
- **Class**: one dummy equal to 1 if the passenger is in business class and 0 otherwise and one dummy equal to 1 if the passenger is in eco class and 0 otherwise. No dummy is created for the eco class to avoid multicolinearity. 
- **Satisfaction**: it as the target variable. It is equal to 1 if the customer is satisfied and 0 if he is dissatisfied or neutral.

## 2.1. Gender

In [0]:
# Create dummy for the Gender
df_clean = df_clean.withColumn('Gender',
                               when(col('Gender') == 'Female', 1).otherwise(0))
df_clean.show(5, vertical=True)


-RECORD 0--------------------------------------------
 id                                | 0               
 Gender                            | 1               
 Customer Type                     | Loyal Customer  
 Age                               | 52              
 Type of Travel                    | Business travel 
 Class                             | Eco             
 Flight Distance                   | 160             
 Inflight wifi service             | 5               
 Departure/Arrival time convenient | 4               
 Ease of Online booking            | 3               
 Gate location                     | 4               
 Food and drink                    | 3               
 Online boarding                   | 4               
 Seat comfort                      | 3               
 Inflight entertainment            | 5               
 On-board service                  | 5               
 Leg room service                  | 5               
 Baggage handling           

In [0]:
# Frequency
df_clean.cube('Gender').count().show()


+------+------+
|Gender| count|
+------+------+
|     1| 60416|
|     0| 58788|
|  null|119204|
+------+------+



As before, the are almost the same number of females (1) and males (0).

## 2.2. Customer Type

In [0]:
# Create dummy for the Customer Type
df_clean = df_clean.withColumn('Customer Type', when(col(
    'Customer Type') == 'Loyal Customer', 1).otherwise(0))

# Show dataframe
df_clean.show(5, vertical=True)


-RECORD 0--------------------------------------------
 id                                | 0               
 Gender                            | 1               
 Customer Type                     | 1               
 Age                               | 52              
 Type of Travel                    | Business travel 
 Class                             | Eco             
 Flight Distance                   | 160             
 Inflight wifi service             | 5               
 Departure/Arrival time convenient | 4               
 Ease of Online booking            | 3               
 Gate location                     | 4               
 Food and drink                    | 3               
 Online boarding                   | 4               
 Seat comfort                      | 3               
 Inflight entertainment            | 5               
 On-board service                  | 5               
 Leg room service                  | 5               
 Baggage handling           

In [0]:
# Frequency
df_clean.cube('Customer Type').count().show()


+-------------+------+
|Customer Type| count|
+-------------+------+
|            1|100024|
|            0| 19180|
|         null|119204|
+-------------+------+



There are 100 024 loyal customers. 19 180 are not.

## 2.3. Type of Travel

In [0]:
# Create dummy for the Type of Travel
df_clean = df_clean.withColumn('Type of Travel', when(col(
  'Type of Travel') == 'Business travel', 1).otherwise(0))

# Show dataframe
df_clean.show(5, vertical=True)


-RECORD 0--------------------------------------
 id                                | 0         
 Gender                            | 1         
 Customer Type                     | 1         
 Age                               | 52        
 Type of Travel                    | 1         
 Class                             | Eco       
 Flight Distance                   | 160       
 Inflight wifi service             | 5         
 Departure/Arrival time convenient | 4         
 Ease of Online booking            | 3         
 Gate location                     | 4         
 Food and drink                    | 3         
 Online boarding                   | 4         
 Seat comfort                      | 3         
 Inflight entertainment            | 5         
 On-board service                  | 5         
 Leg room service                  | 5         
 Baggage handling                  | 5         
 Checkin service                   | 2         
 Inflight service                  | 5  

In [0]:
# Frequency
df_clean.cube('Type of Travel').count().show()


+--------------+------+
|Type of Travel| count|
+--------------+------+
|             1| 82445|
|             0| 36759|
|          null|119204|
+--------------+------+



82 445 customers traveled for business and 36 759 for personal reasons.

## 2.4. Class

In [0]:
# Create dummy for the Class
df_clean = df_clean.withColumn('Business Class', when(col(
  'Class') == 'Business', 1).otherwise(0))
df_clean = df_clean.withColumn('Eco Class', when(col(
    'Class') == 'Eco', 1).otherwise(0))

# Delete variable class that is now useless
df_clean = df_clean.drop("Class")

# Show dataframe
df_clean.show(5, vertical=True)


-RECORD 0--------------------------------------
 id                                | 0         
 Gender                            | 1         
 Customer Type                     | 1         
 Age                               | 52        
 Type of Travel                    | 1         
 Flight Distance                   | 160       
 Inflight wifi service             | 5         
 Departure/Arrival time convenient | 4         
 Ease of Online booking            | 3         
 Gate location                     | 4         
 Food and drink                    | 3         
 Online boarding                   | 4         
 Seat comfort                      | 3         
 Inflight entertainment            | 5         
 On-board service                  | 5         
 Leg room service                  | 5         
 Baggage handling                  | 5         
 Checkin service                   | 2         
 Inflight service                  | 5         
 Cleanliness                       | 5  

In [0]:
# Frequency
df_clean.cube('Business Class').count().show()


+--------------+------+
|Business Class| count|
+--------------+------+
|             1| 57992|
|             0| 61212|
|          null|119204|
+--------------+------+



People traveling in business class represent almost 50% of the customers.

In [0]:
# Frequency
df_clean.cube('Eco Class').count().show()


+---------+------+
|Eco Class| count|
+---------+------+
|        1| 52459|
|        0| 66745|
|     null|119204|
+---------+------+



People traveling in eco class represent almost the other 50% of the customers. There are very few customers travelling in Eco class.

## 2.5. Satisfaction

In [0]:
# Create dummy for the satisfaction
df_clean = df_clean.withColumn('satisfaction', when(col(
  'satisfaction') == 'satisfied', 1).otherwise(0))


In [0]:
# Frequency
df_clean.cube('satisfaction').count().show()


+------------+------+
|satisfaction| count|
+------------+------+
|           1| 50874|
|           0| 68330|
|        null|119204|
+------------+------+



A little more customer are dissatisfied or neutral than satisfied.

In [0]:
# Put the satisfaction in first column
df_clean = df_clean.select(df_clean.columns[22],
                           df_clean.columns[0], df_clean.columns[1],
                           df_clean.columns[2], df_clean.columns[3],
                           df_clean.columns[4], df_clean.columns[5],
                           df_clean.columns[6], df_clean.columns[7],
                           df_clean.columns[8], df_clean.columns[9],
                           df_clean.columns[10], df_clean.columns[11],
                           df_clean.columns[12], df_clean.columns[13],
                           df_clean.columns[14], df_clean.columns[15],
                           df_clean.columns[16], df_clean.columns[17],
                           df_clean.columns[18], df_clean.columns[19],
                           df_clean.columns[20], df_clean.columns[21],
                           df_clean.columns[23], df_clean.columns[24])

df_clean.show(5, vertical=True)


-RECORD 0---------------------------------
 satisfaction                      | 1    
 id                                | 0    
 Gender                            | 1    
 Customer Type                     | 1    
 Age                               | 52   
 Type of Travel                    | 1    
 Flight Distance                   | 160  
 Inflight wifi service             | 5    
 Departure/Arrival time convenient | 4    
 Ease of Online booking            | 3    
 Gate location                     | 4    
 Food and drink                    | 3    
 Online boarding                   | 4    
 Seat comfort                      | 3    
 Inflight entertainment            | 5    
 On-board service                  | 5    
 Leg room service                  | 5    
 Baggage handling                  | 5    
 Checkin service                   | 2    
 Inflight service                  | 5    
 Cleanliness                       | 5    
 Departure Delay in Minutes        | 50   
 Arrival De

# Part 3: Add features

Some new variables are created, to try having for information.

## 3.1. Business traveler in business class

This dummy variable is equal to 1 if the customer traveling for business is in business class. The flight was probably paid by the company. Thus, the satisfaction might differ since there was nothing paid by the customer himself.

In [0]:
# Create dummy for the Type of Travel
df_clean = df_clean.withColumn('Business in Business', when((col(
    'Type of Travel') == 1) & (col('Business Class') == 1), 1).otherwise(0))

# Show dataframe
df_clean.show(5, vertical=True)


-RECORD 0---------------------------------
 satisfaction                      | 1    
 id                                | 0    
 Gender                            | 1    
 Customer Type                     | 1    
 Age                               | 52   
 Type of Travel                    | 1    
 Flight Distance                   | 160  
 Inflight wifi service             | 5    
 Departure/Arrival time convenient | 4    
 Ease of Online booking            | 3    
 Gate location                     | 4    
 Food and drink                    | 3    
 Online boarding                   | 4    
 Seat comfort                      | 3    
 Inflight entertainment            | 5    
 On-board service                  | 5    
 Leg room service                  | 5    
 Baggage handling                  | 5    
 Checkin service                   | 2    
 Inflight service                  | 5    
 Cleanliness                       | 5    
 Departure Delay in Minutes        | 50   
 Arrival De

In [0]:
# Frequency
df_clean.cube('Business in Business').count().show()


+--------------------+------+
|Business in Business| count|
+--------------------+------+
|                   1| 55542|
|                   0| 63662|
|                null|119204|
+--------------------+------+



## 3.2. Average grade

This variable gives the average grade for all the 14 variables with a grade.

In [0]:
# Columns with a grade
df_clean.columns[7:21]


Out[25]: ['Inflight wifi service',
 'Departure/Arrival time convenient',
 'Ease of Online booking',
 'Gate location',
 'Food and drink',
 'Online boarding',
 'Seat comfort',
 'Inflight entertainment',
 'On-board service',
 'Leg room service',
 'Baggage handling',
 'Checkin service',
 'Inflight service',
 'Cleanliness']

In [0]:
# Number of columns with a grade
n = lit(len(df_clean.columns[7:21]))

# Compute grade
df_clean = df_clean.withColumn('Mean grades',
                               reduce(add,
                                   (col(x) for x in df_clean.columns[7:21])) / n)

# Show dataframe
df_clean.show(5, vertical=True)


-RECORD 0-----------------------------------------------
 satisfaction                      | 1                  
 id                                | 0                  
 Gender                            | 1                  
 Customer Type                     | 1                  
 Age                               | 52                 
 Type of Travel                    | 1                  
 Flight Distance                   | 160                
 Inflight wifi service             | 5                  
 Departure/Arrival time convenient | 4                  
 Ease of Online booking            | 3                  
 Gate location                     | 4                  
 Food and drink                    | 3                  
 Online boarding                   | 4                  
 Seat comfort                      | 3                  
 Inflight entertainment            | 5                  
 On-board service                  | 5                  
 Leg room service              

In [0]:
# Summary statistics
df_clean.select(col('Mean grades')).describe().show()


+-------+------------------+
|summary|       Mean grades|
+-------+------------------+
|  count|            119204|
|   mean|3.2793770103591964|
| stddev|0.6546830805250025|
|    min|1.1428571428571428|
|    max|               5.0|
+-------+------------------+



The average grade given is 3. One or more people gave only grades equal to 5.

# Part 4: Classification model with mllib

The classification is done using the MLlib libraries. They are divided into two packages. 
Thus, one subsection will focus on the mllib package that contains the original API based on RDDs and another subsection will focus on another API based on dataframes (ml).

One want to classify the customers into 2 categories: 
- satisfied, or
- neutral or dissatisfied.
Thus, one is dealing with a binary classification.

3 different models are going to be used and compared: 
 - **Decision tree**: The decision tree is a supervised classification method that allows to explain a target
variable from other so-called explanatory variables. The algorithm
partitions the individuals into groups of individuals that are as similar as possible in terms of
the variable to be predicted. The result is a tree that reveals hierarchical relationships between
the variables. The decision tree is an iterative algorithm that, at each iteration, will split the
individuals into groups to explain the target variable. The first split is obtained by choosing
the explanatory variable that allows the best separation of the individuals contained in the
train set (this is called the root of the tree). This split results in sub-populations corresponding
to the first node of the tree. This splitting process is then repeated several times for each
sub-population until the splitting process is stopped.
 - **Random forest**: Random forests can be used to solve classification problems. This method is able to
overcome the disadvantages associated with simple decision trees while retaining the advantages. The key to the performance of random forests is the way in which each of the decision trees that make up the forest are created. There are two random selection steps to form the forest trees. The first randomly selects, with replacement, data from the train set. As a result, for each of the trees, a different subset of the variables is used to develop the model for that decision tree. The remaining data is used to test the accuracy of the tree. The second random sampling step is related to the splitting conditions for each node of the tree. At each node, a subset of predictor variables is randomly selected to create the binary rule.
 - **Gradient-boosted tree**: It is an ensemble machine learning algorithm which means that with a single model, one get aggregated output from several models. The ensemble is here constructed from decision tree models. It means that decision trees are built sequentially and added to the ensemble. Each decision tree model is fitted each time to reduce the prediction error of the previous tree, it is the principle of boosting. The loss function and gradient descent optimization algorithms are used for the fitting. If a tree is not good enough then it is pruned until it is good enough and if it is never the case then it isn’t added to the ensemble.

To know how good a model is, several metrics can be used. They can be compuded from the confusion matrix showing:
- **True Positive** (TP), cases where the prediction is positive, and the actual value is actually
positive,
- **True Negative** (TN), cases where the prediction is negative, and the actual value is actually
negative.
- **False Positive** (FP), cases where the prediction is positive, but the actual value is negative.
- **False Negative** (FN), cases where the prediction is negative, but the actual value is positive.

In this particular project:
- positive means being satisfied, and
- negative means being neutral or dissatisfied.

From these, 3 metrics can be computed:
    
- The **accuracy** which is the number of predictions gotten right over the total number of predictions. 
$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
- The **precision** measures the quality of a positive predictions made by the model, for a
given class.
$$ Precision_{class} = \frac{TP_{class}}{TP_{class}+ \sum_{c \in classes} {FP_c}}$$ 
- The **recall** measures if one model correctly identified true positives, for a given class.
$$ Recall_{class} = \frac{TP_{class}}{TP_{class}+ \sum_{c \in classes} {FN_c}}$$

## 4.1. Variables summary

In [0]:
# Print the number of lines
print(df_clean.count())
# Display columns
df_clean.printSchema()


119204
root
 |-- satisfaction: integer (nullable = false)
 |-- id: integer (nullable = true)
 |-- Gender: integer (nullable = false)
 |-- Customer Type: integer (nullable = false)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: integer (nullable = false)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (n

## 4.2. Spark.mllib (RDD based)

This subsection focuses on the spsark.mllib library using the mllib package which is RDD based.

### 4.2.1. Change dataframe to RDD

In [0]:
# Create RDD of labeled points
df_rdd = df_clean.drop('id')
rdd = df_rdd.rdd.map(lambda line: LabeledPoint(line[0], [line[1:]]))
rdd.take(10)


Out[29]: [LabeledPoint(1.0, [1.0,1.0,52.0,1.0,160.0,5.0,4.0,3.0,4.0,3.0,4.0,3.0,5.0,5.0,5.0,5.0,2.0,5.0,5.0,50.0,44.0,0.0,1.0,0.0,4.142857142857143]),
 LabeledPoint(1.0, [1.0,1.0,36.0,1.0,2863.0,1.0,1.0,3.0,1.0,5.0,4.0,5.0,4.0,4.0,4.0,4.0,3.0,4.0,5.0,0.0,0.0,1.0,0.0,1.0,3.4285714285714284]),
 LabeledPoint(1.0, [1.0,1.0,49.0,1.0,1182.0,2.0,3.0,4.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,4.0,0.0,20.0,0.0,1.0,0.0,2.642857142857143]),
 LabeledPoint(1.0, [0.0,1.0,16.0,1.0,311.0,3.0,3.0,3.0,3.0,5.0,5.0,3.0,5.0,4.0,3.0,1.0,1.0,2.0,5.0,0.0,0.0,0.0,1.0,0.0,3.2857142857142856]),
 LabeledPoint(1.0, [1.0,1.0,77.0,1.0,3987.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,3.0,0.0,0.0,1.0,0.0,1.0,4.642857142857143]),
 LabeledPoint(1.0, [1.0,1.0,43.0,1.0,2556.0,2.0,2.0,2.0,2.0,4.0,4.0,5.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,77.0,65.0,1.0,0.0,1.0,3.5]),
 LabeledPoint(1.0, [0.0,1.0,47.0,1.0,556.0,5.0,2.0,2.0,2.0,5.0,5.0,5.0,5.0,2.0,2.0,5.0,3.0,3.0,5.0,1.0,0.0,0.0,1.0,0.0,3.642857142857143]),
 LabeledPoint(

### 4.2.2. Creation of a test, validation and training RDD

The RDD is split into three pieces:

- **train_RDD** the training dataset, which is used to train the models (60%),
- **val_RDD** the validation dataset, which is used to choose the best model (20%),
- **test_RDD** the test dataset, which is used for the experiment (20%).

In [0]:
# Creation of the train, validation and test RRD
train_RDD, val_RDD, test_RDD = rdd.randomSplit([0.6, 0.2, 0.2], seed=10)

# Print number of entries
print('Training entries: %s' % train_RDD.count())
print('Validation entries: %s' % val_RDD.count())
print('Test entries: %s' % test_RDD.count())

# Print 3 first entries
print(train_RDD.take(3))
print(val_RDD.take(3))
print(test_RDD.take(3))


Training entries: 71623
Validation entries: 23915
Test entries: 23666
[LabeledPoint(1.0, [1.0,1.0,52.0,1.0,160.0,5.0,4.0,3.0,4.0,3.0,4.0,3.0,5.0,5.0,5.0,5.0,2.0,5.0,5.0,50.0,44.0,0.0,1.0,0.0,4.142857142857143]), LabeledPoint(1.0, [1.0,1.0,36.0,1.0,2863.0,1.0,1.0,3.0,1.0,5.0,4.0,5.0,4.0,4.0,4.0,4.0,3.0,4.0,5.0,0.0,0.0,1.0,0.0,1.0,3.4285714285714284]), LabeledPoint(1.0, [1.0,1.0,49.0,1.0,1182.0,2.0,3.0,4.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,4.0,0.0,20.0,0.0,1.0,0.0,2.642857142857143])]
[LabeledPoint(1.0, [1.0,1.0,46.0,1.0,1744.0,2.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,28.0,14.0,1.0,0.0,1.0,3.4285714285714284]), LabeledPoint(1.0, [1.0,1.0,46.0,1.0,1009.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,0.0,0.0,1.0,0.0,1.0,4.785714285714286]), LabeledPoint(1.0, [1.0,1.0,52.0,1.0,925.0,2.0,2.0,2.0,2.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,5.0,10.0,0.0,1.0,0.0,1.0,3.5714285714285716])]
[LabeledPoint(1.0, [0.0,1.0,16.0,1.0,311.0,3.0,3.0,3.0,3.0,5.0,5.0,3.0,5.0,4.0,3.

### 4.2.3. Decision tree classifier

In [0]:
# Decision tree classifier model
decision_tree_mllib = DecisionTree.trainClassifier(train_RDD, numClasses = 2, 
                                                   categoricalFeaturesInfo={}, 
                                                   impurity = 'gini', maxDepth = 5, 
                                                   maxBins = 32)


In [0]:
# Prediction on the validation dataset
predictions_DT = decision_tree_mllib.predict(val_RDD.map(lambda x: x.features))

# Actual labels
true_labels = val_RDD.map(lambda y: y.label)

In [0]:
# Actual label and prediction
trueLabel_pred_DT = true_labels.zip(predictions_DT)
trueLabel_pred_DT.take(10)

Out[33]: [(1.0, 1.0),
 (1.0, 1.0),
 (1.0, 1.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (1.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0)]

In [0]:
# Define the metrics function
metrics_DT = MulticlassMetrics(trueLabel_pred_DT)




In [0]:
# Confusion matrix
metrics_DT.confusionMatrix().toArray()


Out[35]: array([[12736.,   895.],
       [  989.,  9295.]])

In [0]:
# False positive rate
print("FN = %s" % metrics_DT.falsePositiveRate(0.0))
print("FP = %s" % metrics_DT.falsePositiveRate(1.0))

# True positive rate
print("TN = %s" % metrics_DT.truePositiveRate(0.0))
print("TP = %s" % metrics_DT.truePositiveRate(1.0))


FN = 0.09616880591209646
FP = 0.0656591592693126
TN = 0.9343408407306873
TP = 0.9038311940879036


There are very small FN and FP rates, less than 10%. It means that almost all the satisfied customers were predict to be so, and it is the same for the neutral or dissatisfied customers.

In [0]:
# Computation of some metrics
accuracy = metrics_DT.accuracy
precision1 = metrics_DT.precision(1.0)
recall1 = metrics_DT.recall(1.0)
precision0 = metrics_DT.precision(0.0)
recall0 = metrics_DT.recall(0.0)

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)


Accuracy = 0.9212209910098265
Error = 0.07877900899017354
Precision satisfied = 0.9121687929342492
Recall satisfied = 0.9038311940879036
Precision neutral or dissatisfied = 0.9279417122040073
Recall neutral or dissatisfied = 0.9343408407306873


Overall, the accuracy is of 92%. It means that 90% of the customers's satisfaction was well predicted. The precision and recall are rather high as well.

### 4.2.4. Random forest classifier

In [0]:
# Random forest classifier model
RF_mllib = RandomForest.trainClassifier(train_RDD, numClasses=2,
                                        categoricalFeaturesInfo={},
                                        numTrees=10,
                                        featureSubsetStrategy="auto",
                                        impurity='gini', maxDepth=4,
                                        maxBins=32)


In [0]:
# Prediction on the validation dataset
predictions_RF = RF_mllib.predict(val_RDD.map(lambda x: x.features))

# Actual labels
true_labels = val_RDD.map(lambda y: y.label)


In [0]:
# Actual label and prediction
trueLabel_pred_RF = true_labels.zip(predictions_RF)
trueLabel_pred_RF.take(10)


Out[40]: [(1.0, 1.0),
 (1.0, 1.0),
 (1.0, 1.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (1.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0)]

In [0]:
# Define the metrics function
metrics_RF = MulticlassMetrics(trueLabel_pred_RF)


In [0]:
# Confusion matrix
metrics_RF.confusionMatrix().toArray()


Out[42]: array([[12502.,   809.],
       [ 1223.,  9381.]])

In [0]:
# False positive rate
print("FN = %s" % metrics_RF.falsePositiveRate(0.0))
print("FP = %s" % metrics_RF.falsePositiveRate(1.0))

# True positive rate
print("TN = %s" % metrics_RF.truePositiveRate(0.0))
print("TP = %s" % metrics_RF.truePositiveRate(1.0))


FN = 0.11533383628819313
FP = 0.0607768011419127
TN = 0.9392231988580872
TP = 0.8846661637118068


As before, one may notice there are very few FN and FP.

In [0]:
# Computation of some metrics
accuracy = metrics_RF.accuracy
precision1 = metrics_RF.precision(1.0)
recall1 = metrics_RF.recall(1.0)
precision0 = metrics_RF.precision(0.0)
recall0 = metrics_RF.recall(0.0)

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)


Accuracy = 0.9150324064394731
Error = 0.08496759356052686
Precision satisfied = 0.9206084396467125
Recall satisfied = 0.8846661637118068
Precision neutral or dissatisfied = 0.9108925318761384
Recall neutral or dissatisfied = 0.9392231988580872


The results of this model are good as well but the accuracy is slightly smaller than with the decision tree classifier model, it is of 91%. 
The precision and recall are rather high as well and very similar to the ones gotten with the previous model.

### 4.2.5 Gradient-boosted tree

In [0]:
# Random forest classifier model
GBT_mllib = GradientBoostedTrees.trainClassifier(train_RDD,
                                             categoricalFeaturesInfo={}, numIterations=3)


In [0]:
# Prediction on the validation dataset
predictions_GBT = GBT_mllib.predict(val_RDD.map(lambda x: x.features))

# Actual labels
true_labels = val_RDD.map(lambda y: y.label)


In [0]:
# Actual label and prediction
trueLabel_pred_GBT = true_labels.zip(predictions_GBT)
trueLabel_pred_GBT.take(10)


Out[47]: [(1.0, 1.0),
 (1.0, 1.0),
 (1.0, 1.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (1.0, 1.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0)]

In [0]:
# Define the metrics function
metrics_GBT = MulticlassMetrics(trueLabel_pred_GBT)


In [0]:
# Confusion matrix
metrics_GBT.confusionMatrix().toArray()


Out[49]: array([[12109.,  1190.],
       [ 1616.,  9000.]])

In [0]:
# False positive rate
print("FN = %s" % metrics_GBT.falsePositiveRate(0.0))
print("FP = %s" % metrics_GBT.falsePositiveRate(1.0))

# True positive rate
print("TN = %s" % metrics_GBT.truePositiveRate(0.0))
print("TP = %s" % metrics_GBT.truePositiveRate(1.0))


FN = 0.1522230595327807
FP = 0.08948041206105722
TN = 0.9105195879389428
TP = 0.8477769404672193


Compared to before, there are more FN and FP. It suggests the model is less good than the 2 used before.

In [0]:
# Computation of some metrics
accuracy = metrics_GBT.accuracy
precision1 = metrics_GBT.precision(1.0)
recall1 = metrics_GBT.recall(1.0)
precision0 = metrics_GBT.precision(0.0)
recall0 = metrics_GBT.recall(0.0)

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)


Accuracy = 0.8826677817269496
Error = 0.1173322182730504
Precision satisfied = 0.8832188420019627
Recall satisfied = 0.8477769404672193
Precision neutral or dissatisfied = 0.8822586520947177
Recall neutral or dissatisfied = 0.9105195879389428


Indeed, this model is the worst since only 88% of the customers were well predict. Thus, this model won't be used for the experiment with the test set.

### 4.2.6 Comparison and experiment with the test RDD

Using the validation RDD enebles to conclude the best model is the **Decision Tree Classifier model**, since it has the highest accuracy which is of around 92%. 
Thus, it is the model that is going to be used to experiment with the test set.

In [0]:
# Decision tree classifier model
decision_tree_mllib2 = DecisionTree.trainClassifier(train_RDD, numClasses = 2, 
                                                   categoricalFeaturesInfo={}, 
                                                   impurity = 'gini', maxDepth = 5, 
                                                   maxBins = 32)

# Prediction on the validation dataset
predictions_DT2 = decision_tree_mllib2.predict(test_RDD.map(lambda x: x.features))

# Actual labels
true_labels = test_RDD.map(lambda y: y.label)

# Actual label and prediction
trueLabel_pred_DT2 = true_labels.zip(predictions_DT2)


In [0]:
# Define the metrics function
metrics_DT2 = MulticlassMetrics(trueLabel_pred_DT2)

# Computation of some metrics
accuracy = metrics_DT2.accuracy
precision1 = metrics_DT2.precision(1.0)
recall1 = metrics_DT2.recall(1.0)
precision0 = metrics_DT2.precision(0.0)
recall0 = metrics_DT2.recall(0.0)

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))
print("Precision satisfied = %s" % precision1)
print("Recall satisfied = %s" % recall1)
print("Precision neutral or dissatisfied = %s" % precision0)
print("Recall neutral or dissatisfied = %s" % recall0)


Accuracy = 0.9217020197752049
Error = 0.07829798022479506
Precision satisfied = 0.9161574676950817
Recall satisfied = 0.900019681165125
Precision neutral or dissatisfied = 0.9257472776437916
Recall neutral or dissatisfied = 0.93801836492891


The results when using the test RDD is very similar to the results gotten with the validation RDD since the accuracy is of 92% for both. Hence, there are only around 8% of errors.  
Furthermore, the precisions and recalls are all above 90%. It means, for instance, that among all the customers predicted to be satisfied, 91% of them are actually satisfied and among all the neutral or dissatisfied customers, 93% were predicted to be so.

## 4.3. Spark.ml (Dataframe based)

This subsection focuses on the spsark.mllib library using the ml package which is Dataframe based.

### 4.3.1. Generate feature vectors

In [0]:
# Dataframe columns
df_clean.printSchema()

root
 |-- satisfaction: integer (nullable = false)
 |-- id: integer (nullable = true)
 |-- Gender: integer (nullable = false)
 |-- Customer Type: integer (nullable = false)
 |-- Age: integer (nullable = true)
 |-- Type of Travel: integer (nullable = false)
 |-- Flight Distance: integer (nullable = true)
 |-- Inflight wifi service: integer (nullable = true)
 |-- Departure/Arrival time convenient: integer (nullable = true)
 |-- Ease of Online booking: integer (nullable = true)
 |-- Gate location: integer (nullable = true)
 |-- Food and drink: integer (nullable = true)
 |-- Online boarding: integer (nullable = true)
 |-- Seat comfort: integer (nullable = true)
 |-- Inflight entertainment: integer (nullable = true)
 |-- On-board service: integer (nullable = true)
 |-- Leg room service: integer (nullable = true)
 |-- Baggage handling: integer (nullable = true)
 |-- Checkin service: integer (nullable = true)
 |-- Inflight service: integer (nullable = true)
 |-- Cleanliness: integer (nullable

In [0]:
# Creation of the feature vectors
features = ["Gender", "Customer Type", "Age", "Type of Travel",
            "Flight Distance",
            "Inflight wifi service", "Departure/Arrival time convenient",
            "Ease of Online booking", "Gate location", "Food and drink",
            "Online boarding", "Seat comfort", "Inflight entertainment",
            "On-board service", "Leg room service", "Baggage handling",
            "Checkin service", "Inflight service", "Cleanliness",
            "Departure Delay in Minutes", "Arrival Delay in Minutes",
            "Business Class", "Eco Class", "Business in Business",
            "Mean grades"]
assembler = VectorAssembler(inputCols=features, outputCol="features")
df_features = assembler.setHandleInvalid("keep").transform(df_clean)


### 4.3.2. Creation of a test, validation and training dataframe

The dataframe is split into three pieces:
- **train_RDD** the training dataset, which is used to train the models (60%),
- **val_RDD** the validation dataset, which is used to choose the best model (20%),
- **test_RDD** the test dataset, which is used for the experiment (20%).

In [0]:
# Creation the train, validation and test dataframe
(train_df, val_df, test_df) = df_features.randomSplit([0.6, 0.2, 0.2], seed=1)

# Print number of entries
print('Training entries: %s' % train_df.count())
print('Validation entries: %s' % val_df.count())
print('Test entries: %s' % test_df.count())


Training entries: 71429
Validation entries: 24138
Test entries: 23637


### 4.3.4. Decision tree classifier

In [0]:
# Decision tree model
DT_ml = DecisionTreeClassifier(maxDepth=2, labelCol="satisfaction", 
                               featuresCol="features", seed=1)


In [0]:
# Fit the model with the train dataframe
model_DT_ml = DT_ml.fit(train_df)

# Prediction on the validation dataframe
DT_ml_predictions = model_DT_ml.transform(val_df)


In [0]:
# Define the metrics function
evaluator = MulticlassClassificationEvaluator(labelCol="satisfaction",
                                           predictionCol="prediction")


In [0]:
# False positive rate
print("FN = %s" % evaluator.evaluate(
    DT_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("FP = %s" % evaluator.evaluate(
    DT_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))

# True positive rate
print("TN = %s" % evaluator.evaluate(
    DT_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("TP = %s" % evaluator.evaluate(
    DT_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))


FN = 0.12610340479192939
FP = 0.1309566852266975
TN = 0.8690433147733024
TP = 0.8738965952080706


The false postive rates are both above 12%, it is more than then using the spark.mllib package and means the overall error will be higher.

In [0]:
# Computation of some metrics
accuracy = evaluator.evaluate(
    DT_ml_predictions,
    {evaluator.metricName: "accuracy"})

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))


Accuracy = 0.8711160825254785
Error = 0.12888391747452155


As expected, the accuracy is of 87% which is worse than when using mllib. It also implies that the error rate is higher, 12%. The precisions and recalls are all almost equal to the accuracy.

### 4.3.5. Random forest classifier

In [0]:
# Random forest model
RF_ml = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="satisfaction",
                               featuresCol="features", seed=1)


In [0]:
# Fit the model with the train dataframe
model_RF_ml = RF_ml.fit(train_df)

# Prediction on the validation dataframe
RF_ml_predictions = model_RF_ml.transform(val_df)


In [0]:
# False positive rate
print("FN = %s" % evaluator.evaluate(
    RF_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("FP = %s" % evaluator.evaluate(
    RF_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))

# True positive rate
print("TN = %s" % evaluator.evaluate(
    RF_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("TP = %s" % evaluator.evaluate(
    RF_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))


FN = 0.30439421864390337
FP = 0.054956974473931594
TN = 0.9450430255260684
TP = 0.6956057813560966


The FN rate is high since it is equal to 30%. It means that 30% of the satisfied customers were predicted to be dissatisfied or neutral. Of course, it impacts the TP rate which is only of around 70%. One may notice that, on the contrary, the FP rate is very small and TN rate is very high.

In [0]:
# Computation of some metrics
accuracy = evaluator.evaluate(
    RF_ml_predictions,
    {evaluator.metricName: "accuracy"})

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))


Accuracy = 0.8385118899660287
Error = 0.16148811003397134


The issue with the FN and TP has an impact on the accuracy which is only of 83%. It is the worst accuracy so far. This model is clearly not the best one.

### 4.3.6. Gradient-boosted tree

In [0]:
# Gradient-boosted tree model
GBT_ml = GBTClassifier(maxIter=5, maxDepth=2, labelCol="satisfaction",
                       featuresCol="features", seed=1)


In [0]:
# Fit the model with the train dataframe
model_GBT_ml = GBT_ml.fit(train_df)

# Prediction on the validation dataframe
GBT_ml_predictions = model_GBT_ml.transform(val_df)

In [0]:
# False positive rate
print("FN = %s" % evaluator.evaluate(
    GBT_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("FP = %s" % evaluator.evaluate(
    GBT_ml_predictions,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))

# True positive rate
print("TN = %s" % evaluator.evaluate(
    GBT_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("TP = %s" % evaluator.evaluate(
    GBT_ml_predictions,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))


FN = 0.11873120574255505
FP = 0.1309566852266975
TN = 0.8690433147733024
TP = 0.881268794257445


The result of this model seems similar to the one gotten with the Decision Tree model. Indeed, the FN and FP rates are of almost, respectively, around 12% and 13%. However, the FN rate is here a little smaller. Furthermore, the TP rate is higher than before, 88% against 87%. This model is likely to be the best one among the 3 models.

In [0]:
# Computation of some metrics
accuracy = evaluator.evaluate(
    GBT_ml_predictions,
    {evaluator.metricName: "accuracy"})

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))


Accuracy = 0.8742646449581573
Error = 0.12573535504184274


The accuracy is indeed a little higher when using a Gradient-Tree Boosted model than when using a Decision Tree model. However, the results are very close. The accuracy is here of 87%. One may notice that, even though it is the best model when using the spark.ml package, the result is still worse than when using the spark.mllib package.

###4.3.7. Comparison and experiment with the test dataframe

Using the validation RDD enebles to conclude the best model is the **Gradient-Tree Boosted** model, since it has the highest accuracy which is of around 87%. Thus, it is the model that is going to be used to experiment with the test set.

In [0]:
# The model is still the same, only here the prediction 
# is done on the test
# Prediction on the test dataframe
GBT_ml_predictions2 = model_GBT_ml.transform(test_df)


In [0]:
# False positive rate
print("FN = %s" % evaluator.evaluate(
    GBT_ml_predictions2,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("FP = %s" % evaluator.evaluate(
    GBT_ml_predictions2,
    {evaluator.metricName: "falsePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))

# True positive rate
print("TN = %s" % evaluator.evaluate(
    GBT_ml_predictions2,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 0.0}))
print("TP = %s" % evaluator.evaluate(
    GBT_ml_predictions2,
    {evaluator.metricName: "truePositiveRateByLabel",
     evaluator.metricLabel: 1.0}))


FN = 0.11725094078035254
FP = 0.1342787502769776
TN = 0.8657212497230223
TP = 0.8827490592196474


The false positive and negative rates are very similar to the ones obtained with the validation dataframe. The FN and FP rates are both again equal to around 12% and 13%. The TN and TP rates are equal to, respectively, 86% and 88%. One can expect the accuracy to be a little higher than previously, but still very close.

In [0]:
# Computation of some metrics
accuracy = evaluator.evaluate(
    GBT_ml_predictions2,
    {evaluator.metricName: "accuracy"})

print("Accuracy = %s" % accuracy)
print("Error = %s" % (1.0 - accuracy))


Accuracy = 0.8729957270381182
Error = 0.12700427296188177


The results when using the test dataframe is very similar to the results gotten with the validation dataframe since the accuracy is of around 87% for both. Hence, there are only around 13% of errors.

# Conclusion

After a cleaning and after adding new features the spark.mllib (RDD based) and spark.ml (dataframe based) packages could be used. Three models where compared using each time both packages: Decision Tree Classifier, Random Forest, and Gradient-Tree Boosted. The initial dataframe was splitted into 3 parts: a training, a validation and a test. 

The models fitted on the training data with the validation data were first evaluated. When using the spark.mllib package, the best model was the Decision Tree Classifier with an accuracy of 92% and when using the spark.ml package, the best model was the Gradient-Tree Boosted with an accuracy of 87%. Moreover, one may add that the results gotten with the spark.mllib package where always better than when using the spark.ml package. 

To conclude, if one had to keep only one package and model to classify the customers into satisfied or neutral/dissatisfied customers, one would keep the Decision Tree Classifier model with the spark.mllib package.