In [None]:
---
title: "Final Presentation"
author: "Anthony Kulowski"
date: "December 17, 2018"
output: ioslides_presentation
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(caret)
library(ggplot2)
```

##Project Introduction
- Predicting whether or not airline customers were satisfied with their flight based off of surveys given to a set of passengers.
- This data would be helpful to airlines attempting to identify what would make the most satisfied customer and most enjoyable flight.
- Business decisions could be made on findings to improve certain aspects of flying experience to get more returning customers and therefore increase sales.

##Data Introduction
- The data was extracted from Kaggle
- 4 Numeric Variables
- 19 Categorical Variables
- Survey Satisfaction on scale of 0 (worst) to 5 (best)
- No information as to how the data was obtained
- Used Packages caret and ggplot2
- 393 Missing Values
- 129,880 Observations

## List of Input Variables
- ID
- Satisfaction
- Gender
- Customer Type
- Age
- Type of Travel
- Class
- Flight Distance
- Seat comfort
- Departure/Arrival time convenient
- Food and drink / Gate location

## Input Variables cont.
- Inflight wifi service
- Inflight entertainment
- Online support
- Ease of Online booking
- On-board service
- Leg room service
- Baggage handling
- Checkin service
- Cleanliness
- Online boarding
- Arrival/Departure Delay in Minutes

## Read and Prepare Dataset

```{r echo=TRUE, results='hide'}
#Read and import credit card data saved as "satisfaction""
satisfaction = read.csv("C:/Users/student/Documents/Bryant University/4 - Senior Year 2018-19/2018 FA/MATH 421/Final Project/satisfaction.csv")

for(x in 1:ncol(satisfaction)) {
  colnames(satisfaction)[x] <- tolower(colnames(satisfaction)[x])
}

satisfaction$id <- NULL
colnames(satisfaction)[1] <- "target"
colnames(satisfaction)[3] <- "customertype"
colnames(satisfaction)[5] <- "traveltype"
colnames(satisfaction)[22] <- "departdelay"
colnames(satisfaction)[23] <- "arrivaldelay"
```


## Graph of Count of Target
```{r graph1, echo = TRUE}
#1 variables (2 graphs)
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = target)) + ggtitle("Count of Target")
```

## Graph of Density of Flight Distance
```{r graph2, echo = TRUE}
ggplot(data = satisfaction) + geom_density(mapping = aes(x = flight.distance)) + ggtitle("Density of Flight Distance")
```

## Graph of Satisfaction by Gender
```{r graph3, echo = TRUE}
#2 variables (7 graphs)
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = target, fill = gender), position = "dodge") + ggtitle("Satisfaction by Gender")
```

## Graph of Histogram of Satisfaction by Age
```{r graph4, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 5, mapping = aes(x = age, fill = target), position = "dodge") + ggtitle("Histogram of Satisfaction by Age")
```

## Graph of Level of Seat Comfort by Gender
```{r graph5, echo = TRUE}
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = seat.comfort, fill = gender), position = "dodge") + ggtitle("Level of Seat Comfort by Gender")
```

## Graph of Satisfaction by Travel Type
```{r graph6, echo = TRUE}
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = traveltype, fill = target), position = "dodge") + ggtitle("Satisfaction by Travel Type")
```

## Graph of Satisfaction by Travel Class
```{r graph7, echo = TRUE}
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = class, fill = target), position = "dodge") + ggtitle("Satisfaction by Travel Class")
```

## Graph of Satisfaction by Customer Type
```{r graph8, echo = TRUE}
ggplot(data = satisfaction) + geom_bar(mapping = aes(x = customertype, fill = target), position = "dodge") + ggtitle("Satisfaction by Customer Type")
```

## Graph of Density of Flight Distance by Satisfaction
```{r graph9, echo = TRUE}
ggplot(data = satisfaction) + geom_density(mapping = aes(x = flight.distance, fill = target), position = "dodge") + ggtitle("Density of Flight Distance by Satisfaction")
```

## Graph of Satisfaction by Gender and Quality of Seat Comfort
```{r graph10, echo = TRUE}
#3 variables (8 graphs)
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = seat.comfort, fill = target), position = "dodge") + facet_wrap(~gender) + ggtitle("Satisfaction by Gender and Quality of Seat Comfort")
```

## Graph of Satisfaction by Gender and Quality of Food/Drink
```{r graph11, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = food.and.drink, fill = target), position = "dodge") + facet_wrap(~gender) + ggtitle("Satisfaction by Gender and Quality of Food/Drink")
```

## Graph of Satisfaction by Ease of Gate Location Relative to Class Flied
```{r graph12, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = gate.location, fill = target), position = "dodge") + facet_wrap(~class) + ggtitle("Satisfaction by Ease of Gate Location Relative to Class Flied")
```

## Graph of Satisfaction by Gender and Quality of WiFi
```{r graph13, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = inflight.wifi.service, fill = target), position = "dodge") + facet_wrap(~gender) + ggtitle("Satisfaction by Gender and Quality of WiFi")
```

## Graph of Satisfaction by Quality of Entertainment Relative to Loyalty
```{r graph14, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = inflight.entertainment, fill = target), position = "dodge") + facet_wrap(~customertype) + ggtitle("Satisfaction by Quality of Entertainment Relative to Loyalty")
```

## Graph of Satisfaction by Gender and Quality of Leg Room
```{r graph15, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = leg.room.service, fill = target), position = "dodge") + facet_wrap(~gender) + ggtitle("Satisfaction by Gender and Quality of Leg Room")
```

## Graph of Satisfaction by Baggage Handling Relative to Type of Travel
```{r graph16, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = baggage.handling, fill = target), position = "dodge") + facet_wrap(~traveltype) + ggtitle("Satisfaction by Baggage Handling Relative to Type of Travel")
```

## Graph of Satisfaction by Gender and Cleanliness
```{r graph17, echo = TRUE}
ggplot(data = satisfaction) + geom_histogram(binwidth = 1, mapping = aes(x = cleanliness, fill = target), position = "dodge") + facet_wrap(~gender) + ggtitle("Satisfaction by Gender and Cleanliness")
```


##Factor Variables
```{r echo=TRUE, results='hide'}
satisfaction$seat.comfort = as.factor(satisfaction$seat.comfort)
satisfaction$departure.arrival.time.convenient = as.factor(satisfaction$departure.arrival.time.convenient)
satisfaction$food.and.drink  = as.factor(satisfaction$food.and.drink)
satisfaction$gate.location = as.factor(satisfaction$gate.location)
satisfaction$inflight.wifi.service = as.factor(satisfaction$inflight.wifi.service)
satisfaction$inflight.entertainment = as.factor(satisfaction$inflight.entertainment)
satisfaction$online.support = as.factor(satisfaction$online.support)
satisfaction$ease.of.online.booking = as.factor(satisfaction$ease.of.online.booking)
satisfaction$on.board.service  = as.factor(satisfaction$on.board.service)
satisfaction$leg.room.service  = as.factor(satisfaction$leg.room.service)
satisfaction$baggage.handling  = as.factor(satisfaction$baggage.handling)
satisfaction$checkin.service  = as.factor(satisfaction$checkin.service)
satisfaction$cleanliness  = as.factor(satisfaction$cleanliness)
satisfaction$online.boarding  = as.factor(satisfaction$online.boarding)

str(satisfaction)
```

##Imput missing values
## Handling missing values #1
The first method to dealing with missing values was to use the median. 
```{r echo=TRUE, results='hide'}
preProcess_med <- preProcess(satisfaction, method = 'medianImpute')
MedData <- predict(preProcess_med, newdata = satisfaction)
sum(is.na(MedData))
```
## Handling missing values #2
The second way to deal with missing values was to use the nearest neighbors method: knnImpute.
```{r echo=TRUE, results='hide'}
preProcess_knn <- preProcess(satisfaction, method = 'knnImpute')
KnnData <- predict(preProcess_knn, newdata = satisfaction)
sum(is.na(KnnData))
```
## Handling missing values #3
The third way used to handle mising values was to use the bagging method: bagImpute.
```{r echo=TRUE, results='hide'}
preProcess_bag <- preProcess(satisfaction, method = 'bagImpute')
BagData <- predict(preProcess_bag, newdata = satisfaction)
sum(is.na(BagData))
```

##  Imputation Methods - Decision Trees

MedData

cp - 0.01355604
Accuracy - 0.8285032


KnnData

cp -0.01355604 
Accuracy - 0.8279180

BagData

cp - 0.01355604
Accuracy - 0.8277487


##Encode/Recode Categorical Variables
```{r echo=TRUE, results='hide'}
encodeMedData <- MedData
levels(encodeMedData$seat.comfort) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$departure.arrival.time.convenient) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$food.and.drink) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$gate.location) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$inflight.wifi.service) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$inflight.entertainment) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$online.support) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$ease.of.online.booking) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$on.board.service) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$leg.room.service) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$baggage.handling) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$checkin.service) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$cleanliness) = c("bad","bad","medium","medium","good","good")
levels(encodeMedData$online.boarding) = c("bad","bad","medium","medium","good","good")
```

## Predictive Model of Each Dataset
set.seed(41)

Decision Tree

encodedt1 <- train(target ~.,data = MedData, method = "rpart", trControl = trainControl(method ="cv", number = 3, verboseIter =  TRUE))


Decision Tree Encoded

encodedt2 <- train(target ~.,data = encodeMedData, method = "rpart", trControl = trainControl(method ="cv", number = 3, verboseIter =  TRUE))


## Encode/Recode Results on MedData
Non-encoded MedData

cp - 0.01355604          
Accuracy - 0.8278103  


Encoded MedData
cp - 0.008691511        
Accuracy - 0.8295889  

# Predictive Models
## Decision Tree
Decision Tree

- Accuracy - 0.8295889
- cp - 0.008691511

Decision Tree Tuned

- Accuracy - 0.8605483
- cp - 0.005

## Random Forest
Random Forest

- Accuracy - 0.9195950
- mtry - 19
- splitrule - gini

Random Forest Tuned

- Accuracy - 0.8908993
- mtry - 2
- min.node.size - 2

## GLM Net
GLMNet

- Accuracy - 0.8618041 
- alpha - 0.10
- lambda - 0.000597716

GLMNet Tuned

- Accuracy - 0.8625332
- alpha - 0.005
- lambda - 0.001

## Neural Network
Neural Network

- Accuracy - 0.8518706
- decay - 0.0001

Neural Network Tuned

- Accuracy - 0.8620029
- decay - 0.1

## Conclusion
- Random Forest with hyperparameters of mtry = 19 and splitrule = gini was the best model
- Median Impute Data
- Encoded data on all variables with satisfaction scale from 0 to 5

