### INFO 2950 Final Project Phase 2
Harrison Chin (hc955), Julie Jeong (sj598), Claire Jiang (cj337)

In [6]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
import duckdb, sqlalchemy

ModuleNotFoundError: No module named 'duckdb'

In [7]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

ModuleNotFoundError: No module named 'sql'

# Research Question(s)

The overall goal of our project is to provide useful insights about in-flight services that an airline can use to improve customer satisfaction. To do this, we will investigate three key questions: 

**1. Are there differences in ratings across different categories based on whether the customer is loyal or disloyal?**

**2. How do ratings or satisfactions change depending on the type of travel? (Business travel vs Personal Travel)?**

**3. How do ratings change between business and economy class passengers?**

These questions could help an airline identify what inflight services they want to improve for which kind of customer. For example, if it appears that business class passengers are very satisfied with food but economy class passengers are not, then an airline could focus on improving economy class food. 

We also want to create a model that will predict customer satisfaction or dissatisfaction based on various characteristics of the customer ahead of time. It is possible that if this model were put into place, then airlines could see what customers are likely to be dissatisfied and focus more efforts on making sure they are happy with the inflight services. A successful implementation of this concept could cause passengers who are not easy to please to be further inclined to use this airline again, thereby increasing profits. 

# Data Description

In the airline dataset that we have chosen, our observations are passengers who have filled out a satisfaction survey and the attributes are various characteristics of the passengers rating the flight, the flight itself, a rating from 1-5 in terms of satisfaction for various inflight and out of flight services, and a classification of the passenger as either “satisfied” or “neutral/dissatisfied”. The characteristics of the passengers rating the flight are Gender, Customer Type (Loyal customer, disloyal customer), and Age. The characteristics of the flight are Type of Travel (Personal Travel, Business Travel), Class (The Seat Class; Business, Eco, Eco Plus), Flight distance, Departure Delay in Minutes (Minutes delayed when departing), and Arrival Delay in Minutes (Minutes delayed when arriving). The services the passengers are rating are Inflight wifi service, Departure/Arrival time convenience, Ease of Online booking, Gate location, Food and drink, Online boarding, Seat comfort, Inflight entertainment, On-board service, Leg room service, Baggage handling, Check-in service, Inflight service, and Cleanliness.

# Data Collection
Our data was collected from the Kaggle website.

In [None]:
satisfaction_df = pd.read_csv('passenger_satisfaction.csv')

# Data Cleaning

## Checking for Null Values

We wanted to make sure that there would be no empty/null values in our dataframe so we would not have any weird outcomes when computing statistics or creating visualizations.

In [None]:
satisfaction_df.isnull().values.any()

## Uncleaned Data

In [None]:
satisfaction_df.head()

In [None]:
satisfaction_df.columns

## Reasoning for Removing Columns

We decided to remove columns that did not pertain to the inflight aspect of the airline as we suspect that these categories are not something that the airline can necessarily completely control on their own. Our goal is to help airlines pinpoint categories that they can improve on and raise satisfaction ratings overall and we believe that an airline can change the most within their own aircraft and the services they provide within it. So, I used the built-in pandas drop function to drop the columns that we didn’t need. Specifically, we dropped  'Departure/Arrival time convenient', 'Ease of Online booking', 'Gate location', 'Online boarding', 'Baggage handling', 'Checkin service',   'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Unnamed: 0'. 
 

In [None]:
satisfaction_df = satisfaction_df.rename(columns={"Customer Type": "CustomerType", 
                                                  "Flight Distance": "FlightDistance", 
                                                  "Food and drink": "FoodAndDrink", 
                                                  "Seat comfort": "SeatComfort", 
                                                  "Inflight entertainment": "InflightEntertainment", 
                                                  "Leg room service": "LegRoomService", 
                                                  'Inflight wifi service': 'InflightWifi',
                                                  'On-board service': 'OnboardService',
                                                  'Type of Travel': 'TravelType',
                                                  'Inflight service': 'InflightService',
                                                  'satisfaction': 'Satisfaction'})
satisfaction_df.columns

## Cleaned Data

In [None]:
satisfaction_df.drop(columns=['Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Online boarding', 'Baggage handling', 'Checkin service', 
                      'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Unnamed: 0'], inplace=True)

satisfaction_df.head()

## Reasons for Removing Short Distance Flights

We wanted to remove short to mid distance flights when considering satisfaction levels because longer distance flights will cause more categories to aggravate customers as they are on the plane longer than usual. Short to mid distance flights constitute flights that are less than 3000 miles long or about 6 hours long or less. This will expose the weaknesses that airlines have within their flights and will help them realize what they need to improve. In addition, the lower the chance that customers will feel less satisfied with certain categories because they are on the plane for a shorter time. By focusing on the longer flights and removing shorter distance flights, we can create better predictions on satisfaction levels of customers. We did this by using a SQL query to filter flights that had distances greater than or equal to 3000 to create a new data frame that we would use for our data analysis.

In [None]:
%sql cleaned_satisfaction_df << SELECT * FROM satisfaction_df\
WHERE FlightDistance >= 3000

cleaned_satisfaction_df.head()

In [None]:
#export as cleaned so we can use the cleaned data directly
cleaned_satisfaction_df.to_csv('cleaned_satisfaction_data.csv', index=False)

In [None]:
cleaned_satisfaction_df = pd.read_csv("cleaned_satisfaction_data.csv")

In [None]:
cleaned_satisfaction_df

## Encoding Categorical Variables into Binary Values

In order to be able to utilize the different categories that we want to check rating/satisfaction values against, we encoded Gender, CustomerType, TravelType, Class, and Satisfaction into binary values using LabelEncoder() such that we could better facilitate our data analysis.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

cleaned_satisfaction_df['Gender'] = le.fit_transform(cleaned_satisfaction_df['Gender']) # Female: 0, Male: 1
cleaned_satisfaction_df['CustomerType'] = le.fit_transform(cleaned_satisfaction_df['CustomerType']) #Loyal: 0, Disloyal: 1
cleaned_satisfaction_df['TravelType'] = le.fit_transform(cleaned_satisfaction_df['TravelType']) # Business travel: 0, Personal travel: 1
cleaned_satisfaction_df['Class'] = le.fit_transform(cleaned_satisfaction_df['Class']) # Business: 0, Eco: 1, Other: ?
cleaned_satisfaction_df['Satisfaction'] = le.fit_transform(cleaned_satisfaction_df['Satisfaction']) # Neutral or Dissatisfied: 0, Satisfied: 1

cleaned_satisfaction_df

In [None]:
cleaned_satisfaction_df.columns

In [None]:
cleaned_satisfaction_df.corr()

In [None]:
#Correlation Matrix for variables
plt.figure(figsize=(12,12))
ax = sns.heatmap(cleaned_satisfaction_df.corr(), annot=True)

In [None]:
# Female: 0, Male: 1
cleaned_satisfaction_df.groupby(['Gender']).mean()
gender_df = cleaned_satisfaction_df.groupby(['Gender']).mean()

# Ratings based on gender 

There is no apparent difference in ratings when separated by gender. This is somewhat expected.

In [None]:
# Reference: https://www.geeksforgeeks.org/plotting-multiple-bar-charts-using-matplotlib-in-python/

X = ['InflightWifi', 'FoodAndDrink', 'SeatComfort',
       'InflightEntertainment', 'OnboardService', 'LegRoomService',
       'InflightService', 'Cleanliness', 'Satisfaction']

In [None]:
Female = gender_df.iloc[0, 6:]
Male = gender_df.iloc[1, 6:]
  
X_axis = np.arange(len(X))

plt.figure(figsize=(17,6))

plt.bar(X_axis - 0.2, Female, 0.4, label = 'Female')
plt.bar(X_axis + 0.2, Male, 0.4, label = 'Male')
  
plt.xticks(X_axis, X)
plt.xlabel("Categories")
plt.ylabel("Ratings")
plt.title("Ratings by Gender")
plt.legend()
plt.show()

# Ratings based on Customer Type

Overall, loyal customers have higher average ratings than disloyal customers. Among categories, FoodAndDrink has the biggest difference and legRoomService has the smallest difference.

In [None]:
#Loyal: 0, Disloyal: 1

cleaned_satisfaction_df.groupby(['CustomerType']).mean()
customer_df = cleaned_satisfaction_df.groupby(['CustomerType']).mean()

In [None]:
Loyal = customer_df.iloc[0, 6:]
Disloyal = customer_df.iloc[1, 6:]
  
X_axis = np.arange(len(X))

plt.figure(figsize=(17,6))

plt.bar(X_axis - 0.2, Loyal, 0.4, label = 'Loyal')
plt.bar(X_axis + 0.2, Disloyal, 0.4, label = 'Disloyal')
  
plt.xticks(X_axis, X)
plt.xlabel("Categories")
plt.ylabel("Ratings")
plt.title("Ratings by Customer Type")
plt.legend()
plt.show()

# Ratings based on Travel Type

Overall, passengers on business travel have higher average ratings than passengers on personal travel. Among categories, InflightService and FoodAndDrink have the biggest differences and legRoomService has the smallest difference.

In [None]:
# Business travel: 0, Personal travel: 1
cleaned_satisfaction_df.groupby(['TravelType']).mean()
travel_df = cleaned_satisfaction_df.groupby(['TravelType']).mean()

In [None]:
Business = travel_df.iloc[0, 6:]
Personal = travel_df.iloc[1, 6:]
  
X_axis = np.arange(len(X))

plt.figure(figsize=(17,6))

plt.bar(X_axis - 0.2, Business, 0.4, label = 'Business')
plt.bar(X_axis + 0.2, Personal, 0.4, label = 'Personal')
  
plt.xticks(X_axis, X)
plt.xlabel("Categories")
plt.ylabel("Ratings")
plt.title("Ratings by Travel Type")
plt.legend()
plt.show()

# Ratings based on Class

Overall, business class passengers have higher average ratings than economy class passengers. Note that class 'others' was not included in the bar chart. Among categories, InflightService, SeatComfort FoodAndDrink have the biggest differences and legRoomService has the smallest difference.

In [None]:
# Business: 0, Eco: 1, Others: 2
class_df = cleaned_satisfaction_df.groupby(['Class']).mean()
class_df

In [None]:
Business = class_df.iloc[0, 6:]
Eco = class_df.iloc[1, 6:]
  
X_axis = np.arange(len(X))

plt.figure(figsize=(17,6))

plt.bar(X_axis - 0.2, Business, 0.4, label = 'Business')
plt.bar(X_axis + 0.2, Eco, 0.4, label = 'Eco')
  
plt.xticks(X_axis, X)
plt.xlabel("Categories")
plt.ylabel("Ratings")
plt.title("Ratings by Class")
plt.legend()
plt.show()

# Ratings based on Overall Satisfaction

Among customers whose overall satisfaction was "satisfied" vs "neutral or dissatisfied", each category had the following differences in ratings:<br>
- InflightEntertainment    1.685928 <br>
- OnboardService           1.528124 <br>
- LegRoomService           1.449006 <br>
- InflightService          1.422147 <br>
- Cleanliness              1.348444 <br>
- SeatComfort              1.233816 <br>
- FoodAndDrink             0.790447 <br>
- InflightWifi             0.442529 <br>

From this, we can infer that InflightEntertainment, OnboardService, LegRoomService have the most impact in determining passengers' satisfaction.

In [None]:
# Neutral or Dissatisfied: 0, Satisfied: 1
satisfaction_df = cleaned_satisfaction_df.groupby(['Satisfaction']).mean()
satisfaction_df

In [None]:
X_satisfaction = ['InflightWifi', 'FoodAndDrink', 'SeatComfort',
       'InflightEntertainment', 'OnboardService', 'LegRoomService',
       'InflightService', 'Cleanliness']

NeutralOrDissatisfied = satisfaction_df.iloc[0, 7:]
Satisfied = satisfaction_df.iloc[1, 7:]
  
X_axis = np.arange(len(X_satisfaction))

plt.figure(figsize=(17,6))

plt.bar(X_axis - 0.2, NeutralOrDissatisfied, 0.4, label = 'NeutralOrDissatisfied')
plt.bar(X_axis + 0.2, Satisfied, 0.4, label = 'Satisfied')
  
plt.xticks(X_axis, X_satisfaction)
plt.xlabel("Categories")
plt.ylabel("Ratings")
plt.title("Ratings by Satisfaction")
plt.legend()
plt.show()

In [None]:
print("Difference in Ratings (Satisfied - NeutralOrDissatisfied)")

satisfaction_diff = Satisfied - NeutralOrDissatisfied
satisfaction_diff.sort_values(ascending=False)

# Limitations

A main limitation of our data is that satisfaction is not an objective metric. For example, if two people feel the same absolute satisfaction about something, it is still possible that they rate that attribute of their flight differently. Additionally, something that is extremely displeasing to one passenger could be just a minor inconvenience to another, so expectations and mood also play into this survey. Other factors could influence the ratings as well. For example, if someone usually flies economy class and flies business class because someone else pays for them, then they might rate services higher or lower just because they are used to different ones. The classification of customers also falls into only two categories: “satisfied” and “neutral or dissatisfied,” which are not very specifically reflective of someone’s experience. Additionally, for the larger goal of our project to help airlines improve customer satisfaction, this dataset only pertains to one airline and is therefore less generalizable, but our project could help an airline see what customers value. 


# Questions For Reviewers

1. Do you have suggestions for other directions we could take our data exploration?
2. Should we make our research questions more broad? More specific? Do you think there is enough for us to do?
3. If there isn't a significant finding to our research questions, should we change our direction or have conclusions with that?

# Phase 4

We used hypothesis testing to see if the differences in average ratings for different categories are significantly different between two types of customers. First, business vs economy class travels, then loyal and disloyal customers, and finally between customers travelling for personal vs business reasons. 

$H_0: \mu_1 - \mu_2 = 0$

$H_a: \mu_1 - \mu_2 > 0$

In [None]:
#function for calculating t scores
def t_scores(arr_1, arr_2, list_vars):
    n1 = len(arr_1)
    n2 = len(arr_2)
    for s in list_vars:
        std_1 = np.std(arr_1[s])
        std_2 = np.std(arr_2[s])
        t_score = (np.mean(arr_1[s]) - np.mean(arr_2[s]))/((std_1**2/n1) + (std_2**2/2))**0.5
        print(s, ": ", round(t_score, 2))

In [None]:
#list of variables we are calculating t scores for 
list_vars = ['InflightWifi', 'FoodAndDrink', 'SeatComfort',
       'InflightEntertainment', 'OnboardService', 'LegRoomService',
       'InflightService', 'Cleanliness']

# Ratings based on Class

Overall, there is not a significant difference in the averages of ratings for any category betweeen business and economy class at the 95% confidence level. The difference in average food and drink rating is significant at the 70% confidence level. 

In [None]:
bus = cleaned_satisfaction_df[cleaned_satisfaction_df["Class"] == 0][list_vars]
eco = cleaned_satisfaction_df[cleaned_satisfaction_df["Class"] == 1][list_vars]

print("degrees of freedom: ", len(bus) - 1 + len(eco) - 1)

In [None]:
t_scores(bus, eco, list_vars)

# Ratings based on Customer Type

Overall, there is not a significant difference in the averages of ratings for any category between disloyal and loyal customers at the 95% confidence level. The difference in average food and drink rating is significant at the 70% confidence level. 

In [None]:
loyal = cleaned_satisfaction_df[cleaned_satisfaction_df["CustomerType"]==0][list_vars]
disloyal = cleaned_satisfaction_df[cleaned_satisfaction_df["CustomerType"]==1][list_vars]

print("degrees of freedom: ", len(loyal) - 1 + len(disloyal) - 1)

In [None]:
t_scores(loyal, disloyal, list_vars)

# Ratings based on Travel Type

There is no significant difference in the averages of ratings for any category between business and personal travel customers at the 95% confidence level. The differences in average food and drink and inflight service ratings are significant at the 60% confidence level. 

In [None]:
bustrav = cleaned_satisfaction_df[cleaned_satisfaction_df["TravelType"]==0][list_vars]
perstrav = cleaned_satisfaction_df[cleaned_satisfaction_df["TravelType"]==1][list_vars]

print("degrees of freedom: ", len(bustrav) - 1 + len(perstrav) - 1)

In [None]:
t_scores(bustrav, perstrav, list_vars)

# Hypothesis Test Conclusions

None of these service categoreis received significantly different average ratings between any two categories of customer that we investigated (at the 95% confidence level). Food and Drink was overall the most significant category. 

# Does Inflight Entertainment have a higher effect on customer satisfaction than Inflight Wifi?

In [None]:
satisfaction_train, satisfaction_test = train_test_split(cleaned_satisfaction_df, test_size = .2)

