In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Function to show data quality of each column
def summary(dtf):
    sumary=pd.concat([dtf.isna().sum(),((dtf == 0).sum())/dtf.shape[0],dtf.dtypes], axis=1)
    sumary=sumary.rename(columns={sumary.columns[0]: 'NaN'})
    sumary=sumary.rename(columns={sumary.columns[1]: 'Zeros'})
    sumary=sumary.rename(columns={sumary.columns[2]: 'Type'})
    sumary['NaN']=sumary['NaN'].astype(str)+' ('+((sumary['NaN']*100/dtf.shape[0]).astype(int)).astype(str)+'%)'
    sumary['Zeros']=(sumary['Zeros']*100).astype(int)
    sumary['Zeros']=(dtf == 0).sum().astype(str)+' ('+sumary['Zeros'].astype(str)+'%)'
    sumary=sumary[['Type','NaN','Zeros']]
    return print(sumary)

In [None]:
df_train = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv")
df_test  = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/test.csv")

# First Glance
Firstly, we will examine the data types of features, they're mostly in a numerical format except for some obvious categorical features eg. Gender, Customer Type and Class etc. but they are not hard to treat.

In [None]:
df_train.dtypes

Let's look at a preview of the data. 

The '*Unnamed: 0*' is a dummy column, and *id* is supposingly the unique identifier of each passenger which does not give us any useful insights. We will drop them right away



In [None]:
df_train.head()

In [None]:
df_train.drop(['Unnamed: 0', 'id'], inplace=True, axis=1)
df_train['satisfaction'].replace({ 'satisfied' : 1, 'neutral or dissatisfied' : 0}, inplace=True)

# Exploratory Data Analysis
By looking at the dataset, my analytical mind came into play and these questions were first prompted in my head:
* What is the overall Customer Demographics in this dataset
* Are the sub-satisfaction (eg. Inflight wifi service, Ease of online booking) affected by customer demographics?
* How much does each sub-satisfaction affects the overall satisfaction level?

These points are worthwhile in performing an EDA!

## Customer distribution by Gender
Female passengers are observed to outnumber male passengers in general, particularly young passengers from Age 10-30, but the difference is not significant in this massive dataset.

In [None]:
sns.histplot(x='Age',hue='Gender',data=df_train, binwidth=10)

## Customer distribution by Loyalty
A great majority of passensers are loyal customers centred at 40 years old. As for disloyal customers, they are centres at adults aged 20 years old.

In [None]:
sns.histplot(x='Age',hue='Customer Type',data=df_train, binwidth=10)

## Customer distribution by Class & Travel Type
Obviously a greater majority passengers on a business class were on a business trip, however for Eco and Eco-plus, travel types are relatively balanced despite Eco class having about 5k more passengers on personal travel than business travel.

In [None]:
sns.histplot(x='Class',data=df_train, hue='Type of Travel')

## Relationship between Sub-Satisfactions (Correlation analysis)
1. Overall *satisfaction* is not heavily influenced by any single sub-satisfaction, with the mild exception of *Online Boarding* (+0.5) and *Inflight Entertainment* (+0.4). We can possibly dive further into other features which are highly correlated with these 2 sub-satisfactions.

2. There is a highly-positive relationship (+0.7) between *clealiness* and *Food & drinks*,and *Seat Comfort*. This is reasonable and we could deduce that the definition of **clealiness from a passenger's perspective refers to food hygiene and seat/cockpit clealiness** as a whole. Although *Inflight entertainment* is also highly correlated with *cleanliness*, this does not makes sense (correlation =/= causation!) but is probably correlated with *Seat Comfort* instead (+0.6).

3. There is a noticeable positive relationship between *Inflight wifi service* and *Ease of Online Booking* (+0.7), *Online Boarding* (+0.5). This is also reasonable as it presents convenience to tech-savvy passengers in their onboarding process, probably youngsters (we will verify this in the next step).

In [None]:
sns.heatmap(df_train[['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Gate location','Food and drink','Online boarding','Seat comfort','Inflight entertainment','On-board service','Leg room service','Baggage handling','Checkin service','Inflight service','Cleanliness','satisfaction']].corr(), annot=True, fmt=".1f")

## Customer Satisfaction breakdown by Age & Sub-satisfaction
It's a good idea to investigate which features do each age group priotize the most for a better overall experience.

Let's start with an example here using *Cleanliness*, we expect overall satisfaction to be higher with higher *Cleanliness* satisfaction.

Based on the scatterplot, we indeed have more satisfied customers with higher Clealiness especially for seniors aged 50 and above.

Do note that **Orange = Satisfied** and **Blue = Dissatisfied/Neutral**

In [None]:
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Cleanliness')

But how about other features?

Well, based on further analysis, we can categorize each of the features into 3 groups depending on who thinks it is important
1. Youngsters and adults aged 35 and below
2. Senior citizens aged 50 and above
3. None (nobody thinks it is important!)


### Youngsters and adults aged 35 and below
Youngsters and adults tend to prioritize these for a better experience:
* Inflight Wifi Service
* Online Boarding
* Ease of Online Booking

By observing the scatterplots, youngsters who gave 4-5 star rating for these features are satisfied in overall, but it seems the same for older age groups aged 50 and above.

And this proved our earlier assumption that online services are crucial for the younger age groups!

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(10, 3))
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Inflight wifi service', legend=False, ax=axs[0])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Online boarding', legend=False, ax=axs[1])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Ease of Online booking', legend=False, ax=axs[2])
plt.show()

### Senior citizens aged 50 and above
As for the seniors, they prioritize on these much more:
* Leg Room Service
* Checkin Service
* Inflight Service
* Inflight Entertainment

There is an obvious increase in overall Satisfaction for seniors with higher rating given to these features, but is unaffected by youngsters. These are mostly features which involves human interactions and therefore we can say that the behaviour of flight attendants and check in crews will affect the overall satisfaction of seniors.

As for inflight entertainment, seniors rely heavily on them keep themselves entertained but not for youngsters, as youngsters prefer using their smartphones with inflight wifi which gives them the flexibility to choose contents not limited to the in flight TV.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Leg room service', legend=False, ax=axs[0,0])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Inflight entertainment', legend=False, ax=axs[0,1])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Checkin service', legend=False, ax=axs[1,0])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Inflight service', legend=False, ax=axs[1,1])
plt.show()

### Not important features
There are also quite a number of features which does not affect overall satisfaction regardless of rating given to that feature:
* Departure/Arrival Time Convenient
* On-board Service
* Gate Location
* Food and Drink
* Baggage Handling
* Seat Comfort

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(10, 7))
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Departure/Arrival time convenient', legend=False, ax=axs[0, 0])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='On-board service', legend=False, ax=axs[0, 1])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Gate location', legend=False, ax=axs[0, 2])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Food and drink', legend=False, ax=axs[1, 0])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Baggage handling', legend=False, ax=axs[1, 1])
sns.scatterplot(data=df_train, x='Age',hue='satisfaction', y='Seat comfort', legend=False, ax=axs[1, 2])
plt.show()

# Feature Engineering
Time to convert our categorical features into numerical.

In [None]:
df_train['Gender'].replace({ 'Male' : 1, 'Female' : 0}, inplace=True)
df_train['Type of Travel'].replace({ 'Business travel' : 1, 'Personal Travel' : 0}, inplace=True)
df_train['Class'].replace({ 'Business' : 2, 'Eco Plus' : 1, 'Eco' : 0}, inplace=True)
df_train['Customer Type'].replace({ 'Loyal Customer' : 1, 'disloyal Customer' : 0}, inplace=True)

Flight Distance is positively skewed, we can normalize it using a cube root transformation

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 5), gridspec_kw={'height_ratios': [1, 5]})
sns.boxplot(x=df_train['Flight Distance'], orient='h', ax=axs[0, 0])
axs[0, 0].set_xlabel('')
axs[0, 0].set_title('Flight Distance')
sns.histplot(x=df_train['Flight Distance'], ax=axs[1, 0])
axs[1, 0].set_xlabel('')
axs[1, 0].set_ylabel('')
sns.boxplot(x=np.cbrt(df_train['Flight Distance']), orient='h', ax=axs[0, 1])
axs[0, 1].set_xlabel('')
axs[0, 1].set_title('cbrt(Flight Distance)')
sns.histplot(x=np.cbrt(df_train['Flight Distance']), ax=axs[1, 1])
axs[1, 1].set_xlabel('')
axs[1, 1].set_ylabel('')
plt.show()

df_train['Flight Distance']=np.cbrt(df_train['Flight Distance'])

In [None]:
summary(df_train)

Looks like we are not done with our Data Cleansing yet, there are NaNs in the *Arrival Delay* feature.

However we can just fill the NaNs with their respective *Departure Delay* (I know you're asking "why!?" , please see the next section "**Effect of Demographics and Misc. features on Satisfaction**" for explanation).

## Effect of Demographics and Misc. features on Satisfaction
Gender, Arrival and Departure Delay does not seem to impact satisfaction levels.

The perfect correlation between Arrival and Departure delay is reasonable because a delay in departure would always lead to an equal delay in arrival as flight durations do not vary by much.

In [None]:
sns.heatmap(df_train[['Gender','Type of Travel','Class','Customer Type','Flight Distance','Arrival Delay in Minutes','Departure Delay in Minutes','satisfaction']].corr(), annot=True, fmt=".1f")

*(cont. from Feature Engineering)* 

Remember earlier when I said input the NaNs in *Arrival Delay* with their respective *Departure Delay* values? Here's why.

Notice in the correlation plot that *Departure Delay* and *Arrival Delay* are perfectly correlated (p=+1), this could mean any of the following
1. Arrival Delay = Departure Delay (ideal situation with a simple formula)
2. Arrival Delay = k*Departure Delay+c (both variables are not equal, but can be modelled by a line equation with slope k and intercept c.

We can plot a simple simple scatterplot to visualize it (see below)

See? There is an obvious linear relationship with a line probably through the origin. Let's plot the line of best fit and input our NaNs accordingly:

In [None]:
from sklearn.linear_model import LinearRegression
x_train=df_train[df_train['Arrival Delay in Minutes'].notna()]['Departure Delay in Minutes']
y_train=df_train[df_train['Arrival Delay in Minutes'].notna()]['Arrival Delay in Minutes']
LR = LinearRegression().fit(x_train.reset_index(), y_train.reset_index())
print('intercept:', LR.intercept_[0])
print('slope:', LR.coef_[0][0])

In [None]:
slp=LR.coef_[0][0]
fig, ax = plt.subplots(figsize=(6, 4))
p1 = sns.scatterplot(data=df_train, y='Arrival Delay in Minutes', x='Departure Delay in Minutes', ax=ax)
p2 = sns.lineplot(x=x_train, y=x_train*slp, color='g', ax=ax)

df_train['Arrival Delay in Minutes'].fillna(df_train['Departure Delay in Minutes']*slp, inplace=True)

Since we have identified the variables which do not influence overall Satisfaction, we can remove them before training our model on the test data.

Let's do the same transformation on the test data first.

In [None]:
# Data Scaling
from sklearn.preprocessing import MinMaxScaler
numeric_col=['Age','Class','Flight Distance','Arrival Delay in Minutes','Departure Delay in Minutes','Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Gate location','Food and drink','Online boarding','Seat comfort','Inflight entertainment','On-board service','Leg room service','Baggage handling','Checkin service','Inflight service','Cleanliness']
df_train[numeric_col] = MinMaxScaler().fit_transform(df_train[numeric_col])


In [None]:
# Perform same transformation on test set
df_test.drop(['Unnamed: 0', 'id'], inplace=True, axis=1)
df_test['satisfaction'].replace({ 'satisfied' : 1, 'neutral or dissatisfied' : 0}, inplace=True)
df_test['Gender'].replace({ 'Male' : 1, 'Female' : 0}, inplace=True)
df_test['Type of Travel'].replace({ 'Business travel' : 1, 'Personal Travel' : 0}, inplace=True)
df_test['Class'].replace({ 'Business' : 2, 'Eco Plus' : 1, 'Eco' : 0}, inplace=True)
df_test['Customer Type'].replace({ 'Loyal Customer' : 1, 'disloyal Customer' : 0}, inplace=True)
df_test['Flight Distance']=np.cbrt(df_test['Flight Distance'])
df_test[numeric_col] = MinMaxScaler().fit_transform(df_test[numeric_col])
df_test.drop(['Gender', 'Arrival Delay in Minutes','Departure Delay in Minutes','Departure/Arrival time convenient','Gate location'], inplace=True, axis=1)

# Model Training
For this binary classification problem, we can approach it with these algorithms:
1. Logistic Regression
2. Naive Bayes
3. Decision Tree
4. Random Forest
5. K-Nearest Neighbors
6. Support Vector Machine

Their hyperparameters will be optimised using a handy tool called **Optuna**. (see 'Appendix' for steps)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
import optuna

y_train=df_train['satisfaction']
x_train=df_train.drop(['satisfaction'],axis=1, inplace=False)
y_test=df_test['satisfaction']
x_test=df_test.drop(['satisfaction'],axis=1, inplace=False)

In [None]:
# Logistic Regression Model
model=LogisticRegression(solver='lbfgs', fit_intercept=True)
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Create table to document model performance
eva = pd.DataFrame(columns=['Algorithm','Accuracy','Precision','Recall','F1-Score']) 
list=['Logistic Regression',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

In [None]:
# Naive Bayes Model
model=GaussianNB()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Append model result
list=['Naive Bayes',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

In [None]:
# Decision Tree Model
model=DecisionTreeClassifier(criterion='log_loss', splitter='best', max_depth=10, min_samples_split=36, min_samples_leaf=9, max_features='log2')
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Append model result
list=['Decision Tree',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

In [None]:
# Random Forest Model
model=RandomForestClassifier(criterion='log_loss', max_depth=15, min_samples_split=3, min_samples_leaf=4, max_features='log2')
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Append model result
list=['Random Forest',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

In [None]:
# K-Nearest Neighbors Model
model=KNeighborsClassifier(n_neighbors=5,  weights='uniform', algorithm='auto')
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Append model result
list=['K-Nearest Neighbors',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

In [None]:
# Support Vector Machine Model
model=SVC()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

# Append model result
list=['SVC',accuracy_score(y_test, y_pred),precision_score(y_test, y_pred),recall_score(y_test, y_pred), f1_score(y_test, y_pred)]
eva.loc[len(eva)] = list

# Model Evaluation
Apparently, the Random Forest outperforms all other algorithms in all aspects including precision and recall, however the nature of the Random Forest also makes it prone to overfitting. The Support Vector Machine, on the other hand has a regularization element which addresses the overfitting problem, and it's performance is comparable to Random Forest. 

Therefore, the **Support Vector Machine** model should be considered for this dataset.

In [None]:
eva=eva.set_index('Algorithm')
eva.sort_values(by=['Accuracy'], ascending=False)

# Appendix

In [None]:
# def objective(trial):
#     criterion = trial.suggest_categorical('criterion', ['log_loss', 'entropy', 'gini'])
#     max_depth = trial.suggest_int('max_depth', 8, 15)
#     min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
#     min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 5)
#     max_features = trial.suggest_categorical('max_features', ['sqrt','log2'])
#     model = RandomForestClassifier(criterion=criterion, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, max_features=max_features)
#     model.fit(x_train,y_train)
#     y_pred = model.predict(x_test)
#     acc = accuracy_score(y_test, y_pred)
#     return acc

# study = optuna.create_study(direction='maximize')
# study.optimize(objective, n_trials=20)
# study.best_params

In [None]:
# def objective(trial):
#     n_neighbors = trial.suggest_int('n_neighbors' , 2 , 20)
#     weights = trial.suggest_categorical('weights', ['uniform', 'distance'])
#     #algorithm = trial.suggest_categorical('algorithm' , ['auto', 'ball_tree', 'kd_tree', 'brute'])
#     #leaf_size = trial.suggest_int('leaf_size' , 1 , 5)
#     model = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, algorithm='ball_tree')
#     model.fit(x_train,y_train)
#     y_pred = model.predict(x_test)
#     acc = accuracy_score(y_test, y_pred)
#     return acc

# study = optuna.create_study(direction='maximize')
# study.optimize(objective, n_trials=20)
# study.best_params

In [None]:
# def objective(trial):
#     #criterion = trial.suggest_categorical('criterion', ['log_loss', 'entropy', 'gini'])
#     #splitter = trial.suggest_categorical('splitter', ['best','random'])
#     #max_depth = trial.suggest_int('max_depth', 5, 10)
#     min_samples_split = trial.suggest_int('min_samples_split', 20, 50)
#     min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 30)
#     #max_features = trial.suggest_categorical('max_features', ['sqrt','log2'])
#     model = DecisionTreeClassifier(criterion='log_loss', splitter='best', max_depth=10, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, max_features='log2')
#     model.fit(x_train,y_train)
#     y_pred = model.predict(x_test)
#     acc = accuracy_score(y_test, y_pred)
#     return acc

# study = optuna.create_study(direction='maximize')
# study.optimize(objective, n_trials=50)
# study.best_params