# Airline Customer Satisfaction Prediction

In this project, we are going to try to predict whether an airline customer will be satisfied or not according to the features such as wifi service, food and drink service, online booking ease etc...

To achieve this we are going to use Machine Learning algorithms. We will build different models using:

*Logistic Reggresion

*Support Vector Machine

*Multi Layer Perceptron

*Random Forest

*Gradient Boosting

Then we are going to compare the models against each other and find the best model that predicts the most accurate results.

## First, let us import the needed libraries and the data

In [1]:
import pandas as pd

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

In [2]:
data = train_data.append(test_data)
data.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [3]:
data.shape

(129880, 25)

In [4]:
data.columns

Index(['Unnamed: 0', 'id', 'Gender', 'Customer Type', 'Age', 'Type of Travel',
       'Class', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'satisfaction'],
      dtype='object')

## We should check if there are values that are missing

In [5]:
data.isnull().sum()

Unnamed: 0                             0
id                                     0
Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
satisfaction    

We see here that, 393 attributes in "Arrival Delay in Minutes" column are missing. We should find out if that is missing at random or it being missing means anything(like arrival is not delayed and it is not registered instead of registering it as 0)

To do that, using the "Departure/Arrival time convenient" feature is a great idea.

In [6]:
data.groupby(data['Arrival Delay in Minutes'].isnull())['Departure/Arrival time convenient'].mean()

Arrival Delay in Minutes
False    3.057349
True     3.139949
Name: Departure/Arrival time convenient, dtype: float64

Here we come to a conclusion that the value being missing does not imply that the plane arrived on time. So we can place the mean of "Arrival Delay in Minutes" where the value is missing.

In [7]:
data['Arrival Delay in Minutes'].fillna(data['Arrival Delay in Minutes'].mean(), inplace=True)

## Now we enumerate non-numeral catagorial values so that the computer can work on them.

In [8]:
data['satisfaction'].value_counts()

neutral or dissatisfied    73452
satisfied                  56428
Name: satisfaction, dtype: int64

In [9]:
satisfaction_num = {'neutral or dissatisfied': 0, 'satisfied': 1} 
data['satisfaction'] = data['satisfaction'].map(satisfaction_num) 

In [10]:
data['Gender'].value_counts()

Female    65899
Male      63981
Name: Gender, dtype: int64

In [11]:
gender_num = {'Female': 0, 'Male': 1} 
data['Gender'] = data['Gender'].map(gender_num) 

In [12]:
data['Customer Type'].value_counts()

Loyal Customer       106100
disloyal Customer     23780
Name: Customer Type, dtype: int64

In [13]:
loyal_num = {'Loyal Customer': 0, 'disloyal Customer': 1} 
data['Customer Type'] = data['Customer Type'].map(loyal_num) 

In [14]:
data['Type of Travel'].value_counts()

Business travel    89693
Personal Travel    40187
Name: Type of Travel, dtype: int64

In [15]:
traveltype_num = {'Business travel': 0, 'Personal Travel': 1} 
data['Type of Travel'] = data['Type of Travel'].map(traveltype_num) 

In [16]:
data['Class'].value_counts()

Business    62160
Eco         58309
Eco Plus     9411
Name: Class, dtype: int64

In [17]:
class_num = {'Business': 0, 'Eco': 1, 'Eco Plus': 2} 
data['Class'] = data['Class'].map(class_num) 

## Droping the unnecessary columns

We drop the "id" columns as they do not affect the satisfaction label because they are given at random.

In [18]:
data = data.drop(['Unnamed: 0', 'id'],axis=1)
data.head(20)

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,1,0,13,1,2,460,3,4,3,1,...,5,4,3,4,4,5,5,25,18.0,0
1,1,1,25,0,0,235,3,2,3,3,...,1,1,5,3,1,4,1,1,6.0,0
2,0,0,26,0,0,1142,2,2,2,2,...,5,4,3,4,4,4,5,0,0.0,1
3,0,0,25,0,0,562,2,5,5,5,...,2,2,5,3,1,4,2,11,9.0,0
4,1,0,61,0,0,214,3,3,3,3,...,3,3,4,4,3,3,3,0,0.0,1
5,0,0,26,1,1,1180,3,4,2,1,...,1,3,4,4,4,4,1,0,0.0,0
6,1,0,47,1,1,1276,2,4,2,3,...,2,3,3,4,3,5,2,9,23.0,0
7,0,0,52,0,0,2035,4,3,4,4,...,5,5,5,5,4,5,4,4,0.0,1
8,0,0,41,0,0,853,1,2,2,2,...,1,1,2,1,4,1,2,0,0.0,0
9,1,1,20,0,1,1061,3,3,3,4,...,2,2,3,4,4,3,2,0,0.0,0


Saving the clean data so that we can use it later.

In [20]:
data.to_csv('data.csv')