## Introduction

We can classify customer satisfaction as a performance indicator that shows how well a company is able to meet consumer expectations before, during, and after a purchase.

When it is high, this metric signals compatibility between what the company offers and what the customer needs. It is a good gauge to understand if the provided service and created experience make sense for the target audience.

On the other hand, if customer satisfaction is low, it means that the consumer's expectations were not met – whether it's related to the service, product, or purchased service. And this can greatly harm your brand reputation.

In this project, the customer satisfaction dataset extracted from Kaggle will be analyzed, and the following phases will be implemented: <br>

    1) Data import, 
    2) Data analysis, 
    3) Data transformation, 
    4) Loading of processed data, 
    5) Machine learning model training, 
    6) Model testing, and
    7) Model implementation.

## Import

In [1]:
# Importing libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

%matplotlib inline

In [2]:
# Importing dataset

url = 'https://raw.githubusercontent.com/liliansom/ML_CustomerSatisfaction/main/data/Invistico_Airline.csv'
data = pd.read_csv(url)



In [3]:
# Observing the data

data.head()

In [4]:
# Data Dictionary

columns = {
    'Column Name': [
        'satisfaction',
        'Gender',
        'Customer Type',
        'Age',
        'Type of Travel',
        'Class',
        'Flight Distance',
        'Seat comfort',
        'Departure/Arrival time convenient',
        'Food and drink',
        'Gate location',
        'Inflight wifi service',
        'Inflight entertainment',
        'Online support',
        'Ease of Online booking',
        'On-board service',
        'Leg room service',
        'Baggage handling',
        'Check-in service',
        'Cleanliness',
        'Online boarding',
        'Departure Delay in Minutes',
        'Arrival Delay in Minutes'
    ],
    'Description': [
        'Indicates the level of customer satisfaction',
        'Represents the gender of the customer',
        'Specifies whether the customer is a loyal or disloyal customer',
        'Represents the age of the customer',
        'Indicates the purpose of the travel, such as business or personal',
        'Specifies the class of the flight, such as Business, Eco, or Eco Plus',
        'Represents the distance of the flight in miles',
        'Indicates the level of comfort experienced with the seating arrangements',
        'Represents the convenience of departure and arrival times',
        'Indicates the satisfaction level with the food and drink options',
        'Represents the satisfaction level with the location of the boarding gate',
        'Indicates the satisfaction level with the in-flight wifi service',
        'Represents the satisfaction level with the in-flight entertainment options',
        'Indicates the satisfaction level with the online customer support',
        'Represents the ease of booking flights online',
        'Indicates the satisfaction level with the on-board services provided',
        'Represents the satisfaction level with the legroom space',
        'Indicates the satisfaction level with the baggage handling process',
        'Represents the satisfaction level with the check-in process',
        'Indicates the satisfaction level with the cleanliness of the flight',
        'Represents the satisfaction level with the online boarding process',
        'Represents the number of minutes of departure delay',
        'Represents the number of minutes of arrival delay'
    ]
}

dict_df = pd.DataFrame(data)
dict_df



## Data Analysis

In [5]:
profile = ProfileReport(data)

In [6]:
profile.to_file(output_file = 'Customer_Satisfaction_Profiling.html')

In [7]:
profile.to_notebook_iframe()

In [8]:
# Checking the number of rows and columns
data.shape

In [9]:
# Checking the type of the columns
data.info()

In [10]:
# Chekcing null data
data.isnull().sum()

### Analysis of Entries by Column

In [11]:
# imputing mean for missing values 

data['Arrival Delay in Minutes'].fillna((round(data['Arrival Delay in Minutes'].mean(),0)), inplace=True)

In [12]:
# Analysing customer type

count_customer_type = data['Customer Type'].value_counts()
count_customer_type

In [13]:
# Analysing type of travel

count_travel_type = data['Type of Travel'].value_counts()
count_travel_type

In [14]:
# Analysing class

count_class = data['Class'].value_counts()
count_class

### Data transformation to binary

In [15]:
# Class needs to be int (0=Business; 1=Eco, 2=Eco Plus)
data.loc[data['Class'] == 'Business', 'Class'] = '0'
data.loc[data['Class'] == 'Eco', 'Class'] = '1'
data.loc[data['Class'] == 'Eco Plus', 'Class'] = '2'

In [16]:
# Satisfaction needs to be binary (0=dissatisfied; 1=satisfied)
data.loc[data['satisfaction'] == 'satisfied', 'satisfaction'] = '1'
data.loc[data['satisfaction'] == 'dissatisfied', 'satisfaction'] = '0'

In [17]:
# Gender needs to be binary (0=female; 1=male)
data.loc[data['Gender'] == 'Female', 'Gender'] = '0'
data.loc[data['Gender'] == 'Male', 'Gender'] = '1'

In [18]:
# Customer Type needs to be binary (0=Loyal; 1=Disloyal)
data.loc[data['Customer Type'] == 'Loyal Customer', 'Customer Type'] = '0'
data.loc[data['Customer Type'] == 'disloyal Customer', 'Customer Type'] = '1'

In [19]:
# Travel Type needs to be binary (0=Business travel; 1=Personal Travel)
data.loc[data['Type of Travel'] == 'Business travel', 'Type of Travel'] = '0'
data.loc[data['Type of Travel'] == 'Personal Travel', 'Type of Travel'] = '1'
data['Type of Travel'].value_counts()

### Descriptive Statistics

In [20]:
data.head()

In [21]:
for column in data.columns:
    if column is not data['Arrival Delay in Minutes']:
        data[column] = pd.to_numeric(data[column], errors='coerce')
    else:
        continue

In [22]:
data.dtypes

In [23]:
# Descriptive Analysis 
data.describe().round(2)

In [24]:
# Data Correlation satisfaction x variables
data_corr = data.corr().round(4)
data_corr = data_corr.iloc[:, 0].sort_values(ascending=False)
data_corr = pd.DataFrame(data_corr)

In [25]:
data_corr.rename(columns={'satisfaction': 'Correlation'}, inplace=True)

In [26]:
data_corr = data_corr.sort_values(by='Correlation', ascending=False)
data_corr

## Interpretation of Pearson's Correlation Coefficient

Plus or minus 0.9 indicates a very strong correlation.


0.5 to 0.7 positive or negative indicates a **moderate correlation**.

* Positive
        
        Inflight entertainment	        0.5235


0.3 to 0.5 positive or negative indicates a **weak correlation**.

* Positive
        
        Ease of Online booking	        0.4318
        Online support	                0.3901
        On-board service	            0.3520
        Online boarding              	0.3381
        Leg room service            	0.3049
        
0 to 0.3 positive or negative indicates a negligible correlation.

* Positive
        
        Checkin service                 0.2662
        Baggage handling                0.2603
        Cleanliness                 	0.2593
        Seat comfort                    0.2424
        Inflight wifi service           0.2271

* Negativo

        Gender                       	-0.2122
        Class	                       	-0.2789
        Customer Type                   -0.2926

In [27]:
# Plotting histogram
sns.set(font_scale=1.0, rc={'figure.figsize': (15,15)})
axis = data.hist(bins=20)

## Boxplot of the variables with moderate and weak correlation

In [28]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Inflight entertainment')
ax.set_title('Boxplot Inflight Entertainment', fontsize=20)

In [29]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Ease of Online booking')
ax.set_title('Boxplot Ease of Online booking', fontsize=20)

In [30]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Checkin service')
ax.set_title('Boxplot Checkin Service', fontsize=20)

In [31]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='On-board service')
ax.set_title('Boxplot Board Service', fontsize=20)

## Saving dataset after processing

In [33]:
data.to_csv('data/Invistico_Airline_treated.csv')