## Introduction

We can classify customer satisfaction as a performance indicator that shows how well a company is able to meet consumer expectations before, during, and after a purchase.

When it is high, this metric signals compatibility between what the company offers and what the customer needs. It is a good gauge to understand if the provided service and created experience make sense for the target audience.

On the other hand, if customer satisfaction is low, it means that the consumer's expectations were not met – whether it's related to the service, product, or purchased service. And this can greatly harm your brand reputation.

In this project, the customer satisfaction dataset extracted from Kaggle will be analyzed, and the following phases will be implemented: <br>

    1) Data import, 
    2) Data analysis, 
    3) Data transformation, 
    4) Loading of processed data, 
    5) Machine learning model training, 
    6) Model testing, and
    7) Model implementation.

## Import

In [1]:
# Importing libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling

C:\Users\LILIAN\.conda\envs\analisedados\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\LILIAN\.conda\envs\analisedados\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
  from .autonotebook import tqdm as notebook_tqdm
  import pandas_profiling


In [13]:
# Importing dataset

url = 'https://raw.githubusercontent.com/liliansom/ML_CustomerSatisfaction/main/data/Invistico_Airline.csv'
data = pd.read_csv(url)


In [14]:
# Observing the data

data.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,...,4,2,2,0,2,4,2,5,0,0.0


In [15]:
# Data Dictionary

data = {
    'Column Name': [
        'satisfaction',
        'Gender',
        'Customer Type',
        'Age',
        'Type of Travel',
        'Class',
        'Flight Distance',
        'Seat comfort',
        'Departure/Arrival time convenient',
        'Food and drink',
        'Gate location',
        'Inflight wifi service',
        'Inflight entertainment',
        'Online support',
        'Ease of Online booking',
        'On-board service',
        'Leg room service',
        'Baggage handling',
        'Check-in service',
        'Cleanliness',
        'Online boarding',
        'Departure Delay in Minutes',
        'Arrival Delay in Minutes'
    ],
    'Description': [
        'Indicates the level of customer satisfaction',
        'Represents the gender of the customer',
        'Specifies whether the customer is a loyal or disloyal customer',
        'Represents the age of the customer',
        'Indicates the purpose of the travel, such as business or personal',
        'Specifies the class of the flight, such as Business, Eco, or Eco Plus',
        'Represents the distance of the flight in miles',
        'Indicates the level of comfort experienced with the seating arrangements',
        'Represents the convenience of departure and arrival times',
        'Indicates the satisfaction level with the food and drink options',
        'Represents the satisfaction level with the location of the boarding gate',
        'Indicates the satisfaction level with the in-flight wifi service',
        'Represents the satisfaction level with the in-flight entertainment options',
        'Indicates the satisfaction level with the online customer support',
        'Represents the ease of booking flights online',
        'Indicates the satisfaction level with the on-board services provided',
        'Represents the satisfaction level with the legroom space',
        'Indicates the satisfaction level with the baggage handling process',
        'Represents the satisfaction level with the check-in process',
        'Indicates the satisfaction level with the cleanliness of the flight',
        'Represents the satisfaction level with the online boarding process',
        'Represents the number of minutes of departure delay',
        'Represents the number of minutes of arrival delay'
    ]
}

df = pd.DataFrame(data)
df



Unnamed: 0,Column Name,Description
0,satisfaction,Indicates the level of customer satisfaction
1,Gender,Represents the gender of the customer
2,Customer Type,Specifies whether the customer is a loyal or d...
3,Age,Represents the age of the customer
4,Type of Travel,"Indicates the purpose of the travel, such as b..."
5,Class,"Specifies the class of the flight, such as Bus..."
6,Flight Distance,Represents the distance of the flight in miles
7,Seat comfort,Indicates the level of comfort experienced wit...
8,Departure/Arrival time convenient,Represents the convenience of departure and ar...
9,Food and drink,Indicates the satisfaction level with the food...


## Data Analysis

In [24]:
pip install --upgrade pandas pandas-profiling


Collecting pandas
  Downloading pandas-2.0.2-cp310-cp310-win_amd64.whl (10.7 MB)
     ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
      --------------------------------------- 0.2/10.7 MB 4.1 MB/s eta 0:00:03
     - -------------------------------------- 0.4/10.7 MB 5.1 MB/s eta 0:00:03
     -- ------------------------------------- 0.6/10.7 MB 4.5 MB/s eta 0:00:03
     --- ------------------------------------ 0.8/10.7 MB 4.6 MB/s eta 0:00:03
     ---- ----------------------------------- 1.1/10.7 MB 5.0 MB/s eta 0:00:02
     ---- ----------------------------------- 1.3/10.7 MB 4.9 MB/s eta 0:00:02
     ----- ---------------------------------- 1.6/10.7 MB 5.0 MB/s eta 0:00:02
     ------ --------------------------------- 1.8/10.7 MB 4.9 MB/s eta 0:00:02
     -------- ------------------------------- 2.1/10.7 MB 5.2 MB/s eta 0:00:02
     --------- ------------------------------ 2.4/10.7 MB 5.4 MB/s eta 0:00:02
     ---------- ----------------------------- 2.8/10.7 MB

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'c:\\users\\lilian\\.conda\\envs\\analisedados\\lib\\site-packages\\numpy-1.24.3.dist-info\\METADATA'



In [23]:
profile = pandas_profiling.ProfileReport(data)
profile.to_file("report.html")

TypeError: type of argument "df" must be one of (pandas.core.frame.DataFrame, NoneType); got dict instead

In [19]:
# Checking the number of rows and columns
data.shape

AttributeError: 'dict' object has no attribute 'shape'

In [7]:
# Checking the type of the columns
data.info()

AttributeError: 'dict' object has no attribute 'info'

In [8]:
# Chekcing null data
data.isnull().sum()

AttributeError: 'dict' object has no attribute 'isnull'

### Analysis of Entries by Column

In [None]:
# imputing mean for missing values 

data['Arrival Delay in Minutes'].fillna((round(data['Arrival Delay in Minutes'].mean(),0)), inplace=True)

In [None]:
# Analysing customer type

count_customer_type = data['Customer Type'].value_counts()
count_customer_type

In [None]:
# Analysing type of travel

count_travel_type = data['Type of Travel'].value_counts()
count_travel_type

In [None]:
# Analysing class

count_class = data['Class'].value_counts()
count_class

### Data transformation to binary

In [None]:
# Class needs to be int (0=Business; 1=Eco, 2=Eco Plus)
data.loc[data['Class'] == 'Business', 'Class'] = '0'
data.loc[data['Class'] == 'Eco', 'Class'] = '1'
data.loc[data['Class'] == 'Eco Plus', 'Class'] = '2'

In [None]:
# Satisfaction needs to be binary (0=dissatisfied; 1=satisfied)
data.loc[data['satisfaction'] == 'satisfied', 'satisfaction'] = '1'
data.loc[data['satisfaction'] == 'dissatisfied', 'satisfaction'] = '0'

In [None]:
# Gender needs to be binary (0=female; 1=male)
data.loc[data['Gender'] == 'Female', 'Gender'] = '0'
data.loc[data['Gender'] == 'Male', 'Gender'] = '1'

In [None]:
# Customer Type needs to be binary (0=Loyal; 1=Disloyal)
data.loc[data['Customer Type'] == 'Loyal Customer', 'Customer Type'] = '0'
data.loc[data['Customer Type'] == 'disloyal Customer', 'Customer Type'] = '1'

In [None]:
# Travel Type needs to be binary (0=Business travel; 1=Personal Travel)
data.loc[data['Type of Travel'] == 'Business travel', 'Type of Travel'] = '0'
data.loc[data['Type of Travel'] == 'Personal Travel', 'Type of Travel'] = '1'
data['Type of Travel'].value_counts()

### Descriptive Statistics

In [None]:
# Descriptive Analysis 
data.describe().round(2)

In [None]:
# Analyzing correlations between variables
data_corr = data.corr().round(4)
data_corr

In [None]:
# Data Correlation satisfaction x variables
data_corr = data.corr().round(4)
data_corr = data_corr.iloc[:, 0].sort_values(ascending=False)
data_corr

## Interpretation of Pearson's Correlation Coefficient

Plus or minus 0.9 indicates a very strong correlation.

0.7 to 0.9 positive or negative indicates a strong correlation.

0.5 to 0.7 positive or negative indicates a **moderate correlation**.

* Positive
        
        Inflight entertainment               0.5235


0.3 to 0.5 positive or negative indicates a **weak correlation**.

* Positive
        
        Ease of Online booking               0.4318
        Online support                       0.3901
        On-board service                     0.3520
        Online boarding                      0.3381
        Leg room service                     0.3049

* Negative

        Customer Type                       -0.2926
        
0 to 0.3 positive or negative indicates a negligible correlation.

In [None]:
# Plotting histogram
sns.set(font_scale=1.0, rc={'figure.figsize': (15,15)})
axis = data.hist(bins=20)

## Boxplot of the variables with moderate and weak correlation

In [None]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Inflight entertainment')
ax.set_title('Boxplot Inflight Entertainment', fontsize=20)

In [None]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Ease of Online booking')
ax.set_title('Boxplot Ease of Online booking', fontsize=20)

In [None]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='Checkin service')
ax.set_title('Boxplot Checkin Service', fontsize=20)

In [None]:
plt.figure(figsize=(10,5))

ax = sns.boxplot(data=data, x='satisfaction', y='On-board service')
ax.set_title('Boxplot Board Service', fontsize=20)

## Saving dataset after processing

In [None]:
data.to_csv('data/Invistico_Airline_treated.csv')