## Airline Satisfaction - PCA Analysis

We'll perform a Principal Component Analysis (PCA) on the Airline Satisfaction dataset. We aim to predict customer satisfaction using the original variables and the principal components derived from PCA. Finally, we'll compare the outcomes and draw conclusions.

- step 1: import libaries 
- step 2: load the dataset
- step 3: exploratory data analysis (eda)
- step 4: data preparatin PCA
- step 5: PCA on variables 
- step 6: Predict satisfaction using variables 
- step 7: predict satis. using PC. 
- step 8: compare and conclude. 

### Wat is PCA

PCA, of Principal Component Analysis, is een techniek voor dimensionaliteitsreductie die vaak wordt gebruikt in machine learning en data visualisatie. Het doel van PCA is om de dimensies van een dataset te verminderen door nieuwe kenmerken te creëren die een combinatie zijn van de oude. Deze nieuwe kenmerken, of "hoofdcomponenten", worden zo gekozen dat ze de meeste variatie in de data vastleggen.

Het voordeel van PCA is dat het kan helpen om de complexiteit van het model te verminderen en overfitting te voorkomen, zonder veel informatie te verliezen. Het kan ook helpen om de rekenkundige efficiëntie te verbeteren, omdat er minder kenmerken zijn om te verwerken.

In [3]:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
    

In [7]:

# Load the dataset (You should replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('Airlinesatisfaction.csv')

# Display the first few rows of the dataframe
df.head()
    

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [8]:

# Perform exploratory data analysis (EDA)
# You can include code for EDA here such as:
# df.describe(), df.info(), etc.

# Drop unnecessary columns
df = df.drop(['id', 'Gate location'], axis=1)

# Vul de missende data met
mean_delay = df['Arrival Delay in Minutes'].mean()
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].fillna(mean_delay)

# Rename columns
df.columns = [c.replace(' ', '_') for c in df.columns]

# Replace categoric data with numberic data
df['Gender'].replace({'Male': 0, 'Female': 1},inplace = True)
df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace =True)
df['Customer_Type'].replace({'disloyal Customer': 0,'Loyal Customer': 1}, inplace = True)
df['Type_of_Travel'].replace({'Personal Travel': 0, 'Business travel': 1}, inplace = True)
df['Class'].replace({'Eco': 0, 'Eco Plus': 1, 'Business': 2} ,inplace=True)

# Print the first 5 rows of the dataframe
df.head()

    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].replace({'Male': 0, 'Female': 1},inplace = True)
  df['Gender'].replace({'Male': 0, 'Female': 1},inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace =True)
  df['satisfactio

Unnamed: 0,Unnamed:_0,Gender,Customer_Type,Age,Type_of_Travel,Class,Flight_Distance,Inflight_wifi_service,Departure/Arrival_time_convenient,Ease_of_Online_booking,...,Inflight_entertainment,On-board_service,Leg_room_service,Baggage_handling,Checkin_service,Inflight_service,Cleanliness,Departure_Delay_in_Minutes,Arrival_Delay_in_Minutes,satisfaction
0,0,0,1,13,0,1,460,3,4,3,...,5,4,3,4,4,5,5,25,18.0,0
1,1,0,0,25,1,2,235,3,2,3,...,1,1,5,3,1,4,1,1,6.0,0
2,2,1,1,26,1,2,1142,2,2,2,...,5,4,3,4,4,4,5,0,0.0,1
3,3,1,1,25,1,2,562,2,5,5,...,2,2,5,3,1,4,2,11,9.0,0
4,4,0,1,61,1,2,214,3,3,3,...,3,3,4,4,3,3,3,0,0.0,1


In [9]:

# Prepare the data for PCA - assuming the target variable is 'satisfaction'
X = df.drop('satisfaction', axis=1)  # Features
y = df['satisfaction']  # Target

# Standardize the features before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
    

In [10]:

# Perform PCA
pca = PCA(n_components=0.85) # 85% variance
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
    

In [11]:

# Predict satisfaction using all variables (using RandomForest as an example)
clf_all_features = RandomForestClassifier()
clf_all_features.fit(X_train, y_train)
y_pred_all = clf_all_features.predict(X_test)
print('Classification Report for all features:\n', classification_report(y_test, y_pred_all))
    

Classification Report for all features:
               precision    recall  f1-score   support

           0       0.95      0.98      0.97     11713
           1       0.97      0.94      0.95      9068

    accuracy                           0.96     20781
   macro avg       0.96      0.96      0.96     20781
weighted avg       0.96      0.96      0.96     20781



In [12]:

# Predict satisfaction using the principal components
clf_pca = RandomForestClassifier()
clf_pca.fit(X_train_pca, y_train)
y_pred_pca = clf_pca.predict(X_test_pca)
print('Classification Report for PCA features:\n', classification_report(y_test, y_pred_pca))
    

Classification Report for PCA features:
               precision    recall  f1-score   support

           0       0.92      0.95      0.94     11713
           1       0.94      0.90      0.92      9068

    accuracy                           0.93     20781
   macro avg       0.93      0.92      0.93     20781
weighted avg       0.93      0.93      0.93     20781



In [13]:

# Compare the outcomes and draw conclusions
accuracy_all = accuracy_score(y_test, y_pred_all)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f'Accuracy with all features: {accuracy_all}')
print(f'Accuracy with PCA features: {accuracy_pca}')

# Conclusions can be drawn based on the accuracy and classification reports.
# For example, you might find that PCA reduces the computation time without a significant loss in accuracy.
    

Accuracy with all features: 0.9611183292430585
Accuracy with PCA features: 0.9277705596458303
