# CS7324 - Lab 3: Extending Logistic Regression
### Jarad Angel, Zach Bohl,  and Luigi Allen

Airline Passenger Satisfation Dataset: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

# Preparation and Overview

### Task Explanation and Business Case

The main purpose of the dataset was to predict customer satisfaction,  however to fullfil the requires that dataset must contain three or more classes to predict.  We are explring an alternative classification problem within the Airline Passenger Satisfaction dataset which is to predict the travel class of passengers (i.e., Business, Economy, and Economy Plus). Here’s is how we are framing this task:

### Classification Task: Predict Passenger Travel Class
This task involves predicting whether a passenger is flying in Business Class, Economy Class, or Economy Plus based on various features available in the dataset. To predict travel class effectively, several features from the dataset are likely to be important. Here are some key features that can significantly influence the prediction:

1. **Flight Distance:** Longer flights might have a higher proportion of Business and Economy Plus passengers.
2. **In-flight Services:** Ratings for Wi-Fi, entertainment, and seat comfort can indicate the class, as Business class typically offers superior services.
3. **Baggage Handling:** Satisfaction with baggage handling might correlate with higher travel classes.
4. **Customer Satisfaction:** Overall satisfaction scores can be higher for Business and Economy Plus passengers.
5. **Flight Delays:** Passengers in higher classes might experience fewer delays or have different perceptions of delays.
6. **Age:** Different age groups might prefer different travel classes.
7. **Gender:** There might be trends in travel class preferences based on gender.
8. **Type of Traveler:** Business travelers are more likely to be in Business class, while leisure travelers might prefer Economy or Economy Plus.


### Business Use-Case:
The goal of predicting travel class can provide insights into passenger behavior and help airlines such as American Airlines (AA) to optimize their service offerings and revenue management strategies. Airlines could use this information for:

- **Revenue Optimization:** Understanding the factors that influence travel class choices allows airlines to adjust pricing, promotions, and services to encourage passengers to upgrade to higher classes (e.g., from Economy to Economy Plus or Business).

- **Passenger Segmentation:** Airlines can use this model to tailor services to specific passenger segments, such as offering personalized upgrades or add-on services to Economy Plus passengers likely to switch to Business class. Currently, American Airlines does not offer upgrades to their lowest tier(Economy) customers to switch to top tier(Buisness class).

- **Operational Efficiency:** The model can help forecast the demand for each travel class on specific routes, allowing airlines to adjust flight capacity (e.g., increasing Business class seats on popular routes).

### Interested Parties:
The parties at American Airlines that would be interested in the results are as follows:

- **Revenue Management Teams:** They can use the model to identify passengers more likely to purchase upgrades or additional services, optimizing pricing strategies for different classes.

- **Customer Experience Teams:** Understanding class preferences allows these teams to tailor in-flight experiences, loyalty rewards, and offers based on predicted class choice.

- **Airline Executives:** Senior leadership can make data-driven decisions regarding fleet configuration (e.g., more Business class seats on high-demand routes).

- **Marketing Teams:** These teams can leverage insights to create targeted promotions and campaigns for each class of traveler, improving customer acquisition and retention.

### Offline Analysis vs. Deployed Model:
- **Offline Analysis:** This model can be used to identify travel class trends for various passenger demographics, seasonality, or routes. This helps in revenue forecasting, adjusting pricing strategies, and configuring flight classes.

- **Deployed Model:** It can also be deployed in the booking system to predict and offer real-time travel class upgrades or special offers for passengers who are likely to switch from Economy to Economy Plus or Business.

### Performance Requirements:
To be valuable for real-time deployment or offline analysis, the model should ideally achieve at least 80% accuracy or higher. However, since American Airlines make significant revenue from upgrades, improving the precision and recall for Business class predictions may be crucial. Airlines would aim for high recall (e.g., >90%) in identifying potential Business class passengers, as a missed opportunity here could directly impact revenue.

The classifier's performance could also be evaluated using F1-score, particularly for underrepresented classes (e.g., Economy Plus, which might have fewer passengers than Economy).

In [314]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Loading the dataset and displaying feature information for analysis
df = pd.read_csv('Archive/train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

In [315]:
print(df.describe())

          Unnamed: 0             id            Age  Flight Distance  \
count  103904.000000  103904.000000  103904.000000    103904.000000   
mean    51951.500000   64924.210502      39.379706      1189.448375   
std     29994.645522   37463.812252      15.114964       997.147281   
min         0.000000       1.000000       7.000000        31.000000   
25%     25975.750000   32533.750000      27.000000       414.000000   
50%     51951.500000   64856.500000      40.000000       843.000000   
75%     77927.250000   97368.250000      51.000000      1743.000000   
max    103903.000000  129880.000000      85.000000      4983.000000   

       Inflight wifi service  Departure/Arrival time convenient  \
count          103904.000000                      103904.000000   
mean                2.729683                           3.060296   
std                 1.327829                           1.525075   
min                 0.000000                           0.000000   
25%                 2.000

In [316]:
# A look at the entire dataframe to understand all of the available features and sample values
pd.set_option('display.max_columns', None) # Display option to show all columns
df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied


In [317]:
# Let's combine some of the ages into buckets...
# Define the age ranges for each category
bins = [0, 12, 25, 65, np.inf]
labels = ['Child', 'Young Adult', 'Adult', 'Elderly']

# Create a new column with the age categories
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Remove the 'Age' column in-place
# df.drop('Age', axis=1, inplace=True)

df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


In [318]:
# Let's combine some of the ages into buckets...
# Define the age ranges for each category
bins = [0, 12, 25, 65, np.inf]
labels = ['Child', 'Young Adult', 'Adult', 'Elderly']

# Create a new column with the age categories
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Remove the 'Age' column in-place
# df.drop('Age', axis=1, inplace=True)

df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


In [319]:
# Remove the 'Ease' column in-place
df.drop('Ease of Online booking', axis=1, inplace=True)

# Remove the row column, not needed
df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

df

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,Female,disloyal Customer,23,Business travel,Eco,192,2,1,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,Male,disloyal Customer,30,Business travel,Business,1995,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


## Transforming Categorical Data


In [320]:
# Updating GENDER from string to binary

# Create a dictionary for mapping Male and Female
gender_map = {'Male': 0, 'Female': 1}

# Replace the original column instead of creating a new one:
df['Gender_Numeric'] = df['Gender'].map(gender_map).fillna(df['Gender'])
df.drop('Gender', axis=1, inplace=True)
df

Unnamed: 0,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group,Gender_Numeric
0,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult,0
1,disloyal Customer,25,Business travel,Business,235,3,2,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult,0
2,Loyal Customer,26,Business travel,Business,1142,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult,1
3,Loyal Customer,25,Business travel,Business,562,2,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult,1
4,Loyal Customer,61,Business travel,Business,214,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,disloyal Customer,23,Business travel,Eco,192,2,1,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult,1
103900,Loyal Customer,49,Business travel,Business,2347,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult,0
103901,disloyal Customer,30,Business travel,Business,1995,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult,0
103902,disloyal Customer,22,Business travel,Eco,1000,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult,1


In [321]:

from sklearn.preprocessing import OneHotEncoder

# Encode categorical variables
categorical_features = df.select_dtypes(include=['object']).columns
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).set_output(transform='pandas')
df_encoded = pd.DataFrame(onehot_encoder.fit_transform(df[['Customer Type', 'Type of Travel', 'Customer Type', 'Type of Travel', 'Class']]))

# # Replace the original categorical columns with the encoded ones
# df_encoded.index = df.index
# df = df.drop(categorical_features, axis=1)
# df = pd.concat([df, df_encoded], axis=1)

# Ensure all column names are strings
#df.columns = df.columns.astype(str)
# # Scale numeric variables
# numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
# scaler = StandardScaler()
# df[numeric_features] = scaler.fit_transform(df[numeric_features])

# # Describe the final dataset
# print(df.describe())

# # Breakdown of variables after preprocessing
# numeric_stats = df[numeric_features].describe().T
# categorical_stats = df_encoded.describe().T

# print("Numeric Features Stats:")
# print(numeric_stats)

# print("Categorical Features Stats:")
# print(categorical_stats)

df_encoded



Unnamed: 0,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Customer Type_Loyal Customer.1,Customer Type_disloyal Customer.1,Type of Travel_Business travel.1,Type of Travel_Personal Travel.1,Class_Business,Class_Eco,Class_Eco Plus
0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
103899,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
103900,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
103901,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
103902,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


In [322]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Encode categorical variables
categorical_features = df.select_dtypes(include=['object']).columns
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
df_encoded = pd.DataFrame(onehot_encoder.fit_transform(df[categorical_features]))

# Replace the original categorical columns with the encoded ones
df_encoded.index = df.index
df = df.drop(categorical_features, axis=1)
df = pd.concat([df, df_encoded], axis=1)

# Ensure all column names are strings
df.columns = df.columns.astype(str)

# Scale numeric variables
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

# Describe the final dataset
print(df.describe())

# Breakdown of variables after preprocessing
numeric_stats = df[numeric_features].describe().T

# Check if df_encoded is empty before describing
if not df_encoded.empty:
    categorical_stats = df_encoded.describe().T
    print("Categorical Features Stats:")
    print(categorical_stats)
else:
    print("No categorical features to describe.")

print("Numeric Features Stats:")
print(numeric_stats)

df

                Age  Flight Distance  Inflight wifi service  \
count  1.035940e+05     1.035940e+05           1.035940e+05   
mean  -8.779415e-18     9.671074e-18          -4.197658e-17   
std    1.000005e+00     1.000005e+00           1.000005e+00   
min   -2.142550e+00    -1.161470e+00          -2.055754e+00   
25%   -8.191903e-01    -7.774302e-01          -5.495706e-01   
50%    4.099330e-02    -3.482682e-01           2.035210e-01   
75%    7.688410e-01     5.551780e-01           9.566125e-01   
max    3.018552e+00     3.803974e+00           1.709704e+00   

       Departure/Arrival time convenient  Gate location  Food and drink  \
count                       1.035940e+05   1.035940e+05    1.035940e+05   
mean                       -1.034325e-16   1.392360e-16   -1.144067e-16   
std                         1.000005e+00   1.000005e+00    1.000005e+00   
min                        -2.006314e+00  -2.329958e+00   -2.408709e+00   
25%                        -6.950321e-01  -7.646656e-01  

Unnamed: 0,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Age_Group,Gender_Numeric,0,1,2,3,4,5,6,7,8
0,-1.745542,-0.731305,0.203521,0.616249,-1.547312,1.352401,-0.185632,1.182991,1.231530,0.479237,-0.267143,0.311853,0.549773,1.156211,1.305913,0.268966,0.072905,Young Adult,-1.015154,0.472883,-0.472883,-1.491414,1.491414,-0.957206,-0.904105,3.587718,0.874582,-0.874582
1,-0.951526,-0.956916,0.203521,-0.695032,0.017981,-1.656487,-0.185632,-1.849863,-1.769166,-1.849452,1.253304,-0.534854,-1.821038,0.305580,-1.742432,-0.360682,-0.237184,Adult,-1.015154,-2.114687,2.114687,0.670505,-0.670505,1.044708,-0.904105,-0.278729,0.874582,-0.874582
2,-0.885358,-0.047454,-0.549571,-0.695032,-0.764666,1.352401,1.296479,1.182991,1.231530,0.479237,-0.267143,0.311853,0.549773,0.305580,1.305913,-0.386917,-0.392229,Adult,0.985072,0.472883,-0.472883,0.670505,-0.670505,1.044708,-0.904105,-0.278729,-1.143403,1.143403
3,-0.951526,-0.629028,-0.549571,1.271890,1.583273,-0.904265,-0.926688,-1.091649,-1.018992,-1.073222,1.253304,-0.534854,-1.821038,0.305580,-0.980345,-0.098328,-0.159662,Adult,0.985072,0.472883,-0.472883,0.670505,-0.670505,1.044708,-0.904105,-0.278729,0.874582,-0.874582
4,1.430521,-0.977973,0.203521,-0.039391,0.017981,0.600179,1.296479,1.182991,-0.268818,-0.296993,0.493081,0.311853,-0.240497,-0.545051,-0.218259,-0.386917,-0.392229,Adult,-1.015154,0.472883,-0.472883,0.670505,-0.670505,1.044708,-0.904105,-0.278729,-1.143403,1.143403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,-1.083862,-1.000033,-0.549571,-1.350673,0.017981,-0.904265,-0.926688,-1.091649,-1.018992,-0.296993,-1.787590,0.311853,-1.030767,-0.545051,-0.980345,-0.308211,-0.392229,Young Adult,0.985072,-2.114687,2.114687,0.670505,-0.670505,-0.957206,1.106066,-0.278729,0.874582,-0.874582
103900,0.636505,1.160818,0.956612,0.616249,0.800627,-0.904265,0.555423,1.182991,1.231530,1.255467,1.253304,1.158561,1.340043,1.156211,0.543827,-0.386917,-0.392229,Adult,-1.015154,0.472883,-0.472883,0.670505,-0.670505,1.044708,-0.904105,-0.278729,-1.143403,1.143403
103901,-0.620686,0.807862,-1.302662,-1.350673,0.017981,0.600179,-1.667744,1.182991,0.481356,-0.296993,-1.027367,0.311853,1.340043,1.156211,0.543827,-0.203270,-0.030458,Adult,-1.015154,-2.114687,2.114687,0.670505,-0.670505,1.044708,-0.904105,-0.278729,0.874582,-0.874582
103902,-1.150030,-0.189839,-1.302662,-1.350673,1.583273,-1.656487,-1.667744,-1.849863,-1.769166,0.479237,1.253304,-2.228269,1.340043,0.305580,-1.742432,-0.386917,-0.392229,Young Adult,0.985072,-2.114687,2.114687,0.670505,-0.670505,-0.957206,1.106066,-0.278729,0.874582,-0.874582


# Optional Response to question (DELETE)

Given that the dataset already comes with separate training and testing files in `.csv` format, there is no need to perform an additional split. Here are the key points supporting this decision:

1. **Pre-Split Data**: The dataset is already divided into training and testing sets, ensuring that the data is appropriately partitioned for model development and evaluation. This pre-split nature saves time and effort, allowing you to focus on analysis and model building.

2. **Consistency**: Using the provided splits ensures consistency in data usage, which is crucial for reproducibility. Any additional splitting could introduce variability and potential biases, affecting the reliability of your results.

3. **Standardization**: The provided splits are likely standardized, meaning they have been carefully curated to represent the overall distribution of the data. This standardization helps in maintaining the integrity of the dataset and ensures that both training and testing sets are representative of the entire dataset.

4. **Avoiding Data Leakage**: By using the pre-split data, you minimize the risk of data leakage, where information from the test set could inadvertently influence the training process. This is crucial for maintaining the validity of your model evaluation.

5. **Efficiency**: Utilizing the pre-split files directly is more efficient, as it eliminates the need for additional preprocessing steps. This allows you to streamline your workflow and focus on more critical aspects of your analysis.

In [325]:
from sklearn.model_selection import train_test_split

print(df.columns)

# Separate features and target variable
X = df.drop(columns=['Class'])
y = df['Class']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Index(['Age', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Gate location', 'Food and drink',
       'Online boarding', 'Seat comfort', 'Inflight entertainment',
       'On-board service', 'Leg room service', 'Baggage handling',
       'Checkin service', 'Inflight service', 'Cleanliness',
       'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Age_Group',
       'Gender_Numeric', '0', '1', '2', '3', '4', '5', '6', '7', '8'],
      dtype='object')


KeyError: "['Class'] not found in axis"

In [293]:
#one hot encoding
# df = pd.get
# Displaying the first 5 rows of the dataset    


In [294]:
#Dementionality reduction

# Modeling

# Deployment

# Exceptional Work 

### Citation
https://towardsdatascience.com/predicting-satisfaction-of-airline-passengers-with-classification-76f1516e1d16