## End to end: Airline Passenger Satisfaction

In this tutorial, we will go through an end-to-end process using a real dataset. To do this, we will follow these steps:



- Load training and test datasets from Kaggle.
- Run a kernel with one of the top scores. This kernel trains a lightgbm, which is very common in Kaggle competitions.
- Reduce the complexity of the model and measure the performance difference between the original model and the reduced model.

To properly understand the functionality of serialization and model complexity reduction, before continuing with this tutorial, please review the notebooks:
- reduce_model_complexity.ipynb
- serialize_my_model.ipynb

In [None]:
# For this example, it is necessary to have lightgbm installed, but it is not necessary to have all packages installed to use auto_zkml. 
# For this reason, we include this cell to ensure the notebook works correctly.

!pip install lightgbm

In [2]:
# Some imports

import lightgbm as lgb
import pandas as pd
from sklearn.metrics import roc_auc_score
from auto_zkml import mcr
from auto_zkml import serialize_model

### Data preprocessing

Download the data from here: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/code?datasetId=522275&sortBy=voteCount

In [14]:
# Change ./ for your input path
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")

We copy the essential functionality from any kernel, for example from this one: https://www.kaggle.com/code/teejmahal20/classification-predicting-customer-satisfaction

In [None]:
def transform_gender(x):
    if x == 'Female':
        return 1
    elif x == 'Male':
        return 0
    else:
        return -1
    
def transform_customer_type(x):
    if x == 'Loyal Customer':
        return 1
    elif x == 'disloyal Customer':
        return 0
    else:
        return -1
    
def transform_travel_type(x):
    if x == 'Business travel':
        return 1
    elif x == 'Personal Travel':
        return 0
    else:
        return -1
    
def transform_class(x):
    if x == 'Business':
        return 2
    elif x == 'Eco Plus':
        return 1
    elif x == 'Eco':
        return 0    
    else:
        return -1
    
def transform_satisfaction(x):
    if x == 'satisfied':
        return 1
    elif x == 'neutral or dissatisfied':
        return 0
    else:
        return -1
    
def process_data(df):
    df = df.drop(['Unnamed: 0', 'id'], axis = 1)
    df['Gender'] = df['Gender'].apply(transform_gender)
    df['Customer Type'] = df['Customer Type'].apply(transform_customer_type)
    df['Type of Travel'] = df['Type of Travel'].apply(transform_travel_type)
    df['Class'] = df['Class'].apply(transform_class)
    df['satisfaction'] = df['satisfaction'].apply(transform_satisfaction)
    df['Arrival Delay in Minutes'].fillna(df['Arrival Delay in Minutes'].median(), inplace = True)
    
    return df

train = process_data(train)
test = process_data(test)

features = ['Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']
target = ['satisfaction']

# Split into test and train
X_train = train[features].to_numpy()
y_train = train[target].to_numpy()
X_test = test[features].to_numpy()
y_test = test[target].to_numpy()

In [None]:
params_lgb ={'colsample_bytree': 0.85, 
         'max_depth': 15, 
         'min_split_gain': 0.1, 
         'n_estimators': 200, 
         'num_leaves': 50, 
         'reg_alpha': 1.2, 
         'reg_lambda': 1.2, 
         'subsample': 0.95, 
         'subsample_freq': 20,
         'verbose' : -1}

model_lgb = lgb.LGBMClassifier(**params_lgb)
model_lgb.fit(X_train, y_train)

We measure the model's performance.

In [18]:
y_pred = model_lgb.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC_AUC = {}".format(roc_auc))

ROC_AUC = 0.9621874665245571


We will reduce the model's complexity to see its final architecture.

In [None]:
model, transformer = mcr(model = model_lgb,
                         X_train = X_train,
                         y_train = y_train, 
                         X_eval = X_test, 
                         y_eval = y_test, 
                         eval_metric = 'auc', 
                         transform_features = True)

Measure again the performance of our new model

In [20]:
X_test_transformed = transformer.transform(X_test)
y_pred = model.predict(X_test_transformed)
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC_AUC = {}".format(roc_auc))

ROC_AUC = 0.9405872068623853


In [13]:
model.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': 4,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 150,
 'n_jobs': None,
 'num_leaves': 24,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'min_data_in_leaf': 51,
 'feature_fraction': 0.461039390211762,
 'bagging_fraction': 0.2799799959644911,
 'verbose': -1,
 'early_stopping_rounds': 10}

In [12]:
model_lgb.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 0.85,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': 15,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.1,
 'n_estimators': 200,
 'n_jobs': None,
 'num_leaves': 50,
 'objective': None,
 'random_state': None,
 'reg_alpha': 1.2,
 'reg_lambda': 1.2,
 'subsample': 0.95,
 'subsample_for_bin': 200000,
 'subsample_freq': 20,
 'verbose': -1}

As we can observe, the complexity of the model has drastically decreased. We have gone from 200 trees of depth 15 to 150 trees of depth 4. 
In addition, the model's performance has only decreased by one percent!

Finally, we would serialize our reduced model in the following way:

In [7]:
# Change "./" for your output_path

serialize_model(model_lgb, "./", "lgbm_reg.json")