# PyCaret Models

In this notebook I use an automated machine learning tool named "PyCaret" in order to get a quick idea about what the best models and features are to optimize our prediction of whether passengers were transported to another dimension or not.

In this notebook I take a sort of step-wise approach to deciding which features and models to use moving forward. 

I use PyCaret to assess whether all features, logarithmic numeric features only, leaving out logarithmic features, removing the aggregated "total spending" redundancy performs best. This allows me to observe how various models perform differently given these different input feature sets. 

In [1]:
import sys
import os

parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir, os.pardir))
sys.path.append(parent_dir)

In [2]:
import pandas as pd
from pycaret.classification import ClassificationExperiment
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from utils.machine_learning import Rounder

In [3]:
train_data = pd.read_pickle(
    "../../data/train_processed.pkl"
)
train_data

Unnamed: 0,PassengerNum,Age,HomePlanet,Destination,CabinDeck,CabinSide,CryoSleep,VIP,RoomService,FoodCourt,...,YesShoppingMall,YesSpa,YesVRDeck,YesTotalSpending,LogRoomService,LogFoodCourt,LogShoppingMall,LogSpa,LogVRDeck,LogTotalSpending
0,01,39.0,Europa,TRAPPIST-1e,B,P,False,False,0.0,0.0,...,False,False,False,False,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,01,24.0,Earth,TRAPPIST-1e,F,S,False,False,109.0,9.0,...,True,True,True,True,4.700480,2.302585,3.258097,6.309918,3.806662,6.602588
2,01,58.0,Europa,TRAPPIST-1e,A,S,False,True,43.0,3576.0,...,False,True,True,True,3.784190,8.182280,0.000000,8.812248,3.912023,9.248021
3,02,33.0,Europa,TRAPPIST-1e,A,S,False,False,0.0,1283.0,...,True,True,True,True,0.000000,7.157735,5.918894,8.110728,5.267858,8.551981
4,01,16.0,Earth,TRAPPIST-1e,F,S,False,False,303.0,70.0,...,True,True,True,True,5.717028,4.262680,5.023881,6.338594,1.098612,6.995766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,01,41.0,Europa,55 Cancri e,A,P,False,True,0.0,6819.0,...,False,True,True,True,0.000000,8.827615,0.000000,7.404888,4.317488,9.052165
8689,01,18.0,Earth,PSO J318.5-22,G,S,True,False,0.0,0.0,...,False,False,False,False,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8690,01,26.0,Earth,TRAPPIST-1e,G,S,False,False,0.0,0.0,...,True,True,False,True,0.000000,0.000000,7.535297,0.693147,0.000000,7.535830
8691,01,32.0,Europa,55 Cancri e,E,S,False,False,0.0,1049.0,...,False,True,True,True,0.000000,6.956545,0.000000,5.869297,8.082093,8.442039


In [4]:
train_data.dtypes

PassengerNum           object
Age                   float64
HomePlanet             object
Destination            object
CabinDeck              object
CabinSide              object
CryoSleep              object
VIP                    object
RoomService           float64
FoodCourt             float64
ShoppingMall          float64
Spa                   float64
VRDeck                float64
Transported            object
NameMissing            object
HomeMissing            object
DestinationMissing     object
CabinMissing           object
CryoMissing            object
VIPMissing             object
PartySize             float64
FamilyGroupMember      object
CabinBin              float64
TotalSpending         float64
YesRoomService         object
YesFoodCourt           object
YesShoppingMall        object
YesSpa                 object
YesVRDeck              object
YesTotalSpending       object
LogRoomService        float64
LogFoodCourt          float64
LogShoppingMall       float64
LogSpa    

In [5]:
train_data.describe(include="all")

Unnamed: 0,PassengerNum,Age,HomePlanet,Destination,CabinDeck,CabinSide,CryoSleep,VIP,RoomService,FoodCourt,...,YesShoppingMall,YesSpa,YesVRDeck,YesTotalSpending,LogRoomService,LogFoodCourt,LogShoppingMall,LogSpa,LogVRDeck,LogTotalSpending
count,8693.0,8514.0,8693,8693,8693,8693,8693,8693,8512.0,8510.0,...,8693,8693,8693,8693,8512.0,8510.0,8485.0,8510.0,8505.0,7785.0
unique,8.0,,4,4,9,3,3,3,,,...,2,2,2,2,,,,,,
top,1.0,,Earth,TRAPPIST-1e,F,S,False,False,,,...,False,False,False,True,,,,,,
freq,6217.0,,4602,5915,2794,4288,5439,8291,,,...,5795,5507,5683,4538,,,,,,
mean,,28.82793,,,,,,,224.687617,458.077203,...,,,,,1.772195,1.947541,1.638622,1.878394,1.796809,4.305709
std,,14.489021,,,,,,,666.717663,1611.48924,...,,,,,2.736122,2.950822,2.586336,2.785687,2.764405,3.700501
min,,0.0,,,,,,,0.0,0.0,...,,,,,0.0,0.0,0.0,0.0,0.0,0.0
25%,,19.0,,,,,,,0.0,0.0,...,,,,,0.0,0.0,0.0,0.0,0.0,0.0
50%,,27.0,,,,,,,0.0,0.0,...,,,,,0.0,0.0,0.0,0.0,0.0,6.602588
75%,,38.0,,,,,,,47.0,76.0,...,,,,,3.871201,4.343805,3.332205,4.094345,3.850148,7.304516


#### Include All Data

In [6]:
df = train_data.copy()

In [7]:
X = df.drop(columns="Transported")
y = df["Transported"]

In [8]:
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))

In [9]:
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)

In [10]:
X_processed = feature_preprocessing.fit_transform(X)

In [11]:
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=1)

Unnamed: 0,Description,Value
0,Session id,1
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 76)"
5,Transformed data shape,"(8693, 76)"
6,Transformed train set shape,"(6085, 76)"
7,Transformed test set shape,"(2608, 76)"
8,Numeric features,75
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x137202050>

In [12]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8026,0.8965,0.8026,0.8031,0.8026,0.6052,0.6057,1.076
gbc,Gradient Boosting Classifier,0.7956,0.893,0.7956,0.7977,0.7951,0.5909,0.5931,0.422
xgboost,Extreme Gradient Boosting,0.7956,0.8913,0.7956,0.7959,0.7955,0.5911,0.5915,0.114
rf,Random Forest Classifier,0.7918,0.8711,0.7918,0.7931,0.7916,0.5837,0.585,0.277
lr,Logistic Regression,0.7911,0.8738,0.7911,0.792,0.7909,0.5821,0.583,0.895
lda,Linear Discriminant Analysis,0.7906,0.8675,0.7906,0.791,0.7905,0.5811,0.5816,0.029
ridge,Ridge Classifier,0.7905,0.8676,0.7905,0.7909,0.7904,0.5808,0.5813,0.034
ada,Ada Boost Classifier,0.7887,0.8765,0.7887,0.7917,0.788,0.577,0.5801,0.142
et,Extra Trees Classifier,0.7788,0.8411,0.7788,0.7826,0.7782,0.5579,0.5615,0.224
knn,K Neighbors Classifier,0.7629,0.8448,0.7629,0.7634,0.7628,0.5258,0.5263,0.05


A few things to note: 
- This is likely to overfit, there are redundant features. 
- This is slow, there are a lot of features
- Only tree-based classifiers can really handle this sort of modeling. 
- Max accuracy is .8074

In [13]:
pc_workflow.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

note in Feature Importance:
- Age appears to be an important variable
- Total Spending is important, as are all of the other spending features
- It uses the raw spending as more important than the logarithmic spending

## What if we only use the log values for spending variables? 

In [14]:
df = train_data.copy()
df.columns

Index(['PassengerNum', 'Age', 'HomePlanet', 'Destination', 'CabinDeck',
       'CabinSide', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt',
       'ShoppingMall', 'Spa', 'VRDeck', 'Transported', 'NameMissing',
       'HomeMissing', 'DestinationMissing', 'CabinMissing', 'CryoMissing',
       'VIPMissing', 'PartySize', 'FamilyGroupMember', 'CabinBin',
       'TotalSpending', 'YesRoomService', 'YesFoodCourt', 'YesShoppingMall',
       'YesSpa', 'YesVRDeck', 'YesTotalSpending', 'LogRoomService',
       'LogFoodCourt', 'LogShoppingMall', 'LogSpa', 'LogVRDeck',
       'LogTotalSpending'],
      dtype='object')

In [15]:
df = train_data.copy()
X = df.drop(
    columns=["Transported", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
)
y = df["Transported"]
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)
X_processed = feature_preprocessing.fit_transform(X)
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=2)

Unnamed: 0,Description,Value
0,Session id,2
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 71)"
5,Transformed data shape,"(8693, 71)"
6,Transformed train set shape,"(6085, 71)"
7,Transformed test set shape,"(2608, 71)"
8,Numeric features,70
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x11197b710>

In [16]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8041,0.8961,0.8041,0.8048,0.8039,0.6081,0.6088,0.416
lightgbm,Light Gradient Boosting Machine,0.803,0.8973,0.803,0.8033,0.8029,0.6059,0.6062,1.051
xgboost,Extreme Gradient Boosting,0.797,0.8951,0.797,0.7975,0.797,0.5941,0.5945,0.099
ada,Ada Boost Classifier,0.788,0.8789,0.788,0.7896,0.7876,0.5757,0.5775,0.17
rf,Random Forest Classifier,0.7877,0.8662,0.7877,0.7909,0.7872,0.5756,0.5787,0.235
ridge,Ridge Classifier,0.7865,0.8612,0.7865,0.7876,0.7863,0.5729,0.5741,0.025
lda,Linear Discriminant Analysis,0.7865,0.8611,0.7865,0.7876,0.7863,0.5729,0.5741,0.036
lr,Logistic Regression,0.7842,0.8626,0.7842,0.7856,0.7839,0.5682,0.5697,0.038
svm,SVM - Linear Kernel,0.7684,0.8512,0.7684,0.7746,0.767,0.5369,0.5429,0.052
et,Extra Trees Classifier,0.7666,0.8285,0.7666,0.7725,0.7655,0.5337,0.5393,0.22


In [17]:
pc_workflow.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

This performs even better, but is very reliant on gradient boosted tree models. However, I do find it to be superior to using all features. However, one thing I notice is that I used Total Spending AND log total spending. let's remove this redundancy.

In [18]:
df = train_data.copy()
df.columns

Index(['PassengerNum', 'Age', 'HomePlanet', 'Destination', 'CabinDeck',
       'CabinSide', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt',
       'ShoppingMall', 'Spa', 'VRDeck', 'Transported', 'NameMissing',
       'HomeMissing', 'DestinationMissing', 'CabinMissing', 'CryoMissing',
       'VIPMissing', 'PartySize', 'FamilyGroupMember', 'CabinBin',
       'TotalSpending', 'YesRoomService', 'YesFoodCourt', 'YesShoppingMall',
       'YesSpa', 'YesVRDeck', 'YesTotalSpending', 'LogRoomService',
       'LogFoodCourt', 'LogShoppingMall', 'LogSpa', 'LogVRDeck',
       'LogTotalSpending'],
      dtype='object')

In [19]:
df = train_data.copy()
X = df.drop(
    columns=[
        "Transported",
        "RoomService",
        "FoodCourt",
        "ShoppingMall",
        "Spa",
        "VRDeck",
        "TotalSpending",
    ]
)
y = df["Transported"]
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)
X_processed = feature_preprocessing.fit_transform(X)

In [20]:
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=3)

Unnamed: 0,Description,Value
0,Session id,3
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 70)"
5,Transformed data shape,"(8693, 70)"
6,Transformed train set shape,"(6085, 70)"
7,Transformed test set shape,"(2608, 70)"
8,Numeric features,69
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x139735fd0>

In [21]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8043,0.8938,0.8043,0.8056,0.804,0.6083,0.6098,0.376
lightgbm,Light Gradient Boosting Machine,0.8029,0.8926,0.8029,0.8033,0.8029,0.6058,0.6062,1.072
xgboost,Extreme Gradient Boosting,0.8015,0.8869,0.8015,0.8017,0.8014,0.6029,0.6032,0.092
rf,Random Forest Classifier,0.7916,0.8672,0.7916,0.7942,0.7913,0.5835,0.5859,0.199
ada,Ada Boost Classifier,0.7901,0.8789,0.7901,0.7927,0.7896,0.58,0.5826,0.13
lr,Logistic Regression,0.7806,0.8599,0.7806,0.7817,0.7803,0.561,0.5622,0.048
ridge,Ridge Classifier,0.7801,0.8589,0.7801,0.781,0.7799,0.56,0.561,0.021
lda,Linear Discriminant Analysis,0.7801,0.8588,0.7801,0.781,0.7799,0.56,0.561,0.03
et,Extra Trees Classifier,0.7661,0.833,0.7661,0.7704,0.7654,0.5327,0.5367,0.228
knn,K Neighbors Classifier,0.754,0.8257,0.754,0.755,0.7538,0.5081,0.509,0.045


In [22]:
pc_workflow.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

I would say that this is our best performing model yet

## What happens if we remove total spending entirely and have the model rely solely on the individual spending variables?

In [23]:
df.columns

Index(['PassengerNum', 'Age', 'HomePlanet', 'Destination', 'CabinDeck',
       'CabinSide', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt',
       'ShoppingMall', 'Spa', 'VRDeck', 'Transported', 'NameMissing',
       'HomeMissing', 'DestinationMissing', 'CabinMissing', 'CryoMissing',
       'VIPMissing', 'PartySize', 'FamilyGroupMember', 'CabinBin',
       'TotalSpending', 'YesRoomService', 'YesFoodCourt', 'YesShoppingMall',
       'YesSpa', 'YesVRDeck', 'YesTotalSpending', 'LogRoomService',
       'LogFoodCourt', 'LogShoppingMall', 'LogSpa', 'LogVRDeck',
       'LogTotalSpending'],
      dtype='object')

In [24]:
df = train_data.copy()
X = df.drop(
    columns=[
        "Transported",
        "RoomService",
        "FoodCourt",
        "ShoppingMall",
        "Spa",
        "VRDeck",
        "TotalSpending",
        "LogTotalSpending",
    ]
)
y = df["Transported"]
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)
X_processed = feature_preprocessing.fit_transform(X)

In [25]:
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=3)

Unnamed: 0,Description,Value
0,Session id,3
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 69)"
5,Transformed data shape,"(8693, 69)"
6,Transformed train set shape,"(6085, 69)"
7,Transformed test set shape,"(2608, 69)"
8,Numeric features,68
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x13a40e3d0>

In [26]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8082,0.8944,0.8082,0.8085,0.8081,0.6163,0.6167,0.799
gbc,Gradient Boosting Classifier,0.8077,0.8935,0.8077,0.809,0.8075,0.6152,0.6166,0.344
xgboost,Extreme Gradient Boosting,0.7933,0.8832,0.7933,0.7934,0.7932,0.5865,0.5866,0.116
ada,Ada Boost Classifier,0.7908,0.8768,0.7908,0.7923,0.7905,0.5814,0.5829,0.099
rf,Random Forest Classifier,0.7882,0.8642,0.7882,0.7912,0.7877,0.5766,0.5795,0.187
lr,Logistic Regression,0.7803,0.8596,0.7803,0.7811,0.78,0.5604,0.5613,0.047
ridge,Ridge Classifier,0.7801,0.8587,0.7801,0.7809,0.7799,0.56,0.5609,0.021
lda,Linear Discriminant Analysis,0.7798,0.8586,0.7798,0.7806,0.7796,0.5594,0.5603,0.042
et,Extra Trees Classifier,0.7609,0.8304,0.7609,0.7646,0.7602,0.5221,0.5257,0.263
knn,K Neighbors Classifier,0.754,0.8264,0.754,0.755,0.7538,0.5081,0.509,0.048


In [27]:
pc_workflow.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

# What if I use the non-logarithmic spending variables? 

In [28]:
df = train_data.copy()
X = df.drop(
    columns=[
        "Transported",
        "LogRoomService",
        "LogFoodCourt",
        "LogShoppingMall",
        "LogSpa",
        "LogVRDeck",
        "LogTotalSpending",
    ]
)
y = df["Transported"]
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)
X_processed = feature_preprocessing.fit_transform(X)

In [29]:
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=5)

Unnamed: 0,Description,Value
0,Session id,5
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 70)"
5,Transformed data shape,"(8693, 70)"
6,Transformed train set shape,"(6085, 70)"
7,Transformed test set shape,"(2608, 70)"
8,Numeric features,69
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x13a320610>

In [30]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8051,0.896,0.8051,0.8052,0.8051,0.6102,0.6103,0.977
gbc,Gradient Boosting Classifier,0.8003,0.8938,0.8003,0.8005,0.8003,0.6006,0.6008,0.345
xgboost,Extreme Gradient Boosting,0.7979,0.8893,0.7979,0.7984,0.7978,0.5958,0.5963,0.089
rf,Random Forest Classifier,0.7898,0.8662,0.7898,0.793,0.7893,0.5799,0.583,0.21
lr,Logistic Regression,0.7883,0.8721,0.7883,0.7886,0.7883,0.5766,0.5769,0.038
ada,Ada Boost Classifier,0.7864,0.8764,0.7864,0.7865,0.7863,0.5726,0.5728,0.101
ridge,Ridge Classifier,0.7798,0.8599,0.7798,0.7813,0.7796,0.5598,0.5612,0.031
lda,Linear Discriminant Analysis,0.7796,0.8599,0.7796,0.7812,0.7794,0.5594,0.5609,0.032
svm,SVM - Linear Kernel,0.7698,0.8646,0.7698,0.7788,0.7677,0.54,0.5485,0.055
knn,K Neighbors Classifier,0.768,0.8404,0.768,0.7688,0.7678,0.536,0.5368,0.044


## Lastly, let's try again, but this time, not including total spending, nor including the logarithmic spending variables

In [31]:
df = train_data.copy()

In [32]:
X = df.drop(
    columns=[
        "Transported",
        "TotalSpending",
        "LogRoomService",
        "LogFoodCourt",
        "LogShoppingMall",
        "LogSpa",
        "LogVRDeck",
        "LogTotalSpending",
        "LogTotalSpending",
    ]
)
y = df["Transported"]

In [33]:
numerical_columns = list(X.select_dtypes(include="number").drop(columns="CabinBin"))
categorical_columns = list(X.select_dtypes(include=["object"]))

In [34]:
set_config(transform_output="pandas")

cat_pipeline = Pipeline([("one_hot", OneHotEncoder(sparse_output=False))])
num_pipeline = Pipeline(
    [
        ("imputer", IterativeImputer(random_state=0)),
        ("scaler", StandardScaler()),
    ]
)
ord_pipeline = Pipeline(
    [
        ("oe", OrdinalEncoder()),
        ("imputer", IterativeImputer(random_state=0)),
        ("rounder", Rounder(decimals=0)),
    ]
)
feature_preprocessing = ColumnTransformer(
    [
        ("cat", cat_pipeline, categorical_columns),
        ("num", num_pipeline, numerical_columns),
        ("ord", ord_pipeline, ["CabinBin"]),
    ],
    verbose_feature_names_out=False,
)

In [35]:
X_processed = feature_preprocessing.fit_transform(X)

In [36]:
pc_workflow = ClassificationExperiment()
pc_workflow.setup(X_processed, target=y, session_id=6)

Unnamed: 0,Description,Value
0,Session id,6
1,Target,Transported
2,Target type,Binary
3,Target mapping,"False: 0, True: 1"
4,Original data shape,"(8693, 69)"
5,Transformed data shape,"(8693, 69)"
6,Transformed train set shape,"(6085, 69)"
7,Transformed test set shape,"(2608, 69)"
8,Numeric features,68
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x13a168890>

In [37]:
best = pc_workflow.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8066,0.8952,0.8066,0.807,0.8065,0.6131,0.6135,0.942
gbc,Gradient Boosting Classifier,0.8026,0.8922,0.8026,0.8034,0.8025,0.6051,0.6059,0.393
xgboost,Extreme Gradient Boosting,0.7974,0.887,0.7974,0.7976,0.7973,0.5947,0.595,0.116
ada,Ada Boost Classifier,0.7918,0.8771,0.7918,0.7927,0.7916,0.5834,0.5844,0.1
lr,Logistic Regression,0.791,0.8759,0.791,0.7918,0.7907,0.5818,0.5826,0.052
rf,Random Forest Classifier,0.787,0.8651,0.787,0.7897,0.7866,0.5743,0.5769,0.206
ridge,Ridge Classifier,0.776,0.857,0.776,0.7774,0.7758,0.5521,0.5535,0.021
lda,Linear Discriminant Analysis,0.776,0.8569,0.776,0.7774,0.7758,0.5521,0.5535,0.024
knn,K Neighbors Classifier,0.764,0.8384,0.764,0.7647,0.7639,0.5281,0.5287,0.047
svm,SVM - Linear Kernel,0.7579,0.8683,0.7579,0.7811,0.752,0.5159,0.5378,0.04


In [38]:
pc_workflow.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

# Conclusions

- All models performed somewhat similarly,
- Tree-based models performed the best: LightGBM and the GradientBoostingClassifier typically performed the highest with respect to accuracy. Followed by XGBoost and Random Forest. 
- Logistic Regression varied in its performance, performaing best with all of the information (however this is potentially evidence of overfitting and not generalizable performance)
- My approach based on these results will be do optimize a variety of these models and create an ensemble voting model.
- Overall, I recommend a tree-based model as the primary method with all combinations of features (but no total spending feature)