## Hotel Booking Classification Exercise

This model will use the hotel reservations dataset from Kaggle. It was used in ML competitions for classification to predict cancelations.  

All the metadata and dataset download can be found below:  
https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

#### *I will be using the minimum amount of code for this exercise as I do a compare / contrast to Pycaret Classification for lab.*

In [1]:
# imports
from autogluon.tabular import TabularDataset, TabularPredictor
from sklearn.model_selection import train_test_split

In [2]:
# create dataset using TabularDataset - this allows for a connection to the TabularPredictor
# all Pandas functionality is still available

data = TabularDataset('https://raw.githubusercontent.com/psdbia/Class-Share/refs/heads/main/data/Hotel%20Reservations.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date   

In [3]:
# Train Test Split 
# Stratified split (ensuring equal representation of 'booking_status' in train and test sets)
# The train_df will be used for cross validation and the we will treat the test_df as the unseen dataset
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, stratify=data['booking_status'])

In [4]:
# Create the predictor and fit the data
predictor = TabularPredictor(label='booking_status', path='reservation_predictors')



In [5]:
# observe the output
predictor.fit(train_df)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          16
Memory Avail:       19.08 GB / 31.91 GB (59.8%)
Disk Space Avail:   620.76 GB / 930.66 GB (66.7%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medi

[1000]	valid_set's binary_error: 0.114


	0.8908	 = Validation score   (accuracy)
	1.75s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: LightGBM ...


[1000]	valid_set's binary_error: 0.1032


	0.9008	 = Validation score   (accuracy)
	1.5s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: RandomForestGini ...
	0.9004	 = Validation score   (accuracy)
	0.74s	 = Training   runtime
	0.04s	 = Validation runtime
Fitting model: RandomForestEntr ...
	0.8996	 = Validation score   (accuracy)
	0.73s	 = Training   runtime
	0.05s	 = Validation runtime
Fitting model: CatBoost ...
	0.8964	 = Validation score   (accuracy)
	117.96s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	0.9	 = Validation score   (accuracy)
	0.59s	 = Training   runtime
	0.04s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	0.9	 = Validation score   (accuracy)
	0.64s	 = Training   runtime
	0.04s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	0.8784	 = Validation score   (accuracy)
	16.74s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: XGBoost ...
	0.8908	 = Validation score   (accuracy)
	0.75s	 = Training   runtime
	0.01s	 = Va

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x1cd2e9e8f90>

In [6]:
# summary
predictor.fit_summary()

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2     0.9096    accuracy       0.118617  122.114187                0.001014           0.100243            2       True         14
1              LightGBM     0.9008    accuracy       0.016189    1.498738                0.016189           1.498738            1       True          4
2         LightGBMLarge     0.9004    accuracy       0.012040    1.091211                0.012040           1.091211            1       True         13
3      RandomForestGini     0.9004    accuracy       0.039431    0.735633                0.039431           0.735633            1       True          5
4        ExtraTreesEntr     0.9000    accuracy       0.039289    0.636207                0.039289           0.636207            1       True          9
5        ExtraTreesGini   



{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.7928,
  'KNeighborsDist': 0.8176,
  'LightGBMXT': 0.8908,
  'LightGBM': 0.9008,
  'RandomForestGini': 0.9004,
  'RandomForestEntr': 0.8996,
  'CatBoost': 0.8964,
  'ExtraTreesGini': 0.9,
  'ExtraTreesEntr': 0.9,
  'NeuralNetFastAI': 0.8784,
  'XGBoost': 0.8908,
  'NeuralNetTorch': 0.8832,
  'LightGBMLarge': 0.9004,
  'WeightedEnsemble_L2': 0.9096},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': ['KNeighborsUnif'],
  'KNeighborsDi

In [7]:
# validate the model against unseen data
y_test = test_df["booking_status"]
test_data = test_df.drop(columns=["booking_status"])

In [8]:
y_pred = predictor.predict(test_data)

In [9]:
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

In [10]:
metrics

{'accuracy': 0.9116471399035149,
 'balanced_accuracy': 0.8904026440348543,
 'mcc': 0.7968631713207306,
 'f1': 0.9354416356128512,
 'precision': 0.9194218966541279,
 'recall': 0.9520295202952029}

In [18]:
# Feature Importance
importance = predictor.feature_importance(test_df)
importance

These features in provided data are not utilized by the predictor and will be ignored: ['Booking_ID']
Computing feature importance via permutation shuffling for 17 features using 5000 rows with 5 shuffle sets...
	33.28s	= Expected runtime (6.66s per shuffle set)
	8.75s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
lead_time,0.18996,0.006864,2.04158e-07,5,0.204092,0.175828
no_of_special_requests,0.1344,0.004069,1.007336e-07,5,0.142779,0.126021
avg_price_per_room,0.0912,0.003496,2.585199e-07,5,0.098398,0.084002
market_segment_type,0.07836,0.002165,6.987741e-08,5,0.082818,0.073902
arrival_month,0.03828,0.002897,3.90585e-06,5,0.044245,0.032315
arrival_year,0.02008,0.001425,3.02734e-06,5,0.023015,0.017145
no_of_weekend_nights,0.0138,0.002054,5.72224e-05,5,0.01803,0.00957
arrival_date,0.01312,0.002344,0.0001171262,5,0.017945,0.008295
no_of_week_nights,0.01236,0.001417,2.037332e-05,5,0.015278,0.009442
no_of_adults,0.00828,0.00218,0.0005268726,5,0.012768,0.003792


In [12]:
# Use Case!
# Adjust the lead times on the reservation, or another features and test!
res = {
    "Booking_ID" : "INN01961234234",
    "no_of_adults" : 1,
    "no_of_children" : 0,
    "no_of_weekend_nights" : 0,
    "no_of_week_nights" : 3,
    "type_of_meal_plan" : "Meal Plan 1",
    "required_car_parking_space" : 0,
    "room_type_reserved" : "Room_Type 1",
    "lead_time" : 190,
    "arrival_year" : 2023,
    "arrival_month" : 11,
    "arrival_date" : 3,
    "market_segment_type" : "Online",
    "repeated_guest" : 0,
    "no_of_previous_cancellations" : 0,
    "no_of_previous_bookings_not_canceled" : 10,
    "avg_price_per_room" : 190.9,
    "no_of_special_requests" : 0
}

In [13]:
reservation_data = TabularDataset([res])
predictor.predict(reservation_data)

0    Canceled
Name: booking_status, dtype: object