<div style="background-color: darkslategray; color: white; padding: 15px; border-radius: 8px;">
    <center><h3 style="font-family: Arial, sans-serif;">MLOps Project - Modeling Exploration</h3></center>
</div>

**<h3>Table of Contents</h3>**
* [1. Environment Setup](#1-environment-setup)
    * [1.1 Import Libraries](#11-import-libraries)
    * [1.2 Import Dataset](#12-import-dataset)
* [2. Preprocessing](#2-preprocessing)
    * [2.1 Train-Test Split](#21-train-test-split)
    * [2.2 Scaling](#22-scaling)
* [3. Modeling](#3-modeling)
    * [3.1 Logistic Regression](#31-logistic-regression)
    * [3.2 Decision Tree](#32-decision-tree)
    * [3.3 Random Forest](#33-random-forest)
    * [3.4 XGBoost](#34-xgboost)
* [4. Wrap up for Pipelines](#4-wrap-up)

<div class="alert alert-block alert-success">

# **1.** Environment Setup

<div>

In [4]:
import pandas as pd
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

In [5]:
os.getcwd()

'C:\\Users\\Asus\\Documents\\GitHub\\MLOps\\project_mlops\\notebooks'

In [6]:
os.chdir("C:/Users/Asus/Documents/GitHub/MLOps/project_mlops")

In [7]:
df = pd.read_csv("data/01_raw/booking.csv")

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36285 entries, 0 to 36284
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Booking_ID                36285 non-null  object 
 1   number of adults          36285 non-null  int64  
 2   number of children        36285 non-null  int64  
 3   number of weekend nights  36285 non-null  int64  
 4   number of week nights     36285 non-null  int64  
 5   type of meal              36285 non-null  object 
 6   car parking space         36285 non-null  int64  
 7   room type                 36285 non-null  object 
 8   lead time                 36285 non-null  int64  
 9   market segment type       36285 non-null  object 
 10  repeated                  36285 non-null  int64  
 11  P-C                       36285 non-null  int64  
 12  P-not-C                   36285 non-null  int64  
 13  average price             36285 non-null  float64
 14  specia

<div class="alert alert-block alert-success">

# **2.** Preprocessing

<div>

## **2.1** Data Split

In [11]:
df["booking status"] = df["booking status"].map({
    "Canceled": 1,
    "Not_Canceled": 0
})

# Split features and target
X = df.drop(columns='booking status')
y = df['booking status']


In [12]:
# Step 1: Initial split (Train + Temp)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 2: Split Temp into Train and Validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42
)

## **2.2** Scaling

In [14]:
# Identify column types
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Define transformers
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine into a column transformer
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, num_cols),
    ("cat", categorical_transformer, cat_cols)
])




# For tree-based models: imputation only, no scaling
numeric_transformer_tree = Pipeline([
    ("imputer", SimpleImputer(strategy="mean"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor_tree = ColumnTransformer([
    ("num", numeric_transformer_tree, num_cols),
    ("cat", categorical_transformer, cat_cols)
])


<div class="alert alert-block alert-success">

# **3.** Modeling

<div>

## **3.1** Logistic Regression
Parameters to tune:

| Parameter | Description                                                            | Common Values                      |
| --------- | ---------------------------------------------------------------------- | ---------------------------------- |
| `C`       | Inverse of regularization strength (smaller = stronger regularization) | `0.01`, `0.1`, `1`, `10`, `100`    |
| `penalty` | Type of regularization                                                 | `'l1'`, `'l2'`, `'elasticnet'`     |
| `solver`  | Optimization algorithm (depends on penalty)                            | `'liblinear'`, `'saga'`, `'lbfgs'` |

- Attempt tuning with mlflow, so that we can record each model version?

In [17]:
logreg_pipe = Pipeline([
    ("preprocessing", preprocessor),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])


# Train on train set
logreg_pipe.fit(X_train, y_train)

# Validate on validation set
y_val_pred_lg = logreg_pipe.predict(X_val)
print("Validation Results (Logistic Regression):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_val, y_val_pred_lg))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_val, y_val_pred_lg))

Validation Results (Logistic Regression):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4371  508]
 [ 797 1581]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.85      0.90      0.87      4879
           1       0.76      0.66      0.71      2378

    accuracy                           0.82      7257
   macro avg       0.80      0.78      0.79      7257
weighted avg       0.82      0.82      0.82      7257



In [18]:
# Final evaluation on test set
y_test_pred_lg = logreg_pipe.predict(X_test)
print("Final Test Results (Logistic Regression):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_test, y_test_pred_lg))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_test, y_test_pred_lg))


Final Test Results (Logistic Regression):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4416  463]
 [ 828 1550]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.84      0.91      0.87      4879
           1       0.77      0.65      0.71      2378

    accuracy                           0.82      7257
   macro avg       0.81      0.78      0.79      7257
weighted avg       0.82      0.82      0.82      7257



## **3.2** Decision Tree

Possible parameters to tune:

| Parameter           | Description                                    | Common Values                       |
| ------------------- | ---------------------------------------------- | ----------------------------------- |
| `max_depth`         | Maximum depth of the tree                      | `3`, `5`, `10`, `None`              |
| `min_samples_split` | Min samples required to split an internal node | `2`, `5`, `10`                      |
| `min_samples_leaf`  | Min samples required at a leaf node            | `1`, `5`, `10`                      |
| `criterion`         | Function to measure split quality              | `'gini'`, `'entropy'`, `'log_loss'` |


In [55]:
# Define the pipeline
dt_pipe = Pipeline([
    ("preprocessing", preprocessor_tree),
    ("clf", DecisionTreeClassifier(random_state=42))
])

# Fit on training data
dt_pipe.fit(X_train, y_train)

# Validate
y_val_pred_dt = dt_pipe.predict(X_val)
print("Validation Results (Decision Tree):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_val, y_val_pred_dt))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_val, y_val_pred_dt))

Validation Results (Decision Tree):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4470  409]
 [ 506 1872]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      4879
           1       0.82      0.79      0.80      2378

    accuracy                           0.87      7257
   macro avg       0.86      0.85      0.86      7257
weighted avg       0.87      0.87      0.87      7257



In [21]:
# Final test
y_test_pred_dt = dt_pipe.predict(X_test)
print("Final Test Results (Decision Tree):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_test, y_test_pred_dt))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_test, y_test_pred_dt))


Final Test Results (Decision Tree):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4511  368]
 [ 499 1879]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      4879
           1       0.84      0.79      0.81      2378

    accuracy                           0.88      7257
   macro avg       0.87      0.86      0.86      7257
weighted avg       0.88      0.88      0.88      7257



## **3.3** Random Forest

Possible parameters to tune:

| Parameter           | Description                         | Common Values                |
| ------------------- | ----------------------------------- | ---------------------------- |
| `n_estimators`      | Number of trees in the forest       | `100`, `200`, `500`          |
| `max_depth`         | Max depth of each tree              | `None`, `5`, `10`, `20`      |
| `max_features`      | Features to consider when splitting | `'auto'`, `'sqrt'`, `'log2'` |
| `min_samples_split` | Min samples to split a node         | `2`, `5`, `10`               |
| `min_samples_leaf`  | Min samples at a leaf               | `1`, `5`, `10`               |
| `bootstrap`         | Whether to use bootstrap samples    | `True`, `False`              |


In [23]:
rf_pipe = Pipeline([
    ("preprocessing", preprocessor_tree),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train on train set
rf_pipe.fit(X_train, y_train)

# Validate on validation set
y_val_pred_rf = rf_pipe.predict(X_val)
print("🔍 Validation Results (Random Forest):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_val, y_val_pred_rf))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_val, y_val_pred_rf))

🔍 Validation Results (Random Forest):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4658  221]
 [ 606 1772]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.88      0.95      0.92      4879
           1       0.89      0.75      0.81      2378

    accuracy                           0.89      7257
   macro avg       0.89      0.85      0.86      7257
weighted avg       0.89      0.89      0.88      7257



In [24]:
# Final evaluation on test set
y_test_pred_rf = rf_pipe.predict(X_test)
print("🧪 Final Test Results (Random Forest):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_test, y_test_pred_rf))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_test, y_test_pred_rf))


🧪 Final Test Results (Random Forest):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4678  201]
 [ 610 1768]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      4879
           1       0.90      0.74      0.81      2378

    accuracy                           0.89      7257
   macro avg       0.89      0.85      0.87      7257
weighted avg       0.89      0.89      0.89      7257



## **3.4** XGBoost

Possible parameters to tune:

| Parameter          | Description                        | Common Values               |
| ------------------ | ---------------------------------- | --------------------------- |
| `n_estimators`     | Number of boosting rounds (trees)  | `100`, `200`, `500`         |
| `max_depth`        | Max depth of each tree             | `3`, `5`, `10`              |
| `learning_rate`    | Step size shrinkage                | `0.01`, `0.1`, `0.2`, `0.3` |
| `subsample`        | Fraction of rows used per tree     | `0.5`, `0.7`, `1`           |
| `colsample_bytree` | Fraction of features used per tree | `0.5`, `0.7`, `1`           |
| `gamma`            | Min loss reduction to split        | `0`, `1`, `5`               |
| `reg_alpha`        | L1 regularization term             | `0`, `0.1`, `1`             |
| `reg_lambda`       | L2 regularization term             | `0`, `0.1`, `1`             |


In [26]:
xgb_pipe = Pipeline([
    ("preprocessing", preprocessor_tree),
    ("classifier", XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42))
])

# Train on train set
xgb_pipe.fit(X_train, y_train)

# Validate on validation set
y_val_pred_xgb = xgb_pipe.predict(X_val)
print("🔍 Validation Results (XGBoost):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_val, y_val_pred_xgb))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_val, y_val_pred_xgb))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


🔍 Validation Results (XGBoost):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4561  318]
 [ 557 1821]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.89      0.93      0.91      4879
           1       0.85      0.77      0.81      2378

    accuracy                           0.88      7257
   macro avg       0.87      0.85      0.86      7257
weighted avg       0.88      0.88      0.88      7257



In [27]:
# Final evaluation on test set
y_test_pred_xgb = xgb_pipe.predict(X_test)
print("🧪 Final Test Results (XGBoost):")
print('___________________________________________________________________________________________________________')
print('                                            Confusion Matrix                                               ')
print(confusion_matrix(y_test, y_test_pred_xgb))
print('___________________________________________________________________________________________________________')
print('                                       Classification Report                                               ')
print(classification_report(y_test, y_test_pred_xgb))


🧪 Final Test Results (XGBoost):
___________________________________________________________________________________________________________
                                            Confusion Matrix                                               
[[4592  287]
 [ 564 1814]]
___________________________________________________________________________________________________________
                                       Classification Report                                               
              precision    recall  f1-score   support

           0       0.89      0.94      0.92      4879
           1       0.86      0.76      0.81      2378

    accuracy                           0.88      7257
   macro avg       0.88      0.85      0.86      7257
weighted avg       0.88      0.88      0.88      7257



<div class="alert alert-block alert-success">

# **4.** Wrap-up for Pipelines

<div>

#### Logistic Regression
| Requirement              | Details                                                                               |
| ------------------------ | ------------------------------------------------------------------------------------- |
| **Feature Scaling**      | **Yes** – StandardScaler or MinMaxScaler is needed (especially for regularization). |
| **Categorical Encoding** | **Yes** – Use One-Hot Encoding or Ordinal Encoding.                                 |
| **Missing Values**       | **No** – Must be handled before training.                                           |
| **Feature Selection**    | Optional – Helps prevent overfitting, especially with high-dimensional data.       |
| **Feature Interaction**  | Not handled automatically – must be engineered manually.                            |

#### Decision Trees

| Requirement              | Details                                                              |
| ------------------------ | -------------------------------------------------------------------- |
| **Feature Scaling**      | **No** – Trees are scale-invariant.                                |
| **Categorical Encoding** | **Yes** – Use Ordinal Encoding (not One-Hot for high cardinality). |
| **Missing Values**       | Some implementations (e.g., `sklearn`) require imputation.        |
| **Feature Selection**    | Built-in – Tree automatically selects relevant features.           |
| **Feature Interaction**  | Handled internally by the tree splits.                             |


#### Random Forest

| Requirement              | Details                                                      |
| ------------------------ | ------------------------------------------------------------ |
| **Feature Scaling**      | **No**                                                     |
| **Categorical Encoding** | **Yes** – Ordinal Encoding or One-Hot for low-cardinality. |
| **Missing Values**       | Imputation needed unless using libraries that support it. |
| **Feature Selection**    | Implicit – Less important features get low importance.     |
| **Feature Interaction**  | Captures interactions via ensemble splits.                 |



#### XGBoost


| Requirement              | Details                                                                                                                       |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| **Feature Scaling**      | **No**                                                     |
| **Categorical Encoding** | **Yes**                                                    |
| **Missing Values**       | **Yes** – Handled natively.                                |
| **Feature Selection**    | Built-in regularization + importance metrics.              |
| **Feature Interaction**  | Learns interactions automatically via boosting.            |


