# Hotel Booking Cancellation Prediction

## Project Overview
This notebook demonstrates the end-to-end workflow for predicting hotel booking cancellations.
We use a dataset of hotel bookings to build a predictive model that can identify bookings likely to be canceled.

## Objective
- Analyze factors influencing cancellations.
- Build a machine learning model to predict cancellations.
- Provide actionable business insights.

In [None]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path to import src modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from src.data_preprocessing import load_data, preprocess_data, split_and_scale_data
from src.modeling import train_logistic_regression_statsmodels, train_decision_tree, tune_decision_tree
from src.evaluation import model_performance_classification, plot_confusion_matrix

# Set visualization style
sns.set_style('whitegrid')
%matplotlib inline

## 1. Data Loading and Exploration

In [None]:
# Load data
data_path = '../data/raw/INNHotelsGroup.csv'
df = load_data(data_path)

# Display first few rows
display(df.head())

# Basic Info
print(df.info())

## 2. Data Preprocessing
We will drop the identifier column and encode the target variable.

In [None]:
# Preprocess data
df_clean = preprocess_data(df)

# Check distribution of target variable
sns.countplot(x='booking_status', data=df_clean)
plt.title('Distribution of Booking Status (0: Not Canceled, 1: Canceled)')
plt.savefig('../visuals/booking_status_distribution.png')
plt.show()

## 3. Train-Test Split and Scaling

In [None]:
# Split and scale
data_dict = split_and_scale_data(df_clean)

X_train, y_train = data_dict['X_train'], data_dict['y_train']
X_test, y_test = data_dict['X_test'], data_dict['y_test']
X_train_scaled, X_test_scaled = data_dict['X_train_scaled'], data_dict['X_test_scaled']

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

## 4. Model Building
We will train a Logistic Regression model and a Decision Tree classifier.

In [None]:
# Train Logistic Regression (Statsmodels)
print("Training Logistic Regression...")
log_reg_model = train_logistic_regression_statsmodels(X_train_scaled, y_train)
print(log_reg_model.summary())

In [None]:
# Train Decision Tree
print("Training Decision Tree...")
dt_model = train_decision_tree(X_train, y_train)

# Evaluate on Test Set
perf_dt = model_performance_classification(dt_model, X_test, y_test)
print("Decision Tree Performance:")
display(perf_dt)

## 5. Hyperparameter Tuning

In [None]:
# Tune Decision Tree
print("Tuning Decision Tree...")
# Note: This might take a minute
dt_tuned = tune_decision_tree(X_train, y_train)

# Evaluate Tuned Model
perf_dt_tuned = model_performance_classification(dt_tuned, X_test, y_test)
print("Tuned Decision Tree Performance:")
display(perf_dt_tuned)

# Confusion Matrix
plot_confusion_matrix(dt_tuned, X_test, y_test)
plt.title('Confusion Matrix - Tuned Decision Tree')
plt.savefig('../visuals/confusion_matrix_tuned_dt.png')
plt.show()

## 6. Conclusion
The Tuned Decision Tree model provides optimized performance for predicting cancellations.
Key insights and recommendations can be derived from the feature importance of this model.

In [None]:
# Feature Importances
import numpy as np
importances = dt_tuned.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X_train.columns

plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names[indices], rotation=90)
plt.tight_layout()
plt.savefig('../visuals/feature_importance.png')
plt.show()