<a href="https://colab.research.google.com/github/hansensean123-cell/Sean-Hansen/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Na√Øve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Na√Øve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You‚Äôve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Na√Øve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In Your Response:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [None]:
# Add code here üîß
import pandas as pd

df = pd.read_csv('/content/hotels.csv')
print("DataFrame loaded successfully. First 5 rows:")
print(df.head())

print("Missing values before imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])

df['children'] = df['children'].fillna(df['children'].mode()[0])
df['country'] = df['country'].fillna(df['country'].mode()[0])
df['agent'] = df['agent'].fillna(0)
df['company'] = df['company'].fillna(0)

print("Missing values after imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])

print("Categorical columns identified:")
categorical_cols = df.select_dtypes(include=['object']).columns
print(categorical_cols)

df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

# Drop reservation_status and reservation_status_date as they are direct outcomes of cancellation or date of final status
df = df.drop(columns=['reservation_status', 'reservation_status_date'])

categorical_cols = df.select_dtypes(include=['object']).columns

print("Categorical columns to encode:")
print(categorical_cols)

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print("DataFrame after one-hot encoding. First 5 rows:")
print(df_encoded.head())
print("Shape after encoding:", df_encoded.shape)

from sklearn.model_selection import train_test_split

X = df_encoded.drop('is_canceled', axis=1)
y = df_encoded['is_canceled']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

### ‚úçÔ∏è Your Response: üîß
The dataset initially contained 119,390 rows and 32 columns. After one-hot encoding and dropping 'reservation_status' and 'reservation_status_date', the encoded DataFrame (df_encoded) has 119,390 rows and 248 columns.
The dataset includes both numerical and categorical features. Numerical features include lead_time, arrival_date_year, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, booking_changes, agent, company, days_in_waiting_list, adr, required_car_parking_spaces, and total_of_special_requests. Categorical features, before encoding, included hotel, arrival_date_month, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, reservation_status, and reservation_status_date.
The following steps were taken to clean and prepare the data:
Missing Value Imputation: Missing values in children and country columns were imputed with their respective modes. Missing values in agent and company columns were filled with 0.
Feature Engineering: The reservation_status_date column was converted to datetime objects.
Column Dropping: The reservation_status and reservation_status_date columns were dropped because they directly reflect the outcome (cancellation status) and are not suitable as predictive features.
Categorical Encoding: All remaining categorical columns were converted into numerical format using one-hot encoding (pd.get_dummies) with drop_first=True to avoid multicollinearity.

## 2. Build a Na√Øve Bayes Model

**Business framing:**  
Na√Øve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Na√Øve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In Your Response:
1. How accurate is this model?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [None]:
# Add code here üîß
from sklearn.naive_bayes import GaussianNB

# Instantiate the Gaussian Naive Bayes model
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

print("Na√Øve Bayes model trained successfully.")

from sklearn.naive_bayes import GaussianNB

# Instantiate the Gaussian Naive Bayes model
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

print("Na√Øve Bayes model trained successfully.")

from sklearn.metrics import classification_report, confusion_matrix

print("Classification Report for Na√Øve Bayes Model:")
print(classification_report(y_test, y_pred_nb))

print("\nConfusion Matrix for Na√Øve Bayes Model:")
print(confusion_matrix(y_test, y_pred_nb))



### ‚úçÔ∏è Your Response: üîß
1.

2.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use RBF kernel)
- Make predictions and evaluate with classification metrics

### In Your Response:
1. How well does the model perform?
2. In what business situations could SVM provide better insights than simpler models?


In [None]:
# Add code here üîß
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data scaled successfully for SVM.")

from sklearn.svm import SVC

# Initialize an SVC model with an RBF kernel
svm_model = SVC(kernel='rbf', random_state=42)

# Train the SVM model using the scaled training data
svm_model.fit(X_train_scaled, y_train)

print("SVM model trained successfully.")

# Make predictions on the scaled test data
y_pred_svm = svm_model.predict(X_test_scaled)
print("Predictions made successfully on test data for SVM.")

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a MLBClassifier model using the neural_network package from sklearn
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In Your Response:
1. How does this model compare to the others?
2. Would the business be comfortable using a ‚Äúblack box‚Äù model like this? Why or why not?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Na√Øve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In Your Response:
1. Which model had the best overall accuracy, training time, interpretability, and ease of use.
2. Would you recommend this model for deployment, and why?


In [None]:
# Add code here üîß

### ‚úçÔ∏è Your Response: üîß
1.

2.

## 6. Final Business Recommendation

### In Your Response:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1.

2.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_12_LastnameFirstname.ipynb"

# Task
Load the `hotels.csv` dataset, handle missing values, encode categorical variables, define features (X) and target (y = 'is_canceled'), and split the data into 70/30 training and testing sets; then, report the total number of rows and columns, the types of features included, and a summary of the data cleaning and preparation steps.

## Load Dataset

### Subtask:
Load the 'hotels.csv' file into a pandas DataFrame for initial inspection and processing.


**Reasoning**:
To load the dataset, I need to import the pandas library and then use its `read_csv` function to load the 'hotels.csv' file into a DataFrame. Then, I will display the first few rows to confirm successful loading.



## Train Naive Bayes Classifier

### Subtask:
Train a Na√Øve Bayes classifier on the preprocessed training data (`X_train`, `y_train`).


**Reasoning**:
To train a Na√Øve Bayes classifier, I need to import the GaussianNB model from sklearn.naive_bayes, instantiate it, and then fit it to the training data (X_train, y_train).



In [7]:
from sklearn.naive_bayes import GaussianNB

# Instantiate the Gaussian Naive Bayes model
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

print("Na√Øve Bayes model trained successfully.")

Na√Øve Bayes model trained successfully.


**Reasoning**:
Now that the Na√Øve Bayes model is trained, I will use it to make predictions on the test data (`X_test`). This is a necessary step before evaluating the model's performance with a classification report and confusion matrix.



In [8]:
y_pred_nb = nb_model.predict(X_test)
print("Predictions made successfully on test data.")

Predictions made successfully on test data.


**Reasoning**:
With the predictions made, the next logical step is to evaluate the Na√Øve Bayes model's performance by generating a classification report and a confusion matrix to understand its accuracy, precision, recall, and F1-score, as well as the types of errors it makes.



In [9]:
from sklearn.metrics import classification_report, confusion_matrix

print("Classification Report for Na√Øve Bayes Model:")
print(classification_report(y_test, y_pred_nb))

print("\nConfusion Matrix for Na√Øve Bayes Model:")
print(confusion_matrix(y_test, y_pred_nb))

Classification Report for Na√Øve Bayes Model:
              precision    recall  f1-score   support

           0       0.86      0.33      0.47     22478
           1       0.45      0.91      0.60     13339

    accuracy                           0.54     35817
   macro avg       0.65      0.62      0.54     35817
weighted avg       0.71      0.54      0.52     35817


Confusion Matrix for Na√Øve Bayes Model:
[[ 7308 15170]
 [ 1172 12167]]


## Build a Support Vector Machine (SVM) Model

### Subtask:
Train an SVM classifier (use RBF kernel), make predictions, and evaluate with classification metrics.


## Scale Data for SVM

### Subtask:
Initialize a `StandardScaler` and scale `X_train` and `X_test`.

**Reasoning**:
Scaling features is essential for SVMs because they rely on distance metrics. Features with larger values might disproportionately influence the model. `StandardScaler` will transform the data such that it has a mean of 0 and a standard deviation of 1, which is optimal for many machine learning algorithms, including SVM with an RBF kernel.

**Reasoning**:
To prepare the data for the SVM model as per the instructions, I need to import the `StandardScaler`, fit it on the training data, and then transform both the training and test sets. This ensures that the features are scaled appropriately for SVM.



In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data scaled successfully for SVM.")

Data scaled successfully for SVM.


**Reasoning**:
Now that the data is scaled, I will initialize an SVC model with an RBF kernel, train it on the scaled training data, and then make predictions on the scaled test data. This aligns with the subtask instructions to train an SVM classifier and make predictions.



In [None]:
from sklearn.svm import SVC

# Initialize an SVC model with an RBF kernel
svm_model = SVC(kernel='rbf', random_state=42)

# Train the SVM model using the scaled training data
svm_model.fit(X_train_scaled, y_train)

print("SVM model trained successfully.")

# Make predictions on the scaled test data
y_pred_svm = svm_model.predict(X_test_scaled)
print("Predictions made successfully on test data for SVM.")

SVM model trained successfully.
