<a href="https://colab.research.google.com/github/goradiam/ACPML---EDA-Assignment---EDA-on-NYC-Taxi-Records---Submission/blob/main/Tree_Models_Guided_Model_Building_%5Bexercise%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tree Models - Guided Model Building**

In this session, you will build a decision tree model using `scikit-learn` to predict hotel booking cancellations. The target variable for this task is `booking status`.

Some of the tasks you will perform include:
* Exploring and understanding the dataset
* Data cleaning and preprocessing
* Encoding categorical variables
* Creating a train-test split
* Training a predictive model to classify the `booking status`
* Evaluating and visualising the model's performance

Let's get started by importing the necessary libraries



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.tree import plot_tree
import warnings
warnings.filterwarnings('ignore')

## **1 Data Preparation**

### **1.1 Load and inspect the data**

Load the data and inspect it to get a sense of the data and to check for the presence of missing values, incorrect formatting, and other anomalies.

In [None]:
df=pd.read_csv("booking.csv")

In [None]:
df.head()

In [None]:
df.info()

### **1.2 Clean and preprocess the data**

Clean the data; be sure to handle missing values, formatting issues, and any other anomalies that may hamper model building and training

In [None]:
### YOUR CODE HERE ###


In [None]:
# Check for missing values
print(df.isnull().sum())

# Depending on the output, you might decide to drop rows/columns or fill missing values.
# Example (fill with median for numerical columns):
# for col in df.columns:
#     if df[col].dtype in ['int64', 'float64']:
#         df[col] = df[col].fillna(df[col].median())

# Example (drop rows with any missing values):
# df.dropna(inplace=True)

# Check for duplicate rows
print(df.duplicated().sum())

# Example (drop duplicate rows):
# df.drop_duplicates(inplace=True)

### **1.3 Encoding Categorical Variables**

As the data contains some categorical entries which we do want to consider for our analysis, we may need to perform one-hot encoding on these categorical columns. You may use `pd.get_dummies()` to achieve this. Be sure to convert the values to numeric type before proceeding.

**Note:** Not all models require this step, please consult the documentation of the model you are using to see if encoding is necessary.


In [None]:
### YOUR CODE HERE ###

# Identify categorical columns (you might need to adjust this based on your data)
categorical_cols = df.select_dtypes(include=['object']).columns

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Convert values to numeric type (get_dummies usually handles this, but it's good to be explicit if needed)
df_encoded = df_encoded.apply(pd.to_numeric, errors='coerce')

# Handle any new missing values that might have been introduced by coercion (e.g., drop rows)
df_encoded.dropna(inplace=True)

# Display the first few rows of the encoded dataframe
print(df_encoded.head())


## **2 Model Building**

### **2.1 Create the train-test splits**





You may use `sklearn.model_selection.train_test_split()` to create a train-test split of your data. We will not be performing cross-validation and hence a cross-validation split is not necessary. We recommend an $80:20$ split.

In [None]:
### YOUR CODE HERE ###

In [None]:
# Define features (X) and target (y)
X = df_encoded.drop('booking status', axis=1)
y = df_encoded['booking status']

# Create the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

###**2.2 Train the model**

You can use `sklearn.tree.DecisionTreeClassifier()` to build a classification model.

Train this model using only the training split of the data.


In [None]:
### YOUR CODE HERE ###

In [None]:
# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

###**2.3 Feature Importance**

You can use the `feature_importances_` attribute of the trained model to get the importance score for each feature.

Visualise the feature importances to understand which features contribute the most to the decision-making process.

In [None]:
### YOUR CODE HERE ###

In [None]:
# Get feature importances
feature_importances = dt_model.feature_importances_

# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({'feature': X.columns, 'importance': feature_importances})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

# Visualize feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=feature_importance_df)
plt.title('Feature Importances')
plt.show()

###**2.4 Hyperparameter Tuning**

You can use `GridSearchCV()` from  `sklearn.model_selection` to find the best combination of hyperparameters for your decision tree classifier.

Fit the grid search using only the training split of the data.

**Hyperparameters to Tune:**

**max_depth:** Controls the tree's maximum depth.

**min_samples_split:** Minimum samples required to split a node.

**min_samples_leaf:** Minimum samples required at a leaf node.

**criterion:** Split quality measure. 'gini' (Gini impurity) or 'entropy' (information gain).

`GridSearchCV()` automatically fits the best model with the optimal hyperparameters and stores it in `grid_search.best_estimator_`.

You can directly access this `best_estimator_` attribute and use it for further predictions, evaluation, or visualisation.

In [None]:
### YOUR CODE HERE ###

In [None]:
# Define the hyperparameter grid to search
param_grid = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and best estimator
best_params = grid_search.best_params_
best_dt_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)
print("Best Decision Tree Model:", best_dt_model)

###**2.5 Visualising the Decision Tree**

After training your `DecisionTreeClassifier`, you can visualise the tree to better understand how the model is making decisions based on the features. You can use `plot_tree` from `sklearn.tree` to visualise the structure of the decision tree.

To keep the visualisation simple and focused, we can  limit the depth of the tree to show only the first 3 levels (`max_depth=3`), making it easier to interpret the key decision points without overwhelming the viewer.

In [None]:
### YOUR CODE HERE ###

In [None]:
# Visualize the best decision tree
plt.figure(figsize=(20, 10))
plot_tree(best_dt_model,
          feature_names=X.columns.tolist(),
          class_names=['Not Cancelled', 'Cancelled'], # Assuming 0 is not cancelled and 1 is cancelled
          filled=True,
          rounded=True,
          max_depth=3) # Limit depth for better visualization
plt.title('Best Decision Tree Classifier (max_depth=3)')
plt.show()


Visualising a decision tree allows us to understand how the model makes decisions and which features are most important in predicting the target variable. It shows the rules the model uses to classify data, helping to explain its behavior.

This can also reveal if the model is too complex, which might indicate overfitting, or too simple, which might suggest it isn't learning enough from the data. By examining the tree, we can gain insights into the patterns in the data, helping to refine the model for better performance.

### **2.6 Make predictions**

Make predictions using the trained model. Make separate predictions on the training and test sets.

In [None]:
### YOUR CODE HERE ###

In [None]:
# Make predictions on the training set
y_train_pred = best_dt_model.predict(X_train)

# Make predictions on the test set
y_test_pred = best_dt_model.predict(X_test)

## **3 Model Evaluation**

###**3.1  ROC Curve**

You can evaluate the performance of your classification model using the ROC (Receiver Operating Characteristic) curve. This helps visualise the trade-off between sensitivity (true positive rate) and specificity (false positive rate) across different thresholds, which is especially important in medical prediction tasks where minimising false negatives is critical.


In [None]:
### YOUR CODE HERE ###

In [None]:
# Get predicted probabilities for the positive class (booking status = 1)
y_prob = best_dt_model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Calculate ROC AUC score
roc_auc = roc_auc_score(y_test, y_prob)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

###**3.2 Classification Report**

Use the classification report to evaluate model performance with key metrics like precision, recall, and F1-score. It helps assess how well the model handles each class, especially in imbalanced datasets.

In [None]:
### YOUR CODE HERE ###


In [None]:
# Generate classification report
class_report = classification_report(y_test, y_test_pred)

# Print the classification report
print(class_report)


## **4 Using Other Models & Techniques**

You've built a decision tree model and tuned it. Now, try stronger models like Random Forest (ensemble of trees) and XGBoost (boosting method).

For tuning, try other techniques outside of a grid search. Consider using `RandomizedSearchCV()` for speed or perform manual tuning if you understand the model well.

Also, you can enhance evaluation with tools like the precision-recall curve to better assess performance.