# GTC ML Project 1 - Data Cleaning & Preprocessing

This project focuses on preparing the **hotel bookings dataset** for machine learning.  
The business problem is predicting booking cancellations, but our task is only **data preprocessing** — not building the final model.  

We will follow three phases:
1. Exploratory Data Analysis (EDA) & Data Quality Report  
2. Data Cleaning  
3. Feature Engineering & Preprocessing


#### ***Objective:*** Build a robust data preprocessing pipeline for a hotel booking cancellation prediction model.
#### ***Business Problem:*** The revenue team has identified that last-minute booking cancellations significantly impact profitability. Your task is not to build the final model, but to prepare the raw data for it. The quality of your data cleaning will directly determine the model's future success.

## Phase 1: Exploratory Data Analysis (EDA) & Data Quality Report

In this phase, we:
- Loaded the dataset and explored its structure.
- Generated summary statistics.
- Identified missing values and visualized them.
- Detected outliers in key numerical features (`adr`, `lead_time`).
- Documented the main data quality issues.


### Step 1: Load the Dataset
Upload the hotel bookings dataset into Google Colab and read it using Pandas.


In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('hotel_bookings - hotel_bookings.csv')

### Step 2: Import the Libraries
Import all the required libraries for data analysis and visualization.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt

### Step 3: Dataset Overview
Check the shape of the dataset, basic info, and summary statistics.


In [None]:
print("Shape of dataset:", df.shape)
print("\n--- Info ---")
print(df.info())
print("\n--- Summary Statistics ---")
print(df.describe(include="all"))

### Step 4: Missing Values
Check for missing values and visualize them with a heatmap.


In [None]:
print("\n--- Missing Values ---")
print(df.isnull().sum())

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.isnull())
plt.title("Missing Values", fontsize=16)
plt.xlabel("Columns", fontsize=15)
plt.ylabel("Rows", fontsize=15)
plt.show()

### Step 5: Outlier Detection
Detect outliers in important numeric columns like `adr` and `lead_time` using boxplots and the IQR method.


In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x=df["adr"])
plt.title("Outliers in ADR")
plt.show()

plt.figure(figsize=(8,5))
sns.boxplot(x=df["lead_time"])
plt.title("Outliers in Lead Time")
plt.show()


## Phase 2: Data Cleaning

In this phase, we:
- Handled missing values:
  - `company`, `agent` → replaced with "None" or `0`
  - `country` → imputed with most frequent value
  - `children` → imputed with median
- Removed duplicate rows.
- Handled outliers in `adr` by capping values above 1000.
- Fixed data types (converted date columns).


### Step 6: Handle Missing Values
- company → "None"  
- agent → 0  
- country → most frequent value  
- children → median


In [None]:
df["company"].fillna("None", inplace=True)
df["agent"].fillna(0, inplace=True)

df["country"].fillna(df["country"].mode()[0], inplace=True)

df["children"].fillna(df["children"].median(), inplace=True)


### Step 7: Remove Duplicates
Drop duplicate rows to avoid data repetition.


In [None]:
print("Before removing duplicates:", df.shape)
df.drop_duplicates(inplace=True)
print("After removing duplicates:", df.shape)


### Step 8: Handle Outliers
Cap extreme ADR values above 1000 to reduce skew.


In [None]:
print("Max ADR before capping:", df["adr"].max())

df["adr"] = df["adr"].clip(upper=1000)

print("Max ADR after capping:", df["adr"].max())


### Step 9: Fix Data Types
Convert date columns (like `reservation_status_date`) to datetime format.


In [None]:
df["reservation_status_date"] = pd.to_datetime(df["reservation_status_date"], errors="coerce")

print("Data types after conversion:")
print(df.dtypes)


In [None]:
print("Remaining Missing Values:")
print(df.isnull().sum())


## Phase 3: Feature Engineering & Preprocessing

In this phase, we:
- Created new features:
  - `total_guests` = adults + children + babies
  - `total_nights` = stays_in_weekend_nights + stays_in_week_nights
  - `is_family` = binary flag for bookings with children/babies
- Encoded categorical variables:
  - One-Hot Encoding for `meal` and `market_segment`
  - Grouped rare `country` values into "Other"
- Removed data leakage columns (`reservation_status`, `reservation_status_date`).
- Split the dataset into training and testing sets (80% / 20%).


### Step 10: Create New Features
- total_guests  
- total_nights  
- is_family (binary flag for families)


In [None]:
df["total_guests"] = df["adults"] + df["children"] + df["babies"]

df["total_nights"] = df["stays_in_weekend_nights"] + df["stays_in_week_nights"]

df["is_family"] = df.apply(lambda row: 1 if (row["children"] + row["babies"]) > 0 else 0, axis=1)


### Step 11: Encode Categorical Variables
- One-Hot Encoding for low-cardinality categories  
- Group rare countries into "Other"


In [None]:
df = pd.get_dummies(df, columns=["meal", "market_segment"], drop_first=True)

country_counts = df["country"].value_counts()
rare_countries = country_counts[country_counts < 100].index
df["country"] = df["country"].replace(rare_countries, "Other")


### Step 12: Remove Data Leakage
Drop reservation_status and reservation_status_date.


In [None]:
df.drop(["reservation_status", "reservation_status_date"], axis=1, inplace=True)


### Step 13: Train-Test Split
Split the dataset into 80% training and 20% testing sets.


In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=42)

print("Training set shape:", train.shape)
print("Testing set shape:", test.shape)


# Final Summary

The dataset is now fully cleaned and preprocessed.  
Key improvements made:
- Missing values handled.
- Outliers capped.
- Duplicates removed.
- Dates converted to proper format.
- New features engineered.
- Categorical variables encoded.
- Data leakage removed.

