<a href="https://colab.research.google.com/github/kitlapp/HotelBookingSQLPreprocessing/blob/kimon/PythonPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hotel Booking Data Preprocessing in the Google Colab Platform

The goal of this project is to migrate the LookerStudioKPIDashboard project, originally prepared in a local environment, to the cloud, specifically to Google Colab. This notebook therefore includes all Python scripts required to create a cleaned DataFrame, suitable for KPI calculations for booking cancellation monitoring and further BI processing.

The preprocessing in this approach is slightly different from the one used for machine learning purposes (BookingCancellationPrediction). For example, null values must be handled in both DataFrames, since BI tools cannot work with them either. However, in cases such as dates, the KPI-focused DataFrame does not require trigonometric component calculations to encode cyclical patterns. Instead, it only needs dates in the correct format, e.g., "YYYY/MM/DD".

The advantages of this migration are:

1. A local environment setup is not required (e.g., connecting to a Jupyter server, managing the command line through Conda, keeping command history logs for reproducible setup, or using Git Bash for version control). Even a gitignore file is unnecessary, since one can carefully choose what to upload directly to this specific GitHub repository from their PC.
2. Working in Google Colab and potentially purchasing resources in the future makes this approach scalable.
3. This environment can be shared and promote collaboration more easily, because replicating the project on any machine only requires cloning the GitHub repository URL in Google Colab. From there, one can immediately run the scripts or start enhancing the code.
4. Google Colab has Gemini integrated which enhances coding help.



# 1. Importing Libraries and Reading the Raw Data from Google Drive

In [4]:
# Authorize access of Google Colab to Google Drive
# The code below has to be executed only once
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

# Google Drive path to the dataset
filepath = '/content/drive/MyDrive/PyCharm_Projects/hotel_booking_RAW.csv'

# Read csv file to a DataFrame
df_raw = pd.read_csv(filepath)

# Rows and columns check of raw data (For SQL preprocessing comparison)
print(df_raw.shape)

# Check for duplicates
print('Number of Duplicates:', df_raw.duplicated().sum())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(119390, 36)
Number of Duplicates: 0


# 2. Preprocessing

## 2.1. Handling Null Values

In [None]:
# Check for nulls
df_raw.isna().sum()

In [6]:
# Make a copy before further preprocessing ('dash': Dashboard, to link with the purpose of this preprocessing)
dfdash2 = df_raw.copy()

# Fill missing values in 'children' column with the most frequent value (mode)
dfdash2['children'] = dfdash2['children'].fillna(value=dfdash2['children'].mode()[0])

# Replace missing values in 'agent' with 0, indicating direct bookings without a travel agent
dfdash2['agent'] = dfdash2['agent'].fillna(value=0)

# Replace missing values in 'company' with 0, meaning bookings not linked to any company
dfdash2['company'] = dfdash2['company'].fillna(value=0)

# Drop all rows with missing 'country' values since location info is important for analysis
dfdash2 = dfdash2.drop(labels=dfdash2.loc[dfdash2['country'].isna()].index)

# Rows and columns check of dfdash2 (For SQL preprocessing comparison)
print(dfdash2.shape)

(118902, 36)


## 2.2. Handling Date-Related Columns

In [7]:
# Make a copy before further preprocessing
dfdash3 = dfdash2.copy()

# Create a dictionary to convert month names to their corresponding numeric values (as strings)
month_mapping = {
    "January": '1', "February": '2', "March": '3', "April": '4', "May": '5',
    "June": '6', "July": '7', "August": '8', "September": '9', "October": '10',
    "November": '11', "December": '12'
}

# Map month names to integers using the same dictionary
dfdash3['arrival_date_month'] = dfdash3['arrival_date_month'].map(month_mapping).astype(int)

# Combine year, month, and day columns into a single date string in 'YYYY-MM-DD' format
dfdash3['arrival_date'] = (
    dfdash3['arrival_date_year'].astype(str) + '-' +
    dfdash3['arrival_date_month'].astype(str) + '-' +
    dfdash3['arrival_date_day_of_month'].astype(str)
)

# Convert the date strings into proper datetime objects
# for easier time-based analysis
dfdash3['arrival_date'] = pd.to_datetime(dfdash3['arrival_date'], format='%Y-%m-%d')

# Rows and columns check of dfdash3 (For SQL preprocessing comparison)
print(dfdash3.shape)

(118902, 37)


## 2.3. Dropping Unimportant Columns

In [10]:
# List of columns to be dropped from the dashboard dataframe
dashcols_to_be_dropped = ['name', 'email', 'arrival_date_month', 'arrival_date_day_of_month', 'phone-number',
                          'credit_card', 'reservation_status', 'reservation_status_date', 'assigned_room_type',
                          'deposit_type', 'required_car_parking_spaces', 'arrival_date_week_number']

# Create a copy of the dashboard dataframe for further processing
dfdash4 = dfdash3.copy()

# Drop the unimportant columns from the dashboard dataframe
dfdash4 = dfdash4.drop(columns=dashcols_to_be_dropped)

# Rows and columns check of dfdash4 (For SQL preprocessing comparison)
print(dfdash4.shape)


(118902, 25)
