# **EDA AND DATA CLEANING NOTEBOOK**

## Objectives

* Carry Out EDA relating to quality of data
* Clean data and save for use in further analysis

## Inputs

* **Raw Dataset:** inputs/datasets/raw/hotel_bookings.csv

## Outputs

* **Cleaned Dataset with All Records:** outputs/{version}/datasets/cleaned/cleaned_all_records.csv
* **Cleaned Dataset with Duplicates Removed:** outputs/{version}/datasets/cleaned/cleaned_deduplicated.csv
* **Images from Duplicates Analysis:**
  - outputs/{version}/images/cancellations_by_dup_category.png
  - outputs/{version}/images/market_segments_by_dup_category.png

---

# Import Packages and Load Data

Imports

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Load data

In [None]:
from pathlib import Path

project_root = Path.cwd().parent
dataset_file = project_root / 'inputs' / 'datasets' / 'raw' / 'hotel_bookings.csv'
df = pd.read_csv(dataset_file)
df.head(3)

---

# Convert Data Types

View current data types

In [None]:
df.info()

Convert data types ready for profiling report.
- `Int64` allows for the missing values and is used for `children`, `agent` and `company` to remove decimal digits before converting to `category` type

In [None]:
df['hotel'] = df['hotel'].astype('category')
df['is_canceled'] = df['is_canceled'].astype('bool')
df['arrival_date_year'] = df['arrival_date_year'].astype('category')
df['arrival_date_month'] = df['arrival_date_month'].astype('category')
df['children'] = df['children'].astype('Int64')
df['meal'] = df['meal'].astype('category')
df['country'] = df['country'].astype('category')
df['market_segment'] = df['market_segment'].astype('category')
df['distribution_channel'] = df['distribution_channel'].astype('category')
df['is_repeated_guest'] = df['is_repeated_guest'].astype('bool')
df['reserved_room_type'] = df['reserved_room_type'].astype('category')
df['assigned_room_type'] = df['assigned_room_type'].astype('category')
df['deposit_type'] = df['deposit_type'].astype('category')
df['agent'] = df['agent'].astype('Int64').astype('category')
df['company'] = df['company'].astype('Int64').astype('category')
df['customer_type'] = df['customer_type'].astype('category')
df['reservation_status'] = df['reservation_status'].astype('category')
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

df.info()

# Profile Report

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df.drop_duplicates(), title="Hotel Bookings Profile Report", minimal=True)
profile.to_notebook_iframe()

The Profile Report alerts us to some possible data issues which require further exploration:
- **Data Ranges:**
  - `lead_time` shows two values over 600
  - `stays_in_weekend_nights` shows some values over 10
  - `stays_in_week_nights` shows some values over 30
  - `adults` shows some values over 20
  - `children` shows a high value of 10
  - `babies` shows high values of 9 and 10
  - `previous_cancellations` shows values over 20
  - `previous_bookings_not_canceled` shows values over 60
  - `booking_changes` shows values over 20
  - `days_in_waiting_list` shows values over 300
  - `adr` shows at least one negative value, many zero values and an incredibly high value of 5400
  - `required_car_parking_spaces` shows a high value of 8
- **Undefined Values for Categorical Variables:**
  - there are 'Undefined' values for `meal`, `market_segment`, `distribution_channel`
- **Missing Values**
  - there are missing values for `children`, `country`, `agent` and `company`

---

# Explore Statistical Outliers

## Functions to assist analysis

Define function for summarising value counts and percentages in a table

In [None]:
def value_counts_and_percentages(df, filter_by_cols=None):
    data = df[filter_by_cols] if filter_by_cols else df
    df_count = data.value_counts(dropna=False)
    df_percent = round(data.value_counts(normalize=True, dropna=False) * 100, 1)
    summary = pd.concat([df_count, df_percent], axis=1)
    summary.columns = ['Count', '%']
    return summary

def plot_categorical_facets(data, x, facet_by, xlim=None):
    g = sns.catplot(data=data, x=x, kind="count", col=facet_by, col_wrap=2, sharey=False, hue=x, legend=False)
    for ax in g.axes.flatten():
        plt.setp(ax.get_xticklabels(), rotation=90)
    if xlim:
        plt.xlim(xlim)
    plt.show()

## Remove 'Undefined' Values

### Meal

View categories

In [None]:
summary = value_counts_and_percentages(df, ['meal'])
summary

Drop 'Undefined'

In [None]:
# Drop rows with values of 'Undefined'
condition = (df['meal'] == 'Undefined')
df = df[~condition]

# Remove 'Undefined' as a category
df['meal'] = df['meal'].cat.remove_categories(['Undefined'])

# Check values have been dropped
summary = value_counts_and_percentages(df, ['meal'])
summary

### Market Segment

View categories

In [None]:
summary = value_counts_and_percentages(df, ['market_segment'])
summary

Drop 'Undefined'

In [None]:
# Drop rows with values of 'Undefined'
condition = (df['market_segment'] == 'Undefined')
df = df[~condition]

# Remove 'Undefined' as a category
df['market_segment'] = df['market_segment'].cat.remove_categories(['Undefined'])

# Check values have been dropped
summary = value_counts_and_percentages(df, ['market_segment'])
summary

### Distribution Channel

View categories

In [None]:
summary = value_counts_and_percentages(df, ['distribution_channel'])
summary

Drop 'Undefined'

In [None]:
# Drop rows with values of 'Undefined'
condition = (df['distribution_channel'] == 'Undefined')
df = df[~condition]

# Remove 'Undefined' as a category
df['distribution_channel'] = df['distribution_channel'].cat.remove_categories(['Undefined'])

# Check values have been dropped
summary = value_counts_and_percentages(df, ['distribution_channel'])
summary

## Lead Time

Show distribution

In [None]:
sns.histplot(df, x='lead_time')

Analyse values over 600

In [None]:
cols = ['hotel', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'stays_in_weekend_nights',
        'stays_in_week_nights', 'meal', 'market_segment', 'distribution_channel', 'agent', 'company','is_repeated_guest',
        'reserved_room_type','assigned_room_type', 'deposit_type', 'days_in_waiting_list',
        'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']
cols = df.columns
data = df[df['lead_time'] > 600][cols].value_counts(dropna=False)
pd.DataFrame(data)

The bookings with lead times between 600 and 700 share a number of similarities:
- same hotel (City Hotel)
- same travel agent
- for off-peak seasons
- all group bookings
- same room type and meal plan
- adr within 58-63 range
- resulted in cancelled bookings

The bookings with lead times above 700 also share a number of similarities:
- same hotel (resort hotel)
- direct bookings
- were not cancelled

They are also both unusual in different ways:
- one has no overnight stays (either weekdays or weekends) and an adr of 0
  - NOTE: a deeper analysis of adr=0 follows later
- one has 28 overnight stays (i.e. 4 weeks)

These bookings do not appear to be random anomalies but rather coherent subsets of bookings. They are also not too much higher than other lead time values.

**ACTION:** keep all records

## Overnight Stays

Add a calculated `total_nights` variable to assist in analysis

In [None]:
df_nights = df.copy().drop_duplicates()
df_nights['total_nights'] = df_nights['stays_in_weekend_nights'] + df['stays_in_week_nights']

### Check consistency between `stays_in_weekend_nights` and `stays_in_week_nights`

By calculating the weeks that a guest stayed using `stays_in_weekend_nights` and `stays_in_week_nights`, we can check that the difference is never greater than 1.

In [None]:
# Create columns for weeks calculated by weekday nights and weekend nights
df_nights['weekend_weeks'] = df_nights['stays_in_weekend_nights'] / 2
df_nights['weekday_weeks'] = df_nights['stays_in_week_nights'] / 5

# If consistent, the difference between weekend_weeks and weekday_weeks should be <= 1
df_nights['weeks_are_consistent'] = (abs(df_nights['weekend_weeks'] - df_nights['weekday_weeks'])<=1)
df_nights['weeks_are_consistent'].value_counts()

All records have consistent values.

### Analyse highest values

Inspect records with long stays

In [None]:
high_overnight_stays = df_nights[(df_nights['stays_in_weekend_nights']>10) | (df_nights['stays_in_week_nights']>30)]
high_overnight_stays

These observations all seem plausible. For example, there are no babies or children staying for this long. Some of the average daily rates are very low with 4 of the stays being without charge. This is presumably a concession for special guests.

### Zero Overnight Stays

Inspect records with no overnight stays

In [None]:
no_overnight_stays = df_nights[(df_nights['total_nights']==0)]
no_overnight_stays

It is unclear what these bookings are but they are clearly a special case.

These bookings constitute less than 1% of the overall data (duplicates removed).

In [None]:
summary = value_counts_and_percentages(df_nights, ['total_nights'])
summary.loc[[0.0]]

All of these records have **adr = 0**. According to the [original data source](https://www.sciencedirect.com/science/article/pii/S2352340918315191), Average Daily Rates are calculated by
> dividing the sum of all lodging transactions by the total number of staying nights.

If there are no staying nights, the value will be undefined and the system presumably defaults to zero.

In [None]:
summary = value_counts_and_percentages(no_overnight_stays, ['adr'])
summary

Cancellation rates are much lower for this type of booking than overnight stays

In [None]:
condition = df_nights['total_nights'] == 0.0

# Get percentage counts for day-only bookings and overnight bookings
day_only_counts = value_counts_and_percentages(df_nights[condition], ['is_canceled'])
overnight_counts = value_counts_and_percentages(df_nights[~condition], ['is_canceled'])

# Concatenate data into one summary table
summary = pd.concat([day_only_counts, overnight_counts], axis=1)

# Drop Counts and rename columns
summary.drop('Count', axis=1, inplace=True)
summary.columns = ['No Overnight Stays (%)', 'Yes Overnight Stays (%)']

summary

Of all bookings that have adr = 0.0, the majority of these do consist of at least one overnight stay (~63%) but bookings with no overnight stays still constitute a significant minority and therefore will not be removed for the moment.

In [None]:
zero_adr = df_nights[df_nights['adr'] == 0.0]
summary = value_counts_and_percentages(zero_adr, ['total_nights'])
summary

## Adults

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['adults'])
summary

Analyse records with more than 5 adults

In [None]:
high_adults = df[df['adults'] > 5].drop_duplicates()
high_adults

These records share some similarities
- same hotel (Resort)
- booked for Sep and Oct in 2015
- booked over 300 days in advance
- group bookings
- many were booked through the same travel agent
- adr = 0
- country = PRT
- all were cancelled - most in Jan 2015 but one in Sep 2015

Since they were group bookings with other similarities, these records are valid and will not be removed. 

## Children

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['children'])
summary

It is highly unlikely but not impossible that a booking involved 10 children

In [None]:
df[df['children'] == 10]

View the distribution of ADR for the same room type

In [None]:
room_D = df[df['assigned_room_type'] == 'D'].drop_duplicates()
sns.histplot(room_D, x='adr')

It is likely an anomaly because:
- Most hotels don’t allow a single room with 2 adults + 10 children so it was likely meant to say 1 or 0.
- ADR (133.16) looks like a normal rate for 1–2 rooms, not for such a large group.
- Reservation status = 'No-Show', which might explain why it slipped through without being corrected on arrival.

**ACTION:** drop

In [None]:
condition = df['children'] == 10
df = df[~condition]

## Babies

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['babies'])
summary

Inspect babies > 5

In [None]:
df[df['babies'] > 5]

The rest of the details look to be consistent with small bookings. These are considered to be anomalies.

**ACTION:** drop

In [None]:
condition = df['babies'] > 5
df = df[~condition]

## Previous Cancellations

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['previous_cancellations'])
summary

Inspect records with high previous_cancellations

In [None]:
high_cancellations = df[df['previous_cancellations'] > 15].drop_duplicates()
high_cancellations


View relationship between previous_cancellations and previous_bookings_not_canceled

In [None]:
sns.scatterplot(df.drop_duplicates(), x='previous_cancellations', y='previous_bookings_not_canceled')

Guests who have very high cancellation rates are typically characterised by having no previous non-cancelled bookings and no other records after Oct 2015.

These are probably valid system records which were flagged by the hotel to prevent future bookings by these customers although they could represent a system design error (e.g. the same “guest” ID may be reused by an agent, inflating cancellation counts).

**ACTION:** keep

## Previous Bookings Not Cancelled

Inspect previous_bookings_not_canceled > 40

In [None]:
df[df['previous_bookings_not_canceled'] > 40].drop_duplicates()


These bookings appear to be valid as they were all made to the City Hotel by the same company with low lead times.

**ACTION:** keep

## Booking Changes

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['booking_changes'])
summary

In [None]:
df[df['booking_changes'] > 15].drop_duplicates()

These do not look like random errors since there are similarities:
- All of these bookings are associated with a company or an agent (most are tied to agent 9 or agent 240, which may represent large travel agencies or corporate portals)
- Some of these are very long stays and these are more likely to require some booking changes

They could possibly be explained by agencies using "change" instead of cancel/rebook.

**ACTION:** keep

## Days in Waiting List

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['days_in_waiting_list'])
summary

In [None]:
df[df['days_in_waiting_list'] > 300].drop_duplicates()

These look to be valid entries because they share the following similiarities:
- same hotel (City)
- group bookings by the same agent (1)
- high lead times

They reflect group reservations booked far in advance, held on a waiting list until confirmed.

**ACTION:** keep

## Average Daily Rates

A negative value for ADR is not valid

**ACTION:** drop

In [None]:
condition = df['adr'] < 0
df = df[~condition]

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['adr'])
summary

Approximately 2% of unduplicated records have **adr = 0** so it is worth more exploration.

In [None]:
df_adr = df.copy().drop_duplicates()
df_adr['zero_adr'] = df_adr['adr'] == 0.0
df_adr['zero_adr'].value_counts(normalize=True)

### Distributions in Categorical Features

Look for differences in the distribution of categorical variables.

In [None]:
cat_features = df_adr.select_dtypes(['category', 'bool']).columns.to_list()
for feature in cat_features:
    plot_categorical_facets(df_adr, feature, 'zero_adr')

View which agents booked the most zero-adr bookings

In [None]:
df_adr[df_adr['zero_adr'] == True]['agent'].value_counts().head(3)

View which companies booked the most zero-adr bookings

In [None]:
df_adr[df_adr['zero_adr'] == True]['company'].value_counts().head(3)

### Distributions in Numeric Features

View numeric features to investigate

In [None]:
numeric_features = df_adr.select_dtypes('number').columns.to_list()
print(numeric_features)

View distributions of lead times

In [None]:
sns.histplot(df_adr, x='lead_time', hue='zero_adr', stat='density', common_norm=False, kde=True)
plt.xlim(0,300)

Bookings with zero-adr are much more likely to have zero lead time

In [None]:
df_zero = df_adr[df_adr['zero_adr']==True]
df_non_zero = df_adr[df_adr['zero_adr']==False]

zero_count = value_counts_and_percentages(df_zero, ['lead_time'])
non_zero_count = value_counts_and_percentages(df_non_zero, ['lead_time'])

summary = pd.concat([zero_count, non_zero_count], axis=1)
summary.drop('Count', axis=1, inplace=True)
summary.columns = ['zero-adr (%)', 'non zero-adr (%)']
summary

View duration of stay

In [None]:
plot_categorical_facets(df_adr, 'stays_in_weekend_nights', 'zero_adr')

In [None]:
plot_categorical_facets(df_adr, 'stays_in_week_nights', 'zero_adr', (-1,10))

In [None]:
df_adr['total_nights'] = df_adr['stays_in_week_nights'] + df_adr['stays_in_weekend_nights']
plot_categorical_facets(df_adr, 'total_nights', 'zero_adr', (-1,20))

In [None]:
for feature in numeric_features:
    sns.histplot(df_adr, x=feature, hue='zero_adr', stat='density', common_norm=False, kde=True)
    plt.show()

### Conclusions

Compared to standard bookings, bookings with zero ADR have:
- lower percentage cancellations
- higher percentage `is_repeated_guest`
- higher percentages in Oct and Dec for `arrival_date_month`
- higher percentages of 'Complementary' for `market_segment`
- higher percentages of 'Direct' for `distribution_channel`
- higher perentage of room I and K for `assigned_room_type`
- higher percentage for 240 for `agent`
- predominantly been booked through company 45
- higher occurences of zero lead time
- fewer overnight stays and a large proportion of zero overnight stays 

These patterns make it unlikely that they are random errors. It is more likely that zero ADR represents valid "non-revenue" bookings such as staff stays, promotions, group allotments or company-specific deals where revenue is not recorded at ADR level.

**ACTION:** Keep

## Car Parking

View values

In [None]:
summary = value_counts_and_percentages(df.drop_duplicates(), ['required_car_parking_spaces'])
summary

Inspect required_car_parking_spaces > 5

In [None]:
df[df['required_car_parking_spaces'] > 5]

These are valid bookings because they are both corporate bookings classed under a `customer_type` of 'Transient-Party'.

**ACTION:** Keep

# Data Consistency Analysis

## Investigate Distribution Channel

### Initial Investigations

View values for `distribution_channel`

In [None]:
summary = value_counts_and_percentages(df, ['distribution_channel'])
display(summary)

### `distribution_channel` + `agent`

We expect that when an **agent** is specified, **distribution_channel = TA/TO**

In [None]:
data = df[~df['agent'].isna()]
summary = value_counts_and_percentages(data, ['distribution_channel'])
display(summary)

Findings
- **TA/TO** is expected:
  - This is the majority class with over 90% of all observations
  - ACTION: keep
- **Direct** is inconsistent:
  - Probably `distribution_channel` or `agent` is mis-labelled but impossible to know which.
  - ACTION: drop
- **Corporate** is possible:
  - Some corporate bookings are handled by travel agents (e.g. travel management companies).
  - ACTION: investigate further
- **GDS** is valid:
  - Many travel agents book via GDS.
  - ACTION: keep

Check that similar patterns are seen with both hotels

In [None]:
# City
city_data = df[(~df['agent'].isna() & (df['hotel'] == 'City Hotel'))]
city_summary = value_counts_and_percentages(city_data, ['distribution_channel'])

# Resort
resort_data = df[(~df['agent'].isna() & (df['hotel'] == 'Resort Hotel'))]
resort_summary = value_counts_and_percentages(resort_data, ['distribution_channel'])

# Concatenate into one table and remove counts
summary = pd.concat([city_summary, resort_summary], axis=1)
summary.drop('Count', axis=1, inplace=True)

# Rename column headings and display
summary.columns = ['% of City Hotel Bookings', '% of Resort Hotel Bookings']
summary


Similar patterns are seen with both hotels so there don't seem to be issues with administrative errors in only one of the hotels.

Investigate `Corporate` further
- How many of the 1134 corporate bookings have a company ID specified alongside the travel agent?

In [None]:
data = df[(~df['agent'].isna()) & (~df['company'].isna())]
summary = value_counts_and_percentages(data, ['distribution_channel'])
display(summary)

Findings
- Only 132 of the 1134 agent bookings assigned distribution_channel = corporate have a company ID associated with them (~12%)
- These may still be valid records where the company was not recorded
- ACTION: keep 'Corporate' but drop 'Direct'


In [None]:
# Drop rows with specified agent but distribution_channel = 'Direct'
condition = (~df['agent'].isna()) & (df['distribution_channel'] == 'Direct')
df = df[~condition]

data = df[~df['agent'].isna()]
summary = value_counts_and_percentages(data, ['distribution_channel'])
display(summary)

### `distribution_channel` + `company`

We expect that when a **company** is specified, **distribution_channel = Corporate**

In [None]:
data = df[~df['company'].isna()]
summary = value_counts_and_percentages(data, ['distribution_channel'])
display(summary)

Findings
- **Corporate** is expected:
  - This is the majority class with ~ 75% of all observations
  - ACTION: keep
- **TA/TO** is possible:
  - Some corporate bookings are handled by travel agents (e.g. travel management companies).
  - ACTION: keep
- **Direct** is inconsistent:
  - Probably `distribution_channel` or `company` is mis-labelled but impossible to know which.
  - ACTION: drop
- **GDS** is possible:
  - The company could have booked via GDS.
  - ACTION: keep

In [None]:
# Drop rows with specified company but distribution_channel = 'Direct'
condition = (~df['company'].isna()) & (df['distribution_channel'] == 'Direct')
df = df[~condition]

data = df[~df['company'].isna()]
summary = value_counts_and_percentages(data, ['distribution_channel'])
display(summary)

## Investigate `is_canceled`

We expect that records with **is_canceled = 0** should have **reservation_status = 'Check-Out'**

In [None]:
data = df[df['is_canceled']==0]
summary = value_counts_and_percentages(data, ['reservation_status'])
display(summary)

This is the case. We also expect that records with **is_canceled = 1** should have **reservation_status = 'Canceled' or 'No-Show'**

In [None]:
data = df[df['is_canceled']==1]
summary = value_counts_and_percentages(data, ['reservation_status'])
display(summary)

This is also the case so no cleaning required here.

## Investigate `is_repeated_guest`

This variable should relate to `previous_bookings_not_canceled` and `previous_cancellations` so these will each be investigated first.

### Investigate `previous_bookings_not_canceled`

According to the [original data source](https://www.sciencedirect.com/science/article/pii/S2352340918315191), this variable was assigned as follows:
> In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and not canceled.

Presumably this also applies to bookings made before the date range of the current dataset.

View counts

In [None]:
data = df.drop_duplicates(keep='first')  # Drop duplicates to prevent skewing the data
summary = value_counts_and_percentages(data, ['previous_bookings_not_canceled'])
display(summary)

### Investigate `previous_cancellations`

According to the [original data source](https://www.sciencedirect.com/science/article/pii/S2352340918315191), this variable was assigned as follows:
> In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and canceled.

Presumably this also applies to bookings made before the date range of the current dataset.

In [None]:
data = df.drop_duplicates(keep='first')  # Drop duplicates to prevent skewing the data
summary = value_counts_and_percentages(data, ['previous_cancellations'])
display(summary)

### Investigate `is_repeated_guest`

According to the [original data source](https://www.sciencedirect.com/science/article/pii/S2352340918315191), this variable was

> *"created by verifying if a profile was associated with the booking customer. If so, and if the customer profile creation date was prior to the creation date for the booking on the PMS database it was assumed the booking was from a repeated guest."*

Create a `total_previous_bookings` column

In [None]:
df_prev_bookings = df.drop_duplicates(keep='first')
df_prev_bookings['total_previous_bookings'] = df_prev_bookings['previous_bookings_not_canceled'] + df_prev_bookings['previous_cancellations']
df_prev_bookings.head(3)


We expect that all records with **is_repeated_guest = 0** will have no previous non-cancelled bookings but this is not found to be the case.

In [None]:
data = df_prev_bookings[df_prev_bookings['is_repeated_guest']==0]
summary = value_counts_and_percentages(data, ['total_previous_bookings', 'previous_bookings_not_canceled', 'previous_cancellations', 'is_repeated_guest'])
display(summary)

We expect that all records with **is_repeated_guest = 1** will have at least one previous non-cancelled booking but this is not found to be the case either. 

In [None]:
data = df_prev_bookings[df_prev_bookings['is_repeated_guest']==1]
summary = value_counts_and_percentages(data, ['total_previous_bookings', 'previous_bookings_not_canceled', 'previous_cancellations', 'is_repeated_guest'])
display(summary)

Since the is_repeated_guest feature is derived from whether the PMS had a guest profile created before the booking (rather than using the previous booking data), the discrepancy may have predictive power when training the model.
- ACTION: add an additional feature (an inconsistency flag) during feature engineering and assess feature importance after training the model to see if the signal has any significance.

# Missing Data

## Initial Overview

Helper function for sumarising missing data

In [None]:
def summarise_missing_data():
    # Get variables with missing data
    missing = pd.DataFrame(df.isna().sum(), columns=['count'])
    missing = missing[missing['count'] > 0]

    # Add percentage column
    total = len(df)
    missing['%'] = round(100 * missing['count'] / total,2)

    return missing

View variables with missing data

In [None]:
summarise_missing_data()

*NOTE: Although the profile report identified 4 missing values for `children`, these have been removed in the data cleaning steps already taken.*

## Handling missing `agent` and `company`

According to the [original data source](https://www.sciencedirect.com/science/article/pii/S2352340918315191), an empty value should be interpreted as 'not applicable' for `agent` and `company`. However, in this dataset, when comparing missing values with the `distribution_channel`, it appears that there were a significant number of bookings were made by travel agents (or tour operators) but the agent ID was not recorded.

In [None]:
summary = value_counts_and_percentages(df[df['agent'].isna()].drop_duplicates(), ['distribution_channel'])
summary

Similarly, some bookings were made by companies but the company ID was not recorded.

In [None]:
summary = value_counts_and_percentages(df[df['company'].isna()].drop_duplicates(), ['distribution_channel'])
summary

Since these columns contain numeric identifiers for each agent/company, a value of zero could be used to represent 'not specified' if it doesn't already exist.

Let's check if zero is already a category

In [None]:
print('Agent 0 exists:', (df['agent'] == 0).any())
print('Company 0 exists:', (df['company'] == 0).any())

Since neither features have a category of zero, this category will be created and used for when the agent/company is not specified.

In [None]:
df['agent'] = df['agent'].cat.add_categories([0]).fillna(0)
df['company'] = df['company'].cat.add_categories([0]).fillna(0)
summarise_missing_data()

---

## Handle Missing `country`

Since it is common for hotels not to know the correct nationality of guests until the moment of check-in, we will check whether missing values are associated with cancelled bookings.

In [None]:
summary = value_counts_and_percentages(df[df['country'].isna()].drop_duplicates(), ['reservation_status'])
summary

Only ~7% of missing values are associated with guests who did not check-in so check if the missing values come predominantly from one hotel.

In [None]:
summary = value_counts_and_percentages(df[df['country'].isna()].drop_duplicates(), ['hotel'])
summary

Most missing values (99%) are associated with the Resort hotel so lets analyse the nationality of guests at the resort hotel.

In [None]:
df_resort = df[df['hotel'] == 'Resort Hotel'].drop_duplicates()
summary = value_counts_and_percentages(df_resort, ['country'])
summary

Most of the guests at the resort hotel are from Portugal but they still constitute less than half of the total guests. Therefore, it is not immediately obvious which country the guests with missing data are from.

Since missing values constitutes a very small percentage of the data, these records will be dropped.

**ACTION:** Drop

In [None]:
df.dropna(subset=['country'], inplace=True)
summarise_missing_data()

# Duplicate Records

## Overview

View extent of duplicated records

In [None]:
total_count = df.shape[0]

unique_count = df.drop_duplicates(keep=False).shape[0]
unique_percent = round(100*unique_count/total_count,1)

duplicated_count = df.duplicated(keep=False).sum()
duplicated_percent = round(100*duplicated_count/total_count,1)

print(f'Total Records: {total_count}')
print(f'Unique Records: {unique_count} ({unique_percent}%)')
print(f'Records with duplicates: {duplicated_count} ({duplicated_percent}%)')

Group duplicate records and count the number in each group

In [None]:
grouped_records = df.value_counts().reset_index(name="count")

unique_records = grouped_records[grouped_records['count']==1]
duplicate_groups = grouped_records[grouped_records['count']>1]

print('Number of unique records:', unique_records.shape[0])
print('Number of distinct duplicate record groups:', duplicate_groups.shape[0])

Show distribution for numbers of duplicates in each group

In [None]:
sns.histplot(data=duplicate_groups, x='count', binwidth=5)

View these as percentages for most common group sizes

In [None]:
summary = value_counts_and_percentages(duplicate_groups, ['count'])
summary.head(10)

Nearly all of the duplicate records have less than 30 exact duplicates with the vast majority of these with between 1 and 5 duplicates. It is not implausible that a small number of independent records could have identical information, especially if the hotels are large and bookings were made in response to advertised special offers. However, it is highly unlikely that more than 5 independent bookings should have exactly the same information and these would likely be system errors.

However, the duplicates could be explained by the booking behaviours of travel agents (e.g. bulk reservations or group bookings) so further investigation is needed.

 ## Top Ten Duplicated Individual Records

View extent of duplication for individual records (see `count` - far right column).

In [None]:
duplicate_groups.head(10)

The following similarities are observed:
- same hotel (City)
- all were cancelled
- short stays (1 to 3 nights) predominantly mid-week
- no children or babies
- same meal type (BB)
- all booked through travel agents or corporate organisations
- none are repeated guests
- same reserved and assigned room type (A)
- no booking changes
- non refundable deposits
- three from the same agent (37)
- no special requests

These are most likely all group bookings or bulk reservations where each room is recorded as a separate record and the rooms were later cancelled in bulk. This theory is supported by the fact that all of the records above have **deposit_type='Non Refund'** - this is the minority class in the context of the whole dataset and suggests that the hotel may be expecting these bookings to later be cancelled.

However, only half of these records have been assigned as 'Groups' under `market_segment` with others preferring to label them as Offline bookings from travel agents. Perhaps the offline bookings were also group bookings or perhaps they represent travel agents making bulk reservations to secure the rooms.

Since no booking reference IDs are included in the dataset, it is impossible to know whether these duplicate records relate to the same booking instance. Further investigation is needed to determine whether the duplicates follow identifiable patterns or are simply random noise.

## Distributions for Different Duplication Size Bins

Create `dup_category` variable for comparing different levels of duplication

In [None]:
bins = [0, 1, 5, 10, 20, 50, 100, float('inf')]
labels = [
    'unique records (1)',
    'very low (2-5)',
    'low (6-10)',
    'medium (11-20)',
    'high (21-50)',
    'very high (51-100)',
    'extremely high (>100)'
]

# Assign each row to a duplication category
grouped_records['dup_category'] = pd.cut(grouped_records['count'], bins=bins, labels=labels)

# Show relative proportions
summary = value_counts_and_percentages(grouped_records, ['dup_category'])
summary

*Note the small sample sizes for 'very high' and 'extremely high' categories. Caution is needed when interpreting these groups.*

Define function to assist with generating comparison plots

In [None]:
def generate_group_percentage_plots(data, group_by, target_feature, save_file_path=None):
    # Count combinations
    counts = (data.groupby([group_by, target_feature], observed=True).size().reset_index(name='count'))

    # Calculate percentages within each dup_category
    counts['percentage'] = counts.groupby(group_by, observed=True)['count'].transform(lambda x: x / x.sum() * 100)

    # Plot
    plt.figure(figsize=(10,6))
    sns.barplot(data=counts, x=group_by, y='percentage', hue=target_feature, palette='Set1')
    plt.xticks(rotation=45)
    plt.xlabel('Size of Duplicate Group')
    plt.ylabel('Percentage')
    plt.title(f"Percentage {target_feature} by duplicate group size")
    plt.legend(title=target_feature)
    plt.tight_layout()
    
    if save_file_path:
        plt.savefig(save_file_path, bbox_inches='tight')
        print(f'SUCCESS: Image saved at {save_file_path}')
    
    plt.show()

### Analyse `dup_category` by Market Segments

In [None]:
generate_group_percentage_plots(grouped_records, 'dup_category', 'market_segment')

'Groups' and 'Offline TA/TO' become more prominant in larger duplicate group sizes which supports the theory that duplicate records are largely group bookings by travel agents.

### Analyse `dup_category` by Distribution Channel

In [None]:
generate_group_percentage_plots(grouped_records, 'dup_category', 'distribution_channel')

Similar proportions for distribution channels are observed in each duplicate group category. There is perhaps a small increase in the proportion of bookings made by travel agents but the percentage of these bookings is already very high and differences may not be statistically significant.

### Duplicated records by `agent`

View top 10 agents by number of *different* duplicated records

In [None]:
duplicate_groups['agent'].value_counts().head(10)

View top 10 agents by total number of duplicated records

In [None]:
duplicate_groups.groupby('agent', observed=True)['count'].sum().sort_values(ascending=False).head(10)

Duplicate records are not distributed randomly between agents but tend to focus on certain big players (especially those with IDs of 1, 9, 6, 240 and 3). This suggests that the duplicates can be explained by the booking behaviours of these agents.

### Duplicated Records by Cancellation Rates

Investigate relationship between cancellations and duplicate records

In [None]:
generate_group_percentage_plots(grouped_records, 'dup_category', 'reservation_status')

The cancellation rate increases with the level of duplication. This also suggests that the duplicates are not just random errors in the data collation process but likely relates to real booking behaviour and administrative workflows.

## Conclusions

Analysis of duplicate records suggests that they can most likely be explained by the booking behaviours of travel agents. It is likely that bookings were made by travel agents to secure availability and then later cancelled if not needed. Analysis of the market segments suggests that many of these could be explained by group bookings.

It is impossible to be sure without learning more about the administrative workflows used by the hotels and travel agents and some of the duplicates may still be the result of errors when collating the data.

Since the number of duplicated records clearly correlates with cancellation rates, it would be useful to keep information about whether the record is unique or has many duplicates. However, having duplicated observations would cause data leakage when splitting the data into train and test sets for training the model later.

**ACTION:** Save two different versions of the cleaned data:
1. **cleaned_all_records.csv:** useful for analysis
2. **cleaned_deduplicated.csv:** useful for training the ML model

*The second of these files would include a new `record_count` feature.*

In [None]:
df_cleaned_all = df.copy()
df_cleaned_deduplicated = df.value_counts().reset_index(name="record_count")

Check df_cleaned_deduplicated looks correct

In [None]:
df_cleaned_deduplicated

# Push files to Repo

Set version number for outputs directory

In [None]:
version = 'v1'

Create output directory with that version number

In [None]:
outputs_v_dir = project_root / 'outputs' / version

if outputs_v_dir.is_dir():
    raise FileExistsError(
        f'Output directory already exists at: {outputs_v_dir}\n'
        'Please create a new version.'
    )
else:
    # Create outputs-version directory
    outputs_v_dir.mkdir(parents=True, exist_ok=False)
    print(f'SUCCESS: New directory created at: {outputs_v_dir}')


Save cleaned data

In [None]:
dataset_cleaned_dir = outputs_v_dir / 'datasets' / 'cleaned'

if dataset_cleaned_dir.is_dir():
    raise FileExistsError(
        f'Output directory already exists at: {dataset_cleaned_dir}\n'
        'Create a new version or delete the existing folder before rerunning.'
    )
else:
    # Create directory for cleaned datasets
    dataset_cleaned_dir.mkdir(parents=True, exist_ok=False)

    # Save cleaned datasets as csv files
    df_cleaned_all.to_csv(dataset_cleaned_dir / 'cleaned_all_records.csv', index=False)
    df_cleaned_deduplicated.to_csv(dataset_cleaned_dir / 'cleaned_deduplicated.csv', index=False)

    # Confirmation message
    print(f'SUCCESS: Cleaned datasets saved in {dataset_cleaned_dir}')

Create images directory for saving images to be used on the dashboard


In [None]:
images_dir = outputs_v_dir / 'images'
images_dir.mkdir(parents=True, exist_ok=True)

Save images

In [None]:
file_path = images_dir / 'market_segments_by_dup_category.png'
generate_group_percentage_plots(grouped_records, 'dup_category', 'market_segment', save_file_path=file_path)

In [None]:
file_path = images_dir / 'cancellations_by_dup_category.png'
generate_group_percentage_plots(grouped_records, 'dup_category', 'is_canceled', save_file_path=file_path)