# **CANCELLATIONS ANALYSIS WORKBOOK**

## Objectives

* Answer business requirement 1:
  - The client wants to understand patterns in booking cancellations and determine which variables are most associated/correlated with cancellations.

## Inputs

* **Cleaned Dataset with Duplicates Removed:** outputs/{version}/datasets/cleaned/cleaned_deduplicated.csv
* **Cleaned Dataset with All Records:** outputs/{version}/datasets/cleaned/cleaned_all_records.csv

## Outputs

* Code that answers business requirement 1 and can be used to build the Streamlit App

---

# Import Packages, Load Data and Initialise Useful Variables

Imports

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Specify project version

In [None]:
version = 'v1'

Load data

In [None]:
from pathlib import Path

project_root = Path.cwd().parent
cleaned_datasets_dir = project_root / 'outputs' / version / 'datasets' / 'cleaned'

cleaned_all = cleaned_datasets_dir / 'cleaned_all_records.csv'
cleaned_deduplicated = cleaned_datasets_dir / 'cleaned_deduplicated.csv'

try:
    df_all = pd.read_csv(cleaned_all)
    df_deduplicated = pd.read_csv(cleaned_deduplicated)
except FileNotFoundError as e:
    print(f"ERROR: Could not find file for specified project version '{version}'\n{e}")
except Exception as e:
    print(f"ERROR: Unexpected error occurred. {e}")
else:
    print("SUCCESS: Datasets have been loaded")


`df_deduplicated` will mainly be used in the following analysis because it allows us to study patterns and drivers of cancellations (per unique booking type) without over-representing certain types of bookings.

In [None]:
df = df_deduplicated.copy()

---

# Change Data Types

View current data types

In [None]:
df.info()

Change data types

In [None]:
# Convert objects to type 'category' (for efficiency and performance)
obj_cols = df.select_dtypes(include='object').columns.to_list()
df[obj_cols] = df[obj_cols].astype('category')

# Convert ints that are really categories
int_to_cat_cols = ['agent', 'company']
df[int_to_cat_cols] = df[int_to_cat_cols].astype('category')

# Convert dates
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

df.info()

---

# Useful Variables and Functions

Useful variables

In [None]:
target = 'is_canceled'

Useful functions

In [None]:
def plot_top_categories_pie(df, feature, top_n, title, facet_col=None):
    """
    Create a pie chart showing the distribution of the top N categories for a given feature,
    grouping all other categories into "Other". Optionally split the chart into facets by another column.
    """
    
    # Prevent changes to original dataframe
    df = df.copy()
    
    # Group countries not in top ten as 'Other'
    top_n_items = df[feature].value_counts().head(top_n).index.to_list()
    df['grouped_feature'] = df[feature].astype(str).where(df[feature].isin(top_n_items), 'Other')

    # Count values again for the grouped column
    if facet_col:
        grouped_counts = df.groupby(['grouped_feature', facet_col]).size().reset_index()
        grouped_counts.columns = [feature, facet_col, 'count']
    else: 
        grouped_counts = df.groupby(['grouped_feature']).size().reset_index()
        grouped_counts.columns = [feature, 'count']

    # Plot pie chart
    fig = px.pie(
        grouped_counts,
        names=feature,
        values='count',
        color=feature,
        color_discrete_sequence=px.colors.qualitative.Set3,
        title=title,
        facet_col=facet_col,
    )
    fig.show()


def get_percentage_cancelled(df, feature, target, min_total_bookings=0):
    """
    Calculate total bookings and percentage cancellations for each group of a specified feature.
    Optionally apply a minimum bookings threshold, and return a dataframe summarizing results
    sorted by cancellation percentage.

    This function assumes `target` is the column `is_canceled` with boolean categories
    (True = cancelled, False = not cancelled).
    """
    
    # Get counts
    counts = df.groupby([feature, target]).size().reset_index(name='count')
    counts = counts.pivot(index=feature, columns=target, values='count').fillna(0)
    counts.columns.name = None

    # Add total bookings and % cancelled
    summary = counts.copy()
    summary['Total Bookings'] = summary.sum(axis=1).astype('int')
    summary['% Cancelled'] = round(summary[True] / summary['Total Bookings'] * 100,1)

    # Filter for relevant columns
    summary = summary[['Total Bookings', '% Cancelled']]

    # Filter out countries with total bookings below a threshold
    condition = summary['Total Bookings'] >= min_total_bookings
    summary = summary[condition]

    # Sort and display
    summary.sort_values('% Cancelled', ascending=False, inplace=True)
    return summary

---

# Correlation Studies

We will explore whether any features correlate strongly with booking cancellations using the dataset with duplicate records removed.

We will analyse:
1. All record types (both unique and duplicated)
2. Unique vs duplicated records
3. City hotel vs resort hotel

It would not be appropriate to use the Pearson method for calculating correlations since most variables are NOT continuous numeric variables with normal distributions. Therefore, we will use Spearman correlation coefficients throughout this analysis.

## Prepare Data

In [None]:
df_corr = df.copy()

Drop `reservation_status` (since it overlaps strongly with `is_canceled`)

In [None]:
df_corr.drop('reservation_status', axis=1, inplace=True)

Encode `hotel` and boolean features as integers 

In [None]:
# Encode hotel type
hotel_map = {'City Hotel': 0, 'Resort Hotel': 1}
df_corr['hotel'] = df_corr['hotel'].map(hotel_map).astype('int')

# Convert booleans to integers
bool_cols = df_corr.select_dtypes('bool').columns.to_list()
df_corr[bool_cols] = df_corr[bool_cols].astype('int')

df_corr.head(3)

Get all numeric columns (including those encoded above) for using with correlation heatmaps

In [None]:
numeric_cols = df_corr.select_dtypes('number').columns.to_list()

# Move target to first column (for easier analysis)
numeric_cols.remove(target)
numeric_cols.insert(0, target)

Get categorical features and separate into those with
- low cardinality (<15 categories): for one-hot encoding
- high cardinality (15+ categories): to exclude from correlation analysis

In [None]:
# Get all categorical features as a list
cat_cols = df_corr.select_dtypes(include=['object', 'category']).columns.to_list()

# Get low cardinality categorical features
low_card_cats = [col for col in cat_cols if (3 <= df_corr[col].nunique() < 15)]
print('Low cardinality features will be encoded:\n', low_card_cats)

# Get high cardinality categorical features and drop from dataframe
high_card_cats = [col for col in cat_cols if df_corr[col].nunique() >= 15]
print('\nHigh cardinality features will be dropped from dataframe:\n', high_card_cats)

One-hot encode categorical features with low cardinality

In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=low_card_cats, drop_last=False)
df_corr = encoder.fit_transform(df_corr)
df_corr.head(3)

Drop high cardinality features

In [None]:
df_corr.drop(high_card_cats, axis=1, inplace=True)
print('Number of columns:', df_corr.shape[1])
df_corr.columns

## Useful Functions for Correlation Analysis

In [None]:
def generate_corr_heatmap(data, method='spearman', figsize=(20,10), lower_threshold=0):
    # Get correlation matrix
    corr = data.corr(method=method)

    # Create masks
    mask_threshold = np.abs(corr) < lower_threshold
    mask_upper = np.triu(np.ones_like(corr, dtype=bool))
    mask = mask_threshold | mask_upper

    # Plot heatmap
    plt.figure(figsize=figsize)
    sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm', linewidths=0.4, linecolor='#ddd')
    plt.show()

    return corr

def generate_corr_comparisons_table(data1, data2, col_headings, target='is_canceled', sort_by='Mean', method='spearman'):
    # Calculate Spearman correlation scores for both datasets
    corr1 = data1.corr(method=method)[target].sort_values(key=abs, ascending=False)[1:]
    corr2 = data2.corr(method=method)[target].sort_values(key=abs, ascending=False)[1:]

    # Concatenate scores into a single summary table
    summary = pd.concat([corr1, corr2], axis=1)
    summary.columns = col_headings

    # Add columns to show the mean and difference
    summary['Mean'] = summary[col_headings].mean(axis=1)
    summary['Diff'] = summary[col_headings[0]] - summary[col_headings[1]]

    # Display table, sorted descending by mean
    summary = summary.sort_values(by=sort_by, key=abs, ascending=False).round(2)
    return summary

def plot_corr_comparisons(summary, col_headings, figsize=(15,6)):
    summary[col_headings].plot(
        kind="bar", figsize=figsize, rot=90, title=f"Comparison of Correlation Coefficients: {col_headings[0]} vs {col_headings[1]}"
    )
    plt.ylabel("Correlation Coefficients")
    plt.show()

def compare_sample_sizes(df, cols, subset_by, target):
    for col in cols:
        counts = df.groupby([subset_by, col, target]).size().reset_index(name="Count")
        sns.catplot(counts, x=col, y="Count", hue=subset_by, col=target, kind="bar",sharey=False)
        plt.show()


## 1. All Records

### Analysis

Generate correlation heatmap using numeric features only

In [None]:
data = df_corr[numeric_cols]
corr = generate_corr_heatmap(data, lower_threshold=0)

View top ten features with strongest correlations to `is_canceled` (including one-hot encoded features)

In [None]:
corr = df_corr.corr(method='spearman')[target].sort_values(key=abs, ascending=False)[1:]
corr.head(10)

### Conclusions

When considering all records in the deduplicated dataset, no features are strongly associated with `is_canceled` but some are weakly associated.

Based on the analysis of all bookings combined, cancellations are somewhat more likely for bookings:
- with **longer lead times**
- made through **Online Travel Agents**
- with a **non-refundable deposit**
- with a **higher average daily rate**
- booked via the **TA/TO distribution channel**
- made by **transient customers**

In contrast, bookings are less likely to be cancelled when they:
- require **more car parking spaces**
- have **no deposit required**
- are made through **Offline Travel Agents or Tour Operators**
- include **special requests**

## 2. Unique vs Duplicated Records

### Analysis

Compare results for unique records and those with duplicates

In [None]:
# Split df_corr for unique and duplicate records
df_unique = df_corr[df_corr['record_count']==1]
df_duplicates = df_corr[df_corr['record_count']>1]

print('Unique records:', df_unique.shape[0])
print('Duplicated records:', df_duplicates.shape[0])

# Generate comparison table
col_headings = ['Unique', 'Duplicates']
summary = generate_corr_comparisons_table(df_unique, df_duplicates, col_headings, sort_by='Mean')
summary.head(10)

Plot results to more easily compare differences

In [None]:
plot_corr_comparisons(summary, col_headings)

Check large differences in correlation coefficients cannot be explained by very small sample sizes

In [None]:
df_diff = df_corr.copy()
df_diff['is_duplicated'] = df_diff['record_count'] > 1
high_diff_features = summary[abs(summary['Diff']) > 0.08].index.to_list()
compare_sample_sizes(df_diff, high_diff_features, 'is_duplicated', target)

### Conclusions

There are clearly significant differences between how strongly ccertain features correlate with cancellation rates in the two types of bookings. This further supports the assumption that duplicated records are by nature a different type of booking (e.g. bulk reservations or group bookings).

*NOTE: Similar differences in the associations of features are observed when the unique records subset is downsampled to the same size as the duplicated records subset. This confirms that the differences are not due to sample size but rather reflect different types of bookings.*

Although the strengths of associations are different, the direction of these associations are nearly always the same in each group. Therefore, it is not the case that correlations are cancelled out between the two groups. One notable exception is seen when considering the `market_segment` 'Groups' category which shows a weakly positive association with cancellation in the duplicates subset but a weakly negative association in the unique records subset. This provides further evidence for the different nature of the bookings in each subset.

The most strongly associated features in the unique records subset are:
- `market_segment`: 'Online TA' shows a weak positive association (+0.23) but 'Offline TA/TO' shows a weak negative association (-0.15)
- `lead_time`: shows a weak positive association (+0.22)
- `required_car_parking_spaces`: shows a weak negative association (-0.18)
- `adr`: shows a weak positive association (+0.17)
- `customer_type`: 'Transient' shows a weak positive association (+0.14) but 'Transient Party' shows a weak negative association (-0.11)
- `previous_cancellations`: shows a weak positive association (+0.28)
- `total_of_special_requests`: shows a weak negative association (-0.12)
- `booking_changes`: shows a weak negative association (-0.11)

The most strongly associated features in the duplicates subset are:
- `deposit_type`: 'Non Refund' shows a moderately positive association (+0.39) but 'No Deposit' shows a moderately negative association (-0.39)
- `customer_type`: 'Transient' shows a weak positive association (+0.30) but 'Transient Party' shows a weak negative association (-0.29)
- `previous_cancellations`: shows a weak positive association (+0.28)
- `total_of_special_requests`: shows a weak negative association (-0.22)
- `lead_time`: shows a weak positive association (+0.22) - *this is the same as in the unique dataset*
- `record_count`: shows a weak positive association (+0.20)
- `booking_changes`: shows a weak negative association (-0.16)

Since some features are more strongly associated with cancellations in one subset but not the other, predictors do not have uniform predictive power across all booking types and need to be compared and contrasted.

**SIMILARITIES**

For both datasets, cancellations are somewhat **more likely** for bookings:
- for **transient customers**
- with **longer lead times**
- made by guests with a history of **previous cancellations**.

Cancellations are **less likely** for bookings that:
- involve **transient-party customers**
- include **special requests** or **booking changes**

**DIFFERENCES**

Although both datasets show the same direction (+/-) of associations in both, the magnitude (strength) of these associations are different in each dataset.

Stronger Associations in **duplicated** dataset:
- **Deposit Type**: (CAUTION - small sample size)
  - bookings are more likely to cancel with **non-refundable deposit** and less likely to cancel with **no deposit** required
  - this likely reflects the fact that group/bulk bookings are more prone to cancellations and that the hotels correctly identify this and therefore require a deposit for these bookings 

Stronger Associations in **unique** dataset:
- **Market Segment** shows a stronger positive association for **Online TA** (more likely to cancel)
- **Number of Parking Spaces Required** shows a stronger negative association (less likely to cancel)

**IMPLICATIONS FOR HYPOTHESIS 3**

Hypothesis:
> "Bookings with higher average daily rates tend to cancel more."

The analysis supports this hypothesis to some extent. In both unique and duplicated bookings there is a very weak positive association with whether a booking will be cancelled. However, other features have stronger associations with cancellation rates and this feature should not be considered a main driver as to whether a booking will be cancelled.

**IMPLICATIONS FOR ML MODELS**

ML models may need to consider interactions with booking type (unique vs duplicated) or even train separate models for each subset.


## 3. City Hotel vs Resort Hotel

### Analysis

Compare results for city and resort hotels

In [None]:
# Split df_corr for unique and duplicate records
df_city = df_corr[df_corr['hotel']==0]
df_hotel = df_corr[df_corr['hotel']==1]

print('City hotel records:', df_city.shape[0])
print('Resort hotel records:', df_hotel.shape[0])

# Generate comparison table
col_headings = ['City', 'Resort']
summary = generate_corr_comparisons_table(df_city, df_hotel, col_headings, sort_by='Mean')
summary.head(10)

In [None]:
diff_features = summary[abs(summary['Diff']) > 0.1].index.to_list()

Plot results to more easily compare differences

In [None]:
plot_corr_comparisons(summary, col_headings)

Check large differences in correlation coefficients cannot be explained by very small sample sizes

In [None]:
df_diff = df_corr.copy()
high_diff_features = summary[abs(summary['Diff']) > 0.08].index.to_list()
compare_sample_sizes(df_diff, high_diff_features, 'hotel', target)

### Conclusions

Many of the same patterns observed earlier are seen again. There are no strong correlations but some features are weakly correlated.

Again, there are differences in how strongly certain features correlate with cancellation rates in the two types of hotel bookings. Although the strengths of associations are different, the direction of these associations are usually the same. The exceptions (meal types, room bookings) are very weakly associated with cancellation rates and not likely to be significant.

The most strongly associated features in the **City Hotel** subset are:
- `lead_time`: shows a weak positive association (+0.22)
- `deposit_type`: 'No Deposit' shows a weak negative association (-0.19) but 'Non Refund' shows a weak positive association (+0.19)
- `total_of_special_requests`: shows a weak negative association (-0.19)
- `market_segment`: 'Online TA' shows a weak positive association (+0.17) but 'Offline TA/TO' shows a weak negative association (-0.12)
- `customer_type`: 'Transient' shows a weak positive association (+0.13)
- `booking_changes`: shows a weak negative association (-0.13)
- `previous_cancellations`: shows a weak positive association (+0.13)
- `distribution_channel`: the 'TA/TO' category shows a weak positive association (+0.12)

The most strongly associated features in the **Resort Hotel** subset are:
- `lead_time`: shows a weak positive association (+0.25)
- `market_segment`: 'Online TA' shows a weak positive association (+0.25) but 'Offline TA/TO' shows a weak negative association (-0.15) and 'Direct' also shows a weak negative association (-0.12)
- `required_car_parking_spaces`: shows a weak negative association (-0.24)
- `adr`: shows a weak positive association (+0.18)
- `distribution_channel`: the 'TA/TO' category shows a weak positive association (+0.14)
- `customer_type`: 'Transient' shows a weak positive association (+0.14)
- `stays_in_week_nights`: shows a weak positive association (+0.13)
- `children`: shows a weak positive association (+0.12)

**Similarities**

In both hotels, bookings are more likely to cancel if they have:
- **longer lead times**
- booked via **travel agents**, especially **online travel agents**
- for **transient customers**
- been made by guests who have **previous cancellations**

Bookings are less likely to be cancelled if they are
- booked by **Offline travel agents or tour operators**
- have **more booking changes**

**Differences**

In the City Hotel, the following factors are more significant:
- **Deposit type**  (CAUTION - small sample sizes)
  - 'No Deposit' corresponds with lower cancellations while 'Non Refund' corresponds with more cancellations
  - This likely reflects existing hotel practices for identify bookings at risk of cancellation
- **Special requests**
  - Both tend to correlate with fewer cancellations but the association is significantly stronger in the city hotel

In the Resort Hotel, the following factors are more significant:
- **Required car parking spaces**
  - More car parking spaces tends to correlate with fewer cancellations
- **Average Daily Rate**
  - Although higher rates are correlated with more cancellations in both, the feature appears slightly more significant ranking in the top 4 features for the resort hotel but number 13 for the City hotel

**IMPLICATIONS FOR HYPOTHESIS 3**

Hypothesis:
> "Bookings with higher average daily rates tend to cancel more."

The analysis supports this hypothesis to some extent. In both hotels there is a very weak positive association with whether a booking will be cancelled. The correlation is more significant for the resort hotel than the city hotel but other features (e.g. lead times, market segment 'Online TA') have stronger associations with cancellation rates in both hotels. Therefore, this feature should not be considered a main driver as to whether a booking will be cancelled.


---

# Analysis by Country

Useful variable

In [None]:
top_n = 10
feature = 'country'

Useful function

## Total Bookings

**QUESTION:** Which countries make the most bookings?

### Analysis

View proportions of bookings by country in deduplicated dataset

In [None]:
title = 'Proportions of Total Bookings by Country (Duplicates counted once)'
plot_top_categories_pie(df_deduplicated, feature=feature, top_n=top_n, title=title)

View proportions of bookings by country for all bookings (with all duplicates counted)

In [None]:
title = 'Proportions of Total Bookings by Country (All duplicates counted)'
plot_top_categories_pie(df_all, feature=feature, top_n=top_n, title=title)

View map to see where most bookings originate (deduplicated dataset)

In [None]:
# Prepare data
counts = df_deduplicated['country'].value_counts().reset_index()
counts.columns = ['Country', 'Total Bookings']
counts['Log_Bookings'] = np.log(counts['Total Bookings'])

# Create choropleth
fig = px.choropleth(
    counts,
    locations='Country',
    color='Log_Bookings',
    hover_name='Country',
    hover_data={'Total Bookings': True, 'Log_Bookings': True},
    color_continuous_scale='Viridis',
    title='Total Bookings by Country (duplicates counted once)',
    range_color=[0, counts['Log_Bookings'].max()],  # fix color scale
    width=1200,
    height=700
)

fig.show()

### Conclusions

Both pie charts show that the largest share of bookings comes from Portuguese guests. These far exceed the number of bookings from Great Britain which ranks in second place.

The first pie chart shows the proportions when duplicated records are counted once. This gives an indication of the number of bookings made without giving too much weight to group bookings and bulk reservations. The number of Portuguese bookings is roughly 3x greater than those from Great Britain.

The second pie chart shows the proportions when all duplicates are counted. This gives an indication of the actual number of room reservations made and a measure of the impact on hotel operations. The number of Portuguese bookings is roughly 4x greater than those from Great Britain, illustrating that most group/bulk bookings originate from Portugal.

On the map view, a logarithmic scale was used to limit the impact of the large number of bookings from Portugal and make it easier to compare the number of bookings from other countries. Unsurprisingly, most bookings are from European countries, followed by USA and Brazil.

## Total Cancellations

**QUESTION:** Which countries make the most cancellations?

### Analysis

View proportions of bookings by country in deduplicated dataset

In [None]:
title = 'Proportions of Total Bookings by Country Faceted by Cancellation Status (All duplicates counted)'
plot_top_categories_pie(df_deduplicated, feature=feature, top_n=top_n, title=title, facet_col=target)

View proportions of bookings by country for all bookings (with all duplicates counted)

In [None]:
title = 'Proportions of Total Bookings by Country Faceted by Cancellation Status (All duplicates counted)'
plot_top_categories_pie(df_all, feature=feature, top_n=top_n, title=title, facet_col=target)

Compare cancellations for top ten countries only

In [None]:
# Get sorted list of top ten countries
top_countries_deduplicated = df_deduplicated['country'].value_counts().sort_values(ascending=False).head(top_n).index.to_list()
top_countries_all = df_all['country'].value_counts().sort_values(ascending=False).head(top_n).index.to_list()

# Create figure with sub-plots
fig, axes = plt.subplots(ncols=2, figsize=(15,6))
fig.suptitle(f'Comparison of Top {top_n} Countries for Bookings', fontsize=16)

# Add subplots
sns.countplot(data=df_deduplicated, x='country', hue=target, order=top_countries_deduplicated, ax=axes[0])
sns.countplot(data=df_all, x='country', hue=target, order=top_countries_all, ax=axes[1])

# Set titles and labels for each sub-plot
axes[0].set_title(f'Duplicates Counted Once')
axes[1].set_title(f'All Duplicates Counted')
for ax in axes:
    ax.set_ylabel('Number of Bookings')
    ax.set_xlabel('Country')
    ax.legend(title='Cancelled')

# Improve layout and display figure
plt.tight_layout()
plt.show()


View total cancellations on a map

In [None]:
is_cancelled = df_deduplicated[target] == 0
counts = df_deduplicated[is_cancelled]['country'].value_counts().reset_index(name='count')
counts.columns = ['Country', 'Total Cancellations']


# Create choropleth
fig = px.choropleth(
    counts,
    locations='Country',
    color='Total Cancellations',
    hover_name='Country',
    hover_data={'Total Cancellations': True, 'Country': False},
    color_continuous_scale='Viridis',
    title='Total Cancellations by Country (duplicates counted once)',
    width=1200,
    height=700
)

fig.show()

### Conclusions

The number of cancellations made by guests from Portugal is much higher than any other country. When all duplicates are counted, the number of cancellations even exceeds the number of completed bookings for Portuguese guests.

Unsurprisingly, most cancellations are from countries that also make the most bookings.

## Percentage Cancellations

**QUESTION:** Which countries have the highest percentage cancellations?

**HYPOTHESIS 1:** Local guests (from Portugal) tend to cancel more than guests from further afield.

### Analysis

Define minumum number of bookings required to be included in analysis

In [None]:
min_total_bookings = 100

Get percentage cancellations for deduplicated dataset

In [None]:
percent_deduplicated = get_percentage_cancelled(df_deduplicated, 'country', target, min_total_bookings=min_total_bookings)
percent_deduplicated.head(10)

Get percentage cancellations for dataset with all duplicates counted

In [None]:
percent_all = get_percentage_cancelled(df_all, 'country', target, min_total_bookings=min_total_bookings)
percent_all.head(10)

Plot countries with highest percentage cancellations in both datasets

In [None]:
# Get top countries in each dataset
top_countries_deduplicated = percent_deduplicated.head(top_n).index.to_list()
top_countries_all = percent_all.head(top_n).index.to_list()

# Assign common colour palette
all_countries = percent_all.index.to_list()
palette = sns.color_palette('tab20', n_colors=len(all_countries))
colour_map = dict(zip(all_countries, palette))

# Create figure with sub-plots
fig, axes = plt.subplots(ncols=2, figsize=(15, 6), sharey=True)
fig.suptitle(f'Comparison of Top {top_n} Countries for % Booking Cancellations', fontsize=16)

# Add subplots
sns.barplot(
    data=percent_deduplicated,
    x=percent_deduplicated.index,
    y='% Cancelled',
    order=top_countries_deduplicated,
    hue=percent_deduplicated.index,
    palette=colour_map,
    ax=axes[0],
)
sns.barplot(
    data=percent_all,
    x=percent_all.index,
    y='% Cancelled',
    order=top_countries_all,
    hue=percent_all.index,
    palette=colour_map,
    ax=axes[1],
)

# Set titles and labels for each sub-plot
axes[0].set_title(f'Duplicates Counted Once')
axes[1].set_title(f'All Duplicates Counted')
for ax in axes:
    ax.set_ylabel('% of Bookings Cancelled')
    ax.set_xlabel('Country')

# Improve layout and display figure
plt.tight_layout()
plt.show()

View percentage cancellations for deduplicated dataset on a map

In [None]:
# Reset index so 'country' is a column
percent_deduplicated_reset = percent_deduplicated.reset_index()

# Create choropleth
fig = px.choropleth(
    percent_deduplicated_reset,
    locations='country',
    color='% Cancelled',
    hover_name='country',
    hover_data={'country': False, '% Cancelled': True},
    color_continuous_scale='Viridis',
    title='Percentage Cancellations by Country (minimum reservations = duplicates counted once)',
    width=1200,
    height=700
)

fig.show()

### Conclusions

In this analysis, only countries with a minimum of 100 total bookings were considered to avoid skewing the data.

When all duplicate bookings are counted, Portuguese guests have the highest cancellation rate at 58.8%. However, when duplicates are counted only once, the cancellation rate for Portuguese bookings drops to 37.1%, and seven other countries exceed this figure.

## Differences by hotel

**QUESTION:** Which countries have the highest percentage cancellations for each hotel?

**HYPOTHESIS 1:** Local guests (from Portugal) tend to cancel more than guests from further afield.

### Analysis With Duplicates Counted Once

Subset for each hotel

In [None]:
df_city = df_deduplicated[df_deduplicated['hotel'] == 'City Hotel']
df_resort = df_deduplicated[df_deduplicated['hotel'] == 'Resort Hotel']

View top percentage cancellations for city hotel

In [None]:
percent_city = get_percentage_cancelled(df_city, 'country', target, min_total_bookings=min_total_bookings)
percent_city.head(10)

View top percentage cancellations for resort hotel

In [None]:
percent_resort = get_percentage_cancelled(df_resort, 'country', target, min_total_bookings=min_total_bookings)
percent_resort.head(10)

Plot countries with highest percentage cancellations in each hotel

In [None]:
# Get top countries in each dataset
top_countries_city = percent_city.head(top_n).index.to_list()
top_countries_resort = percent_resort.head(top_n).index.to_list()

# Assign common colour palette
all_countries = df_deduplicated['country'].value_counts().index.to_list()
palette = sns.color_palette('tab20', n_colors=len(all_countries))
colour_map = dict(zip(all_countries, palette))

# Create figure with sub-plots
fig, axes = plt.subplots(ncols=2, figsize=(15, 6), sharey=True)
fig.suptitle(f'Comparison of Top {top_n} Countries for % Booking Cancellations (duplicates counted once)', fontsize=16)

# Add subplots
sns.barplot(
    data=percent_city,
    x=percent_city.index,
    y='% Cancelled',
    order=top_countries_city,
    hue=percent_city.index,
    palette=colour_map,
    ax=axes[0],
)
sns.barplot(
    data=percent_resort,
    x=percent_resort.index,
    y='% Cancelled',
    order=top_countries_resort,
    hue=percent_resort.index,
    palette=colour_map,
    ax=axes[1],
)

# Set titles and labels for each sub-plot
axes[0].set_title(f'City Hotel')
axes[1].set_title(f'Resort Hotel')
for ax in axes:
    ax.set_ylabel('% of Bookings Cancelled')
    ax.set_xlabel('Country')

# Improve layout and display figure
plt.tight_layout()
plt.show()

### Analysis With All Duplicates Counted

Subset for each hotel

In [None]:
df_city = df_all[df_all['hotel'] == 'City Hotel']
df_resort = df_all[df_all['hotel'] == 'Resort Hotel']

View top percentage cancellations for city hotel

In [None]:
percent_city = get_percentage_cancelled(df_city, 'country', target, min_total_bookings=min_total_bookings)
percent_city.head(10)

View top percentage cancellations for resort hotel

In [None]:
percent_resort = get_percentage_cancelled(df_resort, 'country', target, min_total_bookings=min_total_bookings)
percent_resort.head(10)

Plot countries with highest percentage cancellations in each hotel

In [None]:
# Get top countries in each dataset
top_countries_city = percent_city.head(top_n).index.to_list()
top_countries_resort = percent_resort.head(top_n).index.to_list()

# Assign common colour palette
all_countries = df_deduplicated['country'].value_counts().index.to_list()
palette = sns.color_palette('tab20', n_colors=len(all_countries))
colour_map = dict(zip(all_countries, palette))

# Create figure with sub-plots
fig, axes = plt.subplots(ncols=2, figsize=(15, 6), sharey=True)
fig.suptitle(f'Comparison of Top {top_n} Countries for % Booking Cancellations (all duplicates counted)', fontsize=16)

# Add subplots
sns.barplot(
    data=percent_city,
    x=percent_city.index,
    y='% Cancelled',
    order=top_countries_city,
    hue=percent_city.index,
    palette=colour_map,
    ax=axes[0],
)
sns.barplot(
    data=percent_resort,
    x=percent_resort.index,
    y='% Cancelled',
    order=top_countries_resort,
    hue=percent_resort.index,
    palette=colour_map,
    ax=axes[1],
)

# Set titles and labels for each sub-plot
axes[0].set_title(f'City Hotel')
axes[1].set_title(f'Resort Hotel')
for ax in axes:
    ax.set_ylabel('% of Bookings Cancelled')
    ax.set_xlabel('Country')

# Improve layout and display figure
plt.tight_layout()
plt.show()

### Conclusions

In this analysis, only countries with a minimum of 100 total bookings were considered to avoid skewing the data.

When comparing the countries with the highest cancellation rates for each hotel, significant differences are seen in the order of the countries and the percentages of cancelled bookings. Countries listed with the highest cancellation rates in one hotel are often not listed for the other hotel. Percentage cancellations tend to be higher for the city hotel than the resort hotel.

For the resort hotel, Portugal is the country with the highest percentage cancellations, both when duplicates are counted once or when every duplicate reservation is counted. 

For the city hotel, Portugal is lower down the list when duplicate bookings are counted once - nine other countries have higher percentage cancellations. When all duplicates are counted, Portugal shows the highest percentage cancellations, presumably due to the greater number of group bookings and bulk reservations that are made and then later cancelled.

## Does the Data Support Hypothesis 1?

The hypothesis states:
> *"Local guests (from Portugal) tend to cancel more than guests from further afield."*

This hypothesis is supported to some extent:
- For the resort hotel, Portugal ranks as the country with the highest percentage cancellations.
- When all duplicate reservations are counted, Portugal has the highest percentage cancellations overall (for both hotels).

However, the data does not support the hypothesis in the following ways:
- When duplicate reservations are counted once, Portuguese guests are not the most likely to cancel for the City hotel, although they remain among the top 10 countries with the highest cancellation rates.
- Many of the countries with the highest cancellation rates are for guests from countries that are very far away (such as China, Russia, South Korea and Colombia).

Overall, there is no clear linear relationship between distance from Portugal and percentage of cancellations.