# table of contents
1. [bivariate analysis](#bivariate-analysis)
2. [categories](#categories)
    1. [frequencies](#frequencies)
    2. [independence test](#independence-test)
3. [cancelations vs numerical variables](#cancelations-vs-numerical-variables)
   1. [distribution comparison](#distribution-comparison)
4. [numerical variables](#numerical-variables)
   1. [linearity test](#linearity-test) 
   2. [correlations](#correlations)
5. [hypothesis testing](#hypothesis-testing)
    1. [cancelation vs numerical variables](#cancelation-vs-numerical-variables)
6. [summary](#summary)

# bivariate analysis
This is a breaking moment when this analysis is narrowed down to a research question:
- **Can we predict canceled bookings?**

From now on, focus is on the categorical variable **is_canceled**, its relationship with other categories, as well as numerical variables.\
In this notebook you will also find analysis of relationships among continuous numerical variables.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import chi2_contingency
from matplotlib.backends.backend_pdf import PdfPages

df = pd.read_csv("../data/02_cleaned_data.csv")
pd.set_option("display.max_columns", None)
cat = df[df.columns[df.nunique() < 5]]
num = df[df.columns[df.nunique() >= 5]]

# categories
## frequencies
From bookings that belong to a certain category, these are the ones where canceled bookings are in majority:
- booked_by_agent: 39%
- hotel_City: 42%
- meal_BB: 37%
- meal_FB: 60%
- market_segment_Groups: 61%
- distribution_channel_TA_TO: 41%
- reserved_room_type_A: 39%
- reserved_room_type_H: 41%
- deposit_type_Non_Refund: 99%
- customer_type_Transient: 41%

In [None]:
def crosstabs_with_highlights(cat):
    """
    Generates crosstabs for each categorical column in the DataFrame except 'is_canceled',
    with highlighted values where 'is_canceled' == 1 is the maximum percentage, and returns
    these crosstabs for further use or analysis.
    
    Parameters:
    cat (DataFrame): The DataFrame containing the categorical data.
    
    Returns:
    List[Styler]: A list of Pandas Styler objects, each representing a styled crosstab.
    """
    
    # Define a function to highlight the cell (1, 1) if its value is greater than the cell (1, 0)
    def highlight_max_canceled_1(s):
        if s[1] > s[0]:
            return ["", "background-color: darkred; color: white"]
        return ["", ""]
    
    # List all columns in the cat DataFrame except 'is_canceled'
    columns_to_compare = [col for col in cat.columns if col != "is_canceled"]

    # Initialize list to store crosstab Styler objects
    crosstab_results = []
    
    # Create and display crosstabs with highlighted specific values
    for idx, column in enumerate(columns_to_compare, 1):
        # Calculate the crosstab and normalize by column values to percentage
        crosstab = pd.crosstab(cat["is_canceled"], cat[column], normalize="columns") * 100
        
        # Apply styling to highlight the maximum values where 'is_canceled' == 1
        styled_crosstab = crosstab.style.apply(highlight_max_canceled_1, axis=1, subset=pd.IndexSlice[1, :]).format("{:.0f}%")
        
        # Store the styled crosstab in the list
        crosstab_results.append(styled_crosstab)
        
        # Optionally, display each styled crosstab (can comment out if not needed)
        display(styled_crosstab)
        print("-" * 40)  # Use print to properly display the divider in text

    return crosstab_results

# Usage
crosstabs_highlighted = crosstabs_with_highlights(cat)
crosstabs_highlighted

## independence test
- **is_canceled** category has been tested for independence from all the other categories, starting with the null hypothesis:
  - H0: there is no significiant association between **is_canceled** and another category; they are independent variables,
  - at the significance level (alpha) is set at 0.05.
-  According to **chi2 test**, there is a **significant association** between **is_canceled** and almost all the other categories.\
However, measured by the **Cramér's V**, majority (25/35) associations are **weak**, 9 moderate, and the only significant and strong associations are between:
1. **is_canceled** and **deposit_type_No_Deposit**
2. **is_canceled** and **deposit_type_Non_Refund**

Time-permitting, we could eventually return to these tests, bin some numerical variables such as **lead_time** to convert them into categories, and test their association with **is_canceled**.

In [None]:
def chi2_cramersv(cat):
    """
    Performs Chi-Square tests of independence for each categorical column in the DataFrame,
    excluding 'is_canceled'. It also calculates Cramer's V for the strength of association.
    
    Parameters:
    cat (DataFrame): The DataFrame containing the categorical data.
    
    Returns:
    DataFrame: A DataFrame containing the Chi-Square test results and Cramer's V analysis.
    """
    # List all columns in the DataFrame except 'is_canceled'
    columns_to_compare = [col for col in cat.columns if col != "is_canceled"]

    # Initialize an empty list to store the results
    chi2_results = []

    # Perform a Chi-Square test of independence for each column
    for column in columns_to_compare:
        crosstab = pd.crosstab(cat["is_canceled"], cat[column])
        chi2, p, dof, expected = chi2_contingency(crosstab)
        
        # Total sample size
        n = crosstab.sum().sum()  
        
        # Calculate Cramer's V
        cramer_v = np.sqrt(chi2 / (n * (min(crosstab.shape) - 1)))
        
        # Diagnosis based on p-value
        p_diagnosis = "Significant" if p < 0.05 else "Not Significant"
        
        # Diagnosis based on Cramer's V value
        if cramer_v < 0.1:
            v_diagnosis = "Weak"
        elif cramer_v < 0.3:
            v_diagnosis = "Moderate"
        else:
            v_diagnosis = "Strong"
        
        # Append the results to the list
        chi2_results.append({
            "Variable Name": column,
            "Chi2 Value": chi2,
            "P-value": p,
            "Cramer's V": cramer_v,
            "P-value Diagnosis": p_diagnosis,
            "Cramer's V Diagnosis": v_diagnosis
        })

    # Convert the results to a DataFrame for nicer display
    return pd.DataFrame(chi2_results)

# Use:
chi2_cramersv_results = chi2_cramersv(cat)
chi2_cramersv_results

# cancelations vs numerical variables
## distribution comparison
- **Boxplots show that canceled bookings have longer lead time**; the further ahead someonne makes a booking - it's more likely that they'll cancel it.
- Variable **previous_cancelations** has a noticeable number of outliers for canceled bookings, indicating that guests who have previously canceled are more likely to cancel again.
- Variable **previous_bookings_not_canceled** has a wider spread and higher outliers, suggesting that guests with a history of not canceling tend to continue not canceling their bookings.
- No other factors seem to be predictive of whether a booking is likely to be canceled.

In [None]:
def generate_box_plots(num, is_canceled):
    """
    Generates box plots for each numerical column in the DataFrame, comparing distributions 
    between the 'is_canceled' categories, saves the plots to a PDF file, and returns the list of box plot figures.
    
    Parameters:
    num (DataFrame): The DataFrame containing the numerical data to plot.
    is_canceled (Series): A Pandas Series indicating whether each entry in 'data' was canceled.
    
    Returns:
    list: A list of matplotlib figure objects for each box plot created during the plotting process.
    """
    # Define the path to save the PDF in the 'plots' folder
    pdf_path = "../plots/boxplots.pdf"
    
    # Initialize a list to store the box plot figure objects
    boxplot_figures = []

    # Loop through each numerical column in 'data' and create a box plot compared to 'is_canceled'
    with PdfPages(pdf_path) as pdf:
        for column in num.columns:
            fig, ax = plt.subplots(figsize=(10, 6))
            sns.boxplot(x=is_canceled, y=num[column], ax=ax)
            ax.set_title(f"Box Plot of {column} by is_canceled")
            ax.set_xlabel("is_canceled")  # Adding labels for clarity
            ax.set_ylabel(column)        # Adding labels for clarity
            pdf.savefig(fig)  # Save the current figure into the pdf
            boxplot_figures.append(fig)  # Append the figure object to the list
    
    print("All box plots have been successfully saved to the PDF.")
    
    return boxplot_figures

# Usage
boxplot_figures = generate_box_plots(num, df["is_canceled"]);
boxplot_figures

# numerical variables

## linearity test
**Visual check for linearity** between every possible pair of continuous variables was done by creating individual scatterplots.\
To avoid memory overload because of a high number of plots (91 in this case), it's recommended to not display the plots.\
If you run the code below, upon observing the message "All plots have been successfully saved to the PDF.", you can find the scattrplots saved in a single .pdf file titled **scatter_plots**, located in the folder titled **plots**.

The sccaterplots show **no relationships** among continuous variables.\

In [None]:
def generate_scatter_plots(data, df):
    """
    Generates scatter plots for each pair of numerical columns in the DataFrame, coloring by the 'is_canceled' status. 
    Saves the plots to a PDF file. This function is useful for examining relationships between pairs of features with respect to cancellation status.

    Parameters:
    data (DataFrame): The DataFrame containing the numerical data to plot.
    df (DataFrame): The main DataFrame that includes the 'is_canceled' column used for coloring the points in scatter plots.
    """
    # Define the path to save the PDF in the 'plots' folder
    pdf_path = "../plots/scatter_plots.pdf"
    # Set verbosity level
    verbose = 1

    # Total number of plots
    total_plots = len(data.columns) * (len(data.columns) - 1) // 2
    current_plot = 0

    # Iterate over each pair of columns to create a scatter plot
    with PdfPages(pdf_path) as pdf:
        for i in range(len(data.columns)):
            for j in range(i + 1, len(data.columns)):
                plt.figure(figsize=(8, 6))
                sns.scatterplot(x=data[data.columns[i]], y=data[data.columns[j]], hue=df["is_canceled"])
                plt.title(f"Scatter Plot of {data.columns[i]} vs {data.columns[j]}")
                plt.xlabel(data.columns[i])
                plt.ylabel(data.columns[j])
                plt.legend(title="is_canceled")
                pdf.savefig()  # Save the current figure into the pdf
                plt.close()  # Close the figure to free up memory
                
                # Increment and print the progress if verbose and current_plot is a multiple of 5
                current_plot += 1
                if verbose and current_plot % 5 == 0:
                    print(f"Completed {current_plot} of {total_plots} plots.")

        if verbose:
            print("All plots have been successfully saved to the PDF.")

# Usage
generate_scatter_plots(num, df)

## correlations
Due to the lack of normality and linearity, choice is either to transform data or go on with non-parametric tests.\
Considering the time constraint, the current goal is to deliver a minimum viable product/project, so we'll first work with data as-is, **with non-parametric tests**.\
Time-permitting, we could eventually return to this point, transform data, and run parametric tests.

**Correlation matrix and correlation heatmap** confirm what scatterplots had shown: very weak relationships between the continuous numerical variables.\
The only strong correlation is between arrival_date_week_number and arrival_date_month.\
However, this is expected, logical relationship because months are composed of weeks.

In [None]:
# Calculate correlation matrix for numerical continuous columns
correlation_matrix_num = num.corr(method="spearman")

# Extract upper triangle of correlation matrix (excluding the diagonal)
upper_triangle = correlation_matrix_num.where(np.triu(np.ones(correlation_matrix_num.shape), k=1).astype(bool))

# Stack the upper triangle to get a Series of correlation values
correlation_values_num = upper_triangle.stack().sort_values(ascending=False)

# Convert correlation values to a DataFrame
correlation_df_num = correlation_values_num.reset_index()
correlation_df_num.columns = ["category_1", "category_2", "correlation"]

# Display the DataFrame
display(correlation_df_num.head(), correlation_df_num.tail())

In [None]:
# Create a heatmap for numerical continuous columns
plt.figure(figsize=(18, 15))
sns.heatmap(correlation_matrix_num, annot=True, annot_kws={"size": 10}, cmap="coolwarm")
plt.title("correlation heatmap for numerical continuous columns")
plt.show()

# hypothesis testing

## cancelation vs numerical variables
**Mann-Whitney U test** was used to compare the distribution of each numerical continuous variable between two independent samples: bookings categorized as **not canceled** and **canceled**.\
All the resulting p-values are below the threshold of 0.05, **indicating statistically significant differences between the two groups for all tested variables**.

In [None]:
def mann_whitney_u_tests(df, numerical_columns, target_column="is_canceled"):
    """
    Performs the Mann-Whitney U test for each numerical column in the DataFrame against a categorical target column.
    
    Parameters:
    df (DataFrame): The DataFrame containing the data.
    numerical_columns (list): List of column names of numerical data to test.
    target_column (str): The name of the binary target column to test against.
    
    Returns:
    a DataFrame with Mann-Whitney U test results including U statistic and p-values.
    """
    # Dictionary to store test statistics and p-values
    results = {}
    
    # Perform Mann-Whitney U test for each numerical variable against the categorical variable
    for column in numerical_columns:
        if column != target_column:  # Ensure we're not testing the target against itself
            # Split the data into two groups based on the categorical variable
            group1 = df[df[target_column] == 0][column]
            group2 = df[df[target_column] == 1][column]
            
            # Perform the test
            u_stat, p_value = stats.mannwhitneyu(group1, group2, alternative="two-sided")
            
            # Store the U statistic and p-value in the dictionary
            results[column] = {"U_stat": u_stat, "p_value": p_value}
    
    # Convert the results to a DataFrame for nicer display
    mann_whitney_results = pd.DataFrame.from_dict(results, orient="index")
    
    return mann_whitney_results

# Use:
mann_whitney_results = mann_whitney_u_tests(df, num)
mann_whitney_results

# summary
- univariate analysis:
  - numerical variables:
    - **normality, visual check** (histogram): 11/14 variables show **non-normal distribution**, right-skewed, most data concentrated on the lower end
    - **normality test** (shapiro-wilk, kolmogorov-smirnov): **non-normal distribution** (all the p-values < 0.05)
    - **shape of distribution**:
      - skewness: 8/14 variables exhibit **highly assymetric data**, 6/14 moderately assymetric
      - kurtosis: 9/14 variables have **heavy tails or outliers**, 5/14 light tails or lack of outliers.

---
- bivariate analysis:
  - categorical variables:
    - **independence test** (chi2): **significant association between is_canceled and almost all the other categories**
    - **strength of association** (cramer's v): weak for majority (25/35) of categories, moderate for 9/35.
  - category (is_canceled) vs numerical variables:
    - **distribution comparison** (boxplot):
      - canceled bookings have longer lead time
      - previous cancelations indicate future cancelations
      - absence of previous cancelations indicate future adherence to the booking
  - numerical variables:
    - **linearity** (scatterplot): **no linear relationship** among continuous variables
    - **correlation**: **very weak relationships**.

---
- hypothesis testing:
  - **statistically significant differences** between canceled and non-canceled bookings for all numerical variables.

Next: notebook_04_machine_learning_01_intro