# Introduction
In Part I, we collected our data and performed a little data cleaning. We then exported the cleaned DataFrames as CSVs.

In this Part, here's what you will do:
<ol><li>Read the cleaned CSVs into separate DataFrames</li><li>Perform univariate analysis for H1 and H2</li><li>Perform bivariate analysis for H1 and H2</li></ol>

<strong>Note: we'll analyze the CSVs on their own, i.e. separately first, before comparing them in the next Part. This part is also pretty long because we'll be analyzing two different datasets so take your time in understanding the visualizations that you make.</strong>

<strong>Note 2: have the publication you have handy, e.g., in a different tab or browser, so you can refer to the definitions of the columns</strong>

Useful readings on visualization: 
<a href = "https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed">Introduction to Data Visualization in Python</a> (run it in Incognito Mode if you face the paywall)

We highly recommend this reading if you haven't done visualization often enough.

### Step 1: Import the following libraries
- pandas as pd
- matplotlib.pyplot as plt
- seaborn as sns

In [None]:
# Step 1: Import the libraries

### Step 2: Read the CSVs from Part I as DataFrames
Let's read the CSVs as DataFrames. We'll perform EDA for both of these datasets.

In [None]:
# Step 2a: Read the cleaned H1.csv into a DataFrame

In [None]:
# Step 2b: Read the cleaned H2.csv into a DataFrame

# H1.csv
## Univariate analysis (UA)
In this section, we'll first work with H1.csv and examine the individual features through visualization.

Univariate analysis (UA) analyzes features on their own through plots such as histograms and countplots so that we can understand the distribution of the data and identify possible outliers or errors.

Do refer to the research publication mentioned in Part I so that you can appreciate features better.

### Step 3: Perform UA on IsCanceled with a countplot
First, we want to see the proportion of cancelled bookings vs non-cancelled bookings. To do that, we can use a countplot from the seaborn library.

What do you notice about the cancellations for H1?

In [None]:
# Step 3: Plot a countplot using IsCanceled

### Step 4: Perform UA on LeadTime with a histogram
Let's take a look at the LeadTime with a histogram.

In [None]:
# Step 4: Plot a histogram using LeadTime

### Step 5: Perform UA on ArrivalDateYear with a countplot
We'll take a look at the proportions of the ArrivalDateYear with a countplot, taking note of how many rows belong to which year. 

In [None]:
# Step 5: Plot a countplot using ArrivalDateYear

### Step 6: Perform UA on ArrivalDateMonth with a countplot
Next, we take a look at the ArrivalDateMonth to see what the breakdown of the bookings are. 

Before you plot, hypothesize what kind of pattern you'd see!

<details>
    <summary><font color = 'green'>Click here once for a hint if your plot looks weird</font></summary>
    <div>
        <strong>Google "ordering axis of seaborn.countplot"</strong>
    </div>
    <div>
        <strong>Expanding the size of the plot helps too.</strong>
    </div>
</details>

In [None]:
# Step 6: Plot a countplot using ArrivalDateMonth

### Step 7: Perform UA on ArrivalDateWeekNumber with a countplot
Similarly, we want to see if there's a pattern in the ArrivalDateWeekNumber as well. 

It should be more similar since it's the granular form of ArrivalDataMonth. Let's take a look with a countplot.

In [None]:
# Step 7: Plot a countplot using ArrivalDateWeekNumber

### Step 8: Perform UA on Adults with a countplot/histogram
Let's take a look at the number of adults who book hotel rooms with a countplot. You can also try to use a histogram as well.

In [None]:
# Step 8: Plot a countplot/histogram using Adults

In [None]:
# Optional: Use the .describe method to get a summary statistic on 'Adults'

### Step 9: Perform UA on Children with a countplot/histogram
Similarly, let's take a look at the counts of children with a countplot to see what is the most common number of children brought by would-be customers.

In [None]:
# Step 9: Plot a countplot/histogram using Children

### Step 10: Perform UA on Babies with a countplot/histogram

In [None]:
# Step 10: Plot a countplot using Babies

### Step 11: Perform UA on Meal with a countplot
Meal is where the type of meal is booked. We should anticipate four kinds of categories so let's use a countplot to visualize them. 

In [None]:
# Step 11: Plot a countplot using Meal

### Step 12: Perform UA on Country with value counts
Retrieve the 'Country' column data, and we'll use the value_counts method to summarize the counts of the categorical values in order.

![CountryValueCounts.png](attachment:CountryValueCounts.png)

It's going to be a long list, so <strong>slice</strong> the first 20 items to get the top 20 countries present. 

Where do most of the visitors come from?

<details>
    <summary><font color = 'green'>Click here once for a hint</font></summary>
    <div>
        <strong>Google "value counts pandas"</strong>
    </div>
</details>

In [None]:
# Step 12: Use a value_counts on Country

### Step 13: Perform UA on MarketSegment with a countplot
We want to also see what kinds of people book rooms in the dataset. For this, we'll use a countplot on MarketSegment. 

In [None]:
# Step 13: Plot a countplot using MarketSegment

### Step 14: Perform UA on DistributionChannel with a countplot
Where <em>are</em> all these customers coming from? We can find out by examining the DistributionChannel with a countplot as well.

In [None]:
# Step 14: Plot a countplot using DistributionChannel

### Step 15: Perform UA on IsRepeatedGuest with a countplot
How prevalent are repeat customers in the dataset? We can find that answer in IsRepeatedGuest.

In [None]:
# Step 15: Plot a countplot using IsRepeatedGuest

### Step 16: Perform UA on PreviousCancellations with a countplot
This feature indicates the distribution of the number of previous cancellations by the booking.

In [None]:
# Step 16: Plot a countplot using PreviousCancellations

### Step 17: Perform UA on PreviousBookingsNotCanceled with a countplot
This is kind of the opposite of the Step 16, i.e. "Number of previous bookings not cancelled by the customer prior to the current booking".

In [None]:
# Step 17: Plot a countplot using PreviousBookingsNotCanceled

### Step 18: Perform UA on ReservedRoomType with a countplot
We look at the kinds of rooms reserved. 

As mentioned in the publication, the room types are anonymized but it'd be good to examine the distribution of the kinds of rooms nonetheless. 

In [None]:
# Step 18: Plot a countplot using ReservedRoomType

### Step 19: Perform UA on AssignedRoomType with a countplot
The rooms that eventually get assigned to customers may be different from the ReservedRoomType. 

Let's see if the distribution is similar to the one you see in Step 18.

In [None]:
# Step 19: Plot a countplot using AssignedRoomType

### Step 20: Perform UA on BookingChanges with a countplot/histogram
How many changes happen by customers? We can also check this out with a countplot/histogram on BookingChanges.

In [None]:
# Step 20: Plot a histogram using BookingChanges

### Step 21: Perform UA on DepositType with a countplot
How are the customers paying? We do this with by looking at the DepositeType feature with a countplot. 

In [None]:
# Step 21: Plot a countplot using DepositType

### Step 22: Get the top 20 Agent with value count
Some of the bookings are done with agents - let's see who are the top 20 agents involved.

You may see something interesting - what do you think it is?

<details>
    <summary><font color = 'green'>Click here once for clue</font></summary>
    <div>
        <strong>What do you see at the second highest "Agent"?</strong>
    </div>
</details>

In [None]:
# Step 22: Take the top 20 from the value count

### Step 23: Perform UA on DaysInWaitingList with a histogram
What's the distribution of the number of days someone waits? Plot a histogram to find out! 

In [None]:
# Step 23: Plot a histogram using DaysInWaitingList

### Step 24: Perform UA on CustomerType with a countplot
What are the kinds of people making the bookings - we can find out with a countplot on CustomerType. 

In [None]:
# Step 24: Plot a countplot using CustomerType

### Step 25: Perform UA on ADR with a histogram
ADR, or average daily rate, is defined as a metric that is "calculated by dividing the sum of all lodging transactions by the total number of staying nights".

It's used commonly as a KPI for hotel performance. 

In [None]:
# Step 25: Plot a histogram using ADR

### Step 26: Perform UA on RequiredCarParkingSpaces with a countplot
This is a feature that indicates the number of parking spaces the customer needs. 

You can either use a countplot or go with a value count.

In [None]:
# Step 26: Plot a histogram using RequiredCarParkingSpaces

### Step 27: Perform UA on TotalOfSpecialRequests with a countplot
Are our customers okay, or fussy as hell? Find out with a countplot on TotalOfSpecialRequests. 

In [None]:
# Step 27: Plot a histogram using TotalOfSpecialRequests

### End of UA for H1.csv
We have examined quite a lot of columns! It's a long process, well done for perservering.

It is completely necessary because it'd be hard to appreciate and understand your data otherwise. This is also helpful when/if we need to do feature engineering later on to derive, subtract, or modify features. 

## Bivariate analysis
Now that we're done with univariate analysis, let's come up with a few hypotheses on whether there's a correlation between certain features and cancellations.

### Step 28: Perform BA on LeadTime vs IsCanceled with a boxplot
<blockquote>Is a long lead time correlated to cancellation?</blockquote>

![LeadTimeBoxplot.png](attachment:LeadTimeBoxplot.png)

To test this hypothesis, we can plot a boxplot and visually inspect if there's a qualitative difference between the lead times of cancelled and normal bookings. 

In [None]:
# Step 28: Plot a boxplot using LeadTime vs IsCanceled

### [Optional] Perform a Mann-Whitney U test 
Assuming that we can visually inspect the difference between 0 and 1 for IsCancelled with respect to lead time, how do we really know for sure that the differences are statistically significant?

From Step 4, we have seen that the values for LeadTime are not normally distributed. This means we can't use parametric tests such as t-tests. 

As such, we'll use a Mann-Whitney U test. You can import the stats library from the scipy library.

Conduct this test on two lists - first list are lead times for when isCancelled is 0, and lead times for when isCancelled is 1.

<details>
    <summary><font color = 'green'>Click here once for steps you'll need</font></summary>
    <div>
        <ol>
            <li>Filter DataFrame based on IsCancelled == 0</li>
            <li>Get the values in LeadTime and save it in a variable</li>
            <li>Filter DataFrame based on IsCacelled == 1</li>
            <li>Get the values in LeadTime and save it in another variable</li>
            <li>Import stats from scipy</li>
            <li>Use the .mannwhitneyu method, using variables from 2 and 4</li>
            <li>Reject null hypothesis that both samples come from the same distribution if p < 0.05</li>
        </ol>
    </div>
</details>

In [None]:
# [Optional] Conduct a statistical test of difference using Mann-Whitney U test

### Step 29: Perform BA on DaysInWaitingList vs IsCanceled with a scatterplot
<blockquote>What happens when you keep people waiting in line for their booking?</blockquote>

We can examine this by plotting a boxplot between DaysInWaitingList and IsCanceled.

In [None]:
# Step 29: Plot a boxplot using DaysInWaitingList vs IsCanceled

### [Optional] Conduct a Mann-Whitney U test
If you plot the boxplot, you'll see that it's hard to qualitatitely tell the two boxes apart.

Conduct a Mann-Whitney U test instead and see if there's a difference.

In [None]:
# Optional: Conduct a Mann-Whitney U test

### Try other hypotheses
Be creative! See if there are any other relationships you can unearth between the features, not just between one feature and IsCancelled.

In [None]:
# Try other hypotheses here

# H2.csv
What a long session! Now that we're done for H1.csv, it's time for H2.csv. Don't groan, this is useful later on!

You're halfway there, yes there's a lot of visualizations ahead, but you can do it!

### Step 30: Repeat Steps 3 - 29 with H2.csv data
We don't like to bore you with the same instructions again and again, but by now you should know the drill. 

We will repeat Steps 3 to 29, and examine the same distibributions that we did for H1.csv. 

In [None]:
# Step 30: Repeat Steps 3 - 29

In [None]:
# Step 30: Repeat Steps 3 - 29

In [None]:
# Step 30: Repeat Steps 3 - 29

In [None]:
# Step 30: Repeat Steps 3 - 29

In [None]:
# Step 30: Repeat Steps 3 - 29

### End of Part II
Well done for getting this far! It was long, but as you did the UA and BA for H2.csv data, we hope that you noticed the difference in the distributions. 

This is important because in Part III, we will directly compare some of the features between the two datasets, along with statistical tests.

Take a break, you deserve it!