## EDA: Hotel booking demand

https://www.kaggle.com/jessemostipak/hotel-booking-demand

Zbiór z Kaggle opisujący dwa hotele w Portugalii: jeden miejski (Lizbona), drugi w typie kurortu (region Algarve).
    
Rezerwacje w okresie od 1 lipca 2015 do 31 sierpnia 2017 roku.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

pd.options.mode.chained_assignment = None 

In [2]:
df = pd.read_csv("hotel_bookings.csv")

In [None]:
df

In [None]:
df.describe()

Ile jest wierszy, jakie są kolumny, gdzie są braki w danych?

In [None]:
df.info()

<table>
<thead>
  <tr>
    <th>Variable </th>
    <th> Description </th>
  </tr>
</thead>
<tbody>
  <tr>
    <td> adr </td>
    <td> Average Daily Rate, calculated by dividing the sum of all lodging transactions by the total number of staying nights </td>
  </tr>
  <tr>
    <td> adults </td>
    <td> Number of adults </td>
  </tr>
  <tr>
    <td> agent </td>
    <td> ID of the travel agency that made the booking </td>
  </tr>
  <tr>
    <td> arrival_date_week_number </td>
    <td> Week number of the arrival date </td>
  </tr>
  <tr>
    <td> booking_changes </td>
    <td> Number of changes/amendments made to the booking from the moment the&nbsp;&nbsp;booking was entered on the Property Management System (PMS) until the&nbsp;&nbsp;moment of check-in or cancellation </td>
  </tr>
  <tr>
    <td> country </td>
    <td> Country of origin </td>
  </tr>
  <tr>
    <td> customer_type </td>
    <td> Type of booking, assuming one of four categories: Contract - when the&nbsp;&nbsp;booking has an allotment or other type of contract associated to it;&nbsp;&nbsp;Group - when the booking is associated to a group; Transient - when the&nbsp;&nbsp;booking is not part of a group or contract, and is not associated to&nbsp;&nbsp;other transient booking; Transient-party - when the booking is transient&nbsp;&nbsp;but is associated to at least another transient booking </td>
  </tr>
  <tr>
    <td> hotel </td>
    <td> Type of hotel </td>
  </tr>
  <tr>
    <td> lead_time </td>
    <td> Number of days that elapsed between the entering date of the booking into the PMS and the arrival date </td>
  </tr>
  <tr>
    <td> market_segment </td>
    <td> Market segment designation. In categories, the term "TA" means "Travel Agents" and "TO" means "Tour Operators" </td>
  </tr>
  <tr>
    <td> previous_bookings_not_canceled </td>
    <td> Number of previous bookings not canceled by the customer prior to the current booking </td>
  </tr>
  <tr>
    <td> previous_cancellations </td>
    <td> Number of previous bookings that were canceled by the customer prior to the current booking </td>
  </tr>
  <tr>
    <td> required_car_parking_spaces </td>
    <td> Number of car parking spaces required by the customer </td>
  </tr>
  <tr>
    <td> reserved_room_type </td>
    <td> Code of room type reserved. Code is presented instead of designation for anonymity reasons </td>
  </tr>
  <tr>
    <td> stays_in_week_nights </td>
    <td> Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel </td>
  </tr>
  <tr>
    <td> stays_in_weekend_nights </td>
    <td> Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel </td>
  </tr>
  <tr>
    <td> total_of_special_requests </td>
    <td> Number of special requests made by the customer (e.g. twin bed or high floor) </td>
  </tr>
</tbody>
</table>

In [7]:
nan_replacements = {"children:": 0, "country": "Unknown", "agent": 0, "company": 0}
hotels_df = df.fillna(nan_replacements)

In [None]:
hotels_df[['lead_time', 'adr']].hist(figsize=(15,6), bins=80)
hotels_df[['stays_in_weekend_nights', 'stays_in_week_nights']].hist(figsize=(15,6), bins=10, range=(0,9))
hotels_df[['adults', 'children', 'babies']].hist(figsize=(15,12), bins=6, range=(0,5))
plt.show()

Jakiego typu są kolumny?

In [None]:
# 1.
{name: pd.api.types.is_numeric_dtype(hotels_df[name]) for name in hotels_df.columns}

In [None]:
# ile jest kolumn numerycznych?
np.sum([pd.api.types.is_numeric_dtype(hotels_df[name]) for name in hotels_df.columns])


In [None]:
# 2.
hotels_df.select_dtypes(include=[np.number])

### Zadanie
Zmienna `lead_time` to zmienna, która mówi o długości czasu pomiędzy rezerwacją a planowanym rozpoczęciem pobytu w hotelu. Czy ma jakiś wpływ? 

? Sprawdź wpływ czasu realizacji zamówienia na anulowanie, możesz podzielić obserwacje na 10 grup. 

In [None]:
num_features = [name for name in df.columns if pd.api.types.is_numeric_dtype(hotels_df[name])]

plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(hotels_df[num_features].corr(), annot=True, vmin=-1, vmax=1, cmap="BrBG")
plt.show()

### Jak często są odwoływane rezerwacje?


In [None]:
ax = sns.countplot(data = hotels_df, x = "hotel", hue = "is_canceled")
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+100))

plt.title("Number of canceled and non canceled bookings", fontsize=16)
plt.xlabel("Hotel type", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.show()

### Wpływ typu depozytu na anulowanie rezerwacji

In [None]:
deposit_cancel_data = hotels_df.groupby("deposit_type")["is_canceled"].describe()

plt.figure(figsize=(12, 8))
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100, color = "steelblue")
plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()

In [None]:
hotels_df["deposit_type"].value_counts()

This variable can assume three categories:
- No Deposit – no deposit was made;
- Non Refund – a deposit was made in the value of the total stay cost;
- Refundable – a deposit was made with a value under the total cost of stay."


In [None]:
hotels_df[hotels_df["deposit_type"] == "Non Refund"].groupby(['hotel', 'is_canceled']).size().reset_index()

### Jak zmienia się liczba rezerwacji w ciągu roku?

In [15]:
bookings_monthly = hotels_df[["hotel", "arrival_date_month", "arrival_date_year", "is_canceled", "adr"]]
ordered_months = ["January", "February", "March", "April", "May", "June", 
                  "July", "August", "September", "October", "November", "December"]
bookings_monthly.loc[:,"arrival_date_month"] = pd.Categorical(bookings_monthly["arrival_date_month"], categories=ordered_months, ordered=True)

In [None]:
bookings_monthly = bookings_monthly.groupby(["hotel", "arrival_date_month", "is_canceled"]).size().reset_index(name='counts')
#w ramce dane za lipiec i sierpień występują 3 razy, za pozostałe miesiące - 2 razy
bookings_monthly.loc[(bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"]  == "August"),
                    "counts"] /= 3
bookings_monthly.loc[~((bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"]  == "August")),
                    "counts"] /= 2

In [None]:
plt.figure(figsize=(12, 8))
plt.ylim(0, 2500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "City Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in City Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
plt.ylim(0, 1500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "Resort Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in Resort Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

In [None]:
df.country.value_counts()

### Kraj pochodzenia gości hotelowych

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "City Hotel"], x = "country", hue = "is_canceled", 
              order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in City Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "Resort Hotel"], x = "country", hue = "is_canceled", 
              order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in Resort Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

### Długość pobytu a liczba rezerwacji

In [19]:
res = hotels_df.loc[(hotels_df["hotel"] == "Resort Hotel")]
city = hotels_df.loc[(hotels_df["hotel"] == "City Hotel")]

In [None]:
res["total_nights"] = res["stays_in_weekend_nights"] + res["stays_in_week_nights"]
city["total_nights"] = city["stays_in_weekend_nights"] + city["stays_in_week_nights"]
res_plot = res.groupby(['total_nights', "is_canceled"]).size().reset_index()
city_plot = city.groupby(['total_nights', "is_canceled"]).size().reset_index()
res_plot

In [None]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=city_plot)
plt.title("Bookings and length of stay in City Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.xlim(0,22)
plt.show()

In [None]:
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=res_plot)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Guests [%]", fontsize=16)
plt.xlim(0,22)
plt.title("Bookings and length of stay in Resort Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()

### Czy przypisanie do innego typu pokoju ma znaczący wpływ na anulowanie rezerwacji?

In [None]:
hotels_df.loc[(hotels_df["reserved_room_type"] != hotels_df["assigned_room_type"]), "is_canceled"].mean()

In [None]:
hotels_df.loc[(hotels_df["reserved_room_type"] == hotels_df["assigned_room_type"]), "is_canceled"].mean()

Raczej nie bardzo.

### Czy wyjazdy z dziećmi są częściej odwoływane?


In [None]:
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] > 0, "is_canceled"].mean()

In [None]:
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] == 0, "is_canceled"].mean()

Nie.

### Czy ponowni goście są mniej skłonni do odwołania rezerwacji?

In [None]:
hotels_df.loc[hotels_df["is_repeated_guest"] == 1, "is_canceled"].mean()

In [None]:
hotels_df.loc[hotels_df["is_repeated_guest"] == 0, "is_canceled"].mean()

Tak

### Zadanie
Sprawdź czy zmienna `previous_bookings_not_canceled` ma wpływ na anulowanie rezerwacji. Czy są jakieś inne zmienne, które mogą być zależne od `previous_bookings_not_canceled`.

? Popatrz na macierz korelacji.

## autoEDA: pandas-profiling

`pip install pandas-profiling[notebook]`

- https://github.com/pandas-profiling/pandas-profiling
- https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd

In [None]:
from ydata_profiling import ProfileReport

In [34]:
profile = ProfileReport(df.sample(frac=0.1), title="Pandas Profiling Report", explorative=True)

In [None]:
profile.to_widgets()

In [None]:
profile.to_file("report1.html")

## zajawka: modelowanie predykcyjne

`pip install lazypredict`

- https://github.com/shankarpandala/lazypredict
- https://lazypredict.readthedocs.io/en/latest

In [None]:
from lazypredict.Supervised import LazyClassifier

In [37]:
df_small = hotels_df.select_dtypes(include=[np.number]).sample(frac=0.1)

X = df_small.drop(["is_canceled"], axis=1)
y = df_small.is_canceled

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=123)

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

models, predictions = clf.fit(X_train, X_test, y_train, y_test)

In [None]:
print(models)