## Hotel bookings (mini-project)

**Importing dataset from a csv-file in the working directory**

In [1]:
import pandas as pd

bookings = pd.read_csv('bookings.csv', sep=';')

**Checking number of rows and columns in the dataset**

In [2]:
bookings.shape

(119390, 21)

**Checking column names**

In [3]:
bookings.columns

Index(['Hotel', 'Is Canceled', 'Lead Time', 'arrival full date',
       'Arrival Date Year', 'Arrival Date Month', 'Arrival Date Week Number',
       'Arrival Date Day of Month', 'Stays in Weekend nights',
       'Stays in week nights', 'stays total nights', 'Adults', 'Children',
       'Babies', 'Meal', 'Country', 'Reserved Room Type', 'Assigned room type',
       'customer type', 'Reservation Status', 'Reservation status_date'],
      dtype='object')

**bookings.csv -- data description:**
* Hotel – hotel type (City Hotel or Resort Hotel)  
* Is canceled – was the booking cancelled (1) or not (0)  
* Lead time – number of days between the booking date and the arrival date    
* Arrival full date – arrival full date  
* Arrival date year – arrival year    
* Arrival date month – arrival month   
* Arrival date week number – arrival week number  
* Arrival date day of month – arrival date day of month  
* Stays in weekend nights – weekend nights  
* Stays in week nights – work-week nights (Mon - Fri)  
* Stays total nights – total nights  
* Adults – adults number  
* Children – children number  
* Babies – babies number  
* Meal – type of meal included with the booking  
* Country – visitor's country of residence  
* Reserved room type – reserved suite type  
* Assigned room type – assigned suite type  
* Customer type – customer type  
* Reservation status – reservation status: Canceled; Check-Out; No-Show  
* Reservation status date – date of the last status update  

**Checking columns datatypes**

In [41]:
bookings.dtypes

Hotel                         object
Is Canceled                    int64
Lead Time                      int64
arrival full date             object
Arrival Date Year              int64
Arrival Date Month            object
Arrival Date Week Number       int64
Arrival Date Day of Month      int64
Stays in Weekend nights        int64
Stays in week nights           int64
stays total nights             int64
Adults                         int64
Children                     float64
Babies                         int64
Meal                          object
Country                       object
Reserved Room Type            object
Assigned room type            object
customer type                 object
Reservation Status            object
Reservation status_date       object
dtype: object

**Finding the most frequent data type**

In [5]:
bookings.dtypes.value_counts()

object     10
int64      10
float64     1
dtype: int64

**Checking the first seven rows of the dataset**

In [9]:
bookings.head(7)

Unnamed: 0,Hotel,Is Canceled,Lead Time,arrival full date,Arrival Date Year,Arrival Date Month,Arrival Date Week Number,Arrival Date Day of Month,Stays in Weekend nights,Stays in week nights,...,Adults,Children,Babies,Meal,Country,Reserved Room Type,Assigned room type,customer type,Reservation Status,Reservation status_date
0,Resort Hotel,0,342,2015-07-01,2015,July,27,1,0,0,...,2,0.0,0,BB,PRT,C,C,Transient,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015-07-01,2015,July,27,1,0,0,...,2,0.0,0,BB,PRT,C,C,Transient,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015-07-01,2015,July,27,1,0,1,...,1,0.0,0,BB,GBR,A,C,Transient,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015-07-01,2015,July,27,1,0,1,...,1,0.0,0,BB,GBR,A,A,Transient,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015-07-01,2015,July,27,1,0,2,...,2,0.0,0,BB,GBR,A,A,Transient,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015-07-01,2015,July,27,1,0,2,...,2,0.0,0,BB,GBR,A,A,Transient,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015-07-01,2015,July,27,1,0,2,...,2,0.0,0,BB,PRT,C,C,Transient,Check-Out,2015-07-03


**Printing a concise summary of a DataFrame**

In [7]:
bookings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 21 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Hotel                      119390 non-null  object 
 1   Is Canceled                119390 non-null  int64  
 2   Lead Time                  119390 non-null  int64  
 3   arrival full date          119390 non-null  object 
 4   Arrival Date Year          119390 non-null  int64  
 5   Arrival Date Month         119390 non-null  object 
 6   Arrival Date Week Number   119390 non-null  int64  
 7   Arrival Date Day of Month  119390 non-null  int64  
 8   Stays in Weekend nights    119390 non-null  int64  
 9   Stays in week nights       119390 non-null  int64  
 10  stays total nights         119390 non-null  int64  
 11  Adults                     119390 non-null  int64  
 12  Children                   119386 non-null  float64
 13  Babies                     11

**Replacing whitespaces in the names of the columns**

In [11]:
def name_upd(name):
    return name.replace(' ', '_').lower()

bookings = bookings.rename(columns=name_upd)

bookings

Unnamed: 0,hotel,is_canceled,lead_time,arrival_full_date,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,...,adults,children,babies,meal,country,reserved_room_type,assigned_room_type,customer_type,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015-07-01,2015,July,27,1,0,0,...,2,0.0,0,BB,PRT,C,C,Transient,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015-07-01,2015,July,27,1,0,0,...,2,0.0,0,BB,PRT,C,C,Transient,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015-07-01,2015,July,27,1,0,1,...,1,0.0,0,BB,GBR,A,C,Transient,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015-07-01,2015,July,27,1,0,1,...,1,0.0,0,BB,GBR,A,A,Transient,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015-07-01,2015,July,27,1,0,2,...,2,0.0,0,BB,GBR,A,A,Transient,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017-08-30,2017,August,35,30,2,5,...,2,0.0,0,BB,BEL,A,A,Transient,Check-Out,2017-09-06
119386,City Hotel,0,102,2017-08-31,2017,August,35,31,2,5,...,3,0.0,0,BB,FRA,E,E,Transient,Check-Out,2017-09-07
119387,City Hotel,0,34,2017-08-31,2017,August,35,31,2,5,...,2,0.0,0,BB,DEU,D,D,Transient,Check-Out,2017-09-07
119388,City Hotel,0,109,2017-08-31,2017,August,35,31,2,5,...,2,0.0,0,BB,GBR,A,A,Transient,Check-Out,2017-09-07


**Find top-5 countries by bookings of their residents**

In [21]:
successful_bookings_top5 = bookings.query('is_canceled == 0') \
    .groupby('country') \
    .agg({'is_canceled': 'count'}) \
    .sort_values('is_canceled', ascending=False).head() \
    .rename(columns={'is_canceled': 'successful bookings'})


successful_bookings_top5

Unnamed: 0_level_0,successful bookings
country,Unnamed: 1_level_1
PRT,21071
GBR,9676
FRA,8481
ESP,6391
DEU,6069


**Calculate an average stay total nights by a hotel type**

In [23]:
bookings.groupby('hotel') \
    .agg({'stays_total_nights': 'mean'}) \
    .round(2)

Unnamed: 0_level_0,stays_total_nights
hotel,Unnamed: 1_level_1
City Hotel,2.98
Resort Hotel,4.32


**Count occasions when a reserved room type is different from an assigned room type**

In [31]:
occasions = bookings.rename(columns={'is_canceled': 'occasions'}) \
                    .query('assigned_room_type != reserved_room_type') \
                    .agg({'occasions': 'count'})

occasions

occasions    14917
dtype: int64

**Analyze planned date of arrival. What was the most popular month in 2016? 2017?**

In [35]:
for year in [2016, 2017]:
      max_is_canceled = bookings.query("arrival_date_year == @year and is_canceled == 0")\
            .groupby(['arrival_date_year', 'arrival_date_month'], as_index=False)\
            .agg({'is_canceled': 'count'})
      message = "The most popular month in {}:"
      print(message.format(year), max_is_canceled['arrival_date_month']
            .loc[max_is_canceled.is_canceled.idxmax(axis="columns")])

The most popular month in 2016: October
The most popular month in 2017: May


**The most popular month to cancel booking in 2015? 2016? 2017?**

In [43]:
for year in [2015, 2016, 2017]:
    max_canceled_month = bookings.query('hotel == "City Hotel" and is_canceled == 1 and arrival_date_year == @year') \
    .groupby(['arrival_date_year', 'arrival_date_month']) \
    .agg({'is_canceled': 'count'}) \
    .idxmax()
    print(f'The majority of bookings cancellations in {year} happend in {max_canceled_month[0][1]}.')

The majority of bookings cancellations in 2015 happend in September.
The majority of bookings cancellations in 2016 happend in October.
The majority of bookings cancellations in 2017 happend in May.


**What column does have the biggest mean value: adults, children, babies?**

In [44]:
# alternative solution
bookings[['adults', 'children', 'babies']].mean()

adults      1.856403
children    0.103890
babies      0.007949
dtype: float64

**Create a new column 'total_kids' by adding in it the number of children and the number of babies. What hotel type does have the biggest mean total_kids value?**

In [54]:
# Создайте колонку total_kids, объединив столбцы children и babies. 
# Для отелей какого типа среднее значение переменной оказалось наибольшим?
bookings['total_kids'] = bookings.children + bookings.babies

bookings.groupby('hotel', as_index=False) \
    .agg({'total_kids' : 'mean'}) \
    .round(2) \
    .max()

hotel         Resort Hotel
total_kids            0.14
dtype: object

**Churn rate** - is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period. It is also the rate at which employees leave their jobs within a certain period. For a company to expand its clientele, its growth rate (measured by the number of new customers) must exceed its churn rate.   
https://www.investopedia.com/terms/c/churnrate.asp

**Calculate a churn rate for customers with kids and cutomers having no kids**

In [53]:
# creating a new column
bookings['has_kids'] = bookings.total_kids > 0

# quering cancelaton numbers for visitors with and without kids
canceled_with_kids = bookings.query('has_kids == True and is_canceled == 1') \
    .is_canceled.value_counts()
canceled_no_kids = bookings.query('has_kids == False and is_canceled == 1') \
    .is_canceled.value_counts()

# final values to calc churn rate
canceled_no_kids_int = canceled_no_kids[1]
canceled_with_kids_int = canceled_with_kids[1]
total_bookings_with_kids = len(bookings.query('has_kids == True'))
total_bookings_no_kids = len(bookings.query('has_kids == False'))

# function to calc churn rate
def churn_rate(total, canceled):
    return (canceled / total) * 100


print(f'Customers having no kids churn rate: {round(churn_rate(total_bookings_no_kids, canceled_no_kids_int), 2)}')
print(f'Customers with kids churn rate: {round(churn_rate(total_bookings_with_kids, canceled_with_kids_int), 2)}')

Customers having no kids churn rate: 37.22
Customers with kids churn rate: 34.92
