# Tutorial Exercises

## Hotel stay data

In this example we will use data from 12,843 guests who stayed at a chain of luxury hotels over the past 5 years.
The data were collected by the Clarendon Luxury Hotel chain, and stored in their customer database.

These exercises will review some of the skills learned over the last three weeks. They will also prepare you for the first hand-in exercise: to produce a report for the CEO of the company, describing the main factors affecting how much guests spent, and the length of their hotel stay.

### Set up Python libraries

As usual, run the code cell below to import the relevant Python libraries

In [1]:
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()

### Load and inspect the data

In [2]:
hotelStays=pandas.read_csv('/Users/lhunt/Desktop/hotelStays.csv')
display(hotelStays)

Unnamed: 0,CHARGES,LOS,AGE,SEX,DISCOUNTCODE,SPAVOUCHER
0,4752.00,10,79.0,F,122.0,0.0
1,3941.00,6,34.0,F,122.0,0.0
2,3657.00,5,76.0,F,122.0,0.0
3,1481.00,2,80.0,F,122.0,0.0
4,1681.00,1,55.0,M,122.0,0.0
...,...,...,...,...,...,...
12839,22603.57,14,79.0,F,121.0,0.0
12840,,7,91.0,F,121.0,0.0
12841,14359.14,9,79.0,F,121.0,0.0
12842,12986.00,5,70.0,M,121.0,0.0


What data do we have for each patient?
<ul>
    <li> CHARGES is the cost in pounds of the guest's stay at the hotel (note that this varies considerable due to different room types and time of year, which are not data that are included in this databased)
    <li> LOS is Length of Stay (at the hotel) in nights
    <li> SPAVOUCHER is coded as 1 if the person used a voucher to access the hotel spa, 0 if they did not
    <li> DISCOUNTCODE is a discount code from the company's booking system
</ul>

### Evaluate missing and bad data values

How many missing values (NaNs) are there for each variale (column) in the dataset?

In [1]:
# your code here!

Can you find any data points that look like outliers or misrecorded values?

You could try the following techniques:
<ul>
    <li> plot the data to see if outliers are obvious
    <li> sort the data using <tt>pandas.df.sort_values()</tt> to bring extreme values to the top (or bottom) of the dataframe, then display the sorted dataframe
    <li> obtain descriptive statistics and check the max an min value for each column of the dataframe
</ul>

For hotel guests with outlier values, you should decide whether to:

<ol>
    <li>replace individual datapoints with NaNs
    <li>replace the entire patient record with NaNs
    <li>remove the entire record from the dataset with <tt>pandas.df.drop()</tt>
    <li>retain the data as is, at least for now
</ol>
Think how you would justify your choice to a reader.


In [2]:
# your code here!

### Cost of hotel stay

The column <tt>CHARGES</tt> tells us how much the hotel stay cost in £.

Plot the distribution of charges using a suitable plot type. 

In [3]:
# Your code here

Describe the distribution of hotel stay costs in words, including some descriptive statistics. 

Part of the task here is to decide which descriptives are useful to give the reader a summary fo the distribution of charges. 

Try to make a choice yourself, and then discuss with your tutor if unsure.

In [4]:
# Your code here

### Length of hotel stay

The column <tt>LOS</tt> tells us how long each patient stayed in the hotel.

Plot the distribution of length of stay using a suitable plot type. 

In [5]:
# Your code here

Hm, there is an interesting feature in that data distribution - what is it?

Can you think what the origin of this feature is (what caused it?)

HINT: it may help to plot data separately for the different values of one of the categorical variables, using the argument <tt>hue</tt> in the plotting function. You will get a clearer result with a KDE plot than a histogram (try both and see why).

In [6]:
# Your code here

.......Your comment here......... double click to edit this text box!

### Association between cost and length of stay

Probably the biggest factor affecting the cost of the stay is the length of the stay.

Produce an appropriate plot and descriptive statisitics to demonstrate the relationship between cost and length of stay.

In [7]:
# Your code here

.......Your comment here......... double click to edit this text box!

You may remember from the exercises on covariance that change in $y$ for one unit in $x$ is given by the regression slope:

$$ b = \frac{s_{xy}}{s^2_x} $$

Apply the equation in Python to find out how much, on average, one extra night at the hotel costs.

In [8]:
# Your code here

## Association between length of stay and age

Older people tend to stay longer at the hotel - produce an appropriate plot and descriptive statisitics to demonstrate the relationship between cost and length of stay.

In [9]:
# Your code here

.......Your comment here......... double click to edit this text box!

### Sex difference in spa usage

A greater proportion of the women used the spa than men.

Illustrate this assocation between sex and spa usage using <tt>sns.countplot()</tt>

In [10]:
# Your code here

### Age difference between sexes

Could the higher use of the spa by women be explained by their age?

Explore with appropriate plots and summarize your observations in words including descriptive statstics.

In [11]:
# Your code here

.......Your comment here......... double click to edit this text box!

### What is DISCOUNTCODE?

The column DISCOUNTCODE is a discount code, which tells you something about what discount offer the guest had access to when they made their booking.

I do not know what the different codes mean, but you could try to find out by plotting some of the other variables broken down by DISCOUNT (eg using the <tt>hue</tt> property of <tt>sns</tt> plotting functions</tt>). You should at least be able to work out what code 123 means.

In [12]:
# Your code here

.......Your comment here......... double click to edit this text box!