# VTK Hackathon 2021 - Students

The goal of the Data Science task is to extract valuable insights from raw hotel rates data. The data sample is fetched from Cloud Storage for which the code is already provided below.

The data sample consists of the rates of about 7000 hotels for stay dates (the date on which you check-in) ranging from November 9 till February 6 ( a period of 90 days) as scraped on November 8. All of the rates are for stays of a single night and are in EUR. 



## Fetching data from Cloud Storage [Already implemented]


In [None]:
from google.cloud import storage
from google.colab import auth
import os

PROJECT_ID = 'vtkhackathon-2021'
BUCKET = 'students-public'
RATES_FILE = 'rates.csv'

if not os.path.exists(RATES_FILE):
    auth.authenticate_user()
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.bucket(BUCKET)
    blob = bucket.blob(RATES_FILE)
    blob.download_to_filename(RATES_FILE) 

In [None]:
import pandas as pd
raw_rates = pd.read_csv(RATES_FILE)

## Getting to know the data [Not graded]

Some useful Pandas functions are `DataFrame.info()` , `DataFrame.describe()`, `Series.value_counts()`... See the [docs](https://pandas.pydata.org/docs/) for more information. Transform/clean the data when necessary.


## Task 1: Exceptional deals
As some hotel managers manually adjust prices, mistakes are bound to happen. The task is for you to robustly identify these outliers (so you could exploit them to get a cheap stay).

Just taking the minimum over price values will of course not lead to these cases as €8 for a dormitory bed in Bulgaria might not be that exceptional, while a €25 suite in a 5 star hotel in Brussels probably is. The prices should thus be normalized to a `price index` indicating how exceptional the price is.

**Grading: Provide us with the (our_hotel_id, stay_date) of the most significant outlier**

Hint 1: For aggregations, see the Pandas docs on [DataFrame.groupby()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) and [named aggregation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#named-aggregation).

Hint 2: When you have found an outlier, plot the price value of this offer over all stay dates to see if it is really a valuable outlier.


## Task 2: Patterns per destination 

From the price indexes you calculated in the previous task you can get information on the pricing patterns per destination. For this task we ask you to calculate the business leisure score (a score indicating whether the pricing behaviour is mostly business or leisure driven) per destination according to the following steps:

1. go from the price index per (hotel, offer, stay date) to a price index per (hotel, stay date).
2. go from the price index per (hotel, stay date) to a price index per (destination, stay date)
3. go from the price index per (destination id, stay date) to a price index per (destination, day of week)
4. create a business/leisure score by using the price indexes on business days (Sunday till Thursday) and the price indexes on leisure days (=Friday and Saturday)

For aggregations, you can use medians.

**Grading: Plot the price index per (destination, day of week) for the destination with the lowest and the destination with the highest business/leisure index**