# **DATA COLLECTION NOTEBOOK**

## Objectives

* Fetch data from Kaggle (using API) and save raw data.
* Inspect/fix data quality issues and save it under outputs/datasets/collection.

## Inputs

* Kaggle authentication token - JSON file.

## Outputs

* Generate Dataset: outputs/datasets/collection/raw.csv

## Additional Comments

* The data is originally from the article [Hotel Booking Demand Datasets](https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
* The version hosted on Kaggle was cleaned by Thomas Mock and Antoine Bichat for [#TidyTuesday during the week of February 11th, 2020](https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-02-11/readme.md).
* The data carries an [Attribution license](https://creativecommons.org/licenses/by/4.0/) and is therefore suitable for hosting on a public repository as long as the dataset is attributed to the original authors.

---

# Fetch data from Kaggle

Manually copy and paste the kaggle.json file (the authentication token) into the root directory for this project.

**NOTE:** If you don't have an authentication token:
1. Login to your account on the Kaggle website
2. From the main menu, navigate to the **Settings** page and the **API** section
3. Click **Expire Token**
4. Click **Create New Token**

Then run the cell below so that the token is recognised

In [2]:
import os
from pathlib import Path
import platform

# Set KAGGLE_CONFIG_DIR
project_root = Path.cwd().parent
os.environ['KAGGLE_CONFIG_DIR'] = str(project_root)
print('KAGGLE_CONFIG_DIR has been set to', os.environ['KAGGLE_CONFIG_DIR'],'\n')

# Check kaggle.json found in directory
kaggle_file = project_root / 'kaggle.json'
if not kaggle_file.exists():
    print(
        f'ERROR: kaggle.json NOT found at {kaggle_file}\n',
        'Please ensure kaggle.json file is in the correct location and run this cell again.'
    )
else:
    print(f'SUCCESS: Found kaggle.json at {kaggle_file}\n')
    
    # Set file permissions - needed for Mac and Linux
    if platform.system() in ['Linux', 'Darwin']:
        !chmod 600 {kaggle_file}
        print(f'Set file permissions successfully')

KAGGLE_CONFIG_DIR has been set to c:\Users\Music\Desktop\Code Institute\P5 Hotel Bookings\hotel-bookings 

SUCCESS: Found kaggle.json at c:\Users\Music\Desktop\Code Institute\P5 Hotel Bookings\hotel-bookings\kaggle.json



If authentication token found, download the dataset

In [3]:
KaggleDatasetPath = 'jessemostipak/hotel-booking-demand'
DestinationFolder = project_root / 'inputs' / 'datasets' / 'raw'  # pathlib library overloads '/' to make a path-join operator

# Make target directory if it doesn't exist 
os.makedirs(DestinationFolder, exist_ok=True)

# Download dataset as zip - double quotes used in case filepath contains spaces
! kaggle datasets download -d {KaggleDatasetPath} -p "{DestinationFolder}"

Dataset URL: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading hotel-booking-demand.zip to c:\Users\Music\Desktop\Code Institute\P5 Hotel Bookings\hotel-bookings\inputs\datasets\raw




  0%|          | 0.00/1.25M [00:00<?, ?B/s]
100%|██████████| 1.25M/1.25M [00:00<00:00, 528MB/s]


Unzip the dataset, delete zip file and delete kaggle.json

*NOTE: This code should work across different operating systems, including windows*

In [4]:
import zipfile

# Unzip all zip files in DestinationFolder
destination = Path(DestinationFolder)
for zip_path in destination.glob('*.zip'):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(destination)
    os.remove(zip_path)

# Remove kaggle.json file
if kaggle_file.exists():
    kaggle_file.unlink()

# Inspect Data

Load Data

In [5]:
import pandas as pd
my_file = DestinationFolder / 'hotel_bookings.csv'
df = pd.read_csv(my_file)
df.head(3)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02


Assess size of dataset

In [6]:
df.shape

(119390, 32)

View summary info

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

Assess whether there are duplicate records

In [19]:
df[df.duplicated()]

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,2015-07-03
22,Resort Hotel,0,72,2015,July,27,1,2,4,2,...,No Deposit,250.0,,0,Transient,84.67,0,1,Check-Out,2015-07-07
43,Resort Hotel,0,70,2015,July,27,2,2,3,2,...,No Deposit,250.0,,0,Transient,137.00,0,1,Check-Out,2015-07-07
138,Resort Hotel,1,5,2015,July,28,5,1,0,2,...,No Deposit,240.0,,0,Transient,97.00,0,0,Canceled,2015-07-01
200,Resort Hotel,0,0,2015,July,28,7,0,1,1,...,No Deposit,240.0,,0,Transient,109.80,0,3,Check-Out,2015-07-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119349,City Hotel,0,186,2017,August,35,31,0,3,2,...,No Deposit,9.0,,0,Transient,126.00,0,2,Check-Out,2017-09-03
119352,City Hotel,0,63,2017,August,35,31,0,3,3,...,No Deposit,9.0,,0,Transient-Party,195.33,0,2,Check-Out,2017-09-03
119353,City Hotel,0,63,2017,August,35,31,0,3,3,...,No Deposit,9.0,,0,Transient-Party,195.33,0,2,Check-Out,2017-09-03
119354,City Hotel,0,63,2017,August,35,31,0,3,3,...,No Deposit,9.0,,0,Transient-Party,195.33,0,2,Check-Out,2017-09-03


View extent of missing data

In [11]:
df.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

---

# Conclusions and Next Steps

An initial glance at the data confirms that the dataset is large and requires some cleaning (e.g. dealing with missing values, changing data types). Approximately one quarter of the dataset contains duplicate records which require further investigation to see whether they are legitimate or erroneous. 

Next steps:
- Carry out Exploratory Data Analysis (EDA) to become more familiar with the data
- Explore missing and duplicate values
- Save a cleaned dataset
- Investigate patterns in the data, especially correlations between features and the target variable