# **Data collection notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate dataset: outputs/dataset/collection/hotel_bookings.csv


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Run the cell below so that the token is recognised in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following dataset from Kaggle: [Kaggle URL](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

Get the dataset path from the Kaggle URL
    * When you are viewing the dataset from Kaggle, check what is after https://www.kaggle.com/

Define the Kaggle dataset and destination folder and download it

In [None]:
KaggleDatasetPath = "jessemostipak/hotel-booking-demand"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file, and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

---

# Load and inspect Kaggle data

## Dataframe description and summary

In [None]:
import pandas as pd

df = pd.read_csv(f"inputs/datasets/raw/hotel_bookings.csv")
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

### Dataset description:
| __Index__ | __Variable__ | __Description__ |
|   :---    |     :---     |       :---      |
| 0 | __hotel__ | Type of hotel (Resort Hotel, City Hotel) |
| 1 | __is_canceled__ | Reservation cancellation status (0 = not canceled, 1 = canceled) |
| 2 | __lead_time__ | Number of days between booking and arrival |
| 3 | __arrival_date_year__ | Year of arrival |
| 4 | __arrival_date_month__ | Month of arrival |
| 5 | __arrival_date_week_number__ | Week number of the year for arrival |
| 6 | __arrival_date_day_of_month__ | Day of the month of arrival |
| 7 | __stays_in_weekend_nights__ | Number of weekend nights (Saturday and Sunday) the guest stayed or booked |
| 8 | __stays_in_week_nights__ | Number of week nights (Monday to Friday) the guest stayed or booked |
| 9 | __adults__ | Number of adults |
| 10 | __children__ | Number of children |
| 11 | __babies__ | Number of babies |
| 12 | __meal__ | Type of meal booked (BB (Bed & Breakfast), FB (Full-Board), HB (Half-Board), SC/Undefined (No Meal)) |
| 13 | __country__ | Country of origin of the guest |
| 14 | __market_segment__ | Market segment designation |
| 15 | __distribution_channel__ | Booking distribution channel |
| 16 | __is_repeated_guest__ | If the guest is a repeat customer (0 = not repeated, 1 = repeated) |
| 17 | __previous_cancellations__ | Number of previous bookings that were canceled by the customer |
| 18 | __previous_bookings_not_canceled__ | Number of previous bookings that were not canceled by the customer |
| 19 | __reserved_room_type__ | Type of reserved room |
| 20 | __assigned_room_type__ | Type of assigned room |
| 21 | __booking_changes__ | Number of changes made to the booking |
| 22 | __deposit_type__ | Type of deposit made (No Deposit, Refundable, Non Refund) |
| 23 | __agent__ | ID of the travel agent responsible for the booking |
| 24 | __company__ | ID of the company responsible for the booking |
| 25 | __days_in_waiting_list__ | Number of days the booking was in the waiting list |
| 26 | __customer_type__ | Type of customer (Transient, Contract, Transient-Party, Group) |
| 27 | __adr__ | Average Daily Rate |
| 28 | __required_car_parking_spaces__ | Number of car parking spaces required |
| 29 | __total_of_special_requests__ | Number of special requests made |
| 30 | __reservation_status__ | Last reservation status (Check-Out, Canceled, No-Show) |
| 31 | __reservation_status_date__ | Date of the last reservation status |

### Dataset summary:
- **Number of entries/rows:** 119,390
- **Number of columns:** 32
- **Data types:** float64 (4 columns), int64 (16 columns), object (12 columns)
- **Missing values:**
    - **children:** 4 missing values
    - **country:** 488 missing values
    - **agent:** 16,340 missing values
    - **company:** 112,593 missing values

---

# Push files to Repo

In [None]:
import os

try:
    # Create folder
    os.makedirs(name="outputs/datasets/collection")
except Exception as e:
    print(e)

df.to_csv(f"outputs/datasets/collection/hotel_bookings.csv", index=False)