# **Data collection notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate dataset: outputs/dataset/collection/hotel_bookings.csv


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Run the cell below so that the token is recognised in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following dataset from Kaggle: [Kaggle URL](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

Get the dataset path from the Kaggle URL
    * When you are viewing the dataset from Kaggle, check what is after https://www.kaggle.com/

Define the Kaggle dataset and destination folder and download it

In [None]:
KaggleDatasetPath = "jessemostipak/hotel-booking-demand"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file, and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

---

# Load and inspect Kaggle data

## Dataframe description and summary

In [None]:
import pandas as pd

df = pd.read_csv(f"inputs/datasets/raw/hotel_bookings.csv")
df.head()

Get 8000 rows for a smaller dataset and smaller pkl file size to push to repo.

In [None]:
df = df.sample(n=8000, random_state=0)

In [None]:
df.info()

---

# Push files to Repo

In [None]:
import os

try:
    # Create folder
    os.makedirs(name="outputs/datasets/collection")
except Exception as e:
    print(e)

df.to_csv(f"outputs/datasets/collection/hotel_bookings.csv", index=False)