# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection


## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate Dataset: outputs/datasets/collection/HousePrices.csv
* Generate Dataset: outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

* Data collection is essential to meet the two business requirements.

* The client wants to explore how house features relate to sale price through data visualisations.

* A reliable dataset is also needed to train a model that predicts house prices in Ames, Iowa.


---

## Install python packages in the notebooks

## Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

Change the working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, destination folder and download it. 

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

## Load and Inspect Kaggle data

In [None]:
import pandas as pd

df = pd.read_csv(
    f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv"
)
df_inherited = pd.read_csv(
    f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv"
)
df.head()

Display basic info about the main dataset

In [None]:
df.info()

Display basic info about the inherited houses

In [None]:
df_inherited.head()

Check for duplicates

In [None]:
df.duplicated().sum()

In [None]:
df_inherited.duplicated().sum()

Inspect three important categorical features that may impact house prices:

In [None]:
df["BsmtExposure"].value_counts(dropna=False)

In [None]:
df["GarageFinish"].value_counts(dropna=False)

In [None]:
df["KitchenQual"].value_counts(dropna=False)

---

# Push files to Repo

In [None]:
import os

try:
    os.makedirs(
        name="outputs/datasets/collection"
    )  # create outputs/datasets/collection folder
except Exception as e:
    print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv", index=False)
df_inherited.to_csv(f"outputs/datasets/collection/InheritedHouses.csv", index=False)

## Summary and the next steps

**Summary** 
- Loaded two datasets: house prices and inherited houses.
- Inspected the structure and column data types.
- Verified that there are no duplicate records.
- Confirmed the format of a key categorical variable (`BsmtExposure`).
- Saved both datasets to the `outputs/datasets/collection/` folder for future cleaning and analysis.


Next notebook will be about the Data Study