# **Data Collection Notebook**

## Objectives

* To fetch the data from Kaggle and save it as raw data.
* Inspect the data and save it under: outputs/datasets/collection

## Inputs

* Kaggle JSON File - this is the authentication token to grant access to the dataset.

## Outputs

* To generate the Datasets to:
    
    * outputs/datasets/collection/HousePriceRecords.csv
    * outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

* The HousePriceRecords.csv is the data to be used to build the machine learning models.
* The InheritedHouses.csv is the data with the inherited houses whose prices the client would like to predict. 


---

# Change working directory

* We need to change the working directory from its current folder to its parent folder

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/hertiage-housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/hertiage-housing'

---

# Fetch the data from Kaggle

"**First we must install the Kaggle package in order to fetch the data:**"

In [4]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


"**If it is already installed, this step can be skipped.**"

"Next we will use the **kaggle.json** token to be able to authenticate and access the data:"

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

The client has provided the following dataset: [Housing Prices](/www.kaggle.com/datasets/codeinstitute/housing-prices-data)

Firstly, the *Dataset* needs to be downloaded and set to a destination folder:

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

housing-prices-data.zip: Skipping, found more recently modified local copy (use --force to force download)


Finally, the downloaded file needs to be unzipped, and the **kaggle.json** file deleted:

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
replace inputs/datasets/raw/house-metadata.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

---

# Load and Inspect House Price Records

Next we will load and inspect the **House Price Records** data:

In [None]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df_house_prices.shape)
df_house_prices.head()

ModuleNotFoundError: No module named 'pandas'

Next we will look at the *Dataframe* summary:

In [None]:
df_house_prices.info()

: 

Next we will check for duplicated data:

In [None]:
df_house_prices[df_house_prices.duplicated(subset=None)]

: 

There is no duplicated data.

#### Findings:

The *Dataset* contains **1460** entries of data and **24** columns.

here is a mix of the follow data types: <code>int64</code>,<code>float64</code> and <code>object</code>.

---

# Load and Inspect Inherited Houses

Next we will load and inspect the **Inherited Houses** data:

In [None]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited_houses.shape)
df_inherited_houses.head()

: 

Next we will look at the *Dataframe* summary:

In [None]:
df_inherited_houses.info()

: 

Next we will check for duplicated data:

In [None]:
df_inherited_houses[df_inherited_houses.duplicated(subset=None)]

: 

There is no duplicated data.

#### Findings:

* The *Dataset* contains **4** entries of data and **23** columns.
* There is a mix of the follow data types: <code>int64</code>,<code>float64</code> and <code>object</code>.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with \"Conclusions and Next Steps\" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df_house_prices.to_csv(f"outputs/datasets/collection/HousePriceRecords.csv", index=False)
df_inherited_houses.to_csv(f"outputs/datasets/collection/InheritedHouses.csv", index=False)

: 