# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate and save dataset: outputs/dataset/collection/.... 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

# Fetch data from Kaggle

Section 1 content

In [None]:
%pip install kaggle==1.5.12

Please take the following steps to access the JSON file from Kaggle:
* Create and log in to your Kaggle account.
* At the top right, click on your profile picture, then select “Settings” from the dropdown menu.
* Scroll down to the API section.
* Click 'Expire API Token' to remove any previous tokens.
* Click 'Create New API Token' to generate a fresh authentication token and will download a kaggle.json file.
* Drag and drop the downloaded kaggle.json file into your file explorer and make sure it is named correctly.
* Run the cell below so that the token is recognized in the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* The heritage housing dataset is located at [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)
* Define the kaggle dataset path as the path that comes after https://www.kaggle.com/datasets/
* Set the destination folder.
* Download the data.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

# Load and Inspect Kaggle data

In [5]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-prices/house-price/house_prices_records.csv")
df.head()

ModuleNotFoundError: No module named 'pandas'

DataFrame Summary

In [None]:
df.info()

Check for missing data

In [None]:
df.isnull().sum()

Check for duplicated data

In [None]:
df[df.duplicated(subset=None)]

Check unique values in the columns with non-numeric data type

In [None]:
for col in df:
    if df[col].dtypes=='object':
        print(col, '-', df[col].unique())
    elif df[col].unique().size < 11:
        print(col, '-', df[col].unique().size)

## Data Observations
* The data shape has 1460 rows and 24 columns.
* There is a mix of data types namely integers, floats and objects.
* 9 columns have missing data to differing degrees.
* 4 columns contain categorical data.
* 3 further columns have only a small number of unique numerical entries, suggesting that they could also be converted to categorical data.
Further investigation and Data Cleaning suggested for the next Notebook.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)
