# **01 - DataCollection**

## Objectives

* Fetch data from Kaggle and store it as raw data in inputs/datasets/raw
* Inspect data fetched via Kaggle
* Save data fetched under outputs/datasets/collection/

## Inputs

* My personal Kaggle JSON file to authenticate with Kaggle 

## Outputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv
* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv
* inputs/datasets/raw/house-metadata.txt
* outputs/datasets/collection/HousePricesRecords.csv
* outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

* The file inherited_houses.csv contains features and target variables for the business objective, serving as the foundation for the ML model. It also includes house features for which the client seeks individual and total price predictions.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Download data from Kaggle

The Kaggle API was used to fetch the raw project data, requiring the installation of the Kaggle package.

To authenticate, the API requires a token stored in a kaggle.json file. The token was obtained as follows:

Log into your Kaggle account.
Go to Settings via the user profile menu.
Locate the API section.
Click Expire API Token to remove any existing tokens.
Click Create New API Token to generate a new one.
Download the kaggle.json file.
Move kaggle.json to the project's root directory without renaming or changing its extension.
This ensures proper authentication for accessing Kaggle data.

In [None]:
%pip install kaggle==1.5.12

This code configures Kaggle API authentication by setting the KAGGLE_CONFIG_DIR environment variable to the current working directory so that the API can locate the kaggle.json credentials file. It then executes a shell command to change the file's permissions (chmod 600 kaggle.json), ensuring that only the owner can read and write it.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Uses pandas library to load two datasets into DataFrames from CSV files.

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")

Provides a concise summary of the df DataFrame.

In [None]:
df.info()

Count the number of missing (null/NaN) values in each column of the df DataFrame.

In [None]:
df.isnull().sum()

Filter df, keeping only the rows that are duplicates. There are no duplicates.

In [None]:
df[df.duplicated(subset=None)]

Provides a concise summary of the df_inherited DataFrame.

In [None]:
df_inherited.info()

Count the number of missing (null/NaN) values in each column of the df_inherited DataFrame.

In [None]:
df_inherited.isnull().sum()

Filter df_inherited, keeping only the rows that are duplicates. There are no duplicates.

In [None]:
df_inherited[df_inherited.duplicated(subset=None)]

Iterate over the columns of the DataFrame df and performs checks on the unique values in each column.

In [None]:
for col in df.columns:
    unique_values = df[col].unique()
    
    if df[col].dtype == 'object':
        print(f"{col} - {unique_values}")
    elif len(unique_values) < 11:
        print(f"{col} - {len(unique_values)}")

Create a directory structure and then save the two dataframes as CSV files to outputs/datasets/collection/

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePricesRecords.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/InheritedHouses.csv",index=False)

## Conclusions and Next Steps

The dataset contains 1460 rows and 24 columns.
The data types present include integers, floats, and objects.
There are no duplicate rows in the dataset.
10 columns have missing values, with some columns having almost all of their entries missing.
The columns BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual consist of categorical variables.
Data cleaning and preprocessing are necessary to address the missing values and to properly handle the categorical variables in certain columns.

The Next Steps are to check for missing values in the dataset, address missing values in the dataset, and implement the data cleaning process.