# Data Collection & initial cleaning Notebook

## Objectives

* Fetch data from Kaggle and save it as raw data
* Inspect the data and save it under outputs/datasets/collection
* Perform data cleaning based on initial inspection

## Inputs

*   kaggle.JSON file - authentication token to access Kaggle dataset

## Outputs

* Generate Dataset: - 'outputs/datasets/collection/house_prices_records.csv'


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")



Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

----------


# Install python packages in the notebooks

In [None]:
%pip install -r /workspaces/heritage-housing/requirements.txt

-------

# Fetch data from Kaggle

In [None]:
!pip install --user kaggle

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

-------

# Load and Inspect Kaggle data

Create a DataFrame and store the imported data

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
pd.set_option('display.max_columns', None)
df.head(20)

In [None]:
df.info()

## Data Cleaning

Based on an initial inspection, the following cleaning and transformation actions would make the data more cohesive, making further analysis easier:

- convert all square footage columns to floats `# makes the data more comparable`
- replace NaN values in GarageFinish with None  `# "No Garage" should only be represented as "None"`
- replace NaN values in BsmtFinType1 with None  `# "No Basement" should only be represented as "None"`


In [None]:
# select columns with 'sf' or 'area' in the name
cols_to_convert = df.filter(regex=r'SF|Area', axis=1).columns

# convert to float
df[cols_to_convert] = df[cols_to_convert].astype(float)

# replace 'NaN' values in 'GarageFinish' with 'None'
df['GarageFinish'] = df['GarageFinish'].replace('NaN', 'None')

# replace 'NaN' values in 'BsmtFinType1' with 'None'
df['BsmtFinType1'] = df['BsmtFinType1'].replace('NaN', 'None')


In [None]:
df.info()


In [None]:
numerical_columns = df.select_dtypes(include='number').columns.tolist()


In [None]:
numerical_columns

-----

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)