# **Data Collection Notebook**

**Objectives**  
Fetch house price data from Kaggle and prepare it for processing by:
- Downloading and unzipping the raw dataset
- Inspecting key features and structure
- Saving standardised CSV files for both historical sales and inherited house records

**Inputs**
- Kaggle API key in `kaggle.json`

**Outputs**
- `outputs/datasets/collection/house_prices_records.csv`
- `outputs/datasets/collection/inherited_houses.csv`

**Additional Comments**
- The dataset is public with no usage restrictions.
- All outputs are standardised for use in the data cleaning and feature engineering stages.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-house-price-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-house-price-predictor'

# Fetch data from Kaggle

Section 1 content

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [5]:
import os

# View full folder tree
!ls /workspaces

heritage-house-price-predictor


In [6]:
# Replace with the actual name of your folder seen from the previous command
os.chdir('/workspaces/heritage-house-price-predictor')

# Confirm you're now in the correct directory
print(os.getcwd())
!ls  # to confirm kaggle.json is listed

/workspaces/heritage-house-price-predictor
Procfile   inputs	      kaggle.json  requirements.txt
README.md  jupyter_notebooks  outputs	   setup.sh


In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
!chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [8]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|███████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 585kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Load and Inspect Kaggle data

Historic house price dataset



In [10]:
import pandas as pd
df_sales = pd.read_csv("inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_inherited = pd.read_csv("inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_sales.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


Inherited house dataset



In [11]:
df_inherited = pd.read_csv(
    f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


DataFrame Summary



In [12]:
df_sales.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

We want to check if there are duplicated rows. There are not.


In [13]:
df_sales[df_sales.duplicated()]


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice


Change BedroomAbvGr to whole numbers (nullable Int64)
It's the number of bedrooms, so it should be a whole number.
We use 'Int64' instead of 'int64' so it still works if there are missing values.


In [14]:
df_sales['BedroomAbvGr'] = df_sales['BedroomAbvGr'].astype('Int64')


Change GarageYrBlt to whole numbers (nullable Int64)
It's the year the garage was built, so it should be a whole number too.

In [15]:
df_sales['GarageYrBlt'] = df_sales['GarageYrBlt'].astype('Int64')

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [16]:
import os
try:
    # create here your folder
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)
# Save to standardised outputs folder for downstream notebooks
df_sales.to_csv(f"outputs/datasets/collection/house_prices_records.csv", index=False)
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv", index=False)

print("Raw datasets saved successfully to outputs/datasets/collection/")


Raw datasets saved successfully to outputs/datasets/collection/
