# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate and save dataset: outputs/dataset/collection/.... 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

Install Python packages in the notebooks

In [4]:
%pip install -r requirements.txt

Collecting numpy==1.18.5
  Using cached numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl (20.6 MB)
Collecting pandas==1.4.2
  Using cached pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Collecting matplotlib==3.3.1
  Using cached matplotlib-3.3.1-cp38-cp38-manylinux1_x86_64.whl (11.6 MB)
Collecting seaborn==0.11.0
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Collecting ydata-profiling==4.4.0
  Using cached ydata_profiling-4.4.0-py2.py3-none-any.whl (356 kB)
Collecting plotly==4.12.0
  Using cached plotly-4.12.0-py2.py3-none-any.whl (13.1 MB)
Collecting ppscore==1.2.0
  Using cached ppscore-1.2.0-py2.py3-none-any.whl
Collecting streamlit==0.85.0
  Using cached streamlit-0.85.0-py2.py3-none-any.whl (7.9 MB)
Collecting feature-engine==1.0.2
  Using cached feature_engine-1.0.2-py2.py3-none-any.whl (152 kB)
Collecting imbalanced-learn==0.8.0
  Using cached imbalanced_learn-0.8.0-py3-none-any.whl (206 kB)
Collecting scikit-learn==0.24.2
  Using cached 

# Fetch data from Kaggle

In [5]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Take the following steps to access the JSON file from Kaggle:
* Create and log in to your Kaggle account.
* At the top right, click on your profile picture, then select “Settings” from the dropdown menu.
* Scroll down to the API section.
* Click 'Expire API Token' to remove any previous tokens.
* Click 'Create New API Token' to generate a fresh authentication token and will download a kaggle.json file.
* Drag and drop the downloaded kaggle.json file into your file explorer and make sure it is named correctly.
* Run the cell below so that the token is recognized in the session.

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


* The heritage housing dataset is located at [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)
* Define the kaggle dataset path as the path that comes after https://www.kaggle.com/datasets/
* Set the destination folder.
* Download the data.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

# Load and Inspect Kaggle data

In [7]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-prices/house-price/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


DataFrame Summary

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

Check for missing data

In [9]:
df.isnull().sum()

1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure        0
BsmtFinSF1          0
BsmtFinType1      114
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      162
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

Check for duplicated data

In [10]:
df[df.duplicated(subset=None)]

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice


Check unique values in the columns with non-numeric data type

In [11]:
for col in df:
    if df[col].dtypes=='object':
        print(col, '-', df[col].unique())
    elif df[col].unique().size < 11:
        print(col, '-', df[col].unique().size)

BedroomAbvGr - 9
BsmtExposure - ['No' 'Gd' 'Mn' 'Av' 'None']
BsmtFinType1 - ['GLQ' 'ALQ' 'Unf' 'Rec' nan 'BLQ' 'None' 'LwQ']
GarageFinish - ['RFn' 'Unf' nan 'Fin' 'None']
KitchenQual - ['Gd' 'TA' 'Ex' 'Fa']
OverallCond - 9
OverallQual - 10


## Data Observations
* The data shape has 1460 rows and 24 columns.
* There is a mix of data types namely integers, floats and objects.
* 9 columns have missing data to differing degrees.
* 4 columns contain categorical data.
* 3 further columns have only a small number of unique numerical entries, suggesting that they could also be converted to categorical data.
Further investigation and Data Cleaning suggested for the next Notebook.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)
