# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.* 
Inspect the data and save it under outputs/datasets/collection

## Inputs

*Kaggle JSON file - the authentication token. 

## Outputs

*outputs/datasets/collection/heart.csv  


---

# Change working directory

* Notebooks are stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from the current folder to the parent folder.
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Add the Kaggle authentication token to the root directory.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset and destination folder.
- The data set is located at the following URL: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset- 
Check if the datasets is public or has permission to use- 
Define the kaggle path to be the path that follows www.kaggle.com/datasets/ in the urls- .
Set the destination folde- r.
Download the data.

In [None]:
KaggleDatasetPath = "johnsmith88/heart-disease-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect the Kaggle Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/heart.csv")
df.head()

DataFrame Summary

In [None]:
df.info()

Check for missing values 

In [None]:
df.isnull().sum()

# Preliminary assesment of data

- Data has 1025 entries
- Data has he following data types: float64(1), int64(13)
- There is a total of 14 columns
- 6 columns ( sex, cp, fbs, restecg, exang, target ) are encoded as binary, 0 and 1.
- There are no missing values
- Further investigation in the following notebook required

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/heart.csv",index=False)
