# **Data Collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.
* Kaggle dataset URL - fedesoriano/heart-failure-prediction/data

## Outputs

* outputs/datasets/collection/HeartDiseasePrediction.csv

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/home/jfpaliga/CVD-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/home/jfpaliga/CVD-predictor'

# Import Dataset from Kaggle

Firstly, the Kaggle API must be installed before the data can be loaded.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).

In [4]:
! pip install kaggle==1.6.14

Collecting kaggle==1.6.14
  Downloading kaggle-1.6.14.tar.gz (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting bleach
  Downloading bleach-6.1.0-py3-none-any.whl (162 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.8/162.8 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting certifi>=2023.7.22
  Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/163.8 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting requests
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 KB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting tqdm
  Downloading tqdm-4.66.4-py3-none-any.whl (

Next, the Kaggle config directory is set to the current working directory, and the read/write permissions are set to user only (600)

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Then, define the Kaggle dataset and destination folder paths and download to 'inputs/datasets/raw' directory.

* The dataset path is taken from the Kaggle url, after 'https://www.kaggle.com/datasets/'

In [7]:
KaggleDatasetPath = "fedesoriano/heart-failure-prediction"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
License(s): ODbL-1.0
Downloading heart-failure-prediction.zip to inputs/datasets/raw
  0%|                                               | 0.00/8.56k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.56k/8.56k [00:00<00:00, 40.2MB/s]


The downloaded file is then unzipped, and the zipped file and kaggle.json are both deleted.

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
        && rm kaggle.json

Archive:  inputs/datasets/raw/heart-failure-prediction.zip
  inflating: inputs/datasets/raw/heart.csv  


---

# Load and Inspect the Kaggle Data

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [5]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


A summary of the dataframe columns, non-null counts and datatypes can be obtained.

There are 918 observations in the dataset, so the number of missing data can be calculated by 918 - non-null count.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


Each column has 918 non-null count, so it looks like there is no missing data. It is, however, good practice to double check.

If there are no missing data, we expect an empty list.

In [13]:
columns_with_nan = df.columns[df.isna().sum() > 0].to_list()
columns_with_nan

[]

There are no ID columns, so there is no need to check for any duplicate values.

The FastingBS feature is currently a numerical datatype, however we know it should be categorical so this needs to be converted.

* As a reminder of the FastingBs feature: 1 = fasting blood sugar > 120 mg/dL, 0 = otherwise.
* A fasting blood sugar value of >125 mg/dL is typically indicative of diabetes, therefore we will map 1 to 'high diabetes risk' and 0 to 'low diabetes risk'

In [12]:
fastingbs_map = {0: "low diabetes risk", 1: "high diabetes risk"}
df["FastingBS"].replace(to_replace=fastingbs_map, inplace=True)

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS          object
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

---

# Save the Dataset

The modified dataset is saved to the outputs directory.

In [14]:
import os
try:
  os.makedirs(name="outputs/datasets/collection")
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HeartDiseasePrediction.csv",index=False)

---

# Conclusions

In this notebook, the following was achieved:

* The dataset was successfully imported via the Kaggle API
* The dataset was inspected, and no missing values were found
* The datatype of the "FastingBS" feature was changed from integer to object, to reflect that it is a categorical feature
* The dataset was saved in the outputs directory

## Next Steps

In the next notebook, a exploratory data analysis will be carried out using Pandas profiling and correlation studies.

This work will make up the **'Data Understanding'** aspect of the CRISP-DM workflow, and will provide further understanding of the dataset and address business requirement 1.