# **This notebook performs ETL tasks**

## Objectives

* This notebook covers 3 main objectives:
  1. Extract data from source.( The raw data is downloaded from Kaggle and stored in the Raw folder)
  2. Transform the data to fit operational needs
  3. Load the data into a new csv file saved in the Cleaned folder.

## Inputs

* Input is the diabetes_prediction_dataset.csv file located in the Raw folder.
* Input file is downloaded from Kaggle: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
* The dataset contains contains medical and demographic data of patients along with their diabetes status.
* The 8 features in the dataset include:
  - age                     : Patient's age in years.
  - gender                  : Biological sex of the patient (e.g., Male, Female, Other)
  - body mass index (BMI)   : A measure of body fat based on height and weight (kg/m²).
  - hypertension            : Presence of high blood pressure (1 = Yes, 0 = No).
  - heart disease           : Presence of heart condition (1 = Yes, 0 = No).
  - smoking history         : Patient’s past or current smoking behavior (e.g., never, former, current)
  - HbA1c level             : Average blood sugar level over the past 2-3 months (%).
  - blood glucose level     : Current blood sugar level (usually measured in mg/dL).

* The target variable is:
  - diabetes                : Indicates whether the patient has diabetes (1 = Yes, 0 = No).

## Outputs

* The output is a cleaned CSV file named cleaned_diabetes_data.csv located in the Cleaned folder. 

## Additional Comments

* The original dataset is quite huge with 100K records. 
* The target variable is imbalanced with only ~9% positive for diabetes. This could be a representation of real-world prevalence.
* For the purpose of this project, a subset of 10K records is extracted by randomly sampling 5k records from each target class to preserve balance. 
* Subset extraction steps:  
  1. Load the original dataset into a pandas dataframe.
  2. Split the dataset into two for each target class (diabetes = 0 and diabetes = 1).
  3. Randomly sample 5k records from each class.
  4. Concatenate the two sampled datasets to create a balanced subset of 10k records.
  5. Save the sampled records into a new CSV file named raw_diabetes_data.csv in the raw folder for the rest of the ETL steps.
  6. The original large dataset is retained in the Raw folder for reference but added to the .gitignore file to avoid commiting large files to the repo.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\nived\\Desktop\\Nivya work learnings\\data analytics_ai\\Capstone\\Healthcare_Diabetes_Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\nived\\Desktop\\Nivya work learnings\\data analytics_ai\\Capstone\\Healthcare_Diabetes_Analysis'

# Extraction

* Importing needed libraries

In [4]:
import numpy as np
import pandas as pd

* Loading the original raw data into a pandas dataframe

In [5]:
df_original = pd.read_csv('Data/raw/diabetes_prediction_dataset.csv')
df_original.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


* Checking the shape of the original dataframe

In [6]:
df_original.shape

(100000, 9)

* checking basic info of the original dataframe

In [7]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


* checking for class balance in the target variable by displaying value counts as percentage(0 - no diabetes
1 - diabetes)

In [8]:
print("Diabetes value count as %:", df_original['diabetes'].value_counts(normalize=True) * 100)

Diabetes value count as %: diabetes
0    91.5
1     8.5
Name: proportion, dtype: float64


* As we can see from above the dataset is quite huge with 100K records and the target variable is imbalanced with only ~9% positive for diabetes. 
* Extracting only 10k records by randomly sampling 5k records from each target class to preserve balance.

In [9]:
"""
splitting the dataset into 2 by target class and sampling 5000 records from each class to avoid bias
"""

# Separate by class
df_original_no_diab = df_original[df_original['diabetes'] == 0]
df_original_diab = df_original[df_original['diabetes'] == 1]

""" 
Check if each class has at least 5000 records since we wanted 10k records (this check is purely optional)
Setting random state for reproducibility
"""

min_samp_size = 5000
if len(df_original_no_diab) >= min_samp_size and len(df_original_diab) >= min_samp_size:
    # Randomly sample 5000 from each class
    df_no_diab_sampled = df_original_no_diab.sample(n=min_samp_size, random_state=42)
    df_diab_sampled = df_original_diab.sample(n=min_samp_size, random_state=42)

    # Combine and shuffle the final dataset
    df_raw = pd.concat([df_no_diab_sampled, df_diab_sampled]).sample(frac=1, random_state=42).reset_index(drop=True)

    print(df_raw.shape)
    print(df_raw['diabetes'].value_counts())
else:
    print("Not enough data to sample 5000 from each class.")

(10000, 9)
diabetes
1    5000
0    5000
Name: count, dtype: int64


* The stats  from the sampled raw dataframe above show 10k records with value counts of the target variable confirming that we have a balanced dataset now.

* Save this sampled raw dataframe to a new csv file in the Raw folder for the rest of the ETL steps

In [10]:
current_dir = os.getcwd()
current_dir

df_raw.to_csv('Data/raw/sampled_diabetes.csv', index=False)


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
