# **This notebook handle Feature Engineering**

## Objectives

* The objective of this notebook is to perform feature engineering on the healthcare diabetes dataset to prepare it for predictive modeling. 
* This includes creating new features, transforming existing ones, handling outliers, scaling .

## Inputs

* The input to this notebook is cleaned input file (cleaned_diabetes_data.csv) located in the data/cleaned directory.

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\nived\\Desktop\\Nivya work learnings\\data analytics_ai\\Capstone\\Healthcare_Diabetes_Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\nived\\Desktop\\Nivya work learnings\\data analytics_ai\\Capstone\\Healthcare_Diabetes_Analysis'

# Features addition

* Here we will add new features based on the EDA findings, group and transform existing features.

* Import necessary libraries

In [4]:
import pandas as pd
import numpy as np

* Load the cleaned input file into a dataframe and display the first few rows

In [5]:
df = pd.read_csv('Data/cleaned/cleaned_diabetes_data.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Male,80,0,0,former,26.4,8.2,126,1
1,Female,47,0,0,never,45.88,4.0,159,0
2,Female,26,0,0,not current,27.32,6.0,130,0
3,Male,80,0,0,No Info,27.32,7.5,160,1
4,Male,57,0,0,No Info,27.32,5.8,158,0


* Bin the BMI into categories as per WHO classification.
* I plan to use this column for dashboard visualizations and let the model use the actual BMI values.

In [6]:
# Define BMI bins based on WHO classification
bmi_bins = [0, 18.5, 25, 30, float('inf')]

# Define the labels for each range
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obesity']

# Bin the data using pd.cut()
df['bmi_category'] = pd.cut(df['bmi'], bins=bmi_bins, labels=bmi_labels)

# checking for the value counts of the new bmi category
print(df['bmi_category'].value_counts())

# display the first few rows to verify the new column
df.head()

bmi_category
Overweight     4126
Obesity        3479
Normal         1356
Underweight      81
Name: count, dtype: int64


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,bmi_category
0,Male,80,0,0,former,26.4,8.2,126,1,Overweight
1,Female,47,0,0,never,45.88,4.0,159,0,Obesity
2,Female,26,0,0,not current,27.32,6.0,130,0,Overweight
3,Male,80,0,0,No Info,27.32,7.5,160,1,Overweight
4,Male,57,0,0,No Info,27.32,5.8,158,0,Overweight


* Below is the mapping for BMI categories:
  - Underweight: BMI < 18.5
  - Normal weight: 18.5 ≤ BMI ≤ 24.9
  - Overweight: 25 ≤ BMI ≤ 29.9
  - Obesity: BMI ≥ 30

* Grouping age into bins for better visualization in dashboard.

In [7]:
# Define age bins into 4 categories
age_bins = [18, 30, 45, 60, 100]

# Define the labels for each range
age_labels = ['Young Adults (18–30)', 
              'Early Middle Age (31–45)', 
              'Late Middle Age (46–60)', 
              'Seniors (61+)']

# Bin the data using pd.cut()
df['age_group'] = pd.cut(
    df['age'],
    bins=age_bins,
    labels=age_labels,
    right=True,          # includes the right edge, so 30 is in first bin, 45 in second, etc.
    include_lowest=True  # ensures 18 is included in the first bin
)

# checking for the value counts of the new bmi category
print(df['age_group'].value_counts())

# display the first few rows to verify the new column
df.head()

age_group
Seniors (61+)               3757
Late Middle Age (46–60)     2627
Early Middle Age (31–45)    1645
Young Adults (18–30)        1013
Name: count, dtype: int64


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,bmi_category,age_group
0,Male,80,0,0,former,26.4,8.2,126,1,Overweight,Seniors (61+)
1,Female,47,0,0,never,45.88,4.0,159,0,Obesity,Late Middle Age (46–60)
2,Female,26,0,0,not current,27.32,6.0,130,0,Overweight,Young Adults (18–30)
3,Male,80,0,0,No Info,27.32,7.5,160,1,Overweight,Seniors (61+)
4,Male,57,0,0,No Info,27.32,5.8,158,0,Overweight,Late Middle Age (46–60)


* Creating a new feature called 'diabetes_risk' based on HbA1c levels and blood glucose levels.
* This feature will categorize patients into 'Low Risk', 'Pre-Diabetes', and 'High Risk (Diabetes)' based on their HbA1c and blood glucose levels.
  - Low Risk: HbA1c < 5.7
  - Pre-Diabetes: 5.7 ≤ HbA1c < 6.5
  - High Risk (Diabetes): HbA1c ≥ 6.5

In [8]:
risk_conditions = [
    (df['HbA1c_level'] >= 6.5) | (df['blood_glucose_level'] >= 200),
    (df['HbA1c_level'].between(5.7, 6.4)) & (df['blood_glucose_level'].between(140, 199)),
    (df['HbA1c_level'].between(5.7, 6.4)) | (df['blood_glucose_level'].between(140, 199)),
    (df['HbA1c_level'] < 5.7) | (df['blood_glucose_level'] < 140)
]

risk_labels = [
    'Very High Risk',
    'High Risk',
    'Moderate Risk',
    'Low Risk'
]

df['diabetes_risk'] = np.select(risk_conditions, risk_labels)

print(df['diabetes_risk'].value_counts())

# display the first few rows to verify the new column
df.head()

diabetes_risk
Very High Risk    4836
Moderate Risk     1941
High Risk         1496
Low Risk           769
Name: count, dtype: int64


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,bmi_category,age_group,diabetes_risk
0,Male,80,0,0,former,26.4,8.2,126,1,Overweight,Seniors (61+),Very High Risk
1,Female,47,0,0,never,45.88,4.0,159,0,Obesity,Late Middle Age (46–60),Moderate Risk
2,Female,26,0,0,not current,27.32,6.0,130,0,Overweight,Young Adults (18–30),Moderate Risk
3,Male,80,0,0,No Info,27.32,7.5,160,1,Overweight,Seniors (61+),Very High Risk
4,Male,57,0,0,No Info,27.32,5.8,158,0,Overweight,Late Middle Age (46–60),High Risk


* Grouping the smoking history into just 3 categories - 'never','smoker','unknown'.
* This is to reduce the number of categories and simplify the understanding of the data when used in dashboard.

In [9]:
# grouping the smoking history
def bin_smoking(value):
    if value == 'never':
        return 'non-smoker'
    elif value in ['current', 'former', 'ever', 'not current']:
        return 'smoker'
    else:
        return 'unknown'

df['smoking_binned'] = df['smoking_history'].apply(bin_smoking)

# checking the value counts of the actual smoking history column and the new binned column
print(f"value counts smoking before: ",df['smoking_history'].value_counts())
print(f"value counts smoking after: ",df['smoking_binned'].value_counts())

# display the first few rows to verify the new column
df.head()

value counts smoking before:  smoking_history
never          3534
No Info        2026
former         1352
current         988
not current     685
ever            457
Name: count, dtype: int64
value counts smoking after:  smoking_binned
non-smoker    3534
smoker        3482
unknown       2026
Name: count, dtype: int64


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,bmi_category,age_group,diabetes_risk,smoking_binned
0,Male,80,0,0,former,26.4,8.2,126,1,Overweight,Seniors (61+),Very High Risk,smoker
1,Female,47,0,0,never,45.88,4.0,159,0,Obesity,Late Middle Age (46–60),Moderate Risk,non-smoker
2,Female,26,0,0,not current,27.32,6.0,130,0,Overweight,Young Adults (18–30),Moderate Risk,smoker
3,Male,80,0,0,No Info,27.32,7.5,160,1,Overweight,Seniors (61+),Very High Risk,unknown
4,Male,57,0,0,No Info,27.32,5.8,158,0,Overweight,Late Middle Age (46–60),High Risk,unknown


* Save this dataframe to a new df_dashboard. We will later save this dataframe as a csv file to be used in dashboard visualizations.

In [11]:
df_dashboard = df.copy()
df_dashboard.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,bmi_category,age_group,diabetes_risk,smoking_binned
0,Male,80,0,0,former,26.4,8.2,126,1,Overweight,Seniors (61+),Very High Risk,smoker
1,Female,47,0,0,never,45.88,4.0,159,0,Obesity,Late Middle Age (46–60),Moderate Risk,non-smoker
2,Female,26,0,0,not current,27.32,6.0,130,0,Overweight,Young Adults (18–30),Moderate Risk,smoker
3,Male,80,0,0,No Info,27.32,7.5,160,1,Overweight,Seniors (61+),Very High Risk,unknown
4,Male,57,0,0,No Info,27.32,5.8,158,0,Overweight,Late Middle Age (46–60),High Risk,unknown


* **One-hot encode categorical variables**: Convert categorical variables(gender and smoking_binned) into a format that can be provided to ML algorithms to do a better job in prediction.

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
