## Part 3. Data Clean Function
Based on the data analysis, we create a data cleaning function that performs the following steps for the dataset:

- Step1, Load data.
- Step2, Insert a column `subject_id` for future possible traceability.
- Step3, Set plausible ranges for each variable.
- Step4, Remove samples that contain values outside plausible ranges.
- Step5, Normalization (Min-Max scaling)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


def load_and_clean_data(csv_path):
    
    """
    Load, clean, and normalize the Pima Diabetes dataset.

    Steps performed:
    1. Load CSV into pandas DataFrame.
    2. Insert 'subject_id' column for traceability.
    3. Set plausible ranges of each variable, to remove physiologically impossible or extreme values.
    4. Remove samples that contain values outside plausible ranges.
    5. Normalize all feature columns using Min-Max scaling.

    Input:
    csv_path (str): Path to the CSV dataset.

    Returns:
    pd.DataFrame: Cleaned and normalized dataset with 'subject_id'.
    """
    
    # Step 1: Load data
    df = pd.read_csv(csv_path)

    # Step 2: Insert subject_id for traceability
    df.insert(0, 'subject_id', df.index)

    # Step 3: Set plausible ranges 
    
    # The ranges are adviced by Chatgpt, you can change them in a more reasonable range if it is necessary.
    # !!! Don't forget to save a new version of cleaned data if you change the range （Just uncomment the line that saves to csv）!!!
    
    plausible_ranges = {
        "Pregnancies": (0, np.inf),
        "Glucose": (40, 600),
        "BloodPressure": (30, 150),
        "SkinThickness": (5, 80),
        "Insulin": (2, 900),
        "BMI": (10, 80),
        "DiabetesPedigreeFunction": (0, 3.0),
        "Age": (21, np.inf)
    }

    # Step 4: Remove rows outside plausible ranges
    for feature, (lower, upper) in plausible_ranges.items():
        df = df[(df[feature] >= lower) & (df[feature] <= upper)]
        

    # !!! Uncomment this line to save the cleaned dataset !!!
    # df.to_csv("diabetes_cleaned.csv", index=False)

    return df

In [None]:
load_and_clean_data('data/diabetes.csv')

Unnamed: 0,subject_id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,3,0.058824,0.232394,0.450,0.285714,0.096154,0.202454,0.035118,0.000000,0
1,4,0.000000,0.570423,0.125,0.500000,0.185096,0.509202,0.943469,0.200000,1
2,6,0.176471,0.154930,0.250,0.446429,0.088942,0.261759,0.069807,0.083333,1
3,8,0.117647,0.992958,0.500,0.678571,0.635817,0.251534,0.031263,0.533333,1
4,13,0.058824,0.936620,0.375,0.285714,1.000000,0.243354,0.134047,0.633333,1
...,...,...,...,...,...,...,...,...,...,...
386,753,0.000000,0.880282,0.725,0.660714,0.596154,0.513292,0.058672,0.083333,1
387,755,0.058824,0.507042,0.725,0.571429,0.115385,0.374233,0.416274,0.266667,1
388,760,0.117647,0.225352,0.350,0.339286,0.002404,0.208589,0.291649,0.016667,0
389,763,0.588235,0.316901,0.575,0.732143,0.199519,0.300613,0.036831,0.700000,0
