# 🧹 Data Preparation & Feature Engineering

This notebook is the **second stage** in the machine learning pipeline and builds directly upon the insights from our data exploration. Our goal here is to **prepare the dataset** for effective training by cleaning, transforming, and engineering features that enhance the learning signal.

---

### 📌 Notebook Objective

In this notebook, we aim to:
- Clean and normalize the raw dataset
- Convert categorical features to numerical format
- Handle missing values and ambiguous data entries
- Engineer useful features from existing columns
- Set up a reproducible ML preprocessing pipeline

This ensures the dataset is model-ready and consistent across experiments.

---

### 🔍 Why This Matters

Data quality and representation directly affect model performance and fairness. A well-prepared dataset:
- Improves generalization
- Prevents data leakage
- Enables fair comparison between models
- Helps downstream explainability efforts

---


## 1. Load Data & Initial Copy 📥

In [143]:
# Importing the libaries needed 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.utils import resample  


# Setting the plot style
sns.set(style='whitegrid')


In [144]:
# Importing the dataset into a DataFrame
df = pd.read_csv('../data/diabetic_data.csv')
# Creating a copy to not modify the original dataset
df_copy = df.copy()

# Displaying the first 10 rows of the copied DataFrame
df_copy.head(10)


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
5,35754,82637451,Caucasian,Male,[50-60),?,2,1,2,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
6,55842,84259809,Caucasian,Male,[60-70),?,3,1,2,4,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
7,63768,114882984,Caucasian,Male,[70-80),?,1,1,7,5,...,No,No,No,No,No,No,No,No,Yes,>30
8,12522,48330783,Caucasian,Female,[80-90),?,2,1,4,13,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,15738,63555939,Caucasian,Female,[90-100),?,3,3,4,12,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


## 2. Handle Ambiguous & Missing Values ❓

#### Cleaning placeholder values and setting them as NaN values

In [145]:
# Determening placeholder values in based on commonly used values
placeholder_values = [
    'na', 'Na', 'NA',
    'NaN', 'nan', 'NAN',
    'n/a', 'N/A', 'N\A',
    'n.a.', 'N.A.', 'n.a', 'N.A',
    '?', '-', '--', '.', '*'
]

# Converting placeholder values to NaN
df_copy = df_copy.replace(placeholder_values, np.nan)

df_copy.isnull().sum().sort_values(ascending=False)

weight                      98569
max_glu_serum               96420
A1Cresult                   84748
medical_specialty           49949
payer_code                  40256
race                         2273
diag_3                       1423
diag_2                        358
diag_1                         21
encounter_id                    0
troglitazone                    0
tolbutamide                     0
pioglitazone                    0
rosiglitazone                   0
acarbose                        0
miglitol                        0
citoglipton                     0
tolazamide                      0
examide                         0
glipizide                       0
insulin                         0
glyburide-metformin             0
glipizide-metformin             0
glimepiride-pioglitazone        0
metformin-rosiglitazone         0
metformin-pioglitazone          0
change                          0
diabetesMed                     0
glyburide                       0
repaglinide   

#### Checking The DataSet For Information

In [146]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      99493 non-null   object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    3197 non-null    object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                61510 non-null   object
 11  medical_specialty         51817 non-null   object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

#### Dropping Features That Adds No Value
Based on `01_Data_exploration` there were several features in the bianry and low cardinality data subsets that are redundant, have to low variance or simply do not give any values. In addition, features with a singular class or those that are missing to many values will also be dropped. But before that we need to check if there exists entier rows across the dataset that are duplicates with the help of `encounter_id`.

In [None]:
# Checking for duplicate rows in the DatsFrame and removing them if found
if df_copy['encounter_id'].nunique() != df_copy.shape[0]:
    print("Warning: 'encounter_id' is not unique. This may cause issues in the analysis.")
    print('---'*10)
    print(f"Number of unique 'encounter_id': {df_copy['encounter_id'].nunique()}")
    print('Will drop duplicate row across entire dataset')
    # Drop duplicate rows based on 'encounter_id'
    df_copy.drop_duplicates(subset=['encounter_id'], keep='first', inplace=True)
    print('---'*10)
    print(f"Number of unique 'encounter_id' after dropping duplicates: {df_copy['encounter_id'].nunique()}")

else:
    print("All 'encounter_id' values are unique. No issues detected")

All 'encounter_id' values are unique. No issues detected


In [159]:
# Dropping the columns found in the EDA to be unnecessary or of no value
categories_to_drop = [
    'weight', 'payer_code', 'medical_specialty',
    'max_glu_serum', 'A1Cresult', 'encounter_id',           # Dropped due to high number of null values
    'examide', 'citoglipton',                               # Dropped due to low number of unique values/classes
    'acetohexamide', 'tolbutamide', 'troglitazone',         # --------- Binary values-----------
    'glipizide-metformin', 'glimepiride-pioglitazone',
    'metformin-rosiglitazone', 'metformin-pioglitazone',    # --- Dropped due to low variance/redudancy---
    'repaglinide', 'nateglinide', 'chlorpropamide',         # --------- Low Cardinality values-----------
    'pioglitazone', 'rosiglitazone', 'acarbose',
    'miglitol', 'tolazamide', 'glyburide-metformin',
    'glimepiride',                                          # -- Dropped due to low variance and redundancy --
    'readmitted'                                            # Dropped due to being the target variable
]

# Establishing the columns to be used for feature engineering, will not be dropped only for show at this time
features_to_engineer = [
    'patient_nbr',                                              # for num_of_visits, prior_visit_flag
    'diag_1', 'diag_2', 'diag_3',                               # for mapped diagnosis groups
    'age',                                                      # for ordinal or midpoint conversion
    'insulin', 'metformin', 'glipizide', 'glyburide',           # for med activity and change flags
    'admission_type_id',                                        # for emergency flag
    'discharge_disposition_id',                                 # for AMA, death flags
    'admission_source_id',                                      # for ER/referral flag
    'num_lab_procedures', 'num_procedures', 'number_diagnoses'  # for complexity signals
]


all_colls_to_drop = categories_to_drop

# Dropping the columns from the DataFrame
df_copy = df_copy.drop(columns=all_colls_to_drop)

# Diplaying the DataFrame after dropping the columns and its new shape 
print(df_copy.shape)
df_copy.head(10)


(101766, 24)


Unnamed: 0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,...,diag_1,diag_2,diag_3,number_diagnoses,metformin,glipizide,glyburide,insulin,change,diabetesMed
0,8222157,Caucasian,Female,[0-10),6,25,1,1,41,0,...,250.83,,,1,No,No,No,No,No,No
1,55629189,Caucasian,Female,[10-20),1,1,7,3,59,0,...,276.0,250.01,255,9,No,No,No,Up,Ch,Yes
2,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,11,5,...,648.0,250.0,V27,6,No,Steady,No,No,No,Yes
3,82442376,Caucasian,Male,[30-40),1,1,7,2,44,1,...,8.0,250.43,403,7,No,No,No,Up,Ch,Yes
4,42519267,Caucasian,Male,[40-50),1,1,7,1,51,0,...,197.0,157.0,250,5,No,Steady,No,Steady,Ch,Yes
5,82637451,Caucasian,Male,[50-60),2,1,2,3,31,6,...,414.0,411.0,250,9,No,No,No,Steady,No,Yes
6,84259809,Caucasian,Male,[60-70),3,1,2,4,70,1,...,414.0,411.0,V45,7,Steady,No,No,Steady,Ch,Yes
7,114882984,Caucasian,Male,[70-80),1,1,7,5,73,0,...,428.0,492.0,250,8,No,No,Steady,No,No,Yes
8,48330783,Caucasian,Female,[80-90),2,1,4,13,68,2,...,398.0,427.0,38,8,No,Steady,No,Steady,Ch,Yes
9,63555939,Caucasian,Female,[90-100),3,3,4,12,33,3,...,434.0,198.0,486,8,No,No,No,Steady,Ch,Yes


#### About the Dropped Features

Many of features were dropped in this part, mainly due to a combination of high missingness, low variance and redundancy, or lack of useful information. In addition, some features (like IDs and diagnosis codes) were set aside for later use in feature engineering (part 6).


#### Splitting the Data Into Numerical and Cateogircal Subsets
Code "borrowed" drom `01_data_exploration.upynb` for the splitting

In [163]:
# Getting the list of categorical and numerical features and storing in an array 
categorical_features = df_copy.select_dtypes(include=['object', 'bool', 'category']).columns.tolist().copy()
numerical_features = df_copy.select_dtypes(include=['int64', 'float64']).columns.tolist().copy()

# Features that have codes (categorical feature) but could be numerical in the data. from IDS_mapping.csv and 
# https://datasets.aim-ahead.net/dataset/p/UCI_DS_296
hidden_categorical_features = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']

for feature in hidden_categorical_features:
    if feature in numerical_features:
        numerical_features.remove(feature)
    if feature not in categorical_features:
        categorical_features.append(feature)

# Checking so that every feature has been acccounted for 
num_of_splitted_features = len(categorical_features) + len(numerical_features) 

if num_of_splitted_features != len(df_copy.columns):
    accounted = set(numerical_features + categorical_features)
    missing = set(df_copy.columns) - accounted
    print(f'! WARNING ! Unaccounted features: {missing}. No new DataFrame created')
else:
    print('All features have been accounted for! New DataFrames created')
    df_categorical = df_copy[categorical_features]
    df_numerical = df_copy[numerical_features]


All features have been accounted for! New DataFrames created


In [164]:
df_categorical.head(10)

Unnamed: 0,race,gender,age,diag_1,diag_2,diag_3,metformin,glipizide,glyburide,insulin,change,diabetesMed,admission_type_id,discharge_disposition_id,admission_source_id
0,Caucasian,Female,[0-10),250.83,,,No,No,No,No,No,No,6,25,1
1,Caucasian,Female,[10-20),276.0,250.01,255,No,No,No,Up,Ch,Yes,1,1,7
2,AfricanAmerican,Female,[20-30),648.0,250.0,V27,No,Steady,No,No,No,Yes,1,1,7
3,Caucasian,Male,[30-40),8.0,250.43,403,No,No,No,Up,Ch,Yes,1,1,7
4,Caucasian,Male,[40-50),197.0,157.0,250,No,Steady,No,Steady,Ch,Yes,1,1,7
5,Caucasian,Male,[50-60),414.0,411.0,250,No,No,No,Steady,No,Yes,2,1,2
6,Caucasian,Male,[60-70),414.0,411.0,V45,Steady,No,No,Steady,Ch,Yes,3,1,2
7,Caucasian,Male,[70-80),428.0,492.0,250,No,No,Steady,No,No,Yes,1,1,7
8,Caucasian,Female,[80-90),398.0,427.0,38,No,Steady,No,Steady,Ch,Yes,2,1,4
9,Caucasian,Female,[90-100),434.0,198.0,486,No,No,No,Steady,Ch,Yes,3,3,4


In [165]:
df_numerical.head(10)

Unnamed: 0,patient_nbr,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
0,8222157,1,41,0,1,0,0,0,1
1,55629189,3,59,0,18,0,0,0,9
2,86047875,2,11,5,13,2,0,1,6
3,82442376,2,44,1,16,0,0,0,7
4,42519267,1,51,0,8,0,0,0,5
5,82637451,3,31,6,16,0,0,0,9
6,84259809,4,70,1,21,0,0,0,7
7,114882984,5,73,0,12,0,0,0,8
8,48330783,13,68,2,28,0,0,0,8
9,63555939,12,33,3,18,0,0,0,8


#### Imputing the Missing Values

Missing values were handled separately for categorical and numerical features. Categorical features were imputed using either the most frequent value or a constant placeholder ('missing'), depending on the number of unique categories and the distribution of missing data. Numerical features were imputed using the median, which is robust to outliers and preserves the central tendency of the data.


#### Imputating the Categorical Data

In [None]:
# Looping through the categorical features to check if they need imputation.
# If needed, then the loop will dynamically check which imputer to use based on the number of unique values.
for col in df_categorical.columns:
    if df_categorical[col].isnull().sum() > 0:

        if df_categorical[col].nunique() <= 2:
            print(f'Binary feature {col} with {df_categorical[col].nunique()} unique values')
            print('Using the most frequent imputer\n')
            imputer = SimpleImputer(strategy='most_frequent')

        elif df_categorical[col].nunique() < 10 and df_categorical[col].nunique() > 2:
            print(f'Categorical feature {col} with {df_categorical[col].nunique()} unique values')
            print('Checking spread of values to decide on imputer')
            print('...')

            # Check the spread of values and their ratio of missing values
            value_counts = df_categorical[col].value_counts(normalize=True)
            na_ratio = df_categorical[col].isna().mean()
            
            # Creating a decision tree for which imputer to use based on the spread of values and the ratio of missing values.
            if value_counts.max() > 0.6 and na_ratio <= 0.05:
                print('Using the most frequent imputer')
                imputer = SimpleImputer(strategy='most_frequent')
            else:
                print('Using the missing value imputer')
                imputer = SimpleImputer(strategy='constant', fill_value='missing')
        else:
            print(f'Categorical feature {col} with {df_categorical[col].nunique()} unique values')
            print('Using missing value imputer')
            imputer = SimpleImputer(strategy='constant', fill_value='missing')

        # Imputing the missing values
        df_categorical.loc[:, col] = imputer.fit_transform(df_categorical[[col]]).ravel()
        print('Imputation done\n')


Categorical feature race with 5 unique values
Checking spread of values to decide on imputer
...
Using the most frequent imputer
Imputation done

Continuous feature diag_1 with 716 unique values
Using missing value imputer
Imputation done

Continuous feature diag_2 with 748 unique values
Using missing value imputer
Imputation done

Continuous feature diag_3 with 789 unique values
Using missing value imputer
Imputation done



In [167]:
df_numerical.head(10)

Unnamed: 0,patient_nbr,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
0,8222157,1,41,0,1,0,0,0,1
1,55629189,3,59,0,18,0,0,0,9
2,86047875,2,11,5,13,2,0,1,6
3,82442376,2,44,1,16,0,0,0,7
4,42519267,1,51,0,8,0,0,0,5
5,82637451,3,31,6,16,0,0,0,9
6,84259809,4,70,1,21,0,0,0,7
7,114882984,5,73,0,12,0,0,0,8
8,48330783,13,68,2,28,0,0,0,8
9,63555939,12,33,3,18,0,0,0,8


#### Imputating the Numerical Data

In [168]:
# Check if imputation is done, in order to check if we should show the DataFrame or not.
counter = 0

for col in df_numerical.columns:
    if df_numerical[col].isnull().sum() > 0:
        print(f'Imputation needed for {col} with {df_categorical[col].nunique()} unique values')
        print('Using the median imputer')
        
        # Using the median imputer for the numerical features
        imputer = SimpleImputer(strategy='median')
        
        # Imputing the missing values
        df_numerical.loc[:, col] = imputer.fit_transform(df_numerical[[col]]).ravel()
        print('Imputation done\n')

        counter += 1
    else:
        print(f'No imputation needed for {col}.')

# Checking if it is worth showing the DataFrame or not. 
if counter != 0:
    df_numerical.head(10)

No imputation needed for patient_nbr.
No imputation needed for time_in_hospital.
No imputation needed for num_lab_procedures.
No imputation needed for num_procedures.
No imputation needed for num_medications.
No imputation needed for number_outpatient.
No imputation needed for number_emergency.
No imputation needed for number_inpatient.
No imputation needed for number_diagnoses.


In [173]:
# Quick check that the imputation was done correctly
df_categorical.isnull().sum().sort_values(ascending=False), df_numerical.isnull().sum().sort_values(ascending=False)

(race                        0
 gender                      0
 age                         0
 diag_1                      0
 diag_2                      0
 diag_3                      0
 metformin                   0
 glipizide                   0
 glyburide                   0
 insulin                     0
 change                      0
 diabetesMed                 0
 admission_type_id           0
 discharge_disposition_id    0
 admission_source_id         0
 dtype: int64,
 patient_nbr           0
 time_in_hospital      0
 num_lab_procedures    0
 num_procedures        0
 num_medications       0
 number_outpatient     0
 number_emergency      0
 number_inpatient      0
 number_diagnoses      0
 dtype: int64)

#### Handling Logical Inconsistencies and Domain Anomalies

Certain values in the dataset may be logically inconsistent, medically unlikely, or outright impossible. These anomalies can introduce noise and negatively impact model performance. Identifying and addressing them is a quick and easy way to ensure more value to the data, especially in healthcare data where domain context matters. While this step often requires domain expertise — which not all data scientists may have — we will focus on a few key features where basic logic can still highlight potential issues worth filtering or flagging.


In [178]:
(df_copy['number_diagnoses'] == 0).sum()


np.int64(0)

In [None]:
# Function to check if we should drop the rows based on a threshold of 10%.
def should_drop(subset_df, total_df=df_copy, threshold=0.1):
    """
    Returns True if the percentage of rows in subset_df is below the threshold.
    """
    percent = subset_df.shape[0] / total_df.shape[0]
    return percent <= threshold

# ------------  number_diagnoses      ----------------
# Checking if we shold drop number of diagnoses = 0 (Illogical or missed record)
if should_drop(df_copy[df_copy['number_diagnoses'] == 0]):
    print('Dropping rows with numer of diagnoses = 0')
    print('-----'*10)
    print(f'Dropping {df_copy[df_copy["number_diagnoses"] == 0].shape[0]} rows with number_diagnoses = 0')

    df_copy = df_copy[df_copy['number_diagnoses'] != 0]
else:
    print('No rows dropped for numer of diagnoses = 0')


# Checking if we shold drop number of diagnoses > 10 (Illogical or missed record)
if should_drop(df_copy[df_copy['number_diagnoses'] > 10]):
    print('Dropping rows with numer of diagnoses = 0')
    print('-----'*10)
    print(f'Dropping {df_copy[df_copy["number_diagnoses"] == 0].shape[0]} rows with number_diagnoses = 0')

    df_copy = df_copy[df_copy['number_diagnoses'] < 10]
else:
    print('No rows dropped for numer of diagnoses = 0')


# ------------  discharge_disposition_id      ----------------
# Checking if we shold drop discharge_disposition_id = 1 (Illogical or missed record)
if should_drop(df_copy[df_copy['discharge_disposition_id'] == 1]):
    print('Dropping rows with discharge_disposition_id = 1')
    print('-----'*10)
    print(f'Dropping {df_copy[df_copy["discharge_disposition_id"] == 1].shape[0]} rows with discharge_disposition_id = 1')

    df_copy = df_copy[df_copy['discharge_disposition_id'] != 1]
else:
    print('No rows dropped for discharge_disposition_id = 1')

# ------------  time_in_hospital      ----------------
# Checking if we shold drop time_in_hospital = 0 (Illogical or missed record)
if should_drop(df_copy[df_copy['time_in_hospital'] == 0]):
    print('Instances of time_in_hospital = 0 found')
    print('...')
    print('Changing these to 1 (obviously wront values since there are records )')
    print('-----'*10)
    print(f'Dropping {df_copy[df_copy["time_in_hospital"] == 0].shape[0]} rows with time_in_hospital = 0')
    df_copy.loc[df_copy['time_in_hospital'] == 0, 'time_in_hospital'] = 1
else:
    print('No rows found for time_in_hospital = 0')

#### Final Preprocessing Checks
A final check to make sure that the performed imputation and cleaning results in a dataset that is free of missing values or reduantand, low-value features. 


#### Final Check of the categorical Data

In [169]:
# Some prints for manually checking how the imputation went
print('Exists NaN values:', df_categorical.isnull().sum().sum() > 0)
print('Shape of the cateogircal DataFrame: ', df_categorical.shape)
print('Other information about the dataFrame:\n', df_categorical.info())
print('Unique values per feature\n', df_categorical.nunique())


Exists NaN values: False
Shape of the cateogircal DataFrame:  (101766, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   race                      101766 non-null  object
 1   gender                    101766 non-null  object
 2   age                       101766 non-null  object
 3   diag_1                    101766 non-null  object
 4   diag_2                    101766 non-null  object
 5   diag_3                    101766 non-null  object
 6   metformin                 101766 non-null  object
 7   glipizide                 101766 non-null  object
 8   glyburide                 101766 non-null  object
 9   insulin                   101766 non-null  object
 10  change                    101766 non-null  object
 11  diabetesMed               101766 non-null  object
 12  admission_type_id         101766 non-nu

#### Final Check of the Numerical Data

In [170]:
# Some prints for manually checking how the imputation went
print('Exists NaN values:', df_numerical.isnull().sum().sum() > 0)
print('Shape of the numerical DataFrame: ', df_numerical.shape)
print('Other information about the dataFrame:\n', df_numerical.info())
print('Unique values per feature\n', df_numerical.nunique())
df_numerical.describe()


Exists NaN values: False
Shape of the numerical DataFrame:  (101766, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype
---  ------              --------------   -----
 0   patient_nbr         101766 non-null  int64
 1   time_in_hospital    101766 non-null  int64
 2   num_lab_procedures  101766 non-null  int64
 3   num_procedures      101766 non-null  int64
 4   num_medications     101766 non-null  int64
 5   number_outpatient   101766 non-null  int64
 6   number_emergency    101766 non-null  int64
 7   number_inpatient    101766 non-null  int64
 8   number_diagnoses    101766 non-null  int64
dtypes: int64(9)
memory usage: 7.0 MB
Other information about the dataFrame:
 None
Unique values per feature
 patient_nbr           71518
time_in_hospital         14
num_lab_procedures      118
num_procedures            7
num_medications          75
number_outpatient        39
number_emerg

Unnamed: 0,patient_nbr,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,54330400.0,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,38696360.0,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,135.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,23413220.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,45505140.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,87545950.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,189502600.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


### Summary – Section 2

In this section, we cleaned and prepared the data for modeling by applying different methods. Firstly, we converted potential placeholder values to `NaN` based on commonly known placeholder values. Secondly, we dropped features that had either low variance, to high unbalance or were generally redudant and without any value. Thirdly, we separated the data into categorical and numerical, and lastly they were put trhough an imputation process for those features that had missing values. Luckly, however, after dropping the features there were only one feature that needed imputation. After these steps the data is more robust, cleaned and we have no missing or placeholder values left. 


## 3. Target Variable Transformation 🎯

In [None]:
# Checking for missing values 
print('Missing values in the Target Variable:', df_copy[target].isnull().sum())

# Checking the distribtion of the target variable
df_copy['readmitted'].value_counts(normalize=True)

Missing values in the Target Variable: 0


readmitted
NO     0.539119
>30    0.349282
<30    0.111599
Name: proportion, dtype: float64

In [133]:
# We are interested in 2 cases: If the patient was readmitted within 30 days or not.
# Thus, we can binarize the target variable into two classes: 'Yes' and 'No' (1 and 0).
df_copy['target'] = df_copy['readmitted'].map({'<30': 1, '>30': 0, 'NO': 0})

# Dropping the original target variable and renaming the new one
df_copy.drop(columns='readmitted', inplace=True)

# Checking the distribution of the new target variable
df_copy['target'].value_counts(normalize=True)


target
0    0.888401
1    0.111599
Name: proportion, dtype: float64

### Summary – Section 3

The target variable `readmitted` was binarized into a new column `target`, where:
- `1` indicates a readmission within 30 days (`<30`)
- `0` covers all other outcomes (`>30` and `NO`)

The original column was dropped, and class balance was inspected.


## 4. Feature Engineering 🧪

#### Creating a new DataFrame for features to be engineered and checking the columns

In [None]:
# To keep it clean we will use a new DataFrame for the engineered features before adding them to the subsets.
df_to_be_engineered = df[categories_to_engineer].copy()

# Showing the rows of the DataFrame and the features that will be enginereed.
df_to_be_engineered.head(15)

Unnamed: 0,encounter_id,patient_nbr,diag_1,diag_2,diag_3
0,2278392,8222157,250.83,?,?
1,149190,55629189,276.0,250.01,255
2,64410,86047875,648.0,250,V27
3,500364,82442376,8.0,250.43,403
4,16680,42519267,197.0,157,250
5,35754,82637451,414.0,411,250
6,55842,84259809,414.0,411,V45
7,63768,114882984,428.0,492,250
8,12522,48330783,398.0,427,38
9,15738,63555939,434.0,198,486


#### Checking for missing values and imputing where Nessecary 

In [140]:
# Checking for NaN values 
df_to_be_engineered.isnull().sum().sort_values(ascending=False)

encounter_id    0
patient_nbr     0
diag_1          0
diag_2          0
diag_3          0
dtype: int64

## 5. Encode Categorical Variables 🔧

## 6. Scale Numerical Variables ⚖️

## 7. Build ML Pipeline 🧱

## 8. Export Cleaned Data 🧼