# Neural Networks

We will use a neural network for creating a model which be able to classify correctly.
As the dataset is small, 400 instances, we will follow a straight-line approach:
- We will use a shallow networks because deep ones need more examples.
- We will use **L2 Regularization (Weight Decay)** and **Early Stopping** to stop training the moment the model stops improving
- A small preprocessing, so neural networks require scaling to converge

## Starting point

In [20]:
from pathlib import Path
import pandas as pd


dataset_imputed_path = Path('../../data/processed/dataset_imputed.csv')
df_orig = pd.read_csv(dataset_imputed_path)
df = df_orig.copy()
df['status'] = df['status'].map({'ckd': 1, 'notckd': 0})

df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,status
0,48.0,80.0,3,1,0.0,1.0,0.0,0.0,0.0,121.0,...,44.0,7800.0,5.2,1.0,1.0,0.0,0.0,0.0,0,1
1,7.0,50.0,3,4,0.0,0.0,0.0,0.0,0.0,94.708381,...,38.0,6000.0,4.277045,0.0,0.0,0.0,0.0,0.0,0,1
2,62.0,80.0,1,2,3.0,0.0,0.0,0.0,0.0,423.0,...,31.0,7500.0,3.725814,0.0,1.0,0.0,1.0,0.0,1,1
3,48.0,70.0,0,4,0.0,0.0,1.0,1.0,0.0,117.0,...,32.0,6700.0,3.9,1.0,0.0,0.0,1.0,1.0,1,1
4,51.0,80.0,1,2,0.0,0.0,0.0,0.0,0.0,106.0,...,35.0,7300.0,4.6,0.0,0.0,0.0,0.0,0.0,0,1


In [21]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Calculate correlation matrix
corr_matrix = df.corr()

# Plot heatmap
#plt.figure(figsize=(20, 15))
#sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
#plt.title('Feature Correlation Matrix')
#plt.show()

# Quick check for high correlation
correlation_factor = 0.8
high_corr_pairs = np.where(np.abs(corr_matrix) > correlation_factor)
high_corr_pairs = [(corr_matrix.index[x], corr_matrix.columns[y])
                   for x, y in zip(*high_corr_pairs) if x != y and x < y]

print("Highly correlated pairs to consider dropping:", high_corr_pairs)


Highly correlated pairs to consider dropping: [('bu', 'sc'), ('hemo', 'pcv'), ('hemo', 'rbcc'), ('pcv', 'rbcc')]


Removing dependant features
----

We identified two distinct biological clusters of redundancy.
The first one is the **red blood cell cluster**. With pairs: ('hemo', 'pcv'), ('hemo', 'rbcc'), ('pcv', 'rbcc') as we know:
- Hemo (Hemoglobin): The protein that carries oxygen.
- PCV (Packed Cell Volume): The percentage of blood volume made up of cells.
- RBCC (Red Blood Cell Count): The actual number of cells.

Also, it is important to note that healthy patients PCV = hemo * 3.
So we will keep hemoglobin as it is typically the most robust measurement and is the standard metric for diagnosing anemia in CKD patients.


The second one is the **kideny waste cluster**. With pair: ('bu', 'sc') as we know:
Both Blood Urea (bu) and Serum Creatinine (sc) are waste products filtered by the kidneys. As kidney function declines (CKD progresses), both of these numbers rise together.
In this case, we will keep sc as it is a good marker to calculate the glomerular filtration rate [1] and stage Kidney Disease.

[1] https://en.wikipedia.org/wiki/Glomerular_filtration_rate


In [22]:
cols_to_drop = ['pcv', 'rbcc', 'bu']

existing_cols_to_drop = [col for col in cols_to_drop if col in df.columns]

df.drop(columns=existing_cols_to_drop, inplace=True)

print(f"Dropped columns: {existing_cols_to_drop}")
print(f"Remaining columns: {df.shape[1]}")

Dropped columns: ['pcv', 'rbcc', 'bu']
Remaining columns: 22


Normalization
----

We will apply a **standarization** of the numerical/continuous features. For the binary and categorical features (0/1) we could:
- Leave them as 0/1 or
- Scale them too.

In our case, we will scale everything.



In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df.drop('status', axis=1)
y = df['status']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

print("Mean of scaled features (approx 0):")
print(X_train_scaled.mean().head())

print("\nStandard Deviation of scaled features (approx 1):")
print(X_train_scaled.std().head())

Mean of scaled features (approx 0):
age    1.609823e-16
bp    -2.595146e-16
sg    -9.992007e-17
al    -2.359224e-17
su    -8.326673e-18
dtype: float64

Standard Deviation of scaled features (approx 1):
age    1.001566
bp     1.001566
sg     1.001566
al     1.001566
su     1.001566
dtype: float64


## Data preprocessing
Neural networks cannot handle raw categorical text or unscaled numbers effectively.

In [4]:
# Imports for Modeling
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# We obtain the X and y
df_copy = df.copy()
X = df_copy.drop('status', axis=1)
# We convert target to binary (ckd = 1, notckd = 0) immediately for safety
y = df_copy['status'].map({'ckd' : 1, 'notckd' : 0}).astype(int)

X.head()


Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
0,48.0,80.0,3,1,0.0,1.0,0.0,0.0,0.0,121.0,...,15.4,44.0,7800.0,5.2,1.0,1.0,0.0,0.0,0.0,0
1,7.0,50.0,3,4,0.0,0.0,0.0,0.0,0.0,94.708381,...,11.3,38.0,6000.0,4.277045,0.0,0.0,0.0,0.0,0.0,0
2,62.0,80.0,1,2,3.0,0.0,0.0,0.0,0.0,423.0,...,9.6,31.0,7500.0,3.725814,0.0,1.0,0.0,1.0,0.0,1
3,48.0,70.0,0,4,0.0,0.0,1.0,1.0,0.0,117.0,...,11.2,32.0,6700.0,3.9,1.0,0.0,0.0,1.0,1.0,1
4,51.0,80.0,1,2,0.0,0.0,0.0,0.0,0.0,106.0,...,11.6,35.0,7300.0,4.6,0.0,0.0,0.0,0.0,0.0,0
