## Pre-Processing

The code reads data from the diabetes.csv file in the Dataset folder and loads it into a pandas DataFrame. Next, it creates a SimpleImputer object with the imputation strategy set to "mean" and the missing value set to 0. The imputer is then used to impute missing values in all columns of the dataset except "Outcome".

After imputation, the code separates input features from output class and increases the relevance of significant features by multiplying them by 2. Next, it creates a `MinMaxScaler` object and fits it to the input features. The input features are then transformed using the scaler.

We then divide the data into training, validation and test sets using the `train_test_split` function. We also calculate class weights using the `compute_class_weight` function and save class weights in a dictionary.

Finally, we also save processed data in .npy files in the Pre_Processed_Data folder.


In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.impute import SimpleImputer

# Reading data from dataset
data = pd.read_csv('Dataset/diabetes.csv')

# Create a SimpleImputer object with the imputation strategy set to "mean" and the missing value set to 0
imputer = SimpleImputer(strategy='mean', missing_values=0)

# Impute missing values in all columns of the dataset except "Outcome"
data.iloc[:, :-1] = imputer.fit_transform(data.iloc[:, :-1])

# Separate input features from output class
features = data.drop(['Outcome'], axis=1)
labels = data['Outcome']

# Increasing the relevance of significant features
significant_features = ['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']
features[significant_features] *= 2

# Create a MinMaxScaler
scaler = MinMaxScaler()

# Adapting scaler on input features
scaler.fit(features)

# Transforming input features with scaler
features_normalized = scaler.transform(features)

# Divide data into training, validation and test sets
train_features, test_features, train_labels, test_labels = train_test_split(features_normalized, labels, test_size=0.3, random_state=42)
train_features, val_features, train_labels, val_labels = train_test_split(train_features, train_labels, test_size=0.2, random_state=42)

# Calculate class weighting
class_labels = np.unique(train_labels)
class_weights = compute_class_weight(class_weight='balanced', classes=class_labels, y=train_labels)
# Create a dictionary to store class weighting
class_weights_dict = dict(zip(class_labels, class_weights))

# Saving datas
np.save('Pre_Processed_Data/class_weights.npy', class_weights_dict)
np.save('Pre_Processed_Data/train_features.npy', train_features)
np.save('Pre_Processed_Data/train_labels.npy', train_labels)
np.save('Pre_Processed_Data/val_features.npy', val_features)
np.save('Pre_Processed_Data/val_labels.npy', val_labels)
np.save('Pre_Processed_Data/test_features.npy', test_features)
np.save('Pre_Processed_Data/test_labels.npy', test_labels)

print("train_features:", len(train_features))
print("test_features:", len(test_features))
print("val_features:", len(val_features))
print("total features:",len(features_normalized))

[0.76607143 1.43959732]
train_features: 429
test_features: 231
val_features: 108
total features: 768
