## ProjF5 - Final Model

Use this document as a template to provide the evaluation of your final model. You are welcome to go in as much depth as needed.

Make sure you keep the sections specified in this template, but you are welcome to add more cells with your code or explanation as needed.

In [4]:
import numpy as np
import matplotlib.pyplot as plt

### 1. Load and Prepare Data

This should illustrate your code for loading the dataset and the split into training, validation and testing. You can add steps like pre-processing if needed.

In [5]:
import pandas as pd
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

# Load the dataset
file_path = 'datasets/train_dataset.csv'  # Update this to the path of your dataset file
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Describe the dataset to understand its distribution
print(data.describe())

#convert all columns to lowercase
data.columns = data.columns.str.lower()

#strip all symbols and replace with space, strip all tailing whitespaces and remaining white spaces replaced with underscores 
data.columns = data.columns.str.replace(r'[^a-zA-Z0-9]', ' ', regex=True).str.strip()
data.columns = data.columns.str.replace(' ', '_', regex=True)

# If you need to fill missing values, here's a simple way to do it (example)
# data.fillna(data.mean(), inplace=True)  # This fills missing values with the mean of each column

# You might also want to normalize/standardize your data if you are using neural networks
xvars = ['age', 'height_cm', 'weight_kg', 'waist_cm', 'eyesight_left',
         'eyesight_right', 'hearing_left', 'hearing_right', 'systolic',
         'relaxation', 'fasting_blood_sugar', 'cholesterol', 'triglyceride',
         'hdl', 'ldl', 'hemoglobin', 'urine_protein', 'serum_creatinine', 'ast',
         'alt', 'gtp', 'dental_caries']
yvar = 'smoking'

# Select features and target
X = data[xvars]
y = data[yvar]  # Assuming this is binary (0 = non-smoker, 1 = smoker)

# # Split the dataset into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Normalize the data
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)


   age  height(cm)  weight(kg)  waist(cm)  eyesight(left)  eyesight(right)  \
0   35         170          85       97.0             0.9              0.9   
1   20         175         110      110.0             0.7              0.9   
2   45         155          65       86.0             0.9              0.9   
3   45         165          80       94.0             0.8              0.7   
4   20         165          60       81.0             1.5              0.1   

   hearing(left)  hearing(right)  systolic  relaxation  ...  HDL  LDL  \
0              1               1       118          78  ...   70  142   
1              1               1       119          79  ...   71  114   
2              1               1       110          80  ...   57  112   
3              1               1       158          88  ...   46   91   
4              1               1       109          64  ...   47   92   

   hemoglobin  Urine protein  serum creatinine   AST   ALT  Gtp  \
0        19.8            

### 2. Prepare your Final Model

Here you can have your code to either train (e.g., if you are building it from scratch) your model. These steps may require you to use other packages or python files. You can just call them here. You don't have to include them in your submission. Remember that we will be looking at the saved outputs in the notebooked and we will not run the entire notebook.

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from scikeras.wrappers import KerasClassifier 
from tensorflow.keras.regularizers import l2
import tensorflow as tf

# Define a function to create the Keras model
def create_model(neurons=128, l2_rate=0.01):
    model = Sequential([
        Dense(neurons, activation='relu', input_dim=X.shape[1], kernel_regularizer=l2(l2_rate)),
        Dense(neurons, activation='relu', kernel_regularizer=l2(l2_rate)),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Wrap the model using KerasClassifier
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define the grid search parameters
param_grid = {
    'model__neurons': [64, 128],
    'model__l2_rate': [0.01, 0.02],
    'batch_size': [32, 64],
    'epochs': [20, 50]
}

# Setup cross-validation and grid search
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=kfold)

# Perform grid search with cross-validation
grid_result = grid.fit(X, y)  # No need to pre-scale; it will be handled internally by the grid search

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
#Best: 0.731351 using {'batch_size': 32, 'epochs': 50, 'model__l2_rate': 0.01, 'model__neurons': 64}


In [34]:
# Function to create the model, now just returning the model instance
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, Activation
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.utils import class_weight
import numpy as np

# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define the K-fold cross validator
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Calculate class weights
class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y), y=y)

class_weights = dict(enumerate(class_weights))

def create_optimized_model():
    model = Sequential([
        Dense(128, kernel_regularizer=l2(0.001), input_shape=(X_scaled.shape[1],)),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.5),
        Dense(128, kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.5),
        Dense(64, kernel_regularizer=l2(0.001)),
        Activation('relu'),
        Dense(1, activation='sigmoid')
    ])
    optimizer = Adam(learning_rate=0.001)
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Assuming you have a setup for training and validation splits
callbacks = [
    EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='min'),
    ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True, mode='min', verbose=1)
]
# List to store each fold's accuracy
accuracies = []

# K-fold Cross Validation model evaluation
for train, test in kfold.split(X_scaled, y):
    model = create_optimized_model()
    # Fit data to model
    history = model.fit(X_scaled[train], y[train], epochs=10, batch_size=32, callbacks=callbacks, class_weight=class_weights, verbose=1)
    # Evaluate the model
    _, accuracy = model.evaluate(X_scaled[test], y[test], verbose=0)
    accuracies.append(accuracy)

# Print out the average and the standard deviation of the accuracies
average_accuracy = np.mean(accuracies)
std_deviation_accuracy = np.std(accuracies)
print(f'Average accuracy: {average_accuracy:.2f}, with standard deviation: {std_deviation_accuracy:.2f}')

Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.6639 - loss: 0.8207
Epoch 2/10
[1m148/975[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 1ms/step - accuracy: 0.6971 - loss: 0.6771

  current = self.get_monitor_value(logs)
  self._save_model(epoch=epoch, batch=None, logs=logs)


[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7032 - loss: 0.6514
Epoch 3/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7086 - loss: 0.5872
Epoch 4/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7055 - loss: 0.5634
Epoch 5/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7065 - loss: 0.5487
Epoch 6/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7111 - loss: 0.5382
Epoch 7/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7105 - loss: 0.5357
Epoch 8/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7092 - loss: 0.5419
Epoch 9/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7175 - loss: 0.5349
Epoch 10/10
[1m975/975[0m [32m━━━━━━━━━━━━━━━━━━

### 3. Model Performance

Make sure to include the following:
- Performance on the training set
- Performance on the test set
- Provide some screenshots of your output (e.g., pictures, text output, or a histogram of predicted values in the case of tabular data). Any visualization of the predictions are welcome.

In [None]:
### YOUR CODE HERE