#Announcement

As with wrangling, I switched this notebook over to use the Titanic dataset for consistency. The video uses the Pima dataset.

The video ends rather abuptly. I hope the notebook is clear and what you need to do.

One thing I introduce is Upsampling. I promised I would do this way back in Chapter 2. I do not expect you to use it but thought it was worth demonstrating.

I also provide an optional make-up problem for you to get points back.

At the end of the notebook, I explore several ways of combining the four models. The most sophisticated is something called stacking. In essence we build a meta-model that we train with the output of the existing models. So this meta-model attempts to learn how to interpret the existing models' output.

<center>
<h1>Training and Tuning</h1>
</center>

<hr>

Once you are done here, you are ready to start playing with your server. Cool.

#How long does it take?

I think tuning time is the biggest issue for you now.
Using Pima data (training set = 614 rows) and what I consider an ok set of parameters to tune, this notebook takes me roughly 3 hours.

Take away is that as you tune each model, be aware that you might need to leave it running while you do something else.

The good news is that each model-tuning step is independent. Once you tune model X and save to GitHub, you are done with model X and can move on to model Y. The bad news is that if your dataset is larger, e.g., 5K rows, you can expect an increase in my times.

The further bad news is that there is not an easy way to get a progress bar with HalvingSearch. So if you wait 30 minutes, you don't know if you are almost done or will take another 4 hours.

Here are some strategies to consider:

1. Use incremental tuning. Tune some subset of params. Get best values and fix them. Then take on new subset using fixed values from past. You can use this strategy with both halving search and keras tuner.

2. I have factor=3. You could increase it to reduce wait time. But my experimentation tells me you may not gain that much.

3. For keras_tuner, it's easier. You can play around with `max_trials`. Set it small to start, e.g., 5. You can count on linearity here. If 5 trials takes 10 minutes, 50 likely to take 100 minutes.

##Set-up

First bring in your library.

In [None]:
github_name = 'kwoeser'
repo_name = 'CS423'
source_file = 'library.py'
url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
!rm $source_file
!wget $url
%run -i $source_file

rm: cannot remove 'library.py': No such file or directory
--2025-06-13 23:35:05--  https://raw.githubusercontent.com/kwoeser/CS423/main/library.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32297 (32K) [text/plain]
Saving to: ‘library.py’


2025-06-13 23:35:06 (3.05 MB/s) - ‘library.py’ saved [32297/32297]



##You need to change this url to point to your own dataset

And good idea to rename variables using "titanic" to something closer to your dataset.

In [None]:
url = 'https://raw.githubusercontent.com/kwoeser/CS423/main/final/data/heart_reduced.csv'


heart = pd.read_csv(url)
heart.head()


Unnamed: 0,Cholesterol,MaxHR,Age,RestingBP,Sex,ST_Slope,RestingECG,HeartDisease
0,0,134,62,120,M,Flat,LVH,1
1,318,160,60,102,F,Up,Normal,0
2,160,172,36,150,M,Up,Normal,0
3,248,170,47,135,F,Flat,Normal,1
4,256,113,58,160,M,Up,LVH,1


In [None]:
len(heart)

900

#Break out into features and labels



In [None]:
# Split into features and labels

features = heart.drop(columns='HeartDisease')
labels = heart['HeartDisease'].tolist()

In [None]:
labels.count(1)/len(labels)

0.5533333333333333

##Load pipeline from Wrangling notebook

You will be doing this exact same thing in the server.

In [None]:
import joblib

model_path = 'kwoeser/CS423/main/final/models/'
full_path = f'https://raw.githubusercontent.com/{model_path}final_fully_fitted_pipeline.pkl'
!rm 'final_fully_fitted_pipeline.pkl'
!wget $full_path
heart_transformer = joblib.load("final_fully_fitted_pipeline.pkl")


rm: cannot remove 'final_fully_fitted_pipeline.pkl': No such file or directory
--2025-06-13 23:35:15--  https://raw.githubusercontent.com/kwoeser/CS423/main/final/models/final_fully_fitted_pipeline.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47954 (47K) [application/octet-stream]
Saving to: ‘final_fully_fitted_pipeline.pkl’


2025-06-13 23:35:15 (3.65 MB/s) - ‘final_fully_fitted_pipeline.pkl’ saved [47954/47954]



# Step I. Break into numpy datasets

In [None]:
%%capture
rs = 33 #what you computed in wrangling notebook
label_column = 'HeartDisease'  #change to name of your label column

x_train,  x_test, y_train,  y_test = dataset_setup(heart, label_column, heart_transformer, rs=rs)

In [None]:
len(x_train)

720

#II. Upsampling

In Chapter 2 I removed duplicates, giving us unique rows. However, it did shrink the table down to roughly 1000. I noted I would show you a way to build the table back up. I am going to use a popular method called SMOTE (Synthetic Minority Over-sampling Technique). I'll show you how to use it even though I do not expect you will need it here. But could come in handy later in your career.

Note that I am only applying it to the training data. I'd like to keep the test data pure: augment training, let test data stand.

You can find plenty of tutorials on SMOTE. Briefly, it generates new rows by
using existing rows as starting places and then interpolating values. So it
does not duplicate rows but tries to give you similar rows.

In [None]:
from imblearn.over_sampling import SMOTE

# Calculate target numbers for 3000 total samples
target_total = 3000
pos_count = np.sum(y_train == 1)/len(y_train)
neg_count = np.sum(y_train == 0)/len(y_train)
target_0 = int(neg_count * target_total)  # 1950 samples
target_1 = int(pos_count * target_total)  # 1050 samples

# Create SMOTE instance with specified sampling strategy
smote = SMOTE(sampling_strategy={0: target_0, 1: target_1}, random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)  #requires transformed data - cannot handle categorical columns

# Verify the new distribution
print("New class distribution:")
print(f"Class 0: {sum(y_resampled == 0)} ({sum(y_resampled == 0)/len(y_resampled):.2%})")
print(f"Class 1: {sum(y_resampled == 1)} ({sum(y_resampled == 1)/len(y_resampled):.2%})")
print(f"Total samples: {len(y_resampled)}")

New class distribution:
Class 0: 1341 (44.71%)
Class 1: 1658 (55.29%)
Total samples: 2999


In [None]:
#Uncomment if you want to use upsampled data

x_train= x_resampled
y_train = y_resampled

# III. Setup Lime

Reminder: Lime will help us explain to the user why we come up with the predictions we do.

In [None]:
%%capture
!pip install lime

In [None]:
import lime
from lime import lime_tabular

In [None]:
feature_names = features.columns.to_list()
print(feature_names)

['Cholesterol', 'MaxHR', 'Age', 'RestingBP', 'Sex', 'ST_Slope', 'RestingECG']


###Set up the explainer before using

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(x_train,
                    feature_names=feature_names,
                    training_labels=y_train,
                    class_names=[0,1], #label values
                    verbose=True,
                    mode='classification')



# IV. Write out to file

And move to GitHub.

In [None]:
!pip install dill
import dill as pickle
with open('lime_explainer.pkl', 'wb') as file:
    pickle.dump(explainer, file)

#read it back in just as a test
with open('lime_explainer.pkl', 'rb') as file:   #this will be in your webserver
    explainer2 = pickle.load(file)



#Minimal help from me with remainder of notebook

I can remind you of the steps you need for each model's tuning:

1. If using halving search, set up grid. If using Optuna, then set up model builder with hp code. With Optuna, will also need to define a validation set.

2. Get the best model found by tuning.

3. Run it on test set.

4. Produce threshold table.

5. Save both best model and threshold table out to GitHub so can load them back in with server.

I would avoid Run All here. Each notebook can be tuned separately, really in any order. But once you finish steps above for one model, you don't want to waste time and repeat them.

# V. KNN tuning



In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score


###Follow the steps

In [None]:
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance']
}


In [None]:
knn_model = KNeighborsClassifier()
grid_result = halving_search(knn_model, param_grid, x_train, y_train)

n_iterations: 3
n_required_iterations: 3
n_possible_iterations: 3
min_resources_: 333
max_resources_: 2999
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 14
n_resources: 333
Fitting 5 folds for each of 14 candidates, totalling 70 fits
----------
iter: 1
n_candidates: 5
n_resources: 999
Fitting 5 folds for each of 5 candidates, totalling 25 fits
----------
iter: 2
n_candidates: 2
n_resources: 2997
Fitting 5 folds for each of 2 candidates, totalling 10 fits


In [None]:
best_knn_model = grid_result.best_estimator_
grid_result.best_params_

y_pred_probs = best_knn_model.predict_proba(x_test)[:, 1]
thresholds = [i/100 for i in range(20, 81, 5)]
results_df, fancy_df = threshold_results(thresholds, y_test, y_pred_probs)


In [None]:
best_knn_model.score(x_test, y_test)

0.7444444444444445

In [None]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.2,0.72,0.91,0.8,0.8,0.75
1,0.25,0.72,0.89,0.79,0.8,0.74
2,0.3,0.75,0.86,0.8,0.8,0.76
3,0.35,0.76,0.86,0.81,0.8,0.77
4,0.4,0.76,0.84,0.8,0.8,0.76
5,0.45,0.75,0.8,0.78,0.8,0.74
6,0.5,0.75,0.8,0.78,0.8,0.74
7,0.55,0.75,0.76,0.76,0.8,0.73
8,0.6,0.77,0.73,0.75,0.8,0.73
9,0.65,0.76,0.68,0.72,0.8,0.71


In [None]:
# Save model and thresholds
joblib.dump(best_knn_model, 'final_knn_model.joblib')
results_df.to_csv('final_knn_thresholds.csv', index=False)

# VI. Logistic Regression tuning



In [None]:
from sklearn.linear_model import LogisticRegressionCV

###Follow the steps

In [None]:
logreg_model = LogisticRegressionCV(
    Cs=10,
    cv=5,
    penalty='l1',
    solver='saga',
    random_state=rs,
    max_iter=1000,
    n_jobs=-1
)


In [None]:
logreg_model.fit(x_train, y_train)
print(f"Best parameter: {logreg_model.C_[0]}")

Best parameter: 21.54434690031882


In [None]:
y_pred_probs = logreg_model.predict_proba(x_test)[:, 1]
results_df, fancy_df = threshold_results(thresholds, y_test, y_pred_probs)

In [None]:
logreg_model.score(x_test, y_test)

0.7777777777777778

In [None]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.2,0.75,0.95,0.84,0.89,0.79
1,0.25,0.78,0.91,0.84,0.89,0.81
2,0.3,0.79,0.89,0.84,0.89,0.81
3,0.35,0.81,0.87,0.84,0.89,0.81
4,0.4,0.81,0.83,0.82,0.89,0.8
5,0.45,0.82,0.81,0.81,0.89,0.79
6,0.5,0.82,0.77,0.79,0.89,0.78
7,0.55,0.83,0.77,0.8,0.89,0.78
8,0.6,0.84,0.77,0.8,0.89,0.79
9,0.65,0.84,0.75,0.79,0.89,0.78


In [None]:
joblib.dump(logreg_model, 'final_logreg_model.pkl')
results_df.to_csv('final_logreg_thresholds.csv', index=False)

# VII. LGB tuning



In [None]:
from lightgbm import LGBMClassifier

###Follow the steps

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'num_leaves': [20, 31, 40],
    'max_depth': [-1, 5, 10]
}


In [None]:
lgb = LGBMClassifier(random_state=rs, verbose=-1)
grid_result = halving_search(lgb, param_grid, x_train, y_train)

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 37
max_resources_: 2999
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 81
n_resources: 37
Fitting 5 folds for each of 81 candidates, totalling 405 fits
----------
iter: 1
n_candidates: 27
n_resources: 111
Fitting 5 folds for each of 27 candidates, totalling 135 fits
----------
iter: 2
n_candidates: 9
n_resources: 333
Fitting 5 folds for each of 9 candidates, totalling 45 fits
----------
iter: 3
n_candidates: 3
n_resources: 999
Fitting 5 folds for each of 3 candidates, totalling 15 fits
----------
iter: 4
n_candidates: 1
n_resources: 2997
Fitting 5 folds for each of 1 candidates, totalling 5 fits




In [None]:
best_lgb = grid_result.best_estimator_
grid_result.best_params_

y_pred_probs = best_lgb.predict_proba(x_test)[:, 1]
results_df, fancy_df = threshold_results(thresholds, y_test, y_pred_probs)




In [None]:
best_lgb.score(x_test, y_test)



0.7833333333333333

In [None]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.2,0.76,0.88,0.81,0.85,0.78
1,0.25,0.77,0.87,0.82,0.85,0.78
2,0.3,0.77,0.86,0.82,0.85,0.78
3,0.35,0.77,0.85,0.81,0.85,0.78
4,0.4,0.78,0.84,0.81,0.85,0.78
5,0.45,0.78,0.84,0.81,0.85,0.78
6,0.5,0.79,0.84,0.81,0.85,0.78
7,0.55,0.79,0.84,0.81,0.85,0.78
8,0.6,0.79,0.84,0.81,0.85,0.78
9,0.65,0.79,0.84,0.81,0.85,0.78


In [None]:
joblib.dump(best_lgb, 'final_lgb_model.joblib')
results_df.to_csv('final_lgb_thresholds.csv', index=False)

# VIII. ANN tuning



In [None]:
!pip install keras-tuner -q
import keras_tuner

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m122.9/129.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Input
import tensorflow as tf
from tensorflow import keras

In [None]:
tf.keras.utils.set_random_seed(1234)  #need this for replication
tf.config.experimental.enable_op_determinism()  #ditto - https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

In [None]:
import hashlib

def string_to_seed(string):
    # Create a hash of the string using SHA-256
    hash_object = hashlib.sha256(string.encode())
    # Convert first 8 bytes of hash to integer
    hash_int = int.from_bytes(hash_object.digest()[:8], 'big')
    return hash_int % (2**32 - 1)

In [None]:
early_stop_cb = tf.keras.callbacks.EarlyStopping(
    monitor='loss',
    min_delta=0,
    patience=10,
    verbose=0
)

###Follow the steps

In [72]:
def build_model(hp):
    model = Sequential()

    # Input layer
    model.add(Input(shape=(x_train.shape[1],), name="input_layer"))

    l2_reg = hp.Float('l2_regularization', min_value=0.001, max_value=0.05, step=0.01)

    num_layers = hp.Int("layers", min_value=1, max_value=4, step=1)
    for i in range(num_layers):
        layer_name = f"hidden_layer_{i+1}"  # Start at 1, not 0

        units = hp.Int(f"hidden_units{i}", min_value=8, max_value=64, step=8)

        model.add(Dense(
            name=layer_name + '_dense',
            kernel_regularizer=tf.keras.regularizers.l2(l2_reg),
            kernel_initializer=tf.keras.initializers.HeNormal(seed=string_to_seed(layer_name + '_dense')),
            units=units,
            activation='relu'
        ))

    # Output layer
    model.add(Dense(units=1, activation='sigmoid'))

    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
    optimizer_choice = hp.Choice('optimizer', values=['adam', 'rmsprop'])

    smoothing_index = hp.Int("smoothing", min_value=0, max_value=2)
    smoothing_value = [0.0, 0.1, 0.2][smoothing_index]  # 0.0 = no smoothing, 0.1, 0.2 = smoothing

    if optimizer_choice == 'adam':
        optimizer = keras.optimizers.Adam(learning_rate=hp_learning_rate)
    else:
        optimizer = keras.optimizers.RMSprop(learning_rate=hp_learning_rate)

    model.compile(
        optimizer=optimizer,
        loss=tf.keras.losses.BinaryCrossentropy(label_smoothing=smoothing_value),
        metrics=['auc', 'accuracy']
    )
    return model

In [73]:
x_train_ann, x_val_ann, y_train_ann, y_val_ann = train_test_split(
    x_train, y_train, test_size=0.2, random_state=rs
)

In [None]:
# Set up tuner
tuner = keras_tuner.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=20,
    executions_per_trial=1,
    directory='ann_tuning',
    overwrite=True,
    seed=1234
)

print("Starting hyperparameter search...")
tuner.search(
    x_train_ann, y_train_ann,
    epochs=100,
    validation_data=(x_val_ann, y_val_ann),
    callbacks=[early_stop_cb],
    verbose=1
)


Trial 8 Complete [00h 00m 42s]
val_accuracy: 0.8766666650772095

Best val_accuracy So Far: 0.8966666460037231
Total elapsed time: 00h 06m 36s

Search: Running Trial #9

Value             |Best Value So Far |Hyperparameter
0.011             |0.001             |l2_regularization
2                 |1                 |layers
16                |48                |hidden_units0
0.0001            |0.001             |learning_rate
rmsprop           |adam              |optimizer
1                 |0                 |smoothing
32                |56                |hidden_units1
64                |8                 |hidden_units2
16                |48                |hidden_units3

Epoch 1/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.4671 - auc: 0.4492 - loss: 2.1329 - val_accuracy: 0.4167 - val_auc: 0.4879 - val_loss: 2.0767
Epoch 2/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.4671 - auc: 0.4596 - loss: 

In [None]:
# Get Best Hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
best_hp.values

{'l2_regularization': 0.001,
 'layer1_units': 40,
 'layer2_units': 32,
 'layer3_units': 8,
 'learning_rate': 0.001,
 'optimizer': 'adam',
 'smoothing': 1}

In [None]:
# Build model with best hyperparameters
ann_model = build_model(best_hp)
ann_model.summary()

In [None]:
# Train model
history = ann_model.fit(
    x_train, y_train,
    epochs=100,
    validation_split=0.2,
    callbacks=[early_stop_cb],
    verbose=1,
    batch_size=32
)

print(f"Total epochs trained: {len(history.history['loss'])}")
print(f"Final training accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final validation accuracy: {history.history['val_accuracy'][-1]:.4f}")

Epoch 1/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 17ms/step - accuracy: 0.6514 - auc: 0.6685 - loss: 0.8075 - val_accuracy: 0.7867 - val_auc: 0.0000e+00 - val_loss: 0.6713
Epoch 2/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7823 - auc: 0.8416 - loss: 0.6964 - val_accuracy: 0.8867 - val_auc: 0.0000e+00 - val_loss: 0.5679
Epoch 3/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8296 - auc: 0.8899 - loss: 0.6188 - val_accuracy: 0.9233 - val_auc: 0.0000e+00 - val_loss: 0.5070
Epoch 4/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8418 - auc: 0.9067 - loss: 0.5781 - val_accuracy: 0.9300 - val_auc: 0.0000e+00 - val_loss: 0.4827
Epoch 5/100
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8502 - auc: 0.9195 - loss: 0.5553 - val_accuracy: 0.9300 - val_auc: 0.0000e+00 - val_loss: 0.4674
Epoch 6/100
[

In [None]:
y_pred_probs = ann_model.predict(x_test, verbose=0).ravel()

results_df, fancy_df = threshold_results(thresholds, y_test, y_pred_probs)

test_loss, test_auc, test_accuracy = ann_model.evaluate(x_test, y_test, verbose=0)
print(f"Model Accuracy: {test_accuracy:.4f}")
print(f"Model AUC: {test_auc:.4f}")
print(f"Model Loss: {test_loss:.4f}")

Model Accuracy: 0.7944
Model AUC: 0.8583
Model Loss: 0.6235


In [None]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.2,0.73,0.93,0.82,0.86,0.77
1,0.25,0.75,0.93,0.83,0.86,0.79
2,0.3,0.75,0.91,0.82,0.86,0.78
3,0.35,0.76,0.9,0.83,0.86,0.79
4,0.4,0.77,0.89,0.83,0.86,0.79
5,0.45,0.79,0.88,0.83,0.86,0.8
6,0.5,0.79,0.85,0.82,0.86,0.79
7,0.55,0.8,0.84,0.82,0.86,0.79
8,0.6,0.81,0.81,0.81,0.86,0.79
9,0.65,0.82,0.8,0.81,0.86,0.79


In [None]:
ann_model.save('final_ann_model.keras')
results_df.to_csv('final_ann_thresholds.csv', index=False)

#You should eventually have these files on GitHub

* LIME explainer
* tuned KNN model and associated threshold table
* tuned logistic regression model and associated threshold table
* tuned light boosting model and associated threshold table
* tuned ANN model and associated threshold table

#Optional make-up: Random Forest model

I will give you credit for one homework assignment in terms of points if you elect to take on this problem.

You will need to do two things: (1) tune and save your threshold table and model below, and (2) add the model to your production notebook (your last notebook that is part of final.) The latter is the most tricky given you will actually have to change several places in the code I handed you for server. But it is doable if you get an early jump on it.

In [None]:
#From chapter 12
from sklearn.ensemble import RandomForestClassifier

###Follow the steps

##You still need to change the production notebook

Find the places where you are loading models and thresholds and add the RF results. Find places where you are doing predictions and add RF prediction. Find place where you are showing prediction results in html and add RF prediction. Also add threshold table.

This should not take long but will require you to pay attention to what you are doing to avoid screwing up what is already there.

#Just for your interest

There are several ways to combine the results of multiple models, four models in our case. We are using one of the ways in the server, but wanted to show you other options.

# IX. Voting - averaging binary

There are two ways I can see of voting when have 4 models producing results. The first is to convert their output to binary. Then simply look for majority of either 0s or 1s. I added a twist that I fall back on probabilities for ties.

In [None]:
lgb_raw = best_lgb.predict_proba(x_test)[:,1]
knn_raw = best_knn_model.predict_proba(x_test)[:,1]
logreg_raw = logreg_model.predict_proba(x_test)[:,1]
ann_raw = ann_model.predict(x_test)[:,0]

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 




In [None]:
yvotes = []
for i in range(len(y_test)):
  the_vote = (lgb_raw[i]>=.5+logreg_raw[i]>=.5+knn_raw[i]>=.5+ann_raw[i]>=.5)
  if the_vote==2:
    #tie breaker - go to probabilities
    prob = (knn_yraw[i]+logreg_yraw[i]+xgb_yraw[i]+ann_yraw[i])/4
    the_winner = 1 if prob>=.5 else 0
  else:
    the_winner = 1 if the_vote>2 else 0
  yvotes.append(the_winner)

In [None]:
sum([1 if p>=.5 else 0 for p in ann_raw])/len(x_test)

0.5944444444444444

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, yvotes)
print(cm)


[[ 80   0]
 [100   0]]


In [None]:
(cm[0,0]+cm[1,1])/len(y_test)  #accuracy 0.5665399239543726

np.float64(0.4444444444444444)

Can now use it to compute precision and recall.

In [None]:
def precision_recall(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    tp = cm[0,0]
    fp = cm[0,1]
    fn = cm[1,0]
    prec = tp / (tp+fp)
    rec = tp / (tp+fn)
    return prec, rec

precision, recall = precision_recall(y_test, yvotes)
print(f'Precision: {precision} Recall {recall}')

Precision: 1.0 Recall 0.4444444444444444


In [None]:
f1 = 2*(precision*recall)/(precision+recall)
f1

np.float64(0.6153846153846153)

# X. Prob averaging

The second voting approach is not actually voting. Instead, take average of 4 raw probabilities and use result as final probability. Can then run that through threshold table.

This is what the server is doing to get the "Ensemble" value.

In [None]:
avg_yraw = []
for i in range(len(y_test)):
  prob = (knn_raw[i]+logreg_raw[i]+lgb_raw[i]+ann_raw[i])/4
  avg_yraw.append(prob)

In [None]:
result_df, fancy_df = threshold_results(np.linspace(0,1,19,endpoint=True), y_test, avg_yraw)

# XI. Stacking

This is interesting in that it builds a whole separate model (a meta model) that takes the output of other base models, three in example below, and uses that as a row. So a row of 3 feature values, one from each of the base models.

I kind of like it. The meta model learns how to combine the outputs of base models, e.g., when to weight KNN higher than LGB, etc.


In [None]:
from sklearn.ensemble import StackingClassifier

estimators = [
     ('knn', KNeighborsClassifier(15, algorithm='ball_tree', p=1, weights='distance')),
    ('logreg', LogisticRegressionCV(Cs= 5, class_weight= None, cv= 5, max_iter= 500, solver= 'saga', penalty='l1', random_state=1234)),
    ('lgb', LGBMClassifier(boosting_type= 'gbdt',
                          class_weight= 'balanced',
                          learning_rate= 0.3,
                          max_depth= 5,
                          min_child_samples= 10,
                          n_estimators= 10,
                          num_leaves= 7,
                          random_state=1234),
    )
]
final_estimator = LogisticRegressionCV(random_state=1234)   #this is choice for meta model
clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

In [None]:
%%capture
clf.fit(x_train, y_train)

In [None]:
yraw = clf.predict_proba(x_test)[:,1]
result_df, fancy_df = threshold_results(np.linspace(0,1,19,endpoint=True), y_test, yraw)
fancy_df



Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.56,1.0,0.71,0.82,0.56
1,0.06,0.71,0.93,0.81,0.82,0.75
2,0.11,0.74,0.93,0.82,0.82,0.78
3,0.17,0.74,0.92,0.82,0.82,0.78
4,0.22,0.74,0.91,0.82,0.82,0.77
5,0.28,0.75,0.88,0.81,0.82,0.77
6,0.33,0.76,0.87,0.81,0.82,0.77
7,0.39,0.76,0.86,0.81,0.82,0.77
8,0.44,0.76,0.86,0.81,0.82,0.77
9,0.5,0.77,0.85,0.81,0.82,0.77


<img src='https://www.dropbox.com/scl/fi/zilmy2diy1lg1tva9vurx/Screenshot-2025-02-07-at-8.38.53-AM.png?rlkey=006szbv5t0daha005eotxt9k2&raw=1' height=400>