In [1]:
include("utils/preprocessing.jl")
include("utils/model_evaluation.jl")
include("utils/data_loader.jl")
include("utils/visualization.jl")
include("utils/ml1_utils.jl")

evaluateAndPrintMetricsRanking (generic function with 1 method)

### Analysis of data ###

In [15]:
using Random
Random.seed!(123)

data = DataLoader.load_data_for_analysis("dataset\\star_classification.csv");
#Visualization.entry_visualization(data);

### Preprocessing ###

In [2]:
using DataFrames

data = DataLoader.load_data("dataset\\star_classification.csv");

# Preprocess the data

"""    This function does the following:
        - Balance the data using the undersampling method if chosen to do so
        - Parse the data: chosing the correct columns for inputs and targets (Shouldn't this be done before balancing??)
        - Splits the data into training and testing using holdOut method
        - Normalize the inputs  ------> zeromean method still to be implemented!!!!
"""
# preprocess_data(data, train_ratio, norm_method, balance, indices)
train_inputs, train_targets, test_inputs, test_targets = Preprocessing.preprocess_data(data, 0.98, "minmax", true, [4,5,6,7,8]);

""" I've changed the preprocess_data function so it doesn't OneHotEncode the targets
    because it's not needed&advised for KNN, DT & SVM, only for ANN. 
    The OneHotEncoding for the ANN will be done in the modelCrossValidation function,
    which is called by the evaluateAndPrintMetricsRanking function.
"""

" I've changed the preprocess_data function so it doesn't OneHotEncode the targets\n    because it's not needed&advised for KNN, DT & SVM, only for ANN. \n    The OneHotEncoding for the ANN will be done in the modelCrossValidation function,\n    which is called by the evaluateAndPrintMetricsRanking function.\n"

In [28]:
println(typeof(train_inputs), typeof(train_targets),typeof(test_inputs), typeof(test_targets))
println(size(train_inputs),size(train_targets),size(test_inputs),size(test_targets))
println(train_targets[1:10])

Matrix{Float32}Vector{Any}Matrix{Float32}Vector{Any}
(1138, 5)(1138,)(55745, 5)(55745,)
Any["STAR", "STAR", "QSO", "GALAXY", "QSO", "QSO", "STAR", "QSO", "GALAXY", "GALAXY"]


In [29]:
# Check size of train and test sets
println("Train inputs: ", size(train_inputs))
println("Train targets: ", size(train_targets))
println("Test inputs: ", size(test_inputs))
println("Test targets: ", size(test_targets))

Train inputs: (1138, 5)
Train targets: (1138,)
Test inputs: (55745, 5)
Test targets: (55745,)


### Testing hyperparameters for each model ###

In [4]:
using ScikitLearn

@sk_import neural_network: MLPClassifier
@sk_import svm: SVC
@sk_import tree: DecisionTreeClassifier
@sk_import neighbors: KNeighborsClassifier
@sk_import ensemble: StackingClassifier  # For the stacking ensemble
@sk_import linear_model: LogisticRegression  # For the final estimator in stacking

PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

In [5]:
"""
Setting indices for the k-fold cross-validation
    we are about to do with the different models
"""
N=size(train_inputs,1)
k = 5 # number of folds
kFoldIndices = crossvalidation(N, k);

#### Decision Tree ####

In [32]:
# Define an array of hyperparameter dictionaries for the Decision Tree model
dtree_hyperparameters_array = [
    Dict("max_depth" => 3),
    Dict("max_depth" => 5),
    Dict("max_depth" => 10),
    Dict("max_depth" => 20),
    Dict("max_depth" => 50),
    Dict("max_depth" => 100) # Deeper trees can capture more detail but risk overfitting
]

# Call the function to evaluate the model using different sets of hyperparameters and print the ranking of metrics.
evaluateAndPrintMetricsRanking(:DecisionTree,dtree_hyperparameters_array, train_inputs, train_targets, kFoldIndices)


Training with set of hyperparameters 1
Training with set of hyperparameters 2
Training with set of hyperparameters 3
Training with set of hyperparameters 4
Training with set of hyperparameters 5
Training with set of hyperparameters 6

----- acc -----
Set of hyperparameters 3 -> mean: 0.689 Std. Dev.: 0.021
Set of hyperparameters 5 -> mean: 0.689 Std. Dev.: 0.026
Set of hyperparameters 6 -> mean: 0.689 Std. Dev.: 0.026
Set of hyperparameters 4 -> mean: 0.688 Std. Dev.: 0.025
Set of hyperparameters 2 -> mean: 0.683 Std. Dev.: 0.032
Set of hyperparameters 1 -> mean: 0.622 Std. Dev.: 0.037

----- sensitivity -----
Set of hyperparameters 3 -> mean: 0.689 Std. Dev.: 0.021
Set of hyperparameters 5 -> mean: 0.689 Std. Dev.: 0.026
Set of hyperparameters 6 -> mean: 0.689 Std. Dev.: 0.026
Set of hyperparameters 4 -> mean: 0.688 Std. Dev.: 0.025
Set of hyperparameters 2 -> mean: 0.683 Std. Dev.: 0.032
Set of hyperparameters 1 -> mean: 0.622 Std. Dev.: 0.037

----- specificity -----
Set of hyperpar

#### KNN ####

In [33]:
# Define an array of hyperparameter dictionaries for the kNN model
knn_hyperparameters_array = [
    Dict("n_neighbors" => 5),
    Dict("n_neighbors" => 10),
    Dict("n_neighbors" => 15),
    Dict("n_neighbors" => 20),
    Dict("n_neighbors" => 50),
    Dict("n_neighbors" => 100) # Large neighborhoods, smooths out predictions
]

# Call the function to evaluate the model using different sets of hyperparameters and print the ranking of metrics.
evaluateAndPrintMetricsRanking(:kNN,knn_hyperparameters_array, train_inputs, train_targets, kFoldIndices)

Training with set of hyperparameters 1
Training with set of hyperparameters 2
Training with set of hyperparameters 3
Training with set of hyperparameters 4
Training with set of hyperparameters 5
Training with set of hyperparameters 6

----- acc -----
Set of hyperparameters 1 -> mean: 0.742 Std. Dev.: 0.036
Set of hyperparameters 2 -> mean: 0.726 Std. Dev.: 0.029
Set of hyperparameters 3 -> mean: 0.712 Std. Dev.: 0.047
Set of hyperparameters 4 -> mean: 0.709 Std. Dev.: 0.041
Set of hyperparameters 5 -> mean: 0.699 Std. Dev.: 0.04
Set of hyperparameters 6 -> mean: 0.681 Std. Dev.: 0.028

----- sensitivity -----
Set of hyperparameters 1 -> mean: 0.742 Std. Dev.: 0.036
Set of hyperparameters 2 -> mean: 0.726 Std. Dev.: 0.029
Set of hyperparameters 3 -> mean: 0.712 Std. Dev.: 0.047
Set of hyperparameters 4 -> mean: 0.709 Std. Dev.: 0.041
Set of hyperparameters 5 -> mean: 0.699 Std. Dev.: 0.04
Set of hyperparameters 6 -> mean: 0.681 Std. Dev.: 0.028

----- specificity -----
Set of hyperparam

#### ANN ####

In [34]:
# Define an array of hyperparameter dictionaries for the ANN model
ann_hyperparameters_array = [
    # Two-layer architecture, moderate neurons
    Dict("architecture" => [50, 30], "activation" => "relu", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # One-layer architecture, fewer neurons
    Dict("architecture" => [30], "activation" => "relu", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # Two-layer, different activation function
    Dict("architecture" => [50, 30], "activation" => "tanh", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # One-layer, lower learning rate
    Dict("architecture" => [30], "activation" => "relu", "learning_rate" => 0.001, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 2000, "repetitionsTraining" => 10),

    # Two-layer, higher learning rate
    Dict("architecture" => [50, 30], "activation" => "relu", "learning_rate" => 0.05, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # One-layer, logistic activation
    Dict("architecture" => [30], "activation" => "logistic", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # Two-layer, more neurons, different activation
    Dict("architecture" => [70, 40], "activation" => "tanh", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10),

    # One-layer, more neurons
    Dict("architecture" => [50], "activation" => "relu", "learning_rate" => 0.01, "validation_ratio" => 0.1, "n_iter_no_change" => 80, "max_iter" => 1000, "repetitionsTraining" => 10)
]

# Call the function to evaluate the model using different sets of hyperparameters and print the ranking of metrics.
evaluateAndPrintMetricsRanking(:ANN, ann_hyperparameters_array, train_inputs, train_targets, kFoldIndices)

Training with set of hyperparameters 1


#### SVM ####

In [11]:
svm_hyperparameters_array = [
    # Uses 'rbf' kernel, medium complexity with C=1.0, default polynomial degree, 'scale' for gamma 
    Dict("kernel" => "rbf", "degree" => 3, "C" => 1.0, "gamma" => "scale"),
    
    # Same 'rbf' kernel, increased penalty (C=10.0) for larger-margin separation, 'auto' gamma adjusts based on features
    Dict("kernel" => "rbf", "degree" => 3, "C" => 10.0, "gamma" => "auto"),
    
    # Same 'rbf' kernel, lower penalty (C=0.1) for a softer-margin, 'scale' gamma is default scaling
    Dict("kernel" => "rbf", "degree" => 3, "C" => 0.1, "gamma" => "scale"),

    # 'linear' kernel, suitable for less complex data
    Dict("kernel" => "linear", "degree" => 3, "C" => 0.1, "gamma" => "auto"),
    
    # 'linear' kernel, with C=1.0 indicating a balance between margin and misclassification
    Dict("kernel" => "linear", "degree" => 3, "C" => 1.0, "gamma" => "auto"),

    # 'linear' kernel, with a medium penalty and scale gamma
    Dict("kernel" => "linear", "degree" => 3, "C" => 1.0, "gamma" => "scale"),

    # 'linear' kernel with a higher penalty and scale gamma
    Dict("kernel" => "linear", "degree" => 3, "C" => 10.0, "gamma" => "scale"),
    
    # 'poly' kernel, polynomial degree is set twice by mistake, should only be 'degree' => 3, 'scale' gamma defaults to feature scale
    Dict("kernel" => "poly", "degree" => 3, "C" => 1.0, "gamma" => "scale"),
    
    # 'poly' kernel, increased polynomial degree (5) for higher model complexity, 'auto' gamma may overfit with high dimension
    Dict("kernel" => "poly", "degree" => 5, "C" => 1.0, "gamma" => "auto")
]


# Call the function to evaluate the model using different sets of hyperparameters and print the ranking of metrics.
evaluateAndPrintMetricsRanking(:SVM, svm_hyperparameters_array, train_inputs, train_targets, kFoldIndices)

Training with set of hyperparameters 1
Training with set of hyperparameters 2
Training with set of hyperparameters 3
Training with set of hyperparameters 4
Training with set of hyperparameters 5
Training with set of hyperparameters 6
Training with set of hyperparameters 7
Training with set of hyperparameters 8
Training with set of hyperparameters 9

----- acc -----
Set of hyperparameters 1 -> mean: 0.722 Std. Dev.: 0.039
Set of hyperparameters 8 -> mean: 0.714 Std. Dev.: 0.029
Set of hyperparameters 2 -> mean: 0.695 Std. Dev.: 0.053
Set of hyperparameters 7 -> mean: 0.684 Std. Dev.: 0.046
Set of hyperparameters 5 -> mean: 0.682 Std. Dev.: 0.049
Set of hyperparameters 6 -> mean: 0.682 Std. Dev.: 0.049
Set of hyperparameters 3 -> mean: 0.665 Std. Dev.: 0.036
Set of hyperparameters 4 -> mean: 0.647 Std. Dev.: 0.036
Set of hyperparameters 9 -> mean: 0.405 Std. Dev.: 0.045

----- sensitivity -----
Set of hyperparameters 1 -> mean: 0.722 Std. Dev.: 0.039
Set of hyperparameters 8 -> mean: 0.7

In [None]:
"
using Plots

# Define the data for each model
ann_means = [0.947, 0.947, 0.925, 0.788, 0.948, 0.933, 0.8, 0.9]
ann_stds = [0.018, 0.04, 0.07, 0.097, 0.036, 0.039, 0.4, 0.3]
svm_means = [0.947, 0.947, 0.927, 0.953, 0.94, 0.4, 0.5]
svm_stds = [0.03, 0.038, 0.092, 0.051, 0.043, 0.082, 0.7]
dt_means = [0.927, 0.913, 0.913, 0.913, 0.913, 0.913]
dt_stds = [0.043, 0.045, 0.045, 0.045, 0.045, 0.045]
knn_means = [0.947, 0.947, 0.96, 0.94, 0.913, 0.507]
knn_stds = [0.038, 0.038, 0.015, 0.043, 0.104, 0.068]

# Create subplots for each model
p1 = bar(1:6, ann_means, yerr=ann_stds, title="ANN", legend=false)
p2 = bar(1:6, svm_means, yerr=svm_stds, title="SVM", legend=false)
p3 = bar(1:6, dt_means, yerr=dt_stds, title="Decision Tree", legend=false)
p4 = bar(1:6, knn_means, yerr=knn_stds, title="KNN", legend=false)

# Customize the y-axis and labels
for p in [p1, p2, p3, p4]
    ylabel!(p, "Accuracy")
    xlabel!(p, "Set of Hyperparameters")
end

# Combine the plots into one figure
plot(p1, p2, p3, p4, layout=(2,2), size=(800,600))
"

In [37]:
using MLBase
using JLD

dt_model = DecisionTreeClassifier(max_depth=10)

# Fit the model on the training data
fit!(dt_model, train_inputs, train_targets)

# Predict the targets for the test data
predicted_targets = predict(dt_model, test_inputs)

# Calculate and print the accuracy
dt_model_accuracy = mean(predicted_targets .== test_targets)
println("Decision Tree model accuracy: $(dt_model_accuracy * 100) %")

# Save the model
#JLD.save("dt_model.jld", "model", dt_model)

Decision Tree model accuracy: 33.36083953717822 %


In [34]:
knn_model = KNeighborsClassifier(n_neighbors=10)

# Fit the model on the training data
fit!(knn_model, train_inputs, train_targets)

# Predict the targets for the test data
predicted_targets = predict(knn_model, test_inputs)

# Calculate and print the accuracy
knn_model_accuracy = mean(predicted_targets .== test_targets)
println("KNN model accuracy: $(knn_model_accuracy * 100) %")

# Save the model
#JLD.save("knn_model.jld", "model", knn_model)

KNN model accuracy: 33.303435285675846 %


In [31]:
ann_model = MLPClassifier(hidden_layer_sizes=(70, 40), activation="tanh", learning_rate_init=0.01, validation_fraction=0.1, n_iter_no_change=80, max_iter=1000)

# Fit the model on the training data
fit!(ann_model, train_inputs, train_targets)

# Predict the targets for the test data
predicted_targets = predict(ann_model, test_inputs)

# Calculate and print the accuracy
ann_model_accuracy = mean(predicted_targets .== test_targets)
println("ANN model accuracy: $(ann_model_accuracy * 100) %")

# Save the model
#JLD.save("ann_model.jld", "model", ann_model)

ANN model accuracy: 33.31958023141089 %




In [33]:
svm_model = SVC(kernel="rbf", degree=3, C=1.0, gamma="scale")

# Fit the model on the training data
fit!(svm_model, train_inputs, train_targets)

# Predict the targets for the test data
predicted_targets = predict(svm_model, test_inputs)

# Calculate and print the accuracy
knn_model_accuracy = mean(predicted_targets .== test_targets)
println("SVM model accuracy: $(knn_model_accuracy * 100) %")

# Save the model
#JLD.save("svm_model.jld", "model", svm_model)

SVM model accuracy: 34.38335276706431 %


In [35]:
# Define the base models with the chosen hyperparameters
dt_model = DecisionTreeClassifier(max_depth=10)
knn_model = KNeighborsClassifier(n_neighbors=10)
ann_model = MLPClassifier(hidden_layer_sizes=(70, 40), activation="tanh", learning_rate_init=0.01, validation_fraction=0.1, n_iter_no_change=80, max_iter=10000) # Increase max_iter from 1000 to ensure convergence
svm_model = SVC(kernel="rbf", degree=3, C=1.0, gamma="scale")

# Create a list of tuples (name, model) for the base models
base_models = [
    ("DecisionTree", dt_model),
    ("kNN", knn_model),
    ("ANN", ann_model),
    ("SVM", svm_model)
]

# Choose a final estimator for the stacking ensemble
# Logistic Regression is a common choice for combining predictions
final_estimator = LogisticRegression()

# Create the stacking ensemble
ensemble = StackingClassifier(estimators=base_models, final_estimator=final_estimator)

# Train the ensemble model
fit!(ensemble, train_inputs, train_targets)

# Evaluate the ensemble model
model_accuracy = score(ensemble, test_inputs, test_targets)
println("Ensemble model accuracy: $(model_accuracy * 100) %")

# Save the model
#JLD.save("ensemble.jld", "model", ensemble)


Ensemble model accuracy: 34.10888868956857 %


## Copy-pasted from unit 6 but something similar would go here ##

### Best model configuration

Based on the hyperparameters provided for each model and the results we have, the best configurations for each model according to **accuracy** are:


- **Artificial Neural Network (ANN)**: The best-performing ANN model uses the architecture [100, 100, 100] with 'relu' activation, a learning rate of 0.01, a validation ratio of 0.1, and a maximum of 1000 iterations. This suggests that a more complex model with a higher number of neurons was able to capture the complexity of the data better than simpler models.

- **Support Vector Machine (SVM)**: The SVM model that performed best had the 'linear' kernel with a C value of 1.0 and 'auto' gamma setting. This indicates a model that balances margin and misclassification error, without the need for the complexity of a non-linear kernel.

- **Decision Tree**: The best decision tree model had a maximum depth of 3. This suggests that a simpler model, which avoids overfitting by not going too deep into the tree, was sufficient to capture the relevant patterns in the data.

- **K-Nearest Neighbors (KNN)**: The KNN model that yielded the best results had 15 neighbors. This points to an intermediate complexity that balances between smoothing out the noise and capturing sufficient detail from the dataset.

We shall see their performance in the following table:


| Best Model Configuration          | Accuracy         | Sensitivity      | Specificity      | PPV              | NPV              | F_Score          | Err_Rate         |
|---------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|
| ANN (Model 5)       | 0.948 ± 0.036    | 0.948 ± 0.036    | 0.979 ± 0.013    | 0.959 ± 0.027    | 0.969 ± 0.025    | 0.948 ± 0.036    | 0.052 ± 0.036    |
| SVM (Model 4)       | 0.953 ± 0.051    | 0.953 ± 0.051    | 0.980 ± 0.022    | 0.964 ± 0.034    | 0.973 ± 0.033    | 0.953 ± 0.051    | 0.047 ± 0.051    |
| Decision Tree (Model 1) | 0.927 ± 0.043 | 0.927 ± 0.043    | 0.967 ± 0.021    | 0.940 ± 0.032    | 0.955 ± 0.032    | 0.927 ± 0.043    | 0.073 ± 0.043    |
| KNN (Model 3)       | 0.96 ± 0.015     | 0.96 ± 0.015     | 0.98 ± 0.009     | 0.965 ± 0.011    | 0.978 ± 0.015    | 0.96 ± 0.015     | 0.04 ± 0.015     |

### Best performing model 

The KNN model outperforms the others with the highest accuracy of 0.96 and the lowest standard deviation (0.015). It appears to offer the best compromise between bias and variance, which is a key factor in its superior performance in this evaluation. It's worth noting, though, that the ideal model could vary depending on the specific data and the context of the problem. While accuracy is a critical metric, and we believe is a good metric for our use case, it may not be the sole criterion for success in other cases, especially when the consequences of false positives and false negatives differ significantly. In such scenarios, other metrics like sensitivity or specificity may be more relevant for optimization.

# ---------------------------------------------- #