**1. Load Pipeline and Transformed Data**
The preprocessing pipeline and transformed datasets are loaded from saved .pkl files.
A helper function ensures that the data is converted into a proper numeric 2-D matrix, which is required by machine learning models.

In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse
pipeline = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_feature_pipeline.pkl"
)
X_train_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_train_transformed.pkl"
)
X_test_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_test_transformed.pkl"
)
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
   """
   Ensures that input arrays (possibly object-dtype, sparse, or 0-d) are converted to a 2-D float matrix.
   This is necessary because saved feature arrays may have inconsistent shapes or types after transformation,
   and ML models require numeric 2-D arrays for training and prediction.
   """
   if arr.ndim == 0:
       # Handle 0-d array directly
       arr = arr.item()
       if issparse(arr):
           arr = arr.toarray()
       arr = np.array(arr, dtype=float)
   elif arr.dtype == object:
       arr = np.array([
           x.toarray() if issparse(x) else np.array(x, dtype=float)
           for x in arr
       ])
       arr = np.vstack(arr)
   elif issparse(arr):
       arr = arr.toarray()
   else:
       arr = np.array(arr, dtype=float)
   return arr
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)
y_train = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_train.pkl"
)
y_test = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_test.pkl"
)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

**2. Logistic Regression with Grid Search**
A Logistic Regression model is tuned using GridSearchCV to find the best hyperparameters.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

log_reg_params = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l2"],
    "solver": ["lbfgs", "liblinear"]
}

log_reg_grid = GridSearchCV(
    LogisticRegression(max_iter=300),
    log_reg_params,
    cv=3,
    scoring="accuracy"
)

log_reg_grid.fit(X_train, y_train)

log_reg_best_params = log_reg_grid.best_params_
log_reg_best_score = log_reg_grid.best_score_

log_reg_best_params, log_reg_best_score

**3. Random Forest with Grid Search**
A Random Forest classifier is also tuned using GridSearchCV.

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf_params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}

rf_grid = GridSearchCV(
    RandomForestClassifier(),
    rf_params,
    cv=3,
    scoring="accuracy",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

rf_best_params = rf_grid.best_params_
rf_best_score = rf_grid.best_score_

rf_best_params, rf_best_score

**4. Model Selection**
The two models are compared using their best cross-validation accuracy scores.

In [0]:
# Choose the better model based on best_score_
if rf_best_score > log_reg_best_score:
    best_model = rf_grid.best_estimator_
    best_model_name = "Random Forest"
else:
    best_model = log_reg_grid.best_estimator_
    best_model_name = "Logistic Regression"

best_model_name, best_model

**5. Save the Best Model**
The selected model is saved for deployment or future predictions.

In [0]:
best_model = rf_grid.best_estimator_   # <-- replace with your winner
joblib.dump(best_model, "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_best_model.pkl")

**Model Evaluation Report and Ethics Reflection**

After tuning both models, the Logistic Regression model performed best based on the cross-validation accuracy score. The comparison showed that Logistic Regression achieved a higher best score than the Random Forest model, which means it made more accurate predictions during validation. The main metric used for comparison was accuracy, because the grid search was configured to optimize this score. The hyperparameter that improved the Logistic Regression model was the regularization strength (C=0.01), which helped control overfitting and made the model more stable. One interesting result was that the simpler Logistic Regression model outperformed the more complex Random Forest model, suggesting that the dataset may have relatively simple patterns that do not require a highly complex model.

If more time were available, the next step would be to test additional hyperparameters, try different models such as Gradient Boosting or Support Vector Machines, and evaluate performance using other metrics like precision, recall, and F1-score. This would provide a more complete understanding of the modelâ€™s behavior, especially if the dataset is imbalanced. However, hyperparameter tuning can accidentally introduce bias if the model becomes too optimized for patterns in the training data and performs poorly on underrepresented groups. This is why transparency is important. We should clearly report metrics, parameters, and model choices so others can understand and verify the results. Gospel principles also teach honesty and integrity. Just as we are taught to be truthful and fair in our actions, we should evaluate models honestly, report results accurately, and avoid hiding weaknesses or biases in our work.