**Step 1: Load transformed data, labels, and prepare arrays**

In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse
pipeline = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_feature_pipeline.pkl"
)
X_train_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_train_transformed.pkl"
)
X_test_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_test_transformed.pkl"
)
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
   """
   Ensures that input arrays (possibly object-dtype, sparse, or 0-d) are converted to a 2-D float matrix.
   This is necessary because saved feature arrays may have inconsistent shapes or types after transformation,
   and ML models require numeric 2-D arrays for training and prediction.
   """
   if arr.ndim == 0:
       # Handle 0-d array directly
       arr = arr.item()
       if issparse(arr):
           arr = arr.toarray()
       arr = np.array(arr, dtype=float)
   elif arr.dtype == object:
       arr = np.array([
           x.toarray() if issparse(x) else np.array(x, dtype=float)
           for x in arr
       ])
       arr = np.vstack(arr)
   elif issparse(arr):
       arr = arr.toarray()
   else:
       arr = np.array(arr, dtype=float)
   return arr
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)
y_train = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_train.pkl"
)
y_test = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_test.pkl"
)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

**Step 2: Train and evaluate a Logistic Regression baseline model**

In [0]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train, y_train)

log_reg_score = log_reg.score(X_test, y_test)
log_reg_score

**Step 3: Train and evaluate a Random Forest baseline model**

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

rf_score = rf.score(X_test, y_test)
rf_score

**Step 4: Compare baseline model accuracy results**

In [0]:
results = {
    "Logistic Regression baseline": log_reg_score,
    "Random Forest baseline": rf_score
}
results

**Step 5: Save trained models and accuracy metadata**

In [0]:
import os
import joblib
from datetime import datetime

# Create a unique folder name (prevents overwriting files)
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
base_dir = f"/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_models/{run_id}"
os.makedirs(base_dir, exist_ok=True)

# Save trained models
joblib.dump(log_reg, f"{base_dir}/log_reg.joblib")
joblib.dump(rf, f"{base_dir}/random_forest.joblib")

# Save accuracy information (metadata)
metadata = {
    "run_id": run_id,
    "logistic_regression_accuracy": float(log_reg_score),
    "random_forest_accuracy": float(rf_score),
}

joblib.dump(metadata, f"{base_dir}/metadata.joblib")

base_dir


**Step 6: Package saved models and metadata into a ZIP file**

In [0]:
import shutil

# Define the ZIP path (same folder as your models)
zip_path = f"/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_models_{run_id}.zip"

# Create the ZIP archive containing your saved models and metadata
shutil.make_archive(zip_path.replace(".zip", ""), 'zip', base_dir)

print(f"ZIP file created: {zip_path}")

**Baseline Model Analysis**

The Random Forest model worked better than the Logistic Regression model because it had higher accuracy. Random Forest is also more stable when working with noisy sensor data because it can handle errors and small changes in the data better than Logistic Regression. One question I have is why the accuracy numbers are different and how the features in the data helped Random Forest learn better patterns. It is important to test a model before using it in real life because a wrong prediction can cause bad decisions or problems in real systems. If a model is wrong, it can affect users, workers, or people who depend on the systemâ€™s results. Fairness matters in data science because models should treat everyone equally, and it matters in discipleship because we are taught to care for others and avoid causing harm through our choices.