**1. Load Pipeline, Data, and Prepare Matrices**
The preprocessing pipeline, transformed training data, and test data are loaded from saved .pkl files.


In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse
pipeline = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_feature_pipeline.pkl"
)
X_train_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_train_transformed.pkl"
)
X_test_transformed = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/X_test_transformed.pkl"
)
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
   """
   Ensures that input arrays (possibly object-dtype, sparse, or 0-d) are converted to a 2-D float matrix.
   This is necessary because saved feature arrays may have inconsistent shapes or types after transformation,
   and ML models require numeric 2-D arrays for training and prediction.
   """
   if arr.ndim == 0:
       # Handle 0-d array directly
       arr = arr.item()
       if issparse(arr):
           arr = arr.toarray()
       arr = np.array(arr, dtype=float)
   elif arr.dtype == object:
       arr = np.array([
           x.toarray() if issparse(x) else np.array(x, dtype=float)
           for x in arr
       ])
       arr = np.vstack(arr)
   elif issparse(arr):
       arr = arr.toarray()
   else:
       arr = np.array(arr, dtype=float)
   return arr
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)
y_train = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_train.pkl"
)
y_test = joblib.load(
   "/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/y_test.pkl"
)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

**2. Load Best Model and Compute Feature Importances**
The saved best model is loaded from disk.

**Global Feature Importance Observations**

The most important feature by far is `num__distance_cm`, with an importance of 0.918. This makes sense because distance is directly related to whether a step occurs. 

Other features, like `cat__sensor_type_accelerometer`, `cat__sensor_type_gyroscope`, `cat__sensor_type_ultraSonicSensor`, and various device IDs, have much smaller importance values (around 0.01–0.003). These features seem less influential individually, which is reasonable since the main signal comes from the distance measurement.

Overall, the importance pattern looks logical. I would trust predictions made by this model because the top feature aligns with domain knowledge.


In [0]:
import numpy as np
model = joblib.load("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_pipeline/stedi_best_model.pkl")
importances = model.feature_importances_
importance_order = np.argsort(importances)[::-1]

# Get feature names if available
try:
    feature_names = pipeline.named_steps["preprocess"].get_feature_names_out()
except:
    feature_names = [f"feature_{i}" for i in range(X_train.shape[1])]

for idx in importance_order[:10]:
    print(feature_names[idx], ":", importances[idx])

**3. Visualize Global Feature Importance**
A horizontal bar chart is created to visualize the top 10 most important features.

In [0]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.barh([feature_names[i] for i in importance_order[:10]],
         importances[importance_order[:10]])
plt.xlabel("Importance")
plt.title("Top Global Feature Importance")
plt.gca().invert_yaxis()
plt.show()

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/Global_Feature_Importance.png", bbox_inches='tight')

In [0]:
pip install shap

**4. Initialize SHAP Explainer**
SHAP (SHapley Additive exPlanations) is used to explain model predictions.

### SHAP Summary Plot Observations

The SHAP summary plot confirms that `num__distance_cm` is the dominant feature influencing predictions, just like in the global feature importance. 

High values of `num__distance_cm` push predictions toward the "step" class, which is expected. Sensor type features and device ID features have smaller effects, slightly nudging predictions in either direction depending on their value.

No unexpected influences were observed. The model appears to rely on meaningful patterns, and the SHAP summary matches the global feature importance results.

In [0]:
import shap
shap.initjs()

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

**5. SHAP Summary Plot (Global Explanation)**

In [0]:
shap.summary_plot(shap_values[...,0], X_test, feature_names=feature_names, rng=42)

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/SHAP_Summary_Plot.png", bbox_inches='tight')

**SHAP Summary Plot Observations**

The SHAP summary plot confirms that `num__distance_cm` is the dominant feature influencing predictions, just like in the global feature importance. 

High values of `num__distance_cm` push predictions toward the "step" class, which is expected. Sensor type features and device ID features have smaller effects, slightly nudging predictions in either direction depending on their value.

No unexpected influences were observed. The model appears to rely on meaningful patterns, and the SHAP summary matches the global feature importance results.


**6. SHAP Force Plots (Local Explanations)**
Force plots explain individual predictions.

One for class 1

**SHAP Force Plot Observation**

For this specific observation, `num__distance_cm` contributed strongly toward predicting a step (value 0.5795). Interestingly, `cat__device_id_spotter-19.stedi.local` also contributed positively (value 1), showing that the model sometimes relies on specific device IDs for fine-grained prediction adjustments.

The explanation makes sense: the main feature driving the step prediction is distance, with minor contributions from device identifiers. Looking at the raw data would likely lead to the same conclusion. The model is basing its decision on relevant and understandable patterns.

In [0]:
i = 0  # choose any index you like
import shap
shap.initjs()
shap.force_plot(explainer.expected_value[1],
                shap_values[...,1][i],
                X_test[i],
                
                feature_names=feature_names)


**7. SHAP Force Plots (Local Explanations)**
Force plots explain individual predictions.

One for class 0

In [0]:
i = 0  # choose any index you like
import shap
shap.initjs()
shap.force_plot(explainer.expected_value[0],
                shap_values[...,0][i],
                X_test[i],
                
                feature_names=feature_names)


**Reflection**

Global Insight:

The most important feature overall is num__distance_cm. This makes sense because the distance measurement directly reflects movement, which is critical for detecting whether a step occurred. Other features, such as sensor type and device ID, have much smaller contributions but may help adjust predictions slightly depending on context or hardware.

Local Insight:

The SHAP force plot for a single observation showed that num__distance_cm strongly pushed the prediction toward step, while cat__device_id_spotter-19.stedi.local also contributed positively. Features like sensor type and other device IDs had smaller, mixed effects. This shows that the model uses distance as the main signal, with minor adjustments from other features.

Human Intuition Check:

Yes, the model’s logic largely matches what a human would expect from the STEDI data. Distance is the most relevant indicator of a step, and it makes sense that sensor type or device ID only slightly influence the prediction. The model’s decisions are interpretable and aligned with real-world reasoning.

Dashboard Preparation:

For the Week 6 dashboard, I plan to include:

The global feature importance bar chart for the top features.

The SHAP summary plot to show overall feature influence.

At least one SHAP force plot to demonstrate how the model explains individual predictions.

In [0]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [0]:
# Step 1: Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Step 2: Predict using your model
y_pred = model.predict(X_test)

# Step 3: Create confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=['no_step', 'step'])  # specify class order

# Step 4: Plot heatmap
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['no_step', 'step'], yticklabels=['no_step', 'step'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix Heatmap')
plt.show()

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/confusion_matrix.png", bbox_inches='tight')



In [0]:
import pandas as pd

# Create a dataframe with actual and predicted labels
df_cm = pd.DataFrame({
    "actual_label": y_test,
    "predicted_label": y_pred
})

# Convert to Spark dataframe
spark_df = spark.createDataFrame(df_cm)

# Save as a table
spark_df.write.mode("overwrite").saveAsTable("stedi_predictions")


In [0]:
import pandas as pd

num_features = len(model.feature_importances_)
feature_names = [f"feature_{i}" for i in range(num_features)]

fi_df = pd.DataFrame({
    "feature_name": feature_names,
    "importance": model.feature_importances_
})

spark_df = spark.createDataFrame(fi_df)
spark_df.write.mode("overwrite").saveAsTable("feature_importance")


In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Predict on test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='step')
recall = recall_score(y_test, y_pred, pos_label='step')
f1 = f1_score(y_test, y_pred, pos_label='step')

# Optional: convert to percentage
accuracy_pct = round(accuracy * 100, 2)
precision_pct = round(precision * 100, 2)
recall_pct = round(recall * 100, 2)
f1_pct = round(f1 * 100, 2)

print(f"Accuracy: {accuracy_pct}%")
print(f"Precision: {precision_pct}%")
print(f"Recall: {recall_pct}%")
print(f"F1-score: {f1_pct}%")


In [0]:
import pandas as pd

metrics_df = pd.DataFrame({
    "metric_name": ["Accuracy", "Precision", "Recall", "F1-score"],
    "value": [accuracy, precision, recall, f1]
})

# Convert to Spark DataFrame
spark_metrics = spark.createDataFrame(metrics_df)

# Save as table
spark_metrics.write.mode("overwrite").saveAsTable("model_performance")

In [0]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import pandas as pd
# For binary classification
y_prob = model.predict_proba(X_test)[:, 1]  # probability of 'step'
fpr, tpr, thresholds = roc_curve(y_test, y_prob, pos_label='step')
roc_auc = auc(fpr, tpr)
print(f"ROC AUC: {roc_auc:.2f}")
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1], [0,1], color='gray', linestyle='--')  # random classifier
plt.xlim([0,1])
plt.ylim([0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

# Save as image for dashboard
plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/roc_curve.png", bbox_inches='tight')



In [0]:
roc_df = pd.DataFrame({
    "fpr": fpr,
    "tpr": tpr,
    "threshold": thresholds
})

spark_roc = spark.createDataFrame(roc_df)
spark_roc.write.mode("overwrite").saveAsTable("roc_curve")


In [0]:
# For tree-based models
import pandas as pd
import numpy as np

importances = model.feature_importances_  # array of importance values
num_features = len(importances)

# If you don’t have feature names, create generic ones
feature_names = [f"feature_{i}" for i in range(num_features)]

# Create a DataFrame
fi_df = pd.DataFrame({
    "feature_name": feature_names,
    "importance": importances
})

# Sort by importance descending
fi_df = fi_df.sort_values(by="importance", ascending=False)
fi_df.head()

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.barplot(x="importance", y="feature_name", data=fi_df, palette="Blues_r")
plt.title("Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/feature_importance.png", bbox_inches='tight')


In [0]:
import shap
import matplotlib.pyplot as plt

# 1️⃣ Create TreeExplainer
explainer = shap.TreeExplainer(model)

# 2️⃣ Compute SHAP values for your test set
shap_values = explainer.shap_values(X_test)

# For binary classification, shap_values[1] is usually the positive class ('step')
shap_values_for_step = shap_values[1] if isinstance(shap_values, list) else shap_values

# 3️⃣ Plot SHAP summary
feature_names = [f"feature_{i}" for i in range(X_test.shape[1])]
shap.summary_plot(shap_values_for_step, X_test, feature_names=feature_names, show=True)

# 4️⃣ Save plot for dashboard
plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/shap_summary.png", bbox_inches='tight')


In [0]:
import shap
import matplotlib.pyplot as plt

# Initialize JS for interactive plots
shap.initjs()

# Create TreeExplainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

print(type(shap_values))
print(len(shap_values) if isinstance(shap_values, list) else shap_values.shape)



In [0]:
i = 0  # choose any index you like
import shap
shap.initjs()
shap_html = shap.force_plot(explainer.expected_value[1],
                shap_values[...,1][i],
                X_test[i],
                
                feature_names=feature_names,
                matplotlib=False)

# Save as HTML
shap.save_html("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/positive_shap_force_plot.html", shap_html)

In [0]:
i = 0  # first row of X_test
import shap
shap.initjs()
feature_names = [f"feature_{j}" for j in range(X_test.shape[1])]
# For positive class
shap.force_plot(
    explainer.expected_value[1],
    shap_values[...,1][i],
    X_test[i],
    feature_names=feature_names,
    matplotlib=True
)

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/positive_shap_force_plot.png", bbox_inches="tight")

In [0]:
i = 0  # choose any index you like
import shap
shap.initjs()
shap_html = shap.force_plot(explainer.expected_value[0],
                shap_values[...,0][i],
                X_test[i],
                
                feature_names=feature_names,
                matplotlib=False)

# Save as HTML
shap.save_html("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/negative_shap_force_plot.html", shap_html)

In [0]:
i = 0  # first row of X_test
import shap
shap.initjs()
feature_names = [f"feature_{j}" for j in range(X_test.shape[1])]
# For negative class
shap.force_plot(
    explainer.expected_value[0],
    shap_values[...,0][i],
    X_test[i],
    feature_names=feature_names,
    matplotlib=True
)

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/negative_shap_force_plot.png", bbox_inches="tight")

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import pandas as pd

# Predict on test set
y_pred = model.predict(X_test)

# Compute metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='step')
recall = recall_score(y_test, y_pred, pos_label='step')
f1 = f1_score(y_test, y_pred, pos_label='step')

# Convert to percentage
metrics_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1-score"],
    "Value": [accuracy*100, precision*100, recall*100, f1*100]
})
metrics_df

plt.figure(figsize=(6,4))
plt.bar(metrics_df["Metric"], metrics_df["Value"], color='skyblue')
plt.ylim(0, 100)
plt.ylabel("Percentage (%)")
plt.title("Model Performance Metrics", pad=20)
for index, value in enumerate(metrics_df["Value"]):
    plt.text(index, value + 1, f"{value:.1f}%", ha='center', va='center')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/model_metrics.png", bbox_inches='tight')

In [0]:
from sklearn import set_config
import matplotlib.pyplot as plt

# Set sklearn to display diagrams
set_config(display='diagram')

# Display your pipeline
pipeline  # your loaded pipeline object

#plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/feature_pipeline_expanded3.png", bbox_inches='tight')

In [0]:
from sklearn import set_config
set_config(display='diagram')  # ensures diagram mode

# You can render it using matplotlib figure if notebook supports it
plt.figure(figsize=(12,6))
plt.text(0.5, 0.5, str(pipeline), ha='center', va='center', fontsize=12)
plt.axis('off')
plt.savefig("/Workspace/Users/gsc314@ensign.edu/csai382_lab_2_4_-GustavoC-/etl_output/feature_pipeline.png", bbox_inches='tight')

In [0]:
shap_df = pd.DataFrame(shap_values[...,1], columns=feature_names)
spark.createDataFrame(shap_df).write.mode("overwrite").saveAsTable("shap_values")