# machine.ipynb

Machine learning analysis to predict student stress levels.

<br>

## Changes from FP-6

**Fixes from earlier session (correcting FP-6 mistakes):**
- Used 5-fold cross-validation instead of single train/test split
- Added hyperparameter tuning for both models
- Aggregated confusion matrices across all folds

**New improvements for FP-7 (preprocessing & dimensionality reduction):**
- Applied **StandardScaler** to normalize features before KNN (Lesson 13)
- Applied **PCA** to reduce dimensionality and remove correlated features (Lesson 14)
- Imported data from parse_data.ipynb instead of loading CSV directly

<br>

## Plan

**Research Question:** Can we predict a student's stress level from their daily habits?

**Data:**
- Target (y): stress (Low=1, Moderate=2, High=3)
- Features (X): studyhours, sleephours, socialhours, activityhours, Gender

**Preprocessing (new for FP-7):**
- StandardScaler:  transforms features to mean=0, std=1
- PCA: reduces to components explaining 90%+ variance

**Models:** KNN and Decision Tree with 5-fold cross-validation

---

## Setup and Data Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

%matplotlib inline

In [2]:
%%capture
%run 'parse_data.ipynb'

In [3]:
# Gender not in parse_data.ipynb, so add it here
df0 = pd.read_csv('lifestylestudents.csv')
df['Gender'] = df0['Gender'].map({'Male': 0, 'Female': 1})

X = df[['studyhours', 'sleephours', 'socialhours', 'activityhours', 'Gender']]
y = df['stress']

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

The dataset contains 2000 students.  Stress levels are distributed as:
- Low:  ~24%
- Moderate:  ~24%
- High:  ~51%

Classes are somewhat imbalanced but not severely.

---

## Preprocessing:  Scaling

KNN uses Euclidean distance to find nearest neighbors. Without scaling, features with larger ranges dominate the distance calculation.

Our features have different ranges:
- studyhours: 5-10
- sleephours:  5-10
- socialhours: 0-6
- activityhours: 0-13
- Gender: 0-1

StandardScaler transforms each feature to mean=0 and std=1, so all features contribute equally.

<br>

In [4]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

After scaling, all features have mean ≈ 0 and std = 1.

---

## Dimensionality Reduction:  PCA

PCA finds directions of maximum variance in the data. If features are correlated, fewer components can explain most of the variance.

From FP-6, we know study hours and sleep hours dominate stress prediction.  PCA should confirm this.

<br>

In [5]:
pca_full = PCA(n_components=5)
pca_full.fit(X_scaled)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
np.round(cumulative_variance, 2)

array([0.26, 0.47, 0.66, 0.84, 1.0])

Cumulative variance explained:
- PC1: 26%
- PC1-2: 47%
- PC1-3: 66%
- PC1-4: 84%
- PC1-5: 100%

We need 4 components to reach 84%, or all 5 for 90%+.  This suggests the features in this dataset are not highly correlated with each other (unlike the FP-6 feature importance results suggested for the Decision Tree).

We'll test both scaled data and PCA-reduced data to see which works better.

<br>

In [6]:
pca = PCA(n_components=4)
X_pca = pca.fit_transform(X_scaled)

---

## Parameter Tuning

Testing different hyperparameters to find optimal settings, now using scaled data for KNN.

<br>

In [7]:
# KNN tuning with scaled features
knn_results = []
for k in [3, 5, 7, 9, 11, 15]:
    scores = cross_val_score(KNeighborsClassifier(n_neighbors=k), X_scaled, y, cv=kfold)
    knn_results.append({'n_neighbors': k, 'mean':  scores.mean(), 'std': scores. std()})

knn_df = pd.DataFrame(knn_results)
best_knn_k = knn_df. loc[knn_df['mean'].idxmax(), 'n_neighbors']
display(knn_df[['n_neighbors', 'mean']].rename(columns={'mean': 'Mean Accuracy'}).style.format({'Mean Accuracy': '{:.2%}'}))

n_neighbors,Mean Accuracy
3,90.25%
5,91.10%
7,91.55%
9,91.85%
11,92.15%
15,91.90%


In [8]:
# Decision Tree tuning (doesn't need scaling)
tree_results = []
for depth in [3, 4, 5, 6]:
    scores = cross_val_score(DecisionTreeClassifier(max_depth=depth, random_state=42), X, y, cv=kfold)
    tree_results.append({'max_depth': depth, 'mean':  scores.mean(), 'std': scores. std()})

tree_df = pd.DataFrame(tree_results)
best_tree_depth = int(tree_df. loc[tree_df['mean'].idxmax(), 'max_depth'])
display(tree_df[['max_depth', 'mean']]. rename(columns={'mean': 'Mean Accuracy'}).style.format({'Mean Accuracy': '{:.2%}'}))

max_depth,Mean Accuracy
3,100.00%
4,100.00%
5,100.00%
6,100.00%


Best KNN:  n_neighbors=11.  Best Decision Tree: max_depth=3 (simplest with 100% accuracy).

---

## Model Comparison:  Effect of Preprocessing

Comparing KNN performance on raw vs preprocessed data.

<br>

In [9]:
knn_raw = cross_val_score(KNeighborsClassifier(n_neighbors=11), X, y, cv=kfold)
knn_scaled = cross_val_score(KNeighborsClassifier(n_neighbors=11), X_scaled, y, cv=kfold)
knn_pca = cross_val_score(KNeighborsClassifier(n_neighbors=11), X_pca, y, cv=kfold)

comparison = pd.DataFrame({
    'Data': ['Raw (FP-6)', 'Scaled (FP-7)', 'PCA (4 comp)'],
    'Mean Accuracy': [f"{s. mean()*100:.2f}%" for s in [knn_raw, knn_scaled, knn_pca]],
    'Std':  [f"±{s.std()*100:.2f}%" for s in [knn_raw, knn_scaled, knn_pca]]
})
display(comparison)

Data,Mean Accuracy,Std
Raw (FP-6),91.80%,±1.47%
Scaled (FP-7),92.15%,±1.32%
PCA (4 comp),91.95%,±1.28%


Scaling improves KNN accuracy by ~0.35% and reduces variance.  The improvement is modest because feature ranges in this dataset are already fairly similar (mostly 0-10). In datasets with more varied ranges, scaling would have a larger effect.

---

## Feature Importance (Decision Tree)

In [10]:
tree_model = DecisionTreeClassifier(max_depth=best_tree_depth, random_state=42)
tree_model. fit(X, y)

importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': tree_model. feature_importances_
}).sort_values('Importance', ascending=False)

In [11]:
def plot_machine1():
    plt.figure(figsize=(10, 6))
    plt. barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')
    plt.xlabel('Importance Score', fontsize=12)
    plt.ylabel('Lifestyle Factor', fontsize=12)
    plt.title('Feature Importance:  Which Lifestyle Factors Predict Stress?', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    for i, (feature, importance) in enumerate(zip(importance_df['Feature'], importance_df['Importance'])):
        plt.text(importance + 0.01, i, f'{importance*100:.1f}%', va='center', fontsize=10)
    plt.tight_layout()
    plt.show()

plot_machine1()

Study hours (72. 4%) and sleep hours (27.6%) are the only features the Decision Tree uses.  This explains why PCA didn't dramatically improve results—the tree already ignores irrelevant features through its own feature selection.

---

## Confusion Matrices

In [12]:
# using scaled features for KNN now
y_pred_knn = cross_val_predict(KNeighborsClassifier(n_neighbors=int(best_knn_k)), X_scaled, y, cv=kfold)
y_pred_tree = cross_val_predict(DecisionTreeClassifier(max_depth=best_tree_depth, random_state=42), X, y, cv=kfold)

cm_knn = confusion_matrix(y, y_pred_knn)
cm_tree = confusion_matrix(y, y_pred_tree)

knn_accuracy = cm_knn. trace() / cm_knn.sum() * 100
tree_accuracy = cm_tree.trace() / cm_tree.sum() * 100

In [13]:
def plot_machine2():
    fig, axes = plt. subplots(1, 2, figsize=(14, 5))
    
    sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues', ax=axes[0],
                xticklabels=['Low', 'Moderate', 'High'],
                yticklabels=['Low', 'Moderate', 'High'])
    axes[0]. set_title(f'KNN (Scaled) - Accuracy: {knn_accuracy:.1f}%', fontsize=13, fontweight='bold')
    axes[0].set_xlabel('Predicted')
    axes[0].set_ylabel('Actual')
    
    sns.heatmap(cm_tree, annot=True, fmt='d', cmap='Greens', ax=axes[1],
                xticklabels=['Low', 'Moderate', 'High'],
                yticklabels=['Low', 'Moderate', 'High'])
    axes[1].set_title(f'Decision Tree - Accuracy: {tree_accuracy:.1f}%', fontsize=13, fontweight='bold')
    axes[1]. set_xlabel('Predicted')
    axes[1].set_ylabel('Actual')
    
    plt.tight_layout()
    plt.show()

plot_machine2()

**KNN with scaling:** ~92% accuracy, errors mostly between adjacent stress levels.

**Decision Tree:** 100% accuracy.  The dataset has clear thresholds that perfectly separate stress categories.

---

## Summary:  FP-6 vs FP-7

| Aspect | FP-6 | FP-7 |
|--------|------|------|
| Data loading | Direct CSV | Import from parse_data. ipynb |
| Preprocessing | None | StandardScaler |
| Dimensionality reduction | None | PCA tested |
| KNN accuracy | ~91.8% | ~92.2% |

The preprocessing improvements are modest for this dataset because feature ranges were already similar. The main benefit is demonstrating proper ML workflow as taught in Lessons 13-14.