# Exploring Machine Learning on Sensor Data ‚Äî Engineer Lab (v2.3)

### üß≠ Welcome, engineer-explorer
Today‚Äôs journey will take you from the raw pulse of sensor data to the structured intelligence of a working machine learning system. You‚Äôll clean, shape, and understand data, extract meaningful patterns, and teach algorithms to recognize behaviors ‚Äî all while reflecting on the design choices that make ML both powerful and fragile.

By the end, you‚Äôll not only have built a working classifier, but also a deeper sense of what drives modern AI ‚Äî the same foundations that underpin systems like ChatGPT.

Ready? Let‚Äôs build something that learns.

## ‚öôÔ∏è Before you begin ‚Äî running this notebook

This notebook is designed to work out-of-the-box on a typical developer laptop.  
You can run it **locally in VS Code** (recommended for full control) or **online in Google Colab** (no installation required).

---

### üß© Option 1: Run locally in **Visual Studio Code**

This gives you the best performance and flexibility if you already use VS Code.

#### Step 1 ‚Äî Install prerequisites
Make sure you have:
- [Visual Studio Code](https://code.visualstudio.com/)
- The **Python** and **Jupyter** extensions (search ‚ÄúJupyter‚Äù in the Extensions view)
- [Python ‚â• 3.10](https://www.python.org/downloads/) installed and available in your PATH  

‚úÖ **Linux users:**  
Install the `venv` module (needed to create virtual environments):
```bash
sudo apt update
sudo apt install python3-venv
```

Verify Python:
```bash
python3 --version
```

#### Step 2 ‚Äî Prepare your project folder
Create a new folder and place these files inside:
- `Exploring_ML_Sensor_Data_Interactive_v2_4.ipynb` (this notebook)
- `requirements.txt`
- your dataset folder or `data.zip`

#### Step 3 ‚Äî Create and activate a virtual environment
Open a terminal inside VS Code (``Ctrl + ` ``) and run:
```bash
python3 -m venv .venv
```

Activate the environment:

- **Windows (PowerShell):**
  ```powershell
  .venv\Scripts\Activate
  ```
- **macOS / Linux:**
  ```bash
  source .venv/bin/activate
  ```

If the activation script is missing, install `python3-venv` as shown above and recreate the environment.

#### Step 4 ‚Äî Install dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```

#### Step 5 ‚Äî Tell VS Code to use this environment
Press **Ctrl + Shift + P** ‚Üí ‚ÄúPython: Select Interpreter‚Äù ‚Üí choose the one pointing to `.venv`.

#### Step 6 ‚Äî Open and run the notebook
Open the `.ipynb` file. You‚Äôll see **Run All** or **‚ñ∂ Run Cell** buttons above each cell.  
Run the first cell ‚Äî it should print your Python, OS, NumPy, and Pandas versions.

---

### ‚òÅÔ∏è Option 2: Run online in **Google Colab**

This requires no setup ‚Äî perfect if you just want to explore.

1. Visit [Google Colab](https://colab.research.google.com/).  
2. Select *File ‚Üí Upload notebook‚Ä¶* and open this file.  
3. Upload your data archive (e.g. `data.zip`) in the *Files* sidebar.  
4. In **Section 1**, set:
   ```python
   CONFIG["DATA_ROOT"] = "data.zip"
   ```
5. Run all cells ‚Äî Colab already includes the required libraries.

---

**Tip:** Keep your notebook and dataset together.  
Relative paths will resolve automatically, making setup faster and troubleshooting easier.

## 0) Environment check ‚Äî building on solid ground

Before we dive into code, let‚Äôs make sure your environment is ready and connected to the notebook.  
Machine learning depends on **reproducibility**, so even small version mismatches can cause different results.  
This step ensures that you‚Äôre using the correct Python kernel and that your tools are aligned.

---

### ‚úÖ Selecting the correct kernel (VS Code)

If this is your first time running a cell, VS Code will ask you to **‚ÄúSelect Kernel.‚Äù**  
When that popup appears:

1. Click **Python Environments‚Ä¶**
2. Wait a few seconds for VS Code to list interpreters.
3. Choose the one that ends with your project‚Äôs virtual environment, e.g.  
   ```
   .venv/bin/python
   ```
   or  
   ```
   Python 3.12 ('.venv': venv)
   ```
4. After selection, the top-right corner of VS Code should display something like:  
   ```
   Python 3.12 ('.venv': venv)
   ```
   If it takes a moment, VS Code is just installing the `ipykernel` package behind the scenes.

üí° **Tip:** If no environments appear, run this once in your activated terminal and retry:
```bash
pip install ipykernel
```

You can confirm everything is connected by running the cell below ‚Äî it will print your Python, OS, and key library versions.

---

**Reflect & discuss**
- How would you ensure consistent environments across developers or CI pipelines?
- Why might reproducibility matter even for quick experiments?

In [2]:
import sys, platform
import numpy as np, pandas as pd, matplotlib.pyplot as plt

print("Python:", sys.version.split()[0])
print("OS:", platform.platform())
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)

# Optional: verify you're in the intended environment
import os
print("Python executable:", sys.executable)

Python: 3.12.3
OS: Linux-6.14.0-1015-oem-x86_64-with-glibc2.39
NumPy: 2.3.4
Pandas: 2.3.3
Python executable: /home/mattiaah/Github/machine_learning_exercise/.venv/bin/python


## 1) Configuration ‚Äî your control panel
Every data science workflow starts with parameters ‚Äî the dials that define what, where, and how we process. By collecting these into one `CONFIG` dictionary, we make our experiments reproducible and tweakable. You can change any value and re-run downstream sections to explore cause and effect.

**Knobs to tweak**
- `WINDOW_SAMPLES` and `WINDOW_STRIDE`: define time-window size and overlap.
- `MODELS`: pick which algorithms to benchmark.
- `SORT_BY_TIMESTAMP`: should usually stay `True` for time-series data.

**Reflect & discuss**
- How might overlapping windows affect independence of samples?
- What trade-off exists between model performance and reproducibility?

In [3]:
from pathlib import Path
CONFIG = {
    "DATA_ROOT": Path("data/"),  # path to folder or .zip (e.g., "data.zip")
    "INCLUDE_GLOBS": ["**/*.csv","**/*.parquet","**/*.jsonl","**/*.json"],
    "SORT_BY_TIMESTAMP": True,
    "WINDOW_SAMPLES": 128,
    "WINDOW_STRIDE": 128,
    "N_SPLITS": 5,
    "RANDOM_SEED": 42,
    "MODELS": ["knn","logreg","linearsvm"],
}
import numpy as np
np.random.seed(CONFIG["RANDOM_SEED"])

## 2) Data ingestion ‚Äî connecting to the real world
Machine learning begins with the raw, messy world of data. Here we unify files into a single table, attach their origins (`__source_path`), and prepare to trace results back to where they came from. This isn‚Äôt glamorous, but it‚Äôs where most ML engineering time is spent.

**Knobs to tweak**
- `CONFIG['DATA_ROOT']`: folder or zip archive containing your dataset.
- `CONFIG['INCLUDE_GLOBS']`: file extensions or wildcards to include.

**Reflect & discuss**
- Why is keeping track of source paths useful when debugging ML models?
- How does this compare to data lineage tracking in production systems?

In [4]:
import zipfile, tempfile
from pathlib import Path
import pandas as pd

def ensure_loaded_path(data_root):
    if data_root is None:
        return None, None
    p = Path(data_root)
    if p.suffix.lower()==".zip" and p.is_file():
        tmp = Path(tempfile.mkdtemp(prefix="data_zip_"))
        with zipfile.ZipFile(p,"r") as z: z.extractall(tmp)
        return tmp, "zip"
    elif p.exists() and p.is_dir():
        return p, "dir"
    else:
        print("[WARN] DATA_ROOT not found:", data_root); return None, None

def list_files(root: Path, include_globs):
    files=[]
    for pat in include_globs:
        files.extend(root.glob(pat))
    seen=set(); out=[]
    for f in files:
        if f.is_file() and f not in seen:
            out.append(f); seen.add(f)
    return out

DATA_PATH, DATA_KIND = ensure_loaded_path(CONFIG["DATA_ROOT"])
if DATA_PATH is None:
    print("No external data path provided; set CONFIG['DATA_ROOT'] to a folder or .zip and re-run.")

In [5]:
def load_csv_robust(fp: Path):
    import pandas as pd
    tries=[dict(),dict(sep=";"),dict(sep="\t"),dict(engine="python"),
           dict(engine="python",sep=";"),dict(engine="python",sep="\t")]
    last_e=None
    for kw in tries:
        try: return pd.read_csv(fp, **kw)
        except Exception as e: last_e=e
    raise last_e

def load_tabular_file(fp: Path):
    import pandas as pd
    suf=fp.suffix.lower()
    if suf==".csv": return load_csv_robust(fp)
    if suf==".parquet":
        try: return pd.read_parquet(fp)
        except Exception as e: print("[WARN] parquet failed", fp, e); return None
    if suf in [".json",".jsonl"]:
        for lines in [True, False]:
            try: return pd.read_json(fp, lines=lines)
            except Exception: pass
        return None
    return None

if DATA_PATH is not None:
    files = list_files(DATA_PATH, CONFIG["INCLUDE_GLOBS"])
    dfs=[]
    for f in files:
        tdf = load_tabular_file(f)
        if tdf is None or len(tdf)==0: continue
        tdf = tdf.copy()
        tdf["__source_path"] = str(f.relative_to(DATA_PATH))
        dfs.append(tdf)
    assert dfs, "No readable files. Verify formats."
    raw = pd.concat(dfs, ignore_index=True)
    display(raw.head())
    print("Raw shape:", raw.shape, "from", len(dfs), "files")

Unnamed: 0.1,Unnamed: 0,ax,ay,az,__source_path,mx,my,mz,Timestamp,X,Y,Z,gx,gy,gz,Milliseconds
0,501.0,1.501496,-1.19701,9.682856,cleaned/acc_mag/lie_acc1.csv,,,,,,,,,,,
1,502.0,1.508675,-1.244866,9.699606,cleaned/acc_mag/lie_acc1.csv,,,,,,,,,,,
2,503.0,1.487139,-1.225723,9.699606,cleaned/acc_mag/lie_acc1.csv,,,,,,,,,,,
3,504.0,1.537389,-1.259223,9.661321,cleaned/acc_mag/lie_acc1.csv,,,,,,,,,,,
4,505.0,1.54696,-1.283151,9.745069,cleaned/acc_mag/lie_acc1.csv,,,,,,,,,,,


Raw shape: (377069, 16) from 204 files


## 3) Label & session derivation ‚Äî giving meaning to numbers
Our sensors record motion, but without context they‚Äôre just numbers. Here we infer *what* each recording represents (`activity`) and *which session* it belongs to. This is how we transform raw measurements into supervised learning examples.

We‚Äôll detect activity names directly from file paths (like `walk_acc6.csv` ‚Üí `walk`) and group related files using session numbers. Finally, we‚Äôll sort each file chronologically by its `Timestamp`, preserving the real-world flow of data.

**Reflect & discuss**
- Why does chronological sorting matter for time-series problems?
- If the same user appears in multiple sessions, what could that mean for model generalization?

In [6]:
import re
from pathlib import Path
import pandas as pd

def derive_activity_from_path(p: str):
    s = p.lower()
    for token in ["sit_to_stand","stand_to_sit","jump","walk","run","sit","stand"]:
        if token in s: return token
    return None

def derive_session_from_path(p: str):
    stem = Path(p).stem
    m = re.findall(r"(\d+)", stem)
    if m: return f"session_{m[-1]}"
    return Path(p).parent.name or "root"

df = raw.copy()
df["activity"] = df["__source_path"].apply(derive_activity_from_path)
df["group_id"] = df["__source_path"].apply(derive_session_from_path)
before=len(df); df=df.dropna(subset=["activity"]); print("Dropped rows without activity:", before-len(df))

if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

if CONFIG["SORT_BY_TIMESTAMP"] and "Timestamp" in df.columns:
    def robust_parse(ts):
        try: return pd.to_datetime(ts, errors="coerce", dayfirst=True)
        except Exception: return pd.to_datetime(ts, errors="coerce")
    df["__ts"] = robust_parse(df["Timestamp"])
    if df["__ts"].isna().any():
        df["__ts_fallback"] = df.groupby("__source_path").cumcount()
        df["__ts"] = df["__ts"].fillna(pd.to_datetime(df["__ts_fallback"], unit="s"))
        df = df.drop(columns=["__ts_fallback"])
    df = df.sort_values(["__source_path","__ts"]).reset_index(drop=True)
    print("Chronological sorting applied by 'Timestamp'.")
else:
    print("No 'Timestamp' column found or sorting disabled.")

print("Activities:", sorted(df["activity"].unique().tolist()))
print("Sessions (groups):", df["group_id"].nunique())

Dropped rows without activity: 110846
Chronological sorting applied by 'Timestamp'.
Activities: ['jump', 'run', 'sit', 'sit_to_stand', 'stand', 'stand_to_sit', 'walk']
Sessions (groups): 6


## 4) Sanity summaries ‚Äî knowing your battlefield
Before extracting features, we take a strategic pause to inspect the dataset‚Äôs balance and scale. A single glance at class and session counts can save hours of confusion later.

**Reflect & discuss**
- Which classes dominate? Which are rare?
- How might this imbalance skew accuracy compared to macro-F1?

In [7]:
print("Rows:", len(df))
print("Per-class counts (top 10):")
print(df["activity"].value_counts().head(10))
print("\nPer-session row counts (top 10):")
print(df["group_id"].value_counts().head(10))

Rows: 266223
Per-class counts (top 10):
activity
jump            105132
run              85623
walk             50557
sit              10453
stand            10190
sit_to_stand      2388
stand_to_sit      1880
Name: count, dtype: int64

Per-session row counts (top 10):
group_id
session_2    64729
session_1    63672
session_3    40316
session_6    35526
session_4    31252
session_5    30728
Name: count, dtype: int64


## 5) Windowed features ‚Äî letting structure emerge

We convert long sensor sequences into fixed-length windows and describe each window with simple statistics (mean, standard deviation, and peak-to-peak range). Different files may contain different sensors (e.g., accelerometer only vs. accelerometer+gyroscope), so we automatically keep the signals that are present in most rows and compute the summaries robustly.

**Knobs to tweak**
- `WINDOW_SAMPLES`, `WINDOW_STRIDE` ‚Äî temporal context and overlap
- `MIN_PRESENCE` ‚Äî minimum fraction of non-missing values a column must have to be used as a signal

**Reflect & discuss**
- How does changing `WINDOW_SAMPLES` affect separability of activities?
- What happens to metrics if you lower `MIN_PRESENCE` and include sparser sensors?

In [9]:
import numpy as np, pandas as pd

# --- Config knobs for feature engineering ---
WINDOW_SAMPLES = CONFIG["WINDOW_SAMPLES"]
WINDOW_STRIDE  = CONFIG["WINDOW_STRIDE"]
MIN_PRESENCE   = 0.85  # keep columns present (non-missing) in at least 85% of rows

# --- Identify candidate sensor columns (exclude meta) ---
meta_cols = {"__source_path","group_id","activity","Timestamp","__ts"}
candidates = [c for c in df.columns if c not in meta_cols]

# Coerce candidates to numeric where possible (robust against parsing quirks)
def coerce_numeric(series: pd.Series) -> pd.Series:
    if pd.api.types.is_numeric_dtype(series):
        return series
    try:
        return pd.to_numeric(series, errors="coerce")
    except Exception:
        return series  # leave as-is; will be filtered out if not numeric

df_clean = df.copy()
for c in candidates:
    df_clean[c] = coerce_numeric(df_clean[c])

# Keep only numeric signals with sufficient presence across the dataset
numeric_candidates = [c for c in candidates if pd.api.types.is_numeric_dtype(df_clean[c])]
presence = df_clean[numeric_candidates].notna().mean().sort_values(ascending=False)
signal_cols = [c for c in presence.index if presence[c] >= MIN_PRESENCE]

print(f"Using {len(signal_cols)} signal columns (presence ‚â• {MIN_PRESENCE:.0%}):", signal_cols[:12], "‚Ä¶")
dropped = [c for c in numeric_candidates if c not in signal_cols]
if dropped:
    print("Dropped (too sparse or non-numeric):", dropped[:12], "‚Ä¶")

# --- Window feature extraction ---
def make_windows(df_in: pd.DataFrame, signal_cols, label_col, group_col, win, stride):
    feats, labels, groups, sources = [], [], [], []
    lab_cat = pd.Categorical(df_in[label_col])

    for g, gdf in df_in.groupby(group_col, sort=False):
        for src, sdf in gdf.groupby("__source_path", sort=False):
            sdf = sdf.reset_index(drop=True)
            if len(sdf) < win or len(signal_cols) == 0:
                continue

            X = sdf[signal_cols].to_numpy(dtype=float)  # may include some NaNs
            y_codes = pd.Categorical(sdf[label_col], categories=lab_cat.categories).codes

            for start in range(0, len(sdf) - win + 1, stride):
                stop = start + win
                seg = X[start:stop]

                # Majority label in the window
                lab = pd.Series(y_codes[start:stop]).mode().iloc[0]

                # Summary statistics (nan-aware)
                mu  = np.nanmean(seg, axis=0)
                sd  = np.nanstd(seg, axis=0, ddof=1)
                ptp = np.nanmax(seg, axis=0) - np.nanmin(seg, axis=0)

                row = {}
                for c, v in zip(signal_cols, mu):  row[f"{c}_mean"] = v
                for c, v in zip(signal_cols, sd):  row[f"{c}_std"]  = v
                for c, v in zip(signal_cols, ptp): row[f"{c}_ptp"]  = v

                feats.append(row); labels.append(lab); groups.append(g); sources.append(src)

    Xf = pd.DataFrame(feats)
    y = np.asarray(labels)
    groups = np.asarray(groups)
    meta = pd.DataFrame({"group_id": groups, "__source_path": sources})
    return Xf, y, groups, meta, list(lab_cat.categories)

Xf, y, groups, meta, label_names = make_windows(
    df_clean, signal_cols, "activity", "group_id", WINDOW_SAMPLES, WINDOW_STRIDE
)

print("Feature table:", Xf.shape, "classes:", label_names, "groups:", len(set(groups)))
display(Xf.head())


Using 0 signal columns (presence ‚â• 85%): [] ‚Ä¶
Dropped (too sparse or non-numeric): ['ax', 'ay', 'az', 'mx', 'my', 'mz', 'X', 'Y', 'Z', 'gx', 'gy', 'gz'] ‚Ä¶
Feature table: (0, 0) classes: ['jump', 'run', 'sit', 'sit_to_stand', 'stand', 'stand_to_sit', 'walk'] groups: 0


## 6) Evaluation protocol ‚Äî testing without cheating

Before scoring models, we need to understand what ‚Äúfair testing‚Äù means.  
In machine learning, *leakage* happens when information from the test set slips into training ‚Äî giving an illusion of high accuracy.  
For time-series or session-based data, random row splits almost always leak, because adjacent samples are correlated.

`GroupKFold` solves this by ensuring that **entire groups** (sessions, subjects, devices, etc.) are kept together: either in training or in testing, never both.  
This mimics how we‚Äôd deploy a model ‚Äî to data from a *new session* it has never seen before.

**Reflect & discuss**
- Why might random row-based splits exaggerate performance in your dataset?
- If each session came from a different user, what would `GroupKFold` protect you from?
- How is this idea similar to separating staging and production data?

In [None]:
from sklearn.model_selection import GroupKFold
import numpy as np
import pandas as pd

# Toy example: 12 samples, 3 sessions (groups)
X_demo = np.arange(12).reshape(-1, 1)
y_demo = np.array(list("AAABBBCCCDD?"))[:12]   # arbitrary labels
groups_demo = np.repeat(["session_1", "session_2", "session_3"], 4)

gkf = GroupKFold(n_splits=3)
folds = []
for fold, (train_idx, test_idx) in enumerate(gkf.split(X_demo, y_demo, groups_demo), 1):
    folds.append({
        "fold": fold,
        "train_groups": np.unique(groups_demo[train_idx]).tolist(),
        "test_groups": np.unique(groups_demo[test_idx]).tolist()
    })
pd.DataFrame(folds)

## 7) Baseline algorithms & leaderboard ‚Äî friendly competition
We‚Äôll compare classic algorithms using a common pipeline (Impute ‚Üí Scale ‚Üí Model). The leaderboard ranks models by macro-F1 (fair to all classes) and accuracy. Simple models like KNN and Logistic Regression often surprise us when well-prepared data is fed to them.

**Knobs to tweak**
- Add or remove models in `CONFIG['MODELS']`.
- Tune hyperparameters (`n_neighbors`, `C`, etc.) for insight.

**Reflect & discuss**
- What trade-offs do you notice between speed, accuracy, and interpretability?

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import GroupKFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.base import clone
import numpy as np, pandas as pd, time

def build_model(name: str):
    if name=="knn":       clf=KNeighborsClassifier(n_neighbors=5)
    elif name=="logreg":  clf=LogisticRegression(max_iter=2000)
    elif name=="linearsvm": clf=LinearSVC()
    else: raise ValueError(name)
    return Pipeline([("impute", SimpleImputer(strategy="median")),
                     ("scaler", StandardScaler()),
                     ("clf", clf)])

def evaluate_group_kfold(model, X, y, groups, n_splits=5):
    g_unique = np.unique(groups)
    n_splits = min(n_splits, len(g_unique)) if len(g_unique)>1 else 2
    gkf=GroupKFold(n_splits=n_splits)
    accs,f1s,fit_ms,pred_ms=[],[],[],[]
    for tr,te in gkf.split(X,y,groups):
        m=clone(model)
        t0=time.time(); m.fit(X[tr],y[tr]); fit_ms.append((time.time()-t0)*1000)
        t1=time.time(); yhat=m.predict(X[te]); pred_ms.append((time.time()-t1)*1000)
        accs.append(accuracy_score(y[te],yhat))
        f1s.append(f1_score(y[te],yhat,average="macro"))
    return {"acc_mean":float(np.mean(accs)),"acc_std":float(np.std(accs)),
            "f1_mean":float(np.mean(f1s)),"f1_std":float(np.std(f1s)),
            "fit_ms_mean":float(np.mean(fit_ms)),"pred_ms_mean":float(np.mean(pred_ms))}

X_np = Xf.to_numpy(dtype=float)
rows=[]
for name in CONFIG["MODELS"]:
    res=evaluate_group_kfold(build_model(name), X_np, y, groups, n_splits=CONFIG["N_SPLITS"])
    res["model"]=name; rows.append(res)
leaderboard=pd.DataFrame(rows).sort_values("f1_mean", ascending=False).reset_index_drop=True if hasattr(pd.DataFrame, "reset_index_drop") else pd.DataFrame(rows).sort_values("f1_mean", ascending=False).reset_index(drop=True)
display(leaderboard)
best_name=leaderboard.iloc[0]["model"]
print("Best model:", best_name)

## 8) Error analysis ‚Äî facing the model‚Äôs blind spots
Good engineers don‚Äôt just celebrate scores; they investigate mistakes. Here we inspect one session as a pseudo hold-out, viewing its confusion matrix and per-class metrics. Patterns of confusion often reveal deeper structure in both data and domain.

**Reflect & discuss**
- Which pairs of activities are commonly confused? Why?
- What additional features might help separate them?

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

ug = pd.unique(groups)
holdout = ug[-1]
tr = groups!=holdout; te = groups==holdout
best_model = build_model(best_name)
best_model.fit(X_np[tr], y[tr])
yhat = best_model.predict(X_np[te])
print("Hold-out group:", holdout, "N:", int(te.sum()))
print(classification_report(y[te], yhat, digits=3))
cm = confusion_matrix(y[te], yhat)
plt.figure(); plt.imshow(cm, interpolation="nearest")
plt.title("Confusion matrix (hold-out)"); plt.xlabel("Pred"); plt.ylabel("True")
plt.colorbar(); plt.show()

## 9) Class-by-session heatmap ‚Äî understanding your data‚Äôs topology
This visualization shows which classes appear in which sessions. It helps you spot imbalances or missing combinations that may limit generalization. Think of it as a map of where your training signal comes from.

**Reflect & discuss**
- Are some classes missing from entire sessions?
- How might this affect evaluation reliability?

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
tmp = pd.DataFrame({"session": groups, "label": y})
pivot = tmp.pivot_table(index="session", columns="label", aggfunc="size", fill_value=0)
plt.figure(figsize=(8, max(3, len(pivot)*0.3)))
plt.imshow(pivot, aspect="auto", interpolation="nearest")
plt.title("Class-by-session (window counts)")
plt.xlabel("Class ID"); plt.ylabel("Session")
plt.colorbar(label="count")
plt.yticks(ticks=range(len(pivot.index)), labels=pivot.index)
plt.xticks(ticks=range(pivot.shape[1]), labels=range(pivot.shape[1]))
plt.tight_layout(); plt.show()

## 10) Stretch prompts ‚Äî guided curiosity
Now that your pipeline runs end-to-end, it‚Äôs time to experiment. Tweak, break, and observe ‚Äî this is how intuition for machine learning is built.

**Try these challenges:**
- Change `WINDOW_STRIDE` to half the window size and see how your sample count and metrics shift.
- Tune K in KNN or `C` in Logistic Regression and observe speed vs. accuracy trade-offs.
- Add magnitude features like `sqrt(ax¬≤ + ay¬≤ + az¬≤)` ‚Äî does it help distinguish sitting from standing?
- Re-group sessions differently (by folder name or date) ‚Äî what happens to generalization?

**Reflect & discuss**
- Which change most improved your understanding of the system?
- What parallels can you draw to software performance tuning or A/B testing?

## 11) Reflection ‚Äî from Sensor ML to Language Models
You‚Äôve now walked through a full machine learning workflow: from messy data to a functioning classifier. What you‚Äôve done in miniature mirrors how most real-world ML systems ‚Äî including the large-scale ones like ChatGPT ‚Äî are designed.

### How this exercise connects to ChatGPT
I (ChatGPT, built by OpenAI and based on GPT‚Äë5) am also a machine learning model. Instead of learning to classify short motion windows, I predict the next token (word fragment) in a sequence of text. The foundation, however, is the same: structured data, careful evaluation, and lots of iteration.

| Your workflow concept | In large-scale language models |
|-----------------------|--------------------------------|
| Input window of sensor samples | Window of text tokens |
| Handcrafted features (mean/std/ptp) | Learned embeddings in neural layers |
| Activity label | Next-token prediction target |
| GroupKFold leakage control | Massive deduplication and held-out corpora |
| KNN, Logistic Regression, SVM | Deep transformer network |
| Accuracy / F1 metrics | Cross-entropy loss, perplexity |

Both pipelines rest on the same principles: clean data, fair evaluation, and thoughtful iteration.

### How models like me evolved from these ideas
- **From hand-crafted to learned features:** Deep networks now discover the best representations automatically.
- **From labeled to self-supervised data:** Instead of labels, models like me learn directly from predicting text.
- **From small to vast scale:** The same basic loops ‚Äî forward, loss, backward ‚Äî just run on billions of parameters.

### What stayed the same
- Data quality still rules everything.
- Separating training and evaluation remains essential.
- Curiosity and structured experimentation ‚Äî just like what you practiced ‚Äî remain core engineering virtues.

### Reflect & discuss
- Which parts of your workflow feel universal across ML domains?
- Where do human judgment and intuition still make the biggest impact?

> *This notebook was co-created with ChatGPT (GPT-5, OpenAI), a large language model trained through the same machine learning principles you‚Äôve explored here.  
>  
> The starting point for this exercise was a university lab report ‚Äî a traditional academic format focusing on data cleaning and classification. Together, we transformed it into a hands-on, story-driven workshop for experienced software engineers.  
>  
> Mattias provided the original idea, data, and audience insight; I contributed structure, pedagogy, and modernized tooling. Step by step, we discussed how to balance intuition and rigor, how to guide exploration without overwhelming detail, and how to tie everything together with a narrative that makes machine learning *feel like an engineering journey*.  
>  
> What emerged from that collaboration is this notebook: a synthesis of human design and AI assistance ‚Äî built through the same iterative reasoning loop that drives great machine learning itself.  
>  
> The difference between this and the small models you trained is scale. The **spirit of the work**, however ‚Äî curiosity, clarity, and craftsmanship ‚Äî is exactly the same.*