## ⚙️ Before you begin — running this notebook

This notebook is designed to work out-of-the-box on a typical developer laptop.  
You can run it **locally in VS Code** (recommended) or **online in Google Colab**.

### 🧩 Option 1: Run locally in **Visual Studio Code**

**Step 1 — Install prerequisites**
- Visual Studio Code
- The Python and Jupyter extensions
- Python ≥ 3.10

**Linux users:** install the venv module:
```bash
sudo apt update
sudo apt install python3-venv
```

Verify Python:
```bash
python3 --version
```

**Step 2 — Prepare your project folder**
- Put this notebook, `requirements.txt`, and your dataset folder or `data.zip` in the same directory.

**Step 3 — Create and activate a virtual environment**
```bash
python3 -m venv .venv
# Windows (PowerShell)
.venv\Scripts\Activate
# macOS / Linux
source .venv/bin/activate
```

If activation fails because `activate` is missing, install `python3-venv` and recreate the env.

**Step 4 — Install dependencies**
```bash
pip install --upgrade pip
pip install -r requirements.txt
```

**Step 5 — Select the kernel in VS Code**
When prompted to **Select Kernel**, choose the interpreter that points to `.venv/bin/python`  
(or press Ctrl+Shift+P → Python: Select Interpreter → pick `.venv`).  
If nothing appears, run `pip install ipykernel` in your activated env.

**Step 6 — Run the notebook**
Use **Run All** or run cells one by one.


# Exploring Machine Learning on Sensor Data — Engineer Lab (v2.4)

### 🧭 Welcome, engineer-explorer
Today’s journey will take you from the raw pulse of sensor data to the structured intelligence of a working machine learning system. You'll clean, shape, and understand data, extract meaningful patterns, and teach algorithms to recognize behaviors — all while reflecting on the design choices that make ML both powerful and fragile.

By the end, you’ll not only have built a working classifier, but also a deeper sense of what drives modern AI — the same foundations that underpin systems like ChatGPT.

Ready? Let’s build something that learns.


## 0) Environment check — building on solid ground

Before we dive into code, let’s make sure your environment is ready and connected to the notebook.  
This step confirms you’re using the intended Python kernel and that the core libraries are present.

### ✅ Selecting the correct kernel (VS Code)
If a popup asks you to **Select Kernel**:
1. Click **Python Environments…**
2. Choose the interpreter that ends with your project’s virtual environment, e.g. `.venv/bin/python`
3. The top-right status should show `Python 3.x ('.venv': venv)`

If no environments appear, run in your terminal:
```bash
pip install ipykernel
```

**Reflect & discuss**
- How would you ensure consistent environments across developers or CI pipelines?
- Why might reproducibility matter even for quick experiments?


In [None]:
import sys, platform
import numpy as np, pandas as pd, matplotlib.pyplot as plt

print("Python:", sys.version.split()[0])
print("OS:", platform.platform())
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)
print("Python executable:", sys.executable)

## 1) Configuration — your control panel

Centralize parameters so you can tweak dials and re-run downstream steps.

**Knobs to tweak**
- `TASK`: `'movement'` (default) uses accelerometer data to classify run/walk/jump/pushup. `'static'` focuses on lie/sit/stand (acc, optional mag).
- `WINDOW_SAMPLES`, `WINDOW_STRIDE`: temporal context & overlap
- `N_SPLITS`: rigor of GroupKFold
- `SORT_BY_TIMESTAMP`: keep chronological order within each file


In [None]:
from pathlib import Path
CONFIG = {
    "DATA_ROOT": None,  # folder or .zip (e.g., "data.zip")
    "INCLUDE_GLOBS": ["**/*.csv"],
    "TASK": "movement",  # "movement" or "static"
    "USE_GYRO": False,   # optional extension; default off to keep schema tight
    "SORT_BY_TIMESTAMP": True,
    "WINDOW_SAMPLES": 200,    # ~2s @100Hz, aligned with the lab report's analysis
    "WINDOW_STRIDE": 200,
    "N_SPLITS": 5,
    "RANDOM_SEED": 42,
    "MODELS": ["knn","logreg","linearsvm"],
}
import numpy as np
np.random.seed(CONFIG["RANDOM_SEED"])

## 2) Data ingestion — connect to the real world

We unify files into a single table and keep their origins (`__source_path`).  
Files come from phone sensor recordings and have **two schemas**:
1) `Timestamp, Milliseconds, X, Y, Z` (raw export)
2) `ax, ay, az` (and similar `gx,gy,gz`, `mx,my,mz`) with an extra `Unnamed: 0` index column

We normalize these into consistent column names based on modality inferred from the filename (`*_acc*.csv`, `*_gyro*.csv`, `*_mag*.csv`).  
No cross-modality merging is done here — that avoids artificial NaNs.


In [None]:
import zipfile, tempfile, pandas as pd, re
from pathlib import Path

def ensure_loaded_path(data_root):
    if data_root is None:
        return None, None
    p = Path(data_root)
    if p.suffix.lower()==".zip" and p.is_file():
        tmp = Path(tempfile.mkdtemp(prefix="data_zip_"))
        with zipfile.ZipFile(p,"r") as z: z.extractall(tmp)
        return tmp, "zip"
    elif p.exists() and p.is_dir():
        return p, "dir"
    else:
        print("[WARN] DATA_ROOT not found:", data_root); return None, None

DATA_PATH, DATA_KIND = ensure_loaded_path(CONFIG["DATA_ROOT"])
if DATA_PATH is None:
    print("No external data path provided; set CONFIG['DATA_ROOT'] to a folder or .zip and re-run.")

In [None]:
import pandas as pd
def read_csv_any(fp: Path):
    tries=[dict(),dict(sep=';'),dict(sep='\t'),dict(engine='python'),
           dict(engine='python',sep=';'),dict(engine='python',sep='\t')]
    last=None
    for kw in tries:
        try:
            return pd.read_csv(fp, **kw)
        except Exception as e:
            last=e
    raise last

def infer_modality_from_name(name: str):
    s = name.lower()
    if "gyro" in s: return "gyro"
    if "mag" in s:  return "mag"
    if "acc" in s:  return "acc"
    return None

def normalize_columns(df: pd.DataFrame, modality: str):
    df = df.copy()
    if "Unnamed: 0" in df.columns:
        df = df.drop(columns=["Unnamed: 0"])
    # Map XYZ to modality-specific names if needed
    if set(["X","Y","Z"]).issubset(df.columns):
        if modality=="acc":
            rename={"X":"ax","Y":"ay","Z":"az"}
        elif modality=="gyro":
            rename={"X":"gx","Y":"gy","Z":"gz"}
        elif modality=="mag":
            rename={"X":"mx","Y":"my","Z":"mz"}
        else:
            rename={}
        df = df.rename(columns=rename)
    return df

def load_manifest(root: Path, include_globs):
    files=[]
    for pat in include_globs:
        files.extend(root.glob(pat))
    out=[]
    for f in files:
        try:
            tdf = read_csv_any(f)
        except Exception as e:
            print("[WARN] failed to read", f, e); continue
        modality = infer_modality_from_name(f.name)
        if modality is None: 
            continue
        tdf = normalize_columns(tdf, modality)
        tdf["__source_path"] = str(f.relative_to(root))
        tdf["__modality"] = modality
        out.append(tdf)
    assert out, "No readable files."
    return out

if DATA_PATH is not None:
    tables = load_manifest(DATA_PATH, CONFIG["INCLUDE_GLOBS"])
    print("Loaded tables:", len(tables))
    sample = tables[0]
    display(sample.head())


## 3) Label & session derivation — giving meaning to numbers

We infer **activity** and **session** from file names (e.g., `walk_acc6.csv` → activity=`walk`, session=`6`).  
We keep files relevant to the chosen `TASK` and **sort** each file chronologically if a `Timestamp` column is present.


In [None]:
import pandas as pd, re
from pathlib import Path

def parse_activity(pathstr: str):
    name = Path(pathstr).name.lower()
    for token in ["sit_to_stand","stand_to_sit","pushup","jump","walk","run","sit","stand","lie","all"]:
        if token in name: return token
    return None

def parse_session(pathstr: str):
    stem = Path(pathstr).stem
    m = re.findall(r"(\d+)", stem)
    return m[-1] if m else "0"

def task_filter(activity: str, modality: str, task: str, use_gyro: bool):
    if task=="movement":
        if activity in {"run","walk","jump","pushup"}:
            if use_gyro:
                return modality in {"acc","gyro"}
            else:
                return modality=="acc"
        return False
    else: # static
        if activity in {"lie","sit","stand"}:
            return modality in {"acc","mag"}
        return False

# Build one normalized long table matching the task
frames=[]
if DATA_PATH is not None:
    for t in tables:
        path = t["__source_path"].iloc[0]
        modality = t["__modality"].iloc[0]
        activity = parse_activity(path)
        if activity is None: 
            continue
        if not task_filter(activity, modality, CONFIG["TASK"], CONFIG["USE_GYRO"]):
            continue
        df = t.copy()
        df["activity"] = activity
        df["group_id"] = f"session_{parse_session(path)}"
        # parse timestamp if present
        if CONFIG["SORT_BY_TIMESTAMP"] and "Timestamp" in df.columns:
            df["__ts"] = pd.to_datetime(df["Timestamp"], errors="coerce", dayfirst=True)
            if df["__ts"].isna().any():
                df["__ts_fallback"] = df.groupby("__source_path").cumcount()
                df["__ts"] = df["__ts"].fillna(pd.to_datetime(df["__ts_fallback"], unit="s"))
                df = df.drop(columns=["__ts_fallback"])
            df = df.sort_values(["__source_path","__ts"]).reset_index(drop=True)
        frames.append(df)

assert frames, "No data matched the selected TASK."
raw = pd.concat(frames, ignore_index=True)
print("Task:", CONFIG["TASK"], "Use gyro:", CONFIG["USE_GYRO"])
print("Rows:", len(raw), "Columns:", list(raw.columns)[:12])
print("Activities:", sorted(raw['activity'].unique().tolist()))
print("Groups:", raw['group_id'].nunique())
display(raw.head())

## 4) Sanity summaries — know your battlefield

Quick checks on class distribution and sessions help catch issues early.


In [None]:
print("Per-class rows:"); display(raw['activity'].value_counts())
print("Per-session rows:"); display(raw['group_id'].value_counts())

# Check numeric signals and NaN ratio
meta_cols = {"__source_path","__modality","activity","group_id","Timestamp","Milliseconds","__ts"}
num_cols = [c for c in raw.columns if c not in meta_cols and pd.api.types.is_numeric_dtype(raw[c])]
nan_ratio = raw[num_cols].isna().mean().sort_values(ascending=False)
print("Top 10 NaN ratios among numeric columns:"); display(nan_ratio.head(10))

## 5) Windowed features — letting structure emerge

We convert sensor sequences into fixed-length windows and describe each window with simple statistics (mean, standard deviation, and peak‑to‑peak range).  
For **movement** we default to accelerometer-only to keep a consistent schema across all recordings. (You can enable gyroscope later as an extension.)


In [None]:
import numpy as np, pandas as pd

WINDOW_SAMPLES = CONFIG["WINDOW_SAMPLES"]
WINDOW_STRIDE  = CONFIG["WINDOW_STRIDE"]

# Choose signal columns based on task and available columns
if CONFIG["TASK"]=="movement":
    signal_cols = [c for c in ["ax","ay","az"] if c in raw.columns]
else:
    signal_cols = [c for c in ["ax","ay","az","mx","my","mz"] if c in raw.columns]

assert signal_cols, "No signal columns found for the chosen task."
print("Using signals:", signal_cols)

def make_windows(df_in: pd.DataFrame, signal_cols, label_col, group_col, win, stride):
    feats, labels, groups, sources = [], [], [], []
    lab_cat = pd.Categorical(df_in[label_col])
    for (g, src), sdf in df_in.groupby([group_col,"__source_path"], sort=False):
        sdf = sdf.reset_index(drop=True)
        if len(sdf) < win: 
            continue
        X = sdf[signal_cols].to_numpy(dtype=float)
        y_codes = pd.Categorical(sdf[label_col], categories=lab_cat.categories).codes
        for start in range(0, len(sdf)-win+1, stride):
            stop = start + win
            seg = X[start:stop]
            lab = pd.Series(y_codes[start:stop]).mode().iloc[0]
            mu  = np.nanmean(seg, axis=0)
            sd  = np.nanstd(seg, axis=0, ddof=1)
            ptp = np.nanmax(seg, axis=0) - np.nanmin(seg, axis=0)
            row = {}
            for c, v in zip(signal_cols, mu):  row[f"{c}_mean"]=v
            for c, v in zip(signal_cols, sd):  row[f"{c}_std"] =v
            for c, v in zip(signal_cols, ptp): row[f"{c}_ptp"]=v
            feats.append(row); labels.append(lab); groups.append(g); sources.append(src)
    Xf = pd.DataFrame(feats)
    y = np.asarray(labels)
    groups = np.asarray(groups)
    meta = pd.DataFrame({"group_id": groups, "__source_path": sources})
    return Xf, y, groups, meta, list(lab_cat.categories)

Xf, y, groups, meta, label_names = make_windows(raw, signal_cols, "activity", "group_id", WINDOW_SAMPLES, WINDOW_STRIDE)
print("Feature table:", Xf.shape, "classes:", label_names, "groups:", len(set(groups)))
display(Xf.head())

## 6) Evaluation protocol — testing without cheating

Evaluating on data that’s too similar to training gives an illusion of success.  
`GroupKFold` keeps whole sessions together: either in training or testing, never both.

**Reflect & discuss**
- Why might random row-based splits exaggerate performance here?
- If each session came from a different user, what would `GroupKFold` protect you from?


In [None]:
# Small demo to show how GroupKFold splits by session
from sklearn.model_selection import GroupKFold
import numpy as np, pandas as pd

demo_groups = np.array(["s1","s1","s1","s2","s2","s2","s3","s3","s3"])
demo_X = np.arange(len(demo_groups)).reshape(-1,1)
gkf = GroupKFold(n_splits=3)

folds=[]
for i,(tr,te) in enumerate(gkf.split(demo_X, groups=demo_groups),1):
    folds.append({"fold":i,
                  "train_groups":np.unique(demo_groups[tr]).tolist(),
                  "test_groups":np.unique(demo_groups[te]).tolist()})
pd.DataFrame(folds)

## 7) Baseline algorithms & leaderboard — friendly competition

We compare classic algorithms using a common pipeline (Impute → Scale → Model).  
Leaderboard ranks by macro‑F1 and accuracy.


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import GroupKFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.base import clone
import numpy as np, pandas as pd, time

def build_model(name: str):
    if name=="knn":       clf=KNeighborsClassifier(n_neighbors=5)
    elif name=="logreg":  clf=LogisticRegression(max_iter=2000)
    elif name=="linearsvm": clf=LinearSVC()
    else: raise ValueError(name)
    return Pipeline([("impute", SimpleImputer(strategy="median")),
                     ("scaler", StandardScaler()),
                     ("clf", clf)])

def evaluate_group_kfold(model, X, y, groups, n_splits=5):
    g_unique = np.unique(groups)
    n_splits = min(n_splits, len(g_unique)) if len(g_unique)>1 else 2
    gkf=GroupKFold(n_splits=n_splits)
    accs,f1s,fit_ms,pred_ms=[],[],[],[]
    for tr,te in gkf.split(X,y,groups):
        m=clone(model)
        t0=time.time(); m.fit(X[tr],y[tr]); fit_ms.append((time.time()-t0)*1000)
        t1=time.time(); yhat=m.predict(X[te]); pred_ms.append((time.time()-t1)*1000)
        accs.append(accuracy_score(y[te],yhat))
        f1s.append(f1_score(y[te],yhat,average="macro"))
    return {"acc_mean":float(np.mean(accs)),"acc_std":float(np.std(accs)),
            "f1_mean":float(np.mean(f1s)),"f1_std":float(np.std(f1s)),
            "fit_ms_mean":float(np.mean(fit_ms)),"pred_ms_mean":float(np.mean(pred_ms))}

X_np = Xf.to_numpy(dtype=float)
rows=[]
for name in CONFIG["MODELS"]:
    res=evaluate_group_kfold(build_model(name), X_np, y, groups, n_splits=CONFIG["N_SPLITS"])
    res["model"]=name; rows.append(res)
leaderboard=pd.DataFrame(rows).sort_values("f1_mean", ascending=False).reset_index(drop=True)
display(leaderboard)
best_name=leaderboard.iloc[0]["model"]
print("Best model:", best_name)

## 8) Error analysis — facing the model’s blind spots

We pick one session as a pseudo hold‑out for inspection and view per‑class errors.


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

ug = pd.unique(groups)
holdout = ug[-1]
tr = groups!=holdout; te = groups==holdout
best_model = build_model(best_name)
best_model.fit(X_np[tr], y[tr])
yhat = best_model.predict(X_np[te])
print("Hold-out group:", holdout, "N:", int(te.sum()))
print(classification_report(y[te], yhat, digits=3))
cm = confusion_matrix(y[te], yhat)
plt.figure(); plt.imshow(cm, interpolation="nearest")
plt.title("Confusion matrix (hold-out)"); plt.xlabel("Pred"); plt.ylabel("True")
plt.colorbar(); plt.show()

## 9) Class‑by‑session heatmap — understand your data’s topology

See which classes appear in which sessions to reason about split difficulty and metric reliability.


In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
tmp = pd.DataFrame({"session": groups, "label": y})
pivot = tmp.pivot_table(index="session", columns="label", aggfunc="size", fill_value=0)
plt.figure(figsize=(8, max(3, len(pivot)*0.3)))
plt.imshow(pivot, aspect="auto", interpolation="nearest")
plt.title("Class-by-session (window counts)")
plt.xlabel("Class ID"); plt.ylabel("Session")
plt.colorbar(label="count")
plt.yticks(ticks=range(len(pivot.index)), labels=pivot.index)
plt.xticks(ticks=range(pivot.shape[1]), labels=range(pivot.shape[1]))
plt.tight_layout(); plt.show()

## 10) Stretch prompts — guided curiosity

- Set `WINDOW_STRIDE = WINDOW_SAMPLES // 2` and re-run sections 5→9. What changed and why?
- Toggle `USE_GYRO=True` and attempt feature fusion (accelerometer + gyroscope) — how do metrics move?
- Add magnitude features: `sqrt(ax^2 + ay^2 + az^2)` — do they help with run vs walk?
- Re-group sessions (by parent folder) — what happens to generalization?


## 11) Reflection — from Sensor ML to Language Models

You’ve now walked through a full machine learning workflow: from messy data to a functioning classifier. What you’ve done in miniature mirrors how most real-world ML systems — including ChatGPT — are designed.

### How this exercise connects to ChatGPT
I (ChatGPT, built by OpenAI and based on GPT‑5) am also a machine learning model. Instead of learning to classify motion windows, I predict the next token in a sequence of text. The foundation is the same: structured data, careful evaluation, and iteration.

| Your workflow concept | In large‑scale language models |
|---|---|
| Input window of sensor samples | Window of text tokens |
| Handcrafted features (mean/std/ptp) | Learned embeddings in neural layers |
| Activity label | Next‑token prediction target |
| GroupKFold leakage control | Deduplication and held‑out corpora |
| KNN, Logistic Regression, SVM | Deep transformer network |
| Accuracy / F1 metrics | Cross‑entropy loss, perplexity |

### How models like me evolved from these ideas
- **From hand‑crafted to learned features:** neural networks discover useful representations automatically.  
- **From labeled to self‑supervised data:** train by predicting text itself.  
- **From small to vast scale:** same training loop, many more parameters.

### What stayed the same
- Data quality still rules everything.  
- Separation of training and evaluation remains essential.  
- Curiosity and structured experimentation — just like what you practiced — remain core engineering virtues.

> *This notebook was co‑created with ChatGPT (GPT‑5, OpenAI), inspired by an earlier university lab report from Mattias. We started from that academic exercise and, through your guidance, evolved it into a hands‑on, story‑driven workshop for experienced engineers: we modernized the tooling, clarified evaluation with GroupKFold, tightened the feature schema to avoid modality mixing (and the NaN cascade), and added a narrative thread that links classical ML to contemporary AI. The difference from the models you trained today is scale; the craftsmanship is shared.*
