# Exploring Movement Recognition ‚Äî Engineer Lab (v2.5)

### üß≠ Welcome
In this lab you‚Äôll build a simple, honest movement classifier from **already cleaned and split data**.  
We focus on understanding *how data turns into features* and *how models are evaluated fairly* ‚Äî without ML heavy lifting.

This edition uses **`./data/cleaned/acc_mag_gyro/`** only. It contains **train/** and **test/** splits of movement data with **accelerometer, gyroscope, and magnetometer** readings.


## ‚öôÔ∏è Before you begin ‚Äî running this notebook

This notebook is designed to work **out-of-the-box** for engineers who want to explore machine learning hands-on.  
You can run it locally in **Visual Studio Code** (recommended) or online in **Google Colab**.

---

### üß© Option 1 ‚Äî Run locally in **Visual Studio Code**

#### **Step 1 ‚Äî Install prerequisites**
You‚Äôll need:
- Visual Studio Code
- The **Python** and **Jupyter** extensions for VS Code
- **Python ‚â• 3.10**

üí° **Linux note:** you may need to install the virtual-environment module first:
```bash
sudo apt update
sudo apt install python3-venv
```

Verify your installation:
```bash
python3 --version
```

---

#### **Step 2 ‚Äî Prepare your project folder**
However you obtain this code, preferably cloning the repo, make sure you have the following:
```
Exploring_ML_Movement_v2_5.ipynb
requirements.txt
data/
‚îî‚îÄ‚îÄ cleaned/
    ‚îî‚îÄ‚îÄ acc_mag_gyro/
        ‚îú‚îÄ‚îÄ train/
        ‚îî‚îÄ‚îÄ test/
```

---

#### **Step 3 ‚Äî Create and activate a virtual environment**
From the terminal inside your project folder:

```bash
python3 -m venv .venv
```

Activate it:

- **macOS / Linux**
  ```bash
  source .venv/bin/activate
  ```
- **Windows (PowerShell)**
  ```powershell
  .venv\Scripts\Activate
  ```

If the `activate` script isn‚Äôt there, it usually means `python3-venv` wasn‚Äôt installed before creating the environment.  
Install it, delete `.venv/`, and recreate it using the commands above.

---

#### **Step 4 ‚Äî Install dependencies**
```bash
pip install --upgrade pip
pip install -r requirements.txt
```

---

#### **Step 5 ‚Äî Select the kernel in VS Code**
When prompted to **Select Kernel**, choose the one that points to your new environment:
```
Python 3.x ('.venv': venv)
```

If nothing appears, run this in your activated terminal:
```bash
pip install ipykernel
```
Then restart VS Code and open the notebook again.  

üí° You can confirm you‚Äôre using the right environment by running:
```python
!which python
```
It should print a path ending with `.venv/bin/python`.

---

#### **Step 6 ‚Äî Run the notebook**
You‚Äôre ready! Run cells one by one or choose **Run All**.  
The first few cells will check your setup and print Python, NumPy, and Pandas versions.

In [None]:
import sys, platform
import numpy as np, pandas as pd, matplotlib.pyplot as plt
print("Python:", sys.version.split()[0])
print("OS:", platform.platform())
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)

### üßÆ Option 2 ‚Äî Run in **Google Colab**
If you prefer the cloud:
1. Upload the notebook and the `data/cleaned/acc_mag_gyro/` folder to your Colab environment.  
2. Run the setup cell that installs the dependencies:
   ```bash
   !pip install -r requirements.txt
   ```
3. Proceed through the notebook as usual.

## 1) What data do we have? (movement-only, pre-split)

We work with **movement** recordings (e.g., `walk`, `run`, `jump`, `pushup`). The recordings are made with a cell phone in a person's pocket. 
Data is already cleaned and split into **train/** and **test/**. Each file includes some combination of:
- **Accelerometer** (`ax, ay, az`)
- **Gyroscope** (`gx, gy, gz`)
- **Magnetometer** (`mx, my, mz`)

We‚Äôll derive the **activity label** from file names, keep consistent columns, and inspect class balance.


In [None]:
from pathlib import Path
import pandas as pd

CONFIG = {
    "DATA_ROOT": "./data/cleaned/acc_mag_gyro",  # fixed path per instruction
    "WINDOW_SAMPLES": 200,   # ~2s @100 Hz (adjust if needed)
    "WINDOW_STRIDE": 200,
    "RANDOM_SEED": 42,
    "MODELS": ["knn","logreg","linearsvm"],
}

def read_csv_any(fp: Path):
    import pandas as pd
    tries=[dict(),dict(sep=';'),dict(sep='\t'),dict(engine='python'),
           dict(engine='python',sep=';'),dict(engine='python',sep='\t')]
    last=None
    for kw in tries:
        try:
            return pd.read_csv(fp, **kw)
        except Exception as e:
            last=e
    raise last

def normalize_schema(df: pd.DataFrame):
    df = df.copy()
    if "Unnamed: 0" in df.columns:
        df = df.drop(columns=["Unnamed: 0"])
    # If generic XYZ present, map to accelerometer by default (these cleaned files should already be named, but be safe)
    if set(["X","Y","Z"]).issubset(df.columns):
        df = df.rename(columns={"X":"ax","Y":"ay","Z":"az"})
    return df

def derive_label_from_name(p: Path):
    s = p.name.lower()
    for token in ["sit_to_stand","stand_to_sit","pushup","jump","walk","run","sit","stand","lie"]:
        if token in s: return token
    # fallback: parent dir might be the label
    parent = p.parent.name.lower()
    for token in ["pushup","jump","walk","run","sit","stand","lie"]:
        if token in parent: return token
    return "unknown"

def load_split(split: str):
    root = Path(CONFIG["DATA_ROOT"])/split
    files = list(root.rglob("*.csv"))
    tables=[]
    for f in files:
        try:
            df = read_csv_any(f)
        except Exception as e:
            print("[WARN] read failed:", f, e); continue
        df = normalize_schema(df)
        df["__source_path"] = str(f.relative_to(Path(CONFIG["DATA_ROOT"])))
        df["label"] = derive_label_from_name(f)
        tables.append(df)
    assert tables, f"No CSV files found under {root}"
    data = pd.concat(tables, ignore_index=True)
    return data

train_df = load_split("train")
test_df  = load_split("test")

print("Train rows:", len(train_df), "Test rows:", len(test_df))
print("Train cols:", list(train_df.columns)[:12])
print("Labels (train):", sorted(train_df['label'].unique()))


## 2) Explore labels and signals (visuals)

First, let‚Äôs check **class balance** in the train split, then **peek at a few signals** from a random file to build intuition.


In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt

# Class counts
train_counts = train_df['label'].value_counts().sort_index()
display(train_counts)

# Bar plot
plt.figure(figsize=(6,3))
train_counts.plot(kind='bar')
plt.title("Train class counts")
plt.xlabel("Class")
plt.ylabel("Rows")
plt.tight_layout()
plt.show()

# Quick look at available signal columns
meta_cols = {"__source_path","label","Timestamp","Milliseconds"}
num_cols = [c for c in train_df.columns if c not in meta_cols and pd.api.types.is_numeric_dtype(train_df[c])]
print("Numeric columns (sample):", num_cols[:12])

# Plot a sample recording's time series for intuition
import random
sample_path = random.choice(train_df['__source_path'].unique().tolist())
sample_df = train_df[train_df['__source_path']==sample_path].reset_index(drop=True)
print("Sample file:", sample_path, "Label:", sample_df['label'].iloc[0])

plt.figure(figsize=(8,3))
if set(["ax","ay","az"]).issubset(sample_df.columns):
    sample_df[["ax","ay","az"]].plot(ax=plt.gca())
    plt.title("Accelerometer (ax, ay, az) ‚Äî sample recording")
    plt.xlabel("Sample index"); plt.ylabel("Acceleration")
    plt.tight_layout(); plt.show()

if set(["gx","gy","gz"]).issubset(sample_df.columns):
    plt.figure(figsize=(8,3))
    sample_df[["gx","gy","gz"]].plot(ax=plt.gca())
    plt.title("Gyroscope (gx, gy, gz) ‚Äî sample recording")
    plt.xlabel("Sample index"); plt.ylabel("Angular velocity")
    plt.tight_layout(); plt.show()

if set(["mx","my","mz"]).issubset(sample_df.columns):
    plt.figure(figsize=(8,3))
    sample_df[["mx","my","mz"]].plot(ax=plt.gca())
    plt.title("Magnetometer (mx, my, mz) ‚Äî sample recording")
    plt.xlabel("Sample index"); plt.ylabel("Magnetic field")
    plt.tight_layout(); plt.show()


## 3) Feature extraction ‚Äî windows ‚Üí summaries

We convert each recording into **fixed-length windows** (e.g., 2 seconds) and summarize each window with simple statistics.  
We‚Äôll use **accelerometer + gyroscope** by default (magnetometer is optional) for better movement cues.

**Why windows?** Models work on fixed-size inputs; short windows capture short-lived patterns while keeping computation simple.


In [None]:
import numpy as np, pandas as pd

WINDOW_SAMPLES = CONFIG["WINDOW_SAMPLES"]
WINDOW_STRIDE  = CONFIG["WINDOW_STRIDE"]

# Choose a consistent set of signals present in cleaned data
base_signals = [c for c in ["ax","ay","az","gx","gy","gz"] if c in train_df.columns]
opt_signals  = [c for c in ["mx","my","mz"] if c in train_df.columns]  # optional
signal_cols  = base_signals  # change to base_signals+opt_signals to include mag
print("Using signals:", signal_cols)

def make_windows_from_table(df_in: pd.DataFrame, signal_cols, label_col, group_col, win, stride):
    feats, labels, groups, sources = [], [], [], []
    lab_cat = pd.Categorical(df_in[label_col])
    for src, sdf in df_in.groupby(group_col, sort=False):
        sdf = sdf.reset_index(drop=True)
        if len(sdf) < win: 
            continue
        X = sdf[signal_cols].to_numpy(dtype=float)
        y_codes = pd.Categorical(sdf[label_col], categories=lab_cat.categories).codes
        for start in range(0, len(sdf)-win+1, stride):
            stop = start+win
            seg = X[start:stop]
            lab = pd.Series(y_codes[start:stop]).mode().iloc[0]
            mu  = np.nanmean(seg, axis=0)
            sd  = np.nanstd(seg, axis=0, ddof=1)
            ptp = np.nanmax(seg, axis=0) - np.nanmin(seg, axis=0)
            row = {}
            for c,v in zip(signal_cols, mu):  row[f"{c}_mean"]=v
            for c,v in zip(signal_cols, sd):  row[f"{c}_std"] =v
            for c,v in zip(signal_cols, ptp): row[f"{c}_ptp"]=v
            feats.append(row); labels.append(lab); groups.append(src); sources.append(src)
    Xf = pd.DataFrame(feats)
    y = np.asarray(labels)
    meta = pd.DataFrame({"__source_path": sources})
    return Xf, y, meta, list(lab_cat.categories)

# Build windows from pre-split train/test independently
train_Xf, train_y, train_meta, label_names = make_windows_from_table(
    train_df, signal_cols, "label", "__source_path", WINDOW_SAMPLES, WINDOW_STRIDE
)
test_Xf,  test_y,  test_meta, _           = make_windows_from_table(
    test_df,  signal_cols, "label", "__source_path", WINDOW_SAMPLES, WINDOW_STRIDE
)

print("Train features:", train_Xf.shape, "Test features:", test_Xf.shape, "Classes:", label_names)
display(train_Xf.head(3))


## 4) Feature insights ‚Äî correlations and a quick 2D view

Let‚Äôs **see the relationships** between features and a **simple 2D projection** to build intuition.


In [None]:
import numpy as np, matplotlib.pyplot as plt

# Correlation heatmap (train only, to avoid leakage)
corr = train_Xf.corr(numeric_only=True)
plt.figure(figsize=(6,5))
plt.imshow(corr, aspect='auto', interpolation='nearest')
plt.title("Feature correlation (train)")
plt.xlabel("Features"); plt.ylabel("Features")
plt.colorbar()
plt.tight_layout(); plt.show()

# Optional: quick 2D projection using PCA
from sklearn.decomposition import PCA
X_proj = PCA(n_components=2, random_state=CONFIG["RANDOM_SEED"]).fit_transform(train_Xf.values)
plt.figure(figsize=(5,4))
plt.scatter(X_proj[:,0], X_proj[:,1], s=6)
plt.title("PCA projection (train features)")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.tight_layout(); plt.show()


## 5) Train on train ‚Üí Evaluate on test (no leakage)

We respect the provided split: **fit on train windows**, **score on test windows**.  
We‚Äôll compare a few baseline models and visualize a **confusion matrix** to see which classes confuse each other.


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.base import clone

def build_model(name: str):
    if name=="knn":         clf=KNeighborsClassifier(n_neighbors=5)
    elif name=="logreg":    clf=LogisticRegression(max_iter=2000)
    elif name=="linearsvm": clf=LinearSVC()
    else: raise ValueError(name)
    return Pipeline([("impute", SimpleImputer(strategy="median")),
                     ("scaler", StandardScaler()),
                     ("clf", clf)])

Xtr = train_Xf.to_numpy(dtype=float)
Xte = test_Xf.to_numpy(dtype=float)

rows=[]
best=None; best_metric=-1
for name in CONFIG["MODELS"]:
    model = build_model(name)
    model.fit(Xtr, train_y)
    yhat = model.predict(Xte)
    acc = accuracy_score(test_y, yhat)
    f1m = f1_score(test_y, yhat, average="macro")
    rows.append({"model":name, "acc":acc, "f1_macro":f1m})
    if f1m>best_metric:
        best_metric=f1m; best=(name, model, yhat)

leaderboard = pd.DataFrame(rows).sort_values("f1_macro", ascending=False).reset_index(drop=True)
display(leaderboard)

best_name, best_model, best_yhat = best
print("Best model:", best_name)
print(classification_report(test_y, best_yhat, target_names=[str(c) for c in label_names]))

cm = confusion_matrix(test_y, best_yhat, labels=list(range(len(label_names))))
plt.figure(figsize=(5,4))
plt.imshow(cm, interpolation="nearest")
plt.title("Confusion matrix (test)")
plt.xlabel("Predicted"); plt.ylabel("True")
plt.colorbar(); plt.tight_layout(); plt.show()


## 6) Error analysis ‚Äî where does the model struggle?

Per-class precision/recall highlights which movements are hardest. Use this to guide feature ideas (e.g., add angular features if `run` vs `walk` is confused).


In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support

prec, rec, f1c, supp = precision_recall_fscore_support(test_y, best_yhat, labels=list(range(len(label_names))), zero_division=0)
df_metrics = pd.DataFrame({"label_id":range(len(label_names)), "label":[label_names[i] for i in range(len(label_names))],
                           "precision":prec, "recall":rec, "f1":f1c, "support":supp})
display(df_metrics)

plt.figure(figsize=(7,3))
plt.bar(df_metrics["label"], df_metrics["f1"])
plt.title("Per-class F1 (test)")
plt.xlabel("Class"); plt.ylabel("F1")
plt.xticks(rotation=20); plt.tight_layout(); plt.show()


## 7) Stretch: add simple features (magnitudes)

Try adding vector **magnitudes** within the window (e.g., `‚àö(ax¬≤+ay¬≤+az¬≤)` mean/std/ptp).  
Does this improve separation between `run` and `walk`? Between `jump` and the rest?


In [None]:
import numpy as np, pandas as pd

def add_magnitudes(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    if set(["ax","ay","az"]).issubset(df.columns):
        df["a_mag"] = np.sqrt(df["ax"]**2 + df["ay"]**2 + df["az"]**2)
    if set(["gx","gy","gz"]).issubset(df.columns):
        df["g_mag"] = np.sqrt(df["gx"]**2 + df["gy"]**2 + df["gz"]**2)
    if set(["mx","my","mz"]).issubset(df.columns):
        df["m_mag"] = np.sqrt(df["mx"]**2 + df["my"]**2 + df["mz"]**2)
    return df

train_df_aug = add_magnitudes(train_df)
test_df_aug  = add_magnitudes(test_df)

# Re-window with magnitudes included
sig_aug = [c for c in ["ax","ay","az","gx","gy","gz","a_mag","g_mag"] if c in train_df_aug.columns]
print("Signals with magnitudes:", sig_aug)

def make_windows_table(df_in, signals, label_col, group_col, win, stride):
    feats, labels, sources = [], [], []
    lab_cat = pd.Categorical(df_in[label_col])
    for src, sdf in df_in.groupby(group_col, sort=False):
        sdf = sdf.reset_index(drop=True)
        if len(sdf) < win: continue
        X = sdf[signals].to_numpy(dtype=float)
        y_codes = pd.Categorical(sdf[label_col], categories=lab_cat.categories).codes
        for start in range(0, len(sdf)-win+1, stride):
            stop = start+win
            seg = X[start:stop]
            lab = pd.Series(y_codes[start:stop]).mode().iloc[0]
            mu=np.nanmean(seg,0); sd=np.nanstd(seg,0,ddof=1); ptp=np.nanmax(seg,0)-np.nanmin(seg,0)
            row={}
            for c,v in zip(signals, mu):  row[f"{c}_mean"]=v
            for c,v in zip(signals, sd):  row[f"{c}_std"]=v
            for c,v in zip(signals, ptp): row[f"{c}_ptp"]=v
            feats.append(row); labels.append(lab); sources.append(src)
    return pd.DataFrame(feats), np.asarray(labels)

train_Xf2, train_y2 = make_windows_table(train_df_aug, sig_aug, "label", "__source_path", CONFIG["WINDOW_SAMPLES"], CONFIG["WINDOW_STRIDE"])
test_Xf2,  test_y2  = make_windows_table(test_df_aug,  sig_aug, "label", "__source_path", CONFIG["WINDOW_SAMPLES"], CONFIG["WINDOW_STRIDE"])

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

pipe = Pipeline([("impute", SimpleImputer(strategy="median")),
                 ("scale", StandardScaler()),
                 ("clf", LogisticRegression(max_iter=2000))])

pipe.fit(train_Xf2, train_y2)
yhat2 = pipe.predict(test_Xf2)
print("Macro-F1 with magnitudes:", f1_score(test_y2, yhat2, average="macro"))


## 8) Wrap-up ‚Äî what you built

- Turned **clean, split sensor data** into windowed features.
- Compared a few **baseline models** and read a **confusion matrix**.
- Used visuals to understand **class balance**, **signal shapes**, **feature relationships**.
- Tried a small feature idea (magnitudes) and measured its impact.

> The same engineering mindset ‚Äî clean inputs, fair evaluation, modest baselines, careful iteration ‚Äî scales all the way up to modern AI systems.
