# Replication Plan: *Classifying Motorcyclist Behaviour with XGBoost Based on IMU Data* (Navratil & Giannopoulos)

**Objective:** Reproduce the paper’s results by building a feature-based XGBoost classifier on IMU data to recognize rider behaviours.


## 1) Data Overview
- **Source:** TU Wien motorcycle IMU dataset (accelerometer + gyroscope).
- **Sampling:** ~50 Hz (per authors).
- **Typical columns:** `timestamp`, `ax, ay, az` (m/s²), `gx, gy, gz` (rad/s), `label`.
- **Classes (examples):** normal ride, acceleration, braking, cornering L/R, standing.


In [None]:
import pandas as pd
import os
from pathlib import Path

base_path = Path(r"C:\Users\Studium\Documents\uni\ma_projekt_arbeit\code_msc_projektarbeit\data\TrainingData")

class_dirs = [d for d in base_path.iterdir() if d.is_dir()]
class_dirs
dfs = []

for class_dir in class_dirs:
    label = class_dir.name  # e.g. "cruise", "fun", "overtake", "traffic", "wait"
    for file in class_dir.glob("*.csv"):
        df = pd.read_csv(file, delimiter=",")
        df["label"] = label
        df["source_file"] = file.name
        dfs.append(df)

data = pd.concat(dfs, ignore_index=True)


print("Shape:", data.shape)
print("Columns:", data.columns.tolist())
print(data.head())

## 2) Preprocessing
- **Sync & sort:** Ensure per-file chronological order by `timestamp`.
- (vll Denoise) e.g Light median/low-pass filter.

- Fixed-length sliding windows, e.g. `window = 2.0s` (≈100 samples), `stride = 0.5s`.
- Each window → one feature vector + majority `label` within the window.


In [7]:
print(data["SampleTimeFine"][0])

# df["timestamp"] = pd.to_datetime(df["Systemzeit"], unit="ms")
# df = df.sort_values("timestamp").reset_index(drop=True)
#Always sort by time column before windowing:
#df = df.sort_values("time_s")   # or "timestamp"

NameError: name 'data' is not defined

## 3) Features (Feature Engineering (per axis & per sensor))
For each of `ax, ay, az, gx, gy, gz` within a window:
- **Stats:** mean, std, var, min, max, median, IQR, skew, kurtosis
- **Dynamics:** zero-crossing rate, slope (linear fit β₁), derivative stats
- **Energy:** sum of squares / window length
- **Magnitudes:** accel |a| = √(ax²+ay²+az²), gyro |g| = √(gx²+gy²+gz²) with same stats
- **Orientation cues (optional):** tilt/roll/pitch proxies if available

**Post-process**
- Handle NaNs/inf, **standardize** features (fit on train only).

## 4) Train/Validation Protocol
- **Split:** 70/30 train/test. Prefer **grouped splits** (by ride/session) to avoid leakage.
- **CV:** 10-fold cross-validation on the training set (stratified or grouped).
- **Model:** XGBoost `XGBClassifier`
  - Starting point: `n_estimators=400`, `max_depth=6`, `learning_rate=0.05`,
    `subsample=0.8`, `colsample_bytree=0.8`, `reg_lambda=1.0`.
- **Tuning (optional):** small grid/Optuna over depth, learning rate, trees.

## 5) Metrics & Reporting
- **Primary:** Accuracy, macro-F1, per-class Precision/Recall/F1.
- **Diagnostics:** Confusion matrix, ROC/PR (per class if needed).
- **Explainability:**
  - XGBoost feature importance (gain/weight).
  - **SHAP**: summary plot + class-wise beeswarm to confirm salient axes/windows.


## 6) Reproducibility Checklist
- `random_state` fixed (e.g., 42) for split and model.
- Persist train/val indices (CSV/JSON).
- Log: feature list, window params, class mapping, model params.
- Environment: `python`, `xgboost`, `numpy`, `pandas`, `scikit-learn`, `shap` versions.


## 7) Expected Outcome
- High overall accuracy / macro-F1 with gyro + accel features.
- Most informative features: gyro magnitude stats, accel variance/std during dynamic manoeuvres.
