
# Exploring Machine Learning through Sensor Data

This notebook is a conceptual, hands-on exploration of machine learning using sensor data (accelerometer & gyroscope).  
Follow the workflow: **load → explore → preprocess → feature analysis → model → reflect**.

## Setting up your environment

1. **Create and activate a Python virtual environment** (recommended):
   - macOS / Linux:
     ```bash
     python -m venv ml_env
     source ml_env/bin/activate
     ```
   - Windows (PowerShell):
     ```powershell
     python -m venv ml_env
     .\ml_env\Scripts\Activate.ps1
     ```

2. **Install dependencies** (from the folder containing `requirements.txt`):
   ```bash
   pip install -r requirements.txt
   ```

3. **Run the notebook**:
   ```bash
   jupyter notebook Exploring_ML_Sensor_Data.ipynb
   ```
   or:
   ```bash
   jupyter lab
   ```

4. **Using Visual Studio Code**:
   - Install the **Python** and **Jupyter** extensions.
   - Open the folder containing this notebook in VS Code, select the correct interpreter, open the `.ipynb`, then use the Run UI.

5. **Google Colab (web option)**:
   - Go to https://colab.research.google.com → Upload Notebook → choose this file.
   - Optionally run `!pip install -r requirements.txt` at the top of the runtime.


In [None]:

# Imports and configuration
import os, glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10,4)
SEED = 42
np.random.seed(SEED)


In [None]:

# Environment check: print versions
import sys, numpy as _np, pandas as _pd, sklearn as _sk, seaborn as _sns, matplotlib as _mpl
try:
    import tensorflow as _tf
    tf_version = _tf.__version__
except Exception:
    tf_version = 'NOT INSTALLED'
print('Python:', sys.version.splitlines()[0])
print('NumPy:', _np.__version__)
print('Pandas:', _pd.__version__)
print('scikit-learn:', _sk.__version__)
print('Seaborn:', _sns.__version__)
print('Matplotlib:', _mpl.__version__)
print('TensorFlow:', tf_version)
print('\nIf any are missing, run: pip install -r requirements.txt')



## Overview

This notebook guides you through a practical ML pipeline for sensor data:
1. Load raw sensor data (CSV) or generate synthetic data
2. Explore raw signals visually
3. Preprocess: cleaning, smoothing, resampling, scaling
4. Compare before/after preprocessing visually
5. Explore and select features (manual selection encouraged)
6. Train and compare simple models (KNN, Softmax/Logistic, SVM, Random Forest, optional LSTM)
7. Reflect on results and robustness


## Synthetic data (optional)

Run this cell to create a synthetic dataset that mimics accelerometer + gyroscope readings.

In [None]:

# Create a synthetic dataset with X,Y,Z and GX,GY,GZ + labels and time
classes = ['walk','run','jump','pushup']
N = 500
rows = []
for cls in classes:
    if cls == 'walk':
        acc_base = [1.0, 1.2, 0.8]; gyr_base = [0.2, 0.1, 0.05]
    elif cls == 'run':
        acc_base = [3.0, 2.5, 3.5]; gyr_base = [0.6, 0.5, 0.4]
    elif cls == 'jump':
        acc_base = [5.0, 4.5, 6.0]; gyr_base = [0.8, 0.7, 0.6]
    else:
        acc_base = [0.5, 0.4, 0.6]; gyr_base = [0.15, 0.1, 0.05]
    acc = np.random.normal(loc=acc_base, scale=0.5, size=(N,3))
    gyr = np.random.normal(loc=gyr_base, scale=0.2, size=(N,3))
    dfc = pd.DataFrame(np.hstack([acc,gyr]), columns=['X','Y','Z','GX','GY','GZ'])
    dfc['label'] = cls
    dfc['time'] = pd.date_range('2021-01-01', periods=N, freq='100ms')
    dfc['source_file'] = f'synthetic_{cls}.csv'
    rows.append(dfc)
df_synthetic = pd.concat(rows, ignore_index=True)
print('Synthetic dataset ready — shape:', df_synthetic.shape)
df_synthetic.head()



## Loading your CSV data (raw)

Options:
- **Auto-detection**: search a folder for CSVs (mode='auto')
- **Manual**: provide a list of files (mode='manual')
- **Synthetic**: use the included synthetic data (mode='synthetic')

The loader **does not** preprocess — it simply reads raw CSVs so you can inspect and choose preprocessing steps yourself.


In [None]:

# Flexible data loader
import glob, os

mode = 'synthetic'   # 'auto', 'manual', or 'synthetic'
data_root = './data' # used for 'auto' mode
manual_files = []    # e.g. ['./data/run_1.csv', './data/jump_1.csv']

def load_sensor_csv(path):
    try:
        df = pd.read_csv(path)
    except Exception as e:
        print('Failed to read', path, e)
        return pd.DataFrame()
    # Create unified time column if present
    if 'Timestamp' in df.columns and 'Milliseconds' in df.columns:
        df['time'] = pd.to_datetime(df['Timestamp']) + pd.to_timedelta(df['Milliseconds'], unit='ms')
    elif 'Timestamp' in df.columns:
        df['time'] = pd.to_datetime(df['Timestamp'])
    else:
        df['time'] = pd.RangeIndex(start=0, stop=len(df))
    # Normalize column names (accept common variants)
    colmap = {}
    for c in df.columns:
        cu = c.strip().upper()
        if cu in ['X','AX']:
            colmap[c] = 'X'
        if cu in ['Y','AY']:
            colmap[c] = 'Y'
        if cu in ['Z','AZ']:
            colmap[c] = 'Z'
        if cu in ['GX']:
            colmap[c] = 'GX'
        if cu in ['GY']:
            colmap[c] = 'GY'
        if cu in ['GZ']:
            colmap[c] = 'GZ'
    df = df.rename(columns=colmap)
    # infer label from filename (split on underscore)
    basename = os.path.basename(path)
    label = basename.split('_')[0]
    df['label'] = label
    df['source_file'] = basename
    return df

# Load according to mode
if mode == 'synthetic':
    df_raw = df_synthetic.copy()
    print('Using synthetic data (mode=synthetic)')
elif mode == 'auto':
    files = glob.glob(os.path.join(data_root, '**', '*.csv'), recursive=True)
    if not files:
        print('No CSVs found under', data_root, '- falling back to synthetic')
        df_raw = df_synthetic.copy()
    else:
        print('Found', len(files), 'CSV files. Loading...') 
        dfs = [load_sensor_csv(p) for p in files]
        df_raw = pd.concat(dfs, ignore_index=True)
        print('Combined rows:', len(df_raw))
else:  # manual
    if not manual_files:
        print('No files in manual_files — using synthetic data instead')
        df_raw = df_synthetic.copy()
    else:
        dfs = [load_sensor_csv(p) for p in manual_files]
        df_raw = pd.concat(dfs, ignore_index=True)
        print('Loaded manual files. Combined rows:', len(df_raw))

print('\nColumns available:', df_raw.columns.tolist())
print('Sample labels:', pd.Series(df_raw.get('label', [])).unique()[:10])
df_raw.head()



### Reflection — Observing Raw Data

Open the `df_raw.head()` output and descriptive stats. Look for:
- Uneven timestamps or irregular sampling intervals
- Missing values or NaNs
- Inconsistent column names or scales between files

Keep notes — these observations will inform your preprocessing choices.


## Visual exploration of raw signals

In [None]:

# Quick visual exploration
if 'df_raw' not in globals():
    print('Run the data loader cell first.')
else:
    df = df_raw.copy()
    if 'time' in df.columns:
        df['time'] = pd.to_datetime(df['time'], errors='coerce')
    print('Samples per label:'); display(df['label'].value_counts())
    # Choose a label to visualize
    lbl = df['label'].dropna().unique()[0] if 'label' in df.columns else None
    if lbl is not None:
        sample = df[df['label']==lbl].iloc[:1000]
        plot_cols = [c for c in ['X','Y','Z','GX','GY','GZ'] if c in sample.columns]
        if plot_cols:
            plt.figure(figsize=(12,4))
            for c in plot_cols:
                sns.lineplot(data=sample, x='time' if 'time' in sample.columns else sample.index, y=c, label=c)
            plt.title(f'Raw signals for label={lbl} (first samples)')
            plt.show()
    display(df.describe().T)


## Preprocessing: cleaning, smoothing, resampling, scaling

This pipeline is intentionally *editable*. Try toggling options to see how each step affects the data.

In [None]:

# Preprocessing toggles (edit and re-run to experiment)
fill_missing = True
apply_smoothing = True
apply_standardization = True
resample_to_hz = None  # e.g., 20

if 'df_raw' not in globals():
    print('Run data loader first.')
else:
    df_processed = df_raw.copy()
    if fill_missing:
        df_processed = df_processed.fillna(method='ffill').fillna(method='bfill')
    if resample_to_hz and 'time' in df_processed.columns:
        out = []
        for lbl, g in df_processed.groupby('label'):
            g = g.set_index('time').sort_index()
            g_res = g.resample(f'{int(1000/resample_to_hz)}L').mean().interpolate()
            g_res['label'] = lbl
            out.append(g_res.reset_index())
        df_processed = pd.concat(out, ignore_index=True)
        print(f'Resampled per label to {resample_to_hz} Hz')
    if apply_smoothing:
        cols = [c for c in ['X','Y','Z','GX','GY','GZ'] if c in df_processed.columns]
        df_processed[cols] = df_processed[cols].rolling(window=5, min_periods=1).mean()
    if apply_standardization:
        num_cols = df_processed.select_dtypes(include=['number']).columns.tolist()
        num_cols = [c for c in num_cols if c not in ['Milliseconds']]
        scaler = StandardScaler()
        df_processed[num_cols] = scaler.fit_transform(df_processed[num_cols])
    print('Preprocessing complete — processed shape:', df_processed.shape)
    display(df_processed.head())


## Visual comparison: before vs after preprocessing

In [None]:

# Compare raw vs processed visually (first 500 samples)
if 'df_raw' not in globals() or 'df_processed' not in globals():
    print('Run the loader and preprocessing cells first.')
else:
    cols = [c for c in ['X','Y','Z','GX','GY','GZ'] if c in df_raw.columns][:3]
    if not cols:
        cols = df_raw.select_dtypes('number').columns[:3].tolist()
    fig, axes = plt.subplots(len(cols), 1, figsize=(12, 3*len(cols)), sharex=True)
    for i, c in enumerate(cols):
        ax = axes[i] if len(cols)>1 else axes
        sns.lineplot(data=df_raw[c].iloc[:500], ax=ax, label='raw', linewidth=1)
        sns.lineplot(data=df_processed[c].iloc[:500], ax=ax, label='processed', linewidth=1)
        ax.set_ylabel(c); ax.legend()
    plt.xlabel('sample index (first 500)')
    plt.tight_layout(); plt.show()


## Feature exploration: manual selection

Edit the `selected_features` list to choose which features to analyze (pairplot + correlation heatmap).

In [None]:

# Edit this list to choose features to explore
selected_features = ['X','Y','Z']  # change as needed based on your dataset
avail = df_processed.columns.tolist() if 'df_processed' in globals() else []
selected = [s for s in selected_features if s in avail]
if not selected:
    print('No valid selected features found in processed data. Update the list.')
else:
    print('Selected features:', selected)


In [None]:

if 'df_processed' in globals() and selected:
    sns.pairplot(df_processed[selected].sample(min(1000, len(df_processed))), corner=True)
    plt.suptitle('Pairplot of selected features', y=1.02)
    plt.show()
    plt.figure(figsize=(6,5))
    sns.heatmap(df_processed[selected].corr(), annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation heatmap'); plt.show()


## From features to models

Train simple models and observe differences conceptually.

In [None]:

# Prepare numeric features for modeling
if 'df_processed' not in globals():
    print('Run preprocessing first.')
else:
    num_cols = [c for c in df_processed.columns if c not in ['label','time','source_file','Milliseconds'] and pd.api.types.is_numeric_dtype(df_processed[c])]
    if not num_cols:
        raise ValueError('No numeric features found for modeling')
    print('Numeric columns used for modeling:', num_cols[:12])
    X = df_processed[num_cols].fillna(0).values
    y = df_processed['label'].values
    le = LabelEncoder(); y_enc = le.fit_transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=SEED)
    X_train_s = StandardScaler().fit_transform(X_train)
    X_test_s = StandardScaler().fit_transform(X_test)
    print('Prepared train/test datasets — shapes:', X_train_s.shape, X_test_s.shape)


In [None]:

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
knn_pred = knn.predict(X_test_s)
print('KNN accuracy:', accuracy_score(y_test, knn_pred))
print(classification_report(y_test, knn_pred))
ConfusionMatrixDisplay(confusion_matrix(y_test, knn_pred), display_labels=le.classes_).plot(cmap='Blues')
plt.title('KNN Confusion Matrix'); plt.show()


In [None]:

# Logistic Regression (Softmax)
lr = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='saga')
lr.fit(X_train_s, y_train)
lr_pred = lr.predict(X_test_s)
print('Logistic Regression accuracy:', accuracy_score(y_test, lr_pred))
print(classification_report(y_test, lr_pred))
ConfusionMatrixDisplay(confusion_matrix(y_test, lr_pred), display_labels=le.classes_).plot(cmap='Oranges')
plt.title('Logistic Regression Confusion Matrix'); plt.show()


In [None]:

# SVM
svm = SVC(kernel='rbf', C=1, gamma='scale')
svm.fit(X_train_s, y_train)
svm_pred = svm.predict(X_test_s)
print('SVM accuracy:', accuracy_score(y_test, svm_pred))
print(classification_report(y_test, svm_pred))
ConfusionMatrixDisplay(confusion_matrix(y_test, svm_pred), display_labels=le.classes_).plot(cmap='Purples')
plt.title('SVM Confusion Matrix'); plt.show()


In [None]:

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=SEED)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print('Random Forest accuracy:', accuracy_score(y_test, rf_pred))
print(classification_report(y_test, rf_pred))
importances = pd.Series(rf.feature_importances_, index=num_cols).sort_values(ascending=False)
plt.figure(figsize=(8,6)); importances[:12].plot(kind='barh'); plt.title('Top feature importances (Random Forest)'); plt.gca().invert_yaxis(); plt.show()


## Optional: small LSTM demo (conceptual)

This toy demo reshapes features into short sequences when possible; it's for conceptual understanding only.

In [None]:

# LSTM demo (runs only when reshape is feasible)
n_features = X_train_s.shape[1]
if n_features >= 3 and n_features % 3 == 0:
    timesteps = 3
    step = n_features // timesteps
    X_seq = X_train_s.reshape((X_train_s.shape[0], timesteps, step))
    X_seq_test = X_test_s.reshape((X_test_s.shape[0], timesteps, step))
    y_seq = le.transform(y_train)
    y_seq_test = le.transform(y_test)
    model = Sequential([LSTM(32, input_shape=(timesteps, step)), Dense(len(le.classes_), activation='softmax')])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_seq, y_seq, epochs=4, batch_size=32, validation_split=0.1, verbose=2)
    loss, acc = model.evaluate(X_seq_test, y_seq_test, verbose=0)
    print('LSTM demo accuracy:', acc)
else:
    print('Skipping LSTM demo — reshape not suitable for current feature count')



## Experiments & Reflection

Try experiments:
- Simulate pocket tightness by scaling accelerometer channels and re-run models.
- Remove gyroscope features and compare performance.
- Inject noise or drop segments and observe model robustness.
- Use cross-validation across recording sessions to measure generalization.

**Reflection prompts**:
1. Which preprocessing step changed the signals most — and why?  
2. Which model performed best — and can you explain why based on the features?  
3. How would you adapt data collection or model design for a production system?
