# EEE4114F ML Project Code Part 1: Loading and Understanding the Data

The first step in building our human activity recognition (HAR) classifier was to determine which subset of the available data to use. The dataset comprises multiple files and sensor modalities, but it is neither efficient nor always beneficial to use all of them. Therefore, we evaluated which sensor signals were likely to contribute most meaningfully to classification accuracy, while balancing computational cost and potential overfitting.

The dataset is divided into three main folders:

- **(A) DeviceMotion_data:** Contains a rich combination of accelerometer, gyroscope, and orientation-related features — 12 features in total.

- **(B) Accelerometer_data:** Includes only the raw accelerometer readings (x, y, z) — 3 features.

- **(C) Gyroscope_data:** Includes only the raw gyroscope readings (x, y, z) — 3 features.


From these, **(A) DeviceMotion_data** provides the most comprehensive set of features, combining both linear and rotational data, as well as derived orientation estimates:

- **Attitude (roll, pitch, yaw)** showing device orientation (e.g., facing up/down)
- **Gravity (x, y, z)** showing static acceleration (orientation wrt gravity)
- **Rotation Rate (x, y, z)** showing angular velocity from gyroscope
- **User Acceleration (x, y, z)** showing motion (dynamic body acceleration)


** not confirmed **

We opted to use data from **(A) DeviceMotion_data**, as it contains richer temporal and contextual signals necessary for activity recognition. Rather than using all 12 features, we selected a subset that strikes a balance between informativeness and model complexity by choosing only **Attitude (roll, pitch, yaw)** and **User Acceleration (x, y, z)**. Attitude provides the device's orientation, allowing for improved detection of postural changes, while acceleration is the one capturing dynamic motion and body movement which is essential for distinguishing activities like walking, sitting, or jumping.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Useful functions
Here are some useful functions from motionsense.ipyb we have copied. We didn't copy get_ds_infos() since we don't really care about subject information. We edited the code to have just classified subjects [1-24] without needing to read the subject info.

In [2]:
def set_data_types(data_types=["userAcceleration"]):
    """
    Select the sensors and the mode to shape the final dataset. 

    Args:
        data_types: A list of sensor data type from this list: [attitude, gravity, rotationRate, userAcceleration] 

    Returns:
        A list of columns to use for creating time-series from files.
    """
    dt_list = []
    for t in data_types:
        if t != "attitude":
            dt_list.append([t+".x",t+".y",t+".z"])
        else:
            dt_list.append([t+".roll", t+".pitch", t+".yaw"])
    return dt_list

ACT_LABELS = ["dws","ups", "wlk", "jog", "std", "sit"]
TRIAL_CODES = {
    ACT_LABELS[0]:[1,2,11],
    ACT_LABELS[1]:[3,4,12],
    ACT_LABELS[2]:[7,8,15],
    ACT_LABELS[3]:[9,16],
    ACT_LABELS[4]:[6,14],
    ACT_LABELS[5]:[5,13]
}

In [3]:
def create_time_series(dt_list, act_labels, trial_codes, subject_ids=None, mode="mag", labeled=True):
    """
    Defines what data to include for a given set, using selected sensors and subjects.

    Args:
        dt_list: List of sensor columns to include.
        act_labels: List of activity labels (e.g. ["dws", "ups", "wlk"...]).
        trial_codes: Dictionary mapping activity to trial numbers.
        subject_ids: List of subject IDs to include. Example: [1, 2, ..., 24]
        mode: "raw" = keep all sensor components; "mag" = magnitude only.
        labeled: True to include activity labels.

    Returns:
        A pandas DataFrame containing time-series sensor data.
    """
    if subject_ids is None:
        subject_ids = list(range(1, 25))  # Default: use all subjects

    num_data_cols = len(dt_list) if mode == "mag" else len(dt_list * 3)
    dataset = np.zeros((0, num_data_cols + 1)) if labeled else np.zeros((0, num_data_cols))

    print("[INFO] -- Creating Time-Series")
    for sub_id in subject_ids:
        for act_id, act in enumerate(act_labels):
            for trial in trial_codes[act]:
                # For Olive: fname = f'/Users/olivekschonfeldt/Library/CloudStorage/OneDrive-UniversityofCapeTown/EEE4114F DSP/ML Project 2025/motion-sense-master/data/A_DeviceMotion_data/{act}_{trial}/sub_{sub_id}.csv'
                fname = f'/Users/olivekschonfeldt/Library/CloudStorage/OneDrive-UniversityofCapeTown/EEE4114F DSP/ML Project 2025/motion-sense-master/data/A_DeviceMotion_data/{act}_{trial}/sub_{sub_id}.csv'
                try:
                    raw_data = pd.read_csv(fname)
                    raw_data = raw_data.drop(['Unnamed: 0'], axis=1)
                    vals = np.zeros((len(raw_data), num_data_cols))
                    for x_id, axes in enumerate(dt_list):
                        if mode == "mag":
                            vals[:, x_id] = (raw_data[axes] ** 2).sum(axis=1) ** 0.5
                        else:
                            vals[:, x_id * 3:(x_id + 1) * 3] = raw_data[axes].values
                        vals = vals[:, :num_data_cols]
                    if labeled:
                        lbls = np.array([[act_id]] * len(raw_data))
                        vals = np.concatenate((vals, lbls), axis=1)
                    dataset = np.append(dataset, vals, axis=0)
                except FileNotFoundError:
                    print(f"[WARNING] File not found: {fname}. Skipping.")
                    continue

    cols = []
    for axes in dt_list:
        cols += axes if mode == "raw" else [str(axes[0][:-2])]

    if labeled:
        cols += ["act"]

    dataset = pd.DataFrame(data=dataset, columns=cols)
    return dataset

### Loading the data
Here we extract the data to obtain our new **dataset**. We are only going to extract attitude(roll, pitch, yaw) and userAcceleration(x,y,z) for all activity types.

In [None]:
# Here we set parameter to build labeled time-series from dataset of "(A)DeviceMotion_data"

sdt = ["attitude", "gravity", "rotationRate", "userAcceleration"]
print("[INFO] -- Selected sensor data types: "+str(sdt))    
act_labels = ACT_LABELS  # includes all six activities
print("[INFO] -- Selected activites: "+str(act_labels))    
trial_codes = {act: TRIAL_CODES[act] for act in act_labels}
dt_list = set_data_types(sdt)
dataset = create_time_series(dt_list, act_labels, trial_codes, mode="raw", labeled=True)
print("[INFO] -- Shape of time-Series dataset:"+str(dataset.shape))    
dataset.head()

[INFO] -- Selected sensor data types: ['attitude', 'gravity', 'rotationRate', 'userAcceleration']
[INFO] -- Selected activites: ['dws', 'ups', 'wlk', 'jog', 'std', 'sit']
[INFO] -- Creating Time-Series


### Understanding sample size
Here we gain an understanding of the sample size of each activity so that we can assess the balance of the dataset. This helps identify if any classes are underrepresented, which may affect model training and performance.


In [None]:
# Count the number of samples per activity label
activity_counts = dataset["act"].value_counts().sort_index()

# Convert numeric labels back to activity names
activity_names = act_labels  # this must match the order of label IDs: 0, 1, 2, ...
activity_labels = [activity_names[int(i)] for i in activity_counts.index]

# Plot
plt.figure(figsize=(8,5))
plt.bar(activity_labels, activity_counts.values, color='skyblue')
plt.xlabel("Activity")
plt.ylabel("Number of Samples")
plt.title("Size of Samples per Activity")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This shows us it will be harder for the model to accurately recognise less-represented classes like downstairs, upstairs, and jogging.

It was beneficial to plot this data because it helps us distinguish features such as:
- Dynamic activities like walking stairs, walking and jogging have large peaks and dips, while the more conservative activities like standing and sitting have a flatter range of values.
- Walking and jogging have a 'cadence' to them.

However we have a few concerns:
- There seems to be some 'noise' present in sitting (and what else???)
- Walking downstairs and upstairs looks pretty similar even though they're two different directions of motion.

### Visualisation of Attitude Data per Activity
Here we plot the first 600 samples per activity for the attitude data.

In [None]:
# Define sensor axes to plot
sensor_axes = ['attitude.roll', 'attitude.pitch', 'attitude.yaw']

for i, act_name in enumerate(act_labels):
    subset = dataset[dataset['act'] == i].head(600)
    plt.figure(figsize=(6, 0.5))
    plt.title(act_name)
    for axis in sensor_axes:
        plt.plot(subset.index, subset[axis], label=axis)
    plt.xticks([]) # turn off x labels
    plt.yticks([])  # turn off y labels
    plt.show()

### Visualisation of Gravity Data per Activity
Here we plot the first 600 samples per activity for the gravity data.

In [None]:
# Define sensor axes to plot
sensor_axes = ['gravity.x', 'gravity.y', 'gravity.z']

for i, act_name in enumerate(act_labels):
    subset = dataset[dataset['act'] == i].head(600)
    plt.figure(figsize=(6, 0.5))
    plt.title(act_name)
    for axis in sensor_axes:
        plt.plot(subset.index, subset[axis], label=axis)
    plt.xticks([]) # turn off x labels
    plt.yticks([])  # turn off y labels
    plt.show()
    

j

### Visualisation of Rotation Rate Data per Activity
Here we plot the first 600 samples per activity for the rotation rate data.

In [None]:
# Define sensor axes to plot
sensor_axes = ['rotationRate.x', 'rotationRate.y', 'rotationRate.z']

for i, act_name in enumerate(act_labels):
    subset = dataset[dataset['act'] == i].head(600)
    plt.figure(figsize=(6, 0.5))
    plt.title(act_name)
    for axis in sensor_axes:
        plt.plot(subset.index, subset[axis], label=axis)
    plt.xticks([]) # turn off x labels
    plt.yticks([])  # turn off y labels
    plt.show()

### Visualisation of Acceleration Data per Activity
Here we plot the first 600 samples per activity for the user acceleration data.

In [None]:
# Define sensor axes to plot
sensor_axes = ['userAcceleration.x', 'userAcceleration.y', 'userAcceleration.z']

for i, act_name in enumerate(act_labels):
    subset = dataset[dataset['act'] == i].head(600)
    plt.figure(figsize=(6, 0.5))
    plt.title(act_name)
    for axis in sensor_axes:
        plt.plot(subset.index, subset[axis], label=axis)
    plt.xticks([]) # turn off x labels
    plt.yticks([])  # turn off y labels
    plt.show()