## Fitness Tracker Analysis - Part 1

### 1 - Overview

Inspired by the Quantified Self authoried by Mark Hoogendoorn and the availability of sample fitness data, and with Dave Ebbelaar's guidance, I wanted to explore machine learning models to classify: 

1) the type of movement, 
2) the number of repetitions of that movement, and 
3) the number of sets per workout. 

Such an use case would be helpful because I'm not a fan of writing down how many sets and reps I did for each workout. Instead, the classification algorithm can do all of that while I go by "feel" of exercising to near-failure, thereby making real progress in strength training, rather than arbitrarily exercising to a predetermined number of reps or sets.

### 2 - The Data

The sample data was recorded using the MetaMotionS, a wearable device that offers real-time and continuous monitoring of motion and environmental sensor data.

This data was collected by Dave Ebbelaar when tracking 5 participants as they performed various barbell exercises.

https://mbientlab.com/metamotions/

All the data is stored in the /data/raw/ folder as multiple .csv files. For each participant and exercise, two sets of data were collected:

| Accelerometer data (meters/second-squared) | Gyroscope data (degrees/seconds) |
| - | - |
| ![accelerometer.jpg](attachment:accelerometer.jpg) | ![gyroscope.jpg](attachment:gyroscope.jpg) |

All the raw data will be read and marged into a single panda DataFrame with both accelerometer and gyroscope data.

In [None]:
# Import both pandas and glob libraries
import pandas as pd
from glob import glob

# Set the path to the data files
data_path = "../data/raw/*.csv"
files = glob(data_path)

# Read all files using a function and combine them into 2 sets
def read_data_from_files(files):

    acc_df = pd.DataFrame()
    gyr_df = pd.DataFrame()

    acc_set = 1
    gyr_set = 1

    # Loop through all the files in the directory
    for f in files:

        # Create new features from filename - participant, label, category
        participant = f.split("-")[0].replace(data_path.rstrip("*.csv"), "")
        label = f.split("-")[1]
        category = f.split("-")[2].rstrip("123").rstrip("_MetaWear_2019")

        df = pd.read_csv(f)

        df["participant"] = participant
        df["label"] = label
        df["category"] = category

        # Organize data into two datasets - Accelerometer and Gyroscope
        if "Accelerometer" in f:
            df["set"] = acc_set
            acc_set += 1
            acc_df = pd.concat([acc_df, df])

        if "Gyroscope" in f:
            df["set"] = gyr_set
            gyr_set += 1
            gyr_df = pd.concat([gyr_df, df])

    # Append each file to each corresponding dataset
    acc_df.index = pd.to_datetime(acc_df["epoch (ms)"], unit="ms")
    gyr_df.index = pd.to_datetime(gyr_df["epoch (ms)"], unit="ms")

    # Remove extra time-based columns (no longer needed)
    del acc_df["epoch (ms)"]
    del acc_df["time (01:00)"]
    del acc_df["elapsed (s)"]

    del gyr_df["epoch (ms)"]
    del gyr_df["time (01:00)"]
    del gyr_df["elapsed (s)"]

    # Return the combined datasets
    return acc_df, gyr_df

# Call the read_data_from_files() function
acc_df, gyr_df = read_data_from_files(files)

In [None]:
# Merge the two datasets into a single DataFrame

data_merged = pd.concat([acc_df.iloc[:, :3], gyr_df], axis=1)

data_merged.columns = [
    "acc_x",
    "acc_y",
    "acc_z",
    "gyr_x",
    "gyr_y",
    "gyr_z",
    "participant",
    "label",
    "category",
    "set",
]

Upon inspection of the merged dataset using the ```.head()``` and ```.info()``` functions, we immediately see a problem. The accelerometer and gyroscope data have been recorded at different timestamps. 

In [None]:
data_merged.info()
data_merged.head()

To address the timestamp mismatch issue, we'll try resampling the data to 200ms while taking the mean of all numerical values except the ```set``` feature. This way, we should have both accelerometer and gyroscope data at each 200ms interval.  

In [None]:
sampling = {
    "acc_x": "mean",
    "acc_y": "mean",
    "acc_z": "mean",
    "gyr_x": "mean",
    "gyr_y": "mean",
    "gyr_z": "mean",
    "participant": "last",
    "label": "last",
    "category": "last",
    "set": "last",
}

data_merged[:1000].resample(rule="200ms").apply(sampling)


Now that the merge and resample worked, split the data by the day and apply this transformation to the rest of the dataframe.

In [None]:
# Split by day
days = [g for n, g in data_merged.groupby(pd.Grouper(freq="D"))]

# Apply sampling to each day for the rest of the dataset
data_resampled = pd.concat(
    [df.resample(rule="200ms").apply(sampling).dropna() for df in days]
)

# Column 'set' was found to be a float64 object upon inspection, convert to int
data_resampled["set"] = data_resampled["set"].astype("int")


Inspect the full resampled dataset to make sure there are no missing values for both accelerometer and gyroscope data at 200ms intervals. 

In [None]:
data_resampled.info()

Save the resampled dataset in an interim folder:

In [None]:
data_resampled.to_pickle("../data/interim/01_data_processed.pkl")


### 3 - Exploratory Analysis and Visualization

Create visualizations to better understand the accelerometer and gyroscope data before running classification models. Exploratory analysis helps to determine what type of feature engineering or scaling is needed. 

The plots are saved in the ../../reports/figures/ folder.

The file naming convention is: '{label}-{participant}.png'

First, import the plotting libraries:

In [2]:
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import display

To make the plots look better, adjust the plotting style and settings:

In [3]:
mpl.style.use("seaborn-v0_8-deep")
mpl.rcParams["figure.figsize"] = (20, 5)
mpl.rcParams["figure.dpi"] = 100

In [None]:
# Load the data from the pickle file
df = pd.read_pickle("../data/interim/01_data_processed.pkl")


#### 3.1 - Examine a Single Feature

Plot a single column from a single set and inspect one of the accelerometer features:

In [None]:
# Select a single subset
set_df = df[df["set"] == 1]
set_df

# Plot y-values over the duration of this set
plt.plot(set_df["acc_y"])
plt.show()

# Plot y-values over the number of samples in this set
plt.plot(set_df["acc_y"].reset_index(drop=True))
plt.show()

From the y-axis of the accelerometer, there seems to be a signal of 4 or 5 sets. However, there is a lot of "noise" that will need to be smoothed out before training the classification models.

#### 3.2 - Examine Exercise Patterns

Next, plot all data for each exercise:

In [None]:
for label in df["label"].unique():
    subset = df[df["label"] == label]
    fig, ax = plt.subplots()
    plt.plot(subset["acc_y"].reset_index(drop=True), label=label)
    plt.legend()
    plt.show()

While there are clear patterns in each exercise over the entire dataset, it would be useful to plot a small portion of each exercise to make the patterns more clearer:

In [None]:
# Plot only the first 100 samples for each exercise
for label in df["label"].unique():
    subset = df[df["label"] == label]
    fig, ax = plt.subplots()
    plt.plot(subset[:100]["acc_y"].reset_index(drop=True), label=label)
    plt.legend()
    plt.show()

#### 3.3 - Examine Medium vs. Heavy Sets

Now we want to examine the ```categories``` feature:

In [None]:
df["category"].unique()

For each set, we have two categories worth exploring: heavy and medium. We want to visualize comparisions between medium and heavy sets.

In [None]:
# Stacked queries to pull squat data for participant A
# Add reset_index to make sure the index is continuous (No. of samples)
category_df = df.query("label == 'squat'").query("participant == 'A'").reset_index()

# Group by category (heavy vs. medium) and plot the y-values
fig, ax = plt.subplots()
category_df.groupby(["category"])["acc_y"].plot()
ax.set_ylabel = "acc_y"
ax.set_xlabel = "samples"
plt.legend()

There is a clear difference between medium and heavy sets, with faster accleration on the y-axis for medium sets and slower acceleration on the heavy sets.

#### 3.4 - Examine Exercise Patterns Among Participants

In [None]:
# Pull bench data for all participants (all categories)
participant_df = df.query("label == 'bench'").sort_values("participant").reset_index()

# Group by participant and plot the y-values for benching
fig, ax = plt.subplots()
participant_df.groupby(["participant"])["acc_y"].plot()
ax.set_ylabel = "acc_y"
ax.set_xlabel = "samples"
plt.legend()

Exercise patterns among different participants seem similar for bench press and other excercises, so the classification algorithm should be able to generalize to unseen participants without engineering features based on differences between participants.

#### 3.5 - Plot All Accelerometer Variables In the Same Graph

Examine all accelerometer features ```acc_x```, ```acc_y```, and ```acc_z``` in the same graph.

In [None]:
label = "squat"
participant = "A"
all_axis_df = (
    df.query(f"label == '{label}'")
    .query(f"participant == '{participant}'")
    .reset_index()
)

fig, ax = plt.subplots()
all_axis_df[["acc_x", "acc_y", "acc_z"]].plot(ax=ax)
ax.set_ylabel = "acc_y"
ax.set_xlabel = "samples"
plt.legend()


#### 3.6 - Plot All Combinations for Both Accelerometer and Gyroscope Sensors

For each exercise type and participant, plot accelerometer and gyroscope data, and save each plot to file in the `../reports/figures/` folder.

**Warning: Running this cell will create a large number of plots.**

In [None]:
labels = df["label"].unique()
participants = df["participant"].unique()

# Loop through all combinations of labels and participants
for label in labels:
    for participant in participants:
        combined_plot_df = (
            df.query(f"label == '{label}'")
            .query(f"participant == '{participant}'")
            .reset_index()
        )

        if len(combined_plot_df) > 0:
            fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(20, 10))
            combined_plot_df[["acc_x", "acc_y", "acc_z"]].plot(ax=ax[0])
            combined_plot_df[["gyr_x", "gyr_y", "gyr_z"]].plot(ax=ax[1])

            ax[0].legend(
                loc="upper center",
                bbox_to_anchor=(0.5, 1.15),
                ncol=3,
                fancybox=True,
                shadow=True,
            )
            ax[1].legend(
                loc="upper center",
                bbox_to_anchor=(0.5, 1.15),
                ncol=3,
                fancybox=True,
                shadow=True,
            )
            ax[1].set_xlabel("samples")
            plt.savefig(f"../reports/figures/{label.title()}-({participant}).png")
            plt.close()
            
print("All plots have been saved to the ../reports/figures/ folder.")


## To be continued...

To handle outliers and feature engineering, continue with Part 2 of this notebook here:

/notebooks/nk-fitness-tracker-feature-engineering-2.ipynb