In [None]:
import matplotlib.pyplot as plt
from dotenv import load_dotenv
import numpy as np
import seaborn as sns


np.random.seed(1337)
load_dotenv()

# Data partitioning

The data is partitioned into 5 folds for cross-validation. Each fold contains a training and validation set. The partitioning is done by segmenting the data by session ID (default). This ensures that all segments from the same session are in the same fold. This is one way to prevent data leakage between the training and validation sets as the movement of a person in a session is likely to be similar across segments. There are other factors that could lead to data leakage, such as the person making the same movement across different sessions. However, based on the dataset available, we can only do so much to prevent data leakage as we have too little data to work with.

This notebook will analyse the partitioning made by the `data_pipeline.py` script. It will load the data from the partitions and check for overlapping session IDs between the training and validation sets. It will also look at the distribution of the classes in each fold to ensure that the data is partitioned correctly.

## Load data from Parquet

The prerequisite to this notebook is to have the data partitioned into Parquet files. The `data_pipeline.py` script should have been run to partition the data. The script partitions the data into 5 folds and saves them to the `data/partitions` directory. You can run the script with the following command:

```bash
python src/data_pipeline.py partitioning.k_folds=5
```

In [None]:
PARTITIONS_PATH = "../data/partitions"

from src.utils import get_partition_paths, get_partitioned_data

partitions_paths = get_partition_paths(PARTITIONS_PATH, k_folds=5)
data = get_partitioned_data(partitions_paths)
print(data.keys())

The data is loaded successfully. The keys in the data dictionary are the following:
- `folds`: Contains the training and validation sets for each fold.
- `train_all`: Contains all training data. In the case of k-fold cross-validation, this is the union of the validation sets from each fold.
- `test`: Contains the test data. Serves as a holdout set to evaluate the model after training.

In [None]:
for i, fold in enumerate(data["folds"]):
    fold_dir, train_data, val_data = fold.values()
    print(f"Fold {i + 1} | Train: {train_data.shape} | Validation: {val_data.shape} | {fold_dir}")
    
print(f"Train all shape: {data['train_all'].shape}")
print(f"Test shape: {data['test'].shape}")

Looking at the shapes of the training and validation sets for each fold, we can see that the training set has more data than the validation set. As we split by the session ID, the ratios between the training and validation sets are perfect as sessions can have varying numbers of segments / lengths.

In [None]:
ratios = []
for i, fold in enumerate(data["folds"]):
    _, train_data, val_data = fold.values()
    total = train_data.shape[0] + val_data.shape[0]
    ratio = (train_data.shape[0] / total, val_data.shape[0] / total)
    ratios.append(ratio)
    
plt.figure(figsize=(12, 6))

barWidth = 0.3
r1 = np.arange(len(ratios))
r2 = [x + barWidth for x in r1]

plt.bar(r1, [r[0] for r in ratios], color='b', width=barWidth, edgecolor='grey', label='Train')
plt.bar(r2, [r[1] for r in ratios], color='r', width=barWidth, edgecolor='grey', label='Validation')

for i, r in enumerate(ratios):
    plt.text(i, r[0] / 2, f'{r[0]*100:.2f}%', ha='center', va='center', color='white')
    plt.text(i + barWidth, r[1] / 2, f'{r[1]*100:.2f}%', ha='center', va='center', color='white')

plt.xlabel('Fold')
plt.ylabel('Percentage')
plt.title('Train vs Validation ratio for each fold')
plt.legend()
plt.show()

The plot shows the percentage ratio of the training and validation sets for each fold. The ratios are close to 80% training and 20% validation for each fold but as we split by session ID, the ratios can vary slightly.

## Data distribution by class for each partition

After looking at the partitioning, we can check the distribution of the classes in each fold. This is to ensure that the data is partitioned correctly and that the classes are distributed evenly across the training and validation sets. Stratification is part of the default partitioning process, so the classes should be distributed approximately the same across the training and validation sets.

In [None]:
def plot_fold_label_distribution(y_train, y_val, title):
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    sns.countplot(x='label', data=y_train, ax=ax[0])
    ax[0].set_title('Train')
    total_train = y_train.shape[0]
    for p in ax[0].patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total_train)
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        ax[0].annotate(percentage, (x, y), ha='center', va='bottom')
    
    sns.countplot(x='label', data=y_val, ax=ax[1], order=y_train['label'])
    ax[1].set_title('Validation')
    total_val = y_val.shape[0]
    for p in ax[1].patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total_val)
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        ax[1].annotate(percentage, (x, y), ha='center', va='bottom')
    
    fig.suptitle(title)
    plt.show()

for i, fold in enumerate(data["folds"]):
    _, train_data, val_data = fold.values()
    plot_fold_label_distribution(train_data, val_data, f'Class distribution for fold {i + 1}')

The class distribution for each fold is plotted above. The classes are distributed approximately the same across the training and validation sets but there are some variations.

For example in fold 5 there is quite a big difference in the distribution of the classes between the training and validation sets. This is mainly because there is simply not enough data to have a perfect distribution. One option to mitigate the issue even more at the trade-off of having less data for training is to use a maximum session length truncation. Per default the sessions are truncated to a total maximum length of 180 seconds. This can be changed by setting the `preprocessing.max_session_length_s` parameter in the `configs/preprocessing/default.yaml` file. This will ensure that the sessions are more balanced in terms of the class distribution but the model will have less data to train on.

## Overlapping session IDs between training and validation sets

We will check for overlapping session IDs between the training and validation sets. This is to ensure that the partitioning is done correctly and that there is no data leakage between the training and validation sets.

In [None]:
def check_overlapping_session_ids(train_data, val_data):
    train_session_ids = set(train_data['session_id'])
    val_session_ids = set(val_data['session_id'])
    overlapping_session_ids = train_session_ids.intersection(val_session_ids)
    return overlapping_session_ids

for i, fold in enumerate(data["folds"]):
    _, train_data, val_data = fold.values()
    overlapping_session_ids = check_overlapping_session_ids(train_data, val_data)
    assert len(overlapping_session_ids) == 0, "Overlapping session IDs found"
    
print("No overlapping session IDs found")

The check for overlapping session IDs passed. This means that the partitioning is done correctly and there is no data leakage between the training and validation sets.

## Looking at a random segment

Further analysis can be done by looking at a random segment from the training set. This will give us an idea of what the data looks like and how the features are distributed.

Let's plot the accelerometer data for a random segment from the training set.

In [None]:
fold = np.random.choice(data["folds"])
_, train_data, val_data = fold.values()

segment_id = train_data['segment_id'].sample(1, random_state=42).values[0]
segment = train_data[train_data['segment_id'] == segment_id]
segment = segment.sort_values(by='_time')

After sampling and sorting the segment data by time, we can print out some information about the segment.

In [None]:
print(f"Segment ID: {segment_id}")
print(f"Start time: {segment['_time'].min()}")
print(f"End time: {segment['_time'].max()}")
print(f"Segment duration: {segment['_time'].max().timestamp() - segment['_time'].min().timestamp()} seconds")
print(f"Label: {segment['label'].values[0]}")

The segment contains accelerometer data for 5 seconds. Let's plot the accelerometer data for this segment.

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(segment['_time'], segment['accelerometer_x'])
plt.plot(segment['_time'], segment['accelerometer_y'])
plt.plot(segment['_time'], segment['accelerometer_z'])
plt.title('Accelerometer XYZ data for a random segment (label: {})'.format(segment['label'].values[0]))
plt.xlabel('Time')
plt.ylabel('Acceleration')
plt.legend(['X', 'Y', 'Z'])
plt.show()

The plot shows the accelerometer data for the segment. The data is noisy and the acceleration values are centered around 0. The data will be preprocessed and fed into the model for training.

This concludes the analysis of the data partitioning. The partitioning is done correctly and there is no data leakage between the training and validation sets. The classes are distributed approximately the same across the training and validation sets. The data is ready for training the model.