In [None]:
import math

import pandas as pd
from omegaconf import OmegaConf
from src.data_pipeline import prepare_time_series_segments

# Session Truncation

In this notebook, we will explore the effect of truncating sessions to a fixed length. Our dataset includes five different activities, each with unique physical demands, resulting in varying session lengths. For instance, running typically results in longer sessions compared to sitting due to its higher physical intensity.

This variation in session lengths poses a challenge when splitting the data into train and test sets, especially when done by session. For example, consider a 20-minute running session. If we divide this session into 5-second segments, we get 240 segments. When we split the data by session, all 240 segments from this running session will end up entirely in either the train or test set, leading to an imbalance.

To address this issue, we can truncate sessions to a fixed length, ensuring a more balanced and representative distribution of segments across both train and test sets. This method allows us to maintain the diversity of activities within each set and improve the robustness of our model.

## Load Data

We load the raw data from the cache and prepare the time series segments using the `prepare_time_series_segments` function. We then explore the distribution of session lengths before and after truncation to evaluate the effectiveness of the truncation process.

In [None]:
data = pd.read_parquet("../data/cache/raw_data_db_cache.parquet")
data.shape

After the data is loaded we will run a part of the `data_pipeline.py` script to prepare the time series segments. We will then explore the distribution of session lengths before and after truncation.

In [None]:
crop_start_s = 5
crop_end_s = 5
resample_rate_hz = 50
segment_size_s = 5
overlap_s = 0

cfg = {
    "preprocessing": {
        "crop": {
            "start_seconds": crop_start_s,
            "end_seconds": crop_end_s,
        },
        "resample_rate_hz": resample_rate_hz,
        "segment_size_seconds": segment_size_s,
        "overlap_seconds": overlap_s,
        "smoothing": {
            "type": "null"
        }
    }
}

cfg = OmegaConf.create(cfg)

segments_df = prepare_time_series_segments(data,
                                           cfg)

import logging
logging.disable()

segments_df.shape

The preparation of the time series segments leaves us with a total of 1092101 segments. We will now explore the distribution of session lengths before and after truncation.

## Session Length Distribution

In [None]:
def get_session_lengths(df):
    return (df.groupby("session_id")
            .agg({"session_id": "count"})
            .rename(columns={"session_id": "count"}) / resample_rate_hz)


get_session_lengths(segments_df).plot.hist()
get_session_lengths(segments_df).describe()

The distribution of session lengths before truncation shows a wide range of session lengths, with a mean of 186 seconds and a standard deviation of 277 seconds. We will now truncate the sessions to a fixed length of 5 minutes and explore the distribution of session lengths after truncation to verify the effectiveness of the truncation. We can see that there are some sessions that are very long, and some that are relatively short.

We can calculate the number of segments that fit into a 180 seconds session to determine the maximum number of segments per session after truncation.

In [None]:
file_length_limit_s = 180
max_count_segments = math.floor(file_length_limit_s / segment_size_s)
max_count_segments

## Truncate Sessions

For a maximum of 5 minutes per session, we can fit a maximum of 36 segments per session. We will now truncate the sessions to this fixed length and explore the distribution of session lengths after truncation.

1. **DataFrame Grouping:**
   ```python
   segments_truncated_df = segments_df.groupby('session_id').apply(
   ```
   This part of the code groups the `segments_df` DataFrame by the `session_id` column. Each group will contain all segments belonging to a particular session.

2. **Lambda Function:**
   ```python
   lambda x: x[x['segment_id'].isin(
   ```
   For each group (i.e., each session), a lambda function is applied. `x` represents each group (session) DataFrame. The lambda function is used to filter the segments within each session.

3. **Drop Duplicates and Sample Segments:**
   ```python
   x['segment_id'].drop_duplicates().sample(n=min(len(x['segment_id'].drop_duplicates()), max_count_segments))
   ```
   Within each session, duplicate `segment_id` values are dropped using `drop_duplicates()`, ensuring each segment is unique. Then, a sample of these unique segments is taken.

   - `x['segment_id'].drop_duplicates()`: Drops duplicate segment IDs within the session.
   - `.sample(n=...)`: Samples a number of segments. The number of segments to sample is determined by the `min` function:
     - `len(x['segment_id'].drop_duplicates())`: Total number of unique segments in the session.
     - `max_count_segments`: The maximum number of segments allowed per session (60 in this case).
      - Through random sampling, we ensure that the truncation process does not bias the selection of segments within each session.

   The `min` function ensures that if a session has fewer than 60 segments, all of them are taken. If a session has more than 60 segments, only 60 are randomly sampled.

4. **Filter the Segments:**
   ```python
   x[x['segment_id'].isin(...)]
   ```
   The `isin` method is used to filter the original segments DataFrame (`x`) to include only the sampled segment IDs.

5. **Reset Index:**
   ```python
   ).reset_index(drop=True)
   ```
   After applying the lambda function and filtering the segments, the index is reset to avoid retaining the original index from the grouped DataFrame.

The final result is a new DataFrame `segments_truncated_df` where each session contains a maximum of 60 unique segments, or all segments if the session originally had fewer than 60 segments.

In [None]:
segments_truncated_df = segments_df.groupby('session_id').apply(
    lambda x: x[x['segment_id'].isin(
        x['segment_id'].drop_duplicates().sample(n=min(len(x['segment_id'].drop_duplicates()), max_count_segments))
    )]
).reset_index(drop=True)

In [None]:
get_session_lengths(segments_truncated_df).plot.hist()
get_session_lengths(segments_truncated_df).describe() 

After truncating the sessions to a fixed length of 5 minutes, the distribution of session lengths shows a narrower range of session lengths, with a mean of 95 seconds and a standard deviation of 72 seconds. The truncation process has effectively reduced the variation in session lengths, ensuring a more balanced distribution of segments across sessions. It is visible that the sessions lengths from the upper gather on the upper end of the distribution have been truncated to the maximum length of 180 secons.