# Speed, Memory, and Disk Comparisons

In this notebook, we'll offer some rough comparisons of the computational performance implications of ESGPT vs. other competing pipelines. We'll focus these comparisons on several metrics:
  1. The time, runtime memory, and final disk space required to construct, pre-process, and store an ESGPT dataset relative to other pipelines, where applicable.
  2. The initialization time, iteration speed, and GPU memory costs for producing batches of data within the ESGPT framework vs. other systems.
  
In particular, we'll compare (or justify why they are inappropriate comparators) against the following pipelines:
  1. TemporAI
  2. OMOP-Learn
  3. FIDDLE
  4. MIMIC-Extract
  
We'll make these comparisons leveraging the synthetic data distributed with ESGPT's sample tutorial, but this code can also be ported to any other dataset to run these profiles locally.

In [1]:
%load_ext memory_profiler

import sys
sys.path.append('..')

In [29]:
import os
import numpy as np
import torch

from collections import defaultdict
from datetime import datetime, timedelta
from humanize import naturalsize, naturaldelta
from pathlib import Path
from sparklines import sparklines
from torch.utils.data import DataLoader, Dataset
from tqdm.auto import tqdm
from typing import Callable

from EventStream.data.dataset_polars import Dataset
from EventStream.data.config import PytorchDatasetConfig
from EventStream.data.types import PytorchBatch
from EventStream.data.pytorch_dataset import PytorchDataset

In [3]:
dataset_dir = Path(os.getcwd()) / "processed/sample"

First, let's check and see how much disk space the dataset uses, and in what components

In [4]:
total_dataset_size = sum(f.stat().st_size for f in dataset_dir.glob('**/*') if f.is_file())
DL_reps_size = sum(f.stat().st_size for f in (dataset_dir / "DL_reps").glob('**/*') if f.is_file())
just_dataset_size = total_dataset_size - DL_reps_size

if (dataset_dir / "flat_reps").is_dir():
    flat_reps_size = sum(f.stat().st_size for f in (dataset_dir / "flat_reps").glob('**/*') if f.is_file())
    just_dataset_size -= flat_reps_size
    flat_reps_lines = [f"  * {naturalsize(flat_reps_size)} for the flat representation dataframes."]
else:
    flat_reps_lines = []

lines = [
    f"The total dataset takes up {naturalsize(total_dataset_size)} on disk, which includes:",
    f"  * {naturalsize(just_dataset_size)} for the core dataset.",
    f"  * {naturalsize(DL_reps_size)} for the deep-learning representation dataframes.",
] + flat_reps_lines

print('\n'.join(lines))

The total dataset takes up 164.5 MB on disk, which includes:
  * 19.5 MB for the core dataset.
  * 11.7 MB for the deep-learning representation dataframes.
  * 133.2 MB for the flat representation dataframes.


First, we'll note that loading a dataset doesn't require much of either resource. This is because the data is loaded lazily, so complex dataframe elements aren't loaded until they are needed. 

In [5]:
%%time
%%memit

ESD = Dataset.load(dataset_dir)

peak memory: 346.93 MiB, increment: 1.95 MiB
CPU times: user 146 ms, sys: 16.3 ms, total: 162 ms
Wall time: 274 ms


In [6]:
%%time
%%memit

s_df = ESD.subjects_df
e_df = ESD.events_df
dm_df = ESD.dynamic_measurements_df

Loading subjects from /home/mmd/Projects/EventStreamGPT/sample_data/processed/sample/subjects_df.parquet...
Loading events from /home/mmd/Projects/EventStreamGPT/sample_data/processed/sample/events_df.parquet...
Loading dynamic_measurements from /home/mmd/Projects/EventStreamGPT/sample_data/processed/sample/dynamic_measurements_df.parquet...
peak memory: 508.26 MiB, increment: 161.22 MiB
CPU times: user 345 ms, sys: 107 ms, total: 453 ms
Wall time: 318 ms


## Pytorch Dataset Stats
Now let's load a pytorch dataset and examine iteration speed and GPU memory cost:

In [7]:
def summarize(arr: list[float], strify: Callable[float, str] = naturalsize) -> str:
    mean, std, mn, mx = np.mean(arr), np.std(arr), np.min(arr), np.max(arr)
    simple_summ = f"{strify(mean)} ± {strify(std)} ({strify(mn)}-{strify(mx)})"
    
    if len(arr) < 25: return simple_summ
    
    hist_vals, hist_bins = np.histogram(arr)
    lines = [simple_summ, "Histogram:"]
    sparkline = sparklines(hist_vals)
    
    lines.extend(sparkline)
    left_end = strify(hist_bins[0])
    right_end = strify(hist_bins[1])
    W = len(sparkline[0]) - len(left_end) - len(right_end)
    
    if W > 0:
        lines.append(f"{left_end}{'-'*W}{right_end}")
    else:
        lines.append(f"o {left_end} (left endpoint)")
        lines.append(f"{'-'*(len(sparkline[0])-1)}o {right_end} (right endpoint)")
    return '\n'.join(lines)

def summarize_times(arr: list[float, timedelta]):
    as_seconds = [x / timedelta(seconds=1) for x in arr]
    return summarize(as_seconds, strify=lambda x: str(timedelta(seconds=x)))

In [34]:
def profile_batch_iteration_speed_and_cost(
    batch_size: int,
    pyd: Dataset,
    n_iter_samples: int = 30,
    collate_fn: Callable | None = None,
):
    def make_dataloader():
        if collate_fn is None:
            return DataLoader(pyd, batch_size=batch_size, shuffle=True)
        return DataLoader(pyd, collate_fn=collate_fn, batch_size=batch_size, shuffle=True)

    dataloader = make_dataloader()
    batch_sizes = defaultdict(list)
    total_sizes = []
    for batch in tqdm(dataloader, leave=False):
        total_size = 0
        for k, v in batch.items():
            if v is None: continue
            el_size = v.element_size() * v.nelement()
            batch_sizes[k].append(el_size)
            total_size += el_size
        total_sizes.append(total_size)

    batch_iteration_times = []
    for samp in tqdm(list(range(n_iter_samples)), leave=False, desc="Sampling Dataloader Iteration Speed"):
        dataloader = make_dataloader()
        st = datetime.now()
        for batch in tqdm(dataloader, leave=False, desc="Sampling Batch"):
            pass
        batch_iteration_times.append((datetime.now() - st) / len(dataloader))

    print(
        f"Iterating through an entire dataloader of {len(dataloader)} batches of size {batch_size} "
        f"took the following time per batch:\n{summarize_times(batch_iteration_times)}\n\n"
        f"Total batch size:\n{summarize(total_sizes)}"
    )
    for k, v in batch_sizes.items():
        print(f"  Size of {k}:\n    {summarize(v)}")

In [55]:
%%time
%%memit
pyd_config = PytorchDatasetConfig(
    save_dir=ESD.config.save_dir,
    max_seq_len=1024,
)
pyd = PytorchDataset(config=pyd_config, split='train')

peak memory: 2706.13 MiB, increment: 213.12 MiB
CPU times: user 2.24 s, sys: 237 ms, total: 2.47 s
Wall time: 2.29 s


In [56]:
%%time
%%memit

profile_batch_iteration_speed_and_cost(batch_size=16, pyd=pyd, n_iter_samples=30, collate_fn=pyd.collate)

  0%|          | 0/5 [00:00<?, ?it/s]

Sampling Dataloader Iteration Speed:   0%|          | 0/30 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/5 [00:00<?, ?it/s]

Iterating through an entire dataloader of 5 batches of size 16 took the following time per batch:
0:00:00.734263 ± 0:00:00.078552 (0:00:00.586128-0:00:00.871448)
Histogram:
▄▁▁▂▄█▃▁▃▃
o 0:00:00.586128 (left endpoint)
---------o 0:00:00.614660 (right endpoint)

Total batch size:
8.3 MB ± 1.1 MB (6.6 MB-9.7 MB)
  Size of event_mask:
    16.4 kB ± 0 Bytes (16.4 kB-16.4 kB)
  Size of time_delta:
    65.5 kB ± 0 Bytes (65.5 kB-65.5 kB)
  Size of static_indices:
    128 Bytes ± 0 Bytes (128 Bytes-128 Bytes)
  Size of static_measurement_indices:
    128 Bytes ± 0 Bytes (128 Bytes-128 Bytes)
  Size of dynamic_indices:
    3.1 MB ± 422.7 kB (2.5 MB-3.7 MB)
  Size of dynamic_measurement_indices:
    3.1 MB ± 422.7 kB (2.5 MB-3.7 MB)
  Size of dynamic_values:
    1.6 MB ± 211.3 kB (1.2 MB-1.8 MB)
  Size of dynamic_values_mask:
    393.2 kB ± 52.8 kB (311.3 kB-458.8 kB)
peak memory: 2697.75 MiB, increment: 14.18 MiB
CPU times: user 3min 23s, sys: 1.49 s, total: 3min 25s
Wall time: 1min 53s


## Other Pipelines
### TemporAI Format

In [10]:
import pandas as pd
import polars as pl
import polars.selectors as cs

In [11]:
def ESD_to_temporai(ESD: Dataset) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Converts an ESD data format into a TemporAI dataset format."""

    static_df = (
        ESD.subjects_df
        .select(
            'subject_id',
            *[pl.col(c) for c, cfg in ESD.measurement_configs.items() if cfg.temporality == 'static']
        )
        .to_pandas()
        .set_index("subject_id")
    )
    
    # For the time-series dataframe, as they need only one row per subject ID, timestamp, we need to use the wide
    # format of the flat representation. 
    
    flat_reps_dir = ESD.config.save_dir / "flat_reps" / "raw"
    if not flat_reps_dir.is_dir():
        raise FileNotFoundError(f"Must have pre-cached flat representations at {flat_reps_dir}!")
        
    time_series_df = (
        pl.scan_parquet(flat_reps_dir / "*" / "*.parquet")
        .select("subject_id", "timestamp", cs.starts_with("dynamic"))
        .collect()
        .to_pandas()
        .set_index(["subject_id", "timestamp"])
    )
    
    return static_df, time_series_df

In [12]:
%%time
%%memit
# We need to convert to a flat format prior to getting temporai representations.
# The performance #s here are not reliable as these files may be already generated.
ESD.cache_flat_representation(
    subjects_per_output_file=None,
    feature_inclusion_frequency=None,
    do_overwrite=False,
    do_update=True,
)

Flattening Splits:   0%|          | 0/3 [00:00<?, ?it/s]

Subject chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Subject chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Subject chunks:   0%|          | 0/1 [00:00<?, ?it/s]

peak memory: 784.77 MiB, increment: 0.18 MiB
CPU times: user 263 ms, sys: 47.9 ms, total: 311 ms
Wall time: 426 ms


In [13]:
%%time
%%memit

temporai_static, temporai_ts = ESD_to_temporai(ESD)

peak memory: 1893.27 MiB, increment: 1108.49 MiB
CPU times: user 1.31 s, sys: 687 ms, total: 2 s
Wall time: 942 ms


In [14]:
print(
    f"TemporAI uses two dataframes, a static dataframe of shape {temporai_static.shape} "
    f"and a time series dataframe of shape {temporai_ts.shape}."
)

TemporAI uses two dataframes, a static dataframe of shape (100, 1) and a time series dataframe of shape (530742, 160).


Let's save these dataframes to disk, so we can inspect their disk cost and the memory cost to re-load them from scratch.

In [15]:
save_dir = Path("./speed_comparisons/temporai/compressed")
save_dir.mkdir(parents=True, exist_ok=True)

temporai_static.to_parquet(save_dir / "static.parquet")
temporai_ts.to_parquet(save_dir / "ts.parquet")

uncompressed_save_dir = Path("./speed_comparisons/temporai/uncompressed")
uncompressed_save_dir.mkdir(parents=True, exist_ok=True)

temporai_static.to_parquet(uncompressed_save_dir / "static.parquet", compression=None)
temporai_ts.to_parquet(uncompressed_save_dir / "ts.parquet", compression=None)

compressed_temporai_size = sum(f.stat().st_size for f in save_dir.glob('**/*') if f.is_file())
uncompressed_temporai_size = sum(f.stat().st_size for f in uncompressed_save_dir.glob('**/*') if f.is_file())

print(
    f"The compressed data takes up {naturalsize(compressed_temporai_size)} on disk.\n"
    f"The uncompressed data takes up {naturalsize(uncompressed_temporai_size)} on disk "
    "(this is a good approximation of memory cost as it is uncompressed)."
)

The compressed data takes up 23.9 MB on disk.
The uncompressed data takes up 26.0 MB on disk (this is a good approximation of memory cost as it is uncompressed).


In [16]:
%%time
%%memit

temporai_static = pd.read_parquet(save_dir / "static.parquet")
temporai_ts = pd.read_parquet(save_dir / "ts.parquet")

peak memory: 2391.35 MiB, increment: 904.11 MiB
CPU times: user 917 ms, sys: 444 ms, total: 1.36 s
Wall time: 718 ms


TemporAI generally converts their timeseries data into a dense, 3D matrix across samples, timepoints, and features. For use in ML pipelines, this is then generally iterated through directly via simple numpy iteration. 

For example: 
  * Datasets are converted to 3D views here: https://github.com/vanderschaarlab/temporai/blob/main/src/tempor/plugins/prediction/one_off/classification/__init__.py#L59 and https://github.com/vanderschaarlab/temporai/blob/67ebd74dc24728163d9aec37f1771a83fc3346e2/src/tempor/data/utils.py#L49
  * Iteration through numpy arrays happens here: https://github.com/vanderschaarlab/temporai/blob/main/src/tempor/models/ddh.py#L155
  
Though a full comparison warrants use of their library (and will further depend on the exact model used (as each has different strategies for processing data), we can simulate that approach here quickly:

In [48]:
def no_categories(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for c in df.columns:
        if pd.api.types.is_categorical_dtype(df[c]):
            df[c] = df[c].cat.codes
    return df

def to_3D_arr(df: pd.DataFrame, max_timesteps: int | None = None) -> np.ndarray:
    df = no_categories(df)
    samples = set(df.index.get_level_values(0))
    num_samples = len(samples)
    num_features = len(df.columns)
    num_timesteps_per_sample = df.groupby(level=0).size()
    max_actual_timesteps = num_timesteps_per_sample.max()
    max_timesteps = max_actual_timesteps if max_timesteps is None else max_timesteps
    array = np.full(shape=(num_samples, max_timesteps, num_features), fill_value=np.NaN)
    for i_sample, idx_sample in enumerate(samples):
        set_vals = df.loc[idx_sample, :, :].to_numpy()[:max_timesteps, :]  # pyright: ignore
        if i_sample == 0:
            array = array.astype(set_vals.dtype)  # Need to cast to the type matching source data.
        array[i_sample, : num_timesteps_per_sample[idx_sample], :] = set_vals  # pyright: ignore
    return array

In [59]:
class SimpleTemporAIStyleDataset(Dataset):
    def __init__(self, static: np.ndarray, ts: np.ndarray):
        self.static = static
        self.ts = ts
        
    def __len__(self) -> int: return self.ts.shape[0]
    
    def __getitem__(self, idx) -> dict[str, torch.Tensor]:
        return {'static': torch.Tensor(self.static[idx]), 'ts': torch.Tensor(self.ts[idx])}
    
def profile_temporai_dataset(
    temporai_static, temporai_ts, batch_size: int = 16,
    n_iter_samples: int = 30,
    max_seq_len: int = 32,
):
    static_as_np = no_categories(temporai_static).to_numpy()
    ts_as_np = to_3D_arr(temporai_ts, max_timesteps=max_seq_len)
    print(
        f"Yielded a static NP array of shape {static_as_np.shape} and a TS NP array "
        f"of shape {ts_as_np.shape}."
    )
    temporai_pyd = SimpleTemporAIStyleDataset(static_as_np, ts_as_np)

    profile_batch_iteration_speed_and_cost(
        batch_size=batch_size, pyd=temporai_pyd, n_iter_samples=n_iter_samples
    )

In [60]:
%%time
%%memit

profile_temporai_dataset(temporai_static, temporai_ts, batch_size=16, n_iter_samples=30, max_seq_len=1024)

Yielded a static NP array of shape (100, 1) and a TS NP array of shape (100, 1024, 160).


  0%|          | 0/7 [00:00<?, ?it/s]

Sampling Dataloader Iteration Speed:   0%|          | 0/30 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Sampling Batch:   0%|          | 0/7 [00:00<?, ?it/s]

Iterating through an entire dataloader of 7 batches of size 16 took the following time per batch:
0:00:00.007319 ± 0:00:00.001439 (0:00:00.006224-0:00:00.012835)
Histogram:
█▅▁▁▁▁▂▁▁▁
o 0:00:00.006224 (left endpoint)
---------o 0:00:00.006885 (right endpoint)

Total batch size:
9.4 MB ± 2.8 MB (2.6 MB-10.5 MB)
  Size of static:
    57 Bytes ± 16 Bytes (16 Bytes-64 Bytes)
  Size of ts:
    9.4 MB ± 2.8 MB (2.6 MB-10.5 MB)
peak memory: 3349.08 MiB, increment: 647.06 MiB
CPU times: user 7.28 s, sys: 410 ms, total: 7.69 s
Wall time: 3.09 s


As we can see, the strategy of featurizing and batching used in TemporAI results (on this synthetic dataset) in a significantly faster iteration speed and a marginally lower memory cost than does the strategy used in ESGPT (all formats are mean ± standard deviation (min - max)

TemporAI Speed: `0:00:00.007319 ± 0:00:00.001439 (0:00:00.006224-0:00:00.012835)`  
ESGPT Speed:    `0:00:00.734263 ± 0:00:00.078552 (0:00:00.586128-0:00:00.871448)`

TemporAI Memory: `9.4 MB ± 2.8 MB (2.6 MB-10.5 MB)`  
ESGPT Memory:    `8.3 MB ± 1.1 MB (6.6 MB-9.7 MB)`

In table form (using chatGPT for conversions, so may need to be double checked), where "Delta" means what % of TemporAI's resource cost does ESGPT _save_ (higher is better), we get the following:
|                      | TemporAI              | ESGPT                 | Delta (%)  |
|----------------------|-----------------------|-----------------------|------------|
| **Iteration time / batch (ms)** | 7.32 ± 1.44 (6.22 - 12.8) | 734 ± 78.6 (586 - 871) | -9943%           |
| **Memory (MB)**      | 9.4 ± 2.8 (2.6 - 10.5) | 8.3 ± 1.1 (6.6 - 9.7) | 11.7%             |

There are some biases in this format, on both sides:
  1. ESGPT samples different subsequences per item iteration, whereas TemporAI is limited to only using the first max subsequence samples. 
  2. This dataset has relatively few measurements, which will reduce the memory disparity between the two formats (this bias favors TemporAI).
  3. The strategy of flattening this dataset may induce too much memory overhead, as if multiple measurements are not common within an event, it will have extra columns that TemporAI does not need. Conversely, it may reduce a significant amount of data, as if there are many measurements than a simple count, sum, sum_sqd, min, and max representation will not fully capture the data, thereby reducing the burden on TemporAI. (This bias could favor either).
  
Ultimately, these numbers will only be truly reasonable when compared on real data.