This notebook demonstrates a prototype of a TRX format that leverages the parquet format.

In [9]:
from trx_parquet import trxparquet

The primary class is `TrxParquet`. It contains two key attributes, `header` and `data`. The attribute `header` is an attempt at storing a minimal amount of information necessary for processing and conversion to other formats. The `data` attribute is where most of the information is contained.

One quick way to get started is via `init_example_trxparquet`, which initializes trxparquet files with given characteristics and random data. For example, we can start with a representation of two streamlines, each having 3 points.

In [10]:
trxparquet.init_example_trxparquet(2, 3)

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 4)
┌──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┐
│ protected_streamline ┆ protected_position_0 ┆ protected_position_1 ┆ protected_position_2 │
│ ---                  ┆ ---                  ┆ ---                  ┆ ---                  │
│ i64                  ┆ f64                  ┆ f64                  ┆ f64                  │
╞══════════════════════╪══════════════════════╪══════════════════════╪══════════════════════╡
│ 0                    ┆ 0.559016             ┆ 0.253217             ┆ 0.961                │
│ 0                    ┆ 0.110607             ┆ 0.247859             ┆ 0.989746             │
│ 0                    ┆ 0.309266             ┆ 0.266963             ┆ 0.505193             │
│ 1   

In the parquet file, the `header` is stored via frame-level metadata (e.g., of the kind readable by https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_metadata.html). The `header` will not be discussed further in this notebook. 

The `data` always contains at least four columns, each of which have the prefix "protected_". These columns represent 
- An index for streamline
- 3 columns representing the coordinates of each point/vertex within each streamline. 

That is, rows in the data correspond to points or vertices.

Data that is associated with each streamline will be stored in a column that has a prefix "dps_". The function `init_example_trxparquet` can be used to create a this kind of column. The label with include a random string.

In [11]:
trxparquet.init_example_trxparquet(2, 3, 1)

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 5)
┌─────────────────────┬─────────────────────┬────────────────────┬────────────────────┬────────────┐
│ protected_streamlin ┆ protected_position_ ┆ protected_position ┆ protected_position ┆ dps_cntxha │
│ e                   ┆ 0                   ┆ _1                 ┆ _2                 ┆ ---        │
│ ---                 ┆ ---                 ┆ ---                ┆ ---                ┆ f64        │
│ i64                 ┆ f64                 ┆ f64                ┆ f64                ┆            │
╞═════════════════════╪═════════════════════╪════════════════════╪════════════════════╪════════════╡
│ 0                   ┆ 0.526893            ┆ 0.21765            ┆ 0.941686           ┆ 0.429893   │
│ 0                   ┆ 0.677424            ┆ 0.973

Analogously, data that is associated with individual points will have the prefix "dpv_".

In [12]:
trxparquet.init_example_trxparquet(2, 3, 1, 1)

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 6)
┌──────────────────┬─────────────────┬─────────────────┬─────────────────┬────────────┬────────────┐
│ protected_stream ┆ protected_posit ┆ protected_posit ┆ protected_posit ┆ dps_vpthvz ┆ dpv_ycunvm │
│ line             ┆ ion_0           ┆ ion_1           ┆ ion_2           ┆ ---        ┆ ---        │
│ ---              ┆ ---             ┆ ---             ┆ ---             ┆ f64        ┆ f64        │
│ i64              ┆ f64             ┆ f64             ┆ f64             ┆            ┆            │
╞══════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════╪════════════╡
│ 0                ┆ 0.405331        ┆ 0.358066        ┆ 0.856907        ┆ 0.187236   ┆ 0.873383   │
│ 0                ┆ 0.219783        ┆ 0.605258    

To see this value in a more familiar format, they can be converted into `StatefulTractogram`s using the `to_stf()` method.

In [13]:
stf = trxparquet.init_example_trxparquet(2, 3, 1, 1).to_stf()
stf.streamlines

ArraySequence([array([[0.04492556, 0.90927129, 0.94313624],
       [0.63021004, 0.77843718, 0.77896697],
       [0.34047226, 0.36103235, 0.47849451]]), array([[0.73950012, 0.36383573, 0.47686887],
       [0.32364141, 0.34392213, 0.68513902],
       [0.1677373 , 0.2484212 , 0.75261113]])])

The assumption is that, by relying on the parquet format, we get get to leverage all of the work that has gone into making this an efficient medium for analysis. For example, let's create a file that has 100000 streamlines, each with 100 points, checking the size of the file and how long it takes to load.

In [14]:
import tempfile
import time
from pathlib import Path


def human_size(bytes, units=[" bytes", "KB", "MB", "GB", "TB", "PB", "EB"]):
    """Returns a human readable string representation of bytes"""
    return str(bytes) + units[0] if bytes < 1024 else human_size(bytes >> 10, units[1:])


with tempfile.NamedTemporaryFile(suffix=".parquet") as _f:
    f = Path(_f.name)
    trx = trxparquet.init_example_trxparquet(100000, 100)
    trx.to_file(f)
    size = human_size(f.stat().st_size)
    start = time.time()
    trx2 = trx.from_file(f, loadtype="memory_map")
    end = time.time()

print(f"size: {size}")
print(f"reading time: {end - start}")

trx2
del trx, trx2

size: 230MB
reading time: 0.11250829696655273


There are different ways of loading parquet files, each optimized for different purposes. The previous cell loaded files as a memory map. If only some streamlines need to be processed (or only some columns), then Lazy loading has many advantages. See: https://pola-rs.github.io/polars/user-guide/concepts/lazy-vs-eager/ . The reading time for lazy loading is minimal, but we can still extract useful information from the result.

In [15]:
with tempfile.NamedTemporaryFile(suffix=".parquet") as _f:
    f = Path(_f.name)
    trx = trxparquet.init_example_trxparquet(100000, 100)
    trx.to_file(f)
    start = time.time()
    n_streamlines_in_file = trx.from_file(f, loadtype="lazy").n_streamlines
    end = time.time()

print(f"reading time: {end - start}")
print(f"{n_streamlines_in_file=}")
del trx

reading time: 0.04939889907836914
n_streamlines_in_file=100000


Even when files are loaded into memory, the process is remains speedy.

In [16]:
with tempfile.NamedTemporaryFile(suffix=".parquet") as _f:
    f = Path(_f.name)
    trx = trxparquet.init_example_trxparquet(100000, 100)
    trx.to_file(f)
    start = time.time()
    trx2 = trx.from_file(f, loadtype="memory")
    end = time.time()

object_size = trx2.data.estimated_size("mb")
print(f"{object_size=} MB")
print(f"reading time: {end - start}")
del trx, trx2

object_size=305.17578125 MB
reading time: 0.1168370246887207


For a few additional examples, please see the tests.

Note that, at the time of writing, no group-level information has been incorporated into the prototype.