This notebook demonstrates a prototype of a TRX file that leverages the parquet format.

In [204]:
from trx_parquet import trxparquet

The primary class is `TrxParquet`. It contains two key attributes, `header` and `data`. The attribute `header` aims to store the minimal amount of information necessary for processing and conversion to other formats (e.g., to a `StatefulTractogram`) . The `data` attribute contains most of the tractography information.

To get started, see `init_example_trxparquet`, which initializes in-memory representations of trxparquet files with given characteristics and random data. For example, we can initialize two streamlines, each having 3 points.

In [205]:
trxparquet.init_example_trxparquet(2, 3)

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 4)
┌──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┐
│ protected_streamline ┆ protected_position_0 ┆ protected_position_1 ┆ protected_position_2 │
│ ---                  ┆ ---                  ┆ ---                  ┆ ---                  │
│ i64                  ┆ f64                  ┆ f64                  ┆ f64                  │
╞══════════════════════╪══════════════════════╪══════════════════════╪══════════════════════╡
│ 0                    ┆ 0.809122             ┆ 0.434179             ┆ 0.268171             │
│ 0                    ┆ 0.824715             ┆ 0.731883             ┆ 0.06977              │
│ 0                    ┆ 0.909729             ┆ 0.755341             ┆ 0.20684              │
│ 1   

In the trxparquet itself (on disk and in memory), the `header` is stored via frame-level metadata (e.g., of the kind readable by https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_metadata.html). 

The `data` attribute always contains at least four columns, each of which have the prefix "protected_". These columns represent 
- An index for streamline
- 3 columns representing the coordinates of each point/vertex within each streamline. 

That is, rows in the data correspond to points or vertices.

Data that is associated with each streamline is stored in columns that have the prefix "dps_". The function `init_example_trxparquet` can be used to create a this kind of column. The column label includes a random string.

In [206]:
trxparquet.init_example_trxparquet(2, 3, 1)

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 5)
┌─────────────────────┬─────────────────────┬────────────────────┬────────────────────┬────────────┐
│ protected_streamlin ┆ protected_position_ ┆ protected_position ┆ protected_position ┆ dps_urtupb │
│ e                   ┆ 0                   ┆ _1                 ┆ _2                 ┆ ---        │
│ ---                 ┆ ---                 ┆ ---                ┆ ---                ┆ f64        │
│ i64                 ┆ f64                 ┆ f64                ┆ f64                ┆            │
╞═════════════════════╪═════════════════════╪════════════════════╪════════════════════╪════════════╡
│ 0                   ┆ 0.321008            ┆ 0.503018           ┆ 0.214188           ┆ 0.530803   │
│ 0                   ┆ 0.093153            ┆ 0.977

Analogously, data that is associated with individual points will have the prefix "dpv_".

In [207]:
trx = trxparquet.init_example_trxparquet(2, 3, 1, 1)
trx

TrxParquet(header=TrxHeader(DIMENSIONS=array([20, 20, 20], dtype=uint16), VOXEL_TO_RASMM=array([[20.,  0.,  0.,  0.],
       [ 0., 20.,  0.,  0.],
       [ 0.,  0., 20.,  0.],
       [ 0.,  0.,  0.,  1.]], dtype=float32)), data=shape: (6, 6)
┌──────────────────┬─────────────────┬─────────────────┬─────────────────┬────────────┬────────────┐
│ protected_stream ┆ protected_posit ┆ protected_posit ┆ protected_posit ┆ dps_siwehr ┆ dpv_nfohed │
│ line             ┆ ion_0           ┆ ion_1           ┆ ion_2           ┆ ---        ┆ ---        │
│ ---              ┆ ---             ┆ ---             ┆ ---             ┆ f64        ┆ f64        │
│ i64              ┆ f64             ┆ f64             ┆ f64             ┆            ┆            │
╞══════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════╪════════════╡
│ 0                ┆ 0.336253        ┆ 0.298366        ┆ 0.332231        ┆ 0.287658   ┆ 0.174158   │
│ 0                ┆ 0.672749        ┆ 0.622136    

These objects can also be converted into `StatefulTractogram`s using the `to_stf()` method.

In [208]:
stf = trxparquet.init_example_trxparquet(2, 3, 1, 1).to_stf()
stf.streamlines

ArraySequence([array([[0.89317422, 0.7977596 , 0.3516565 ],
       [0.83625061, 0.05822647, 0.53686733],
       [0.97521793, 0.29906299, 0.15221995]]), array([[0.07765206, 0.40298642, 0.15323251],
       [0.76250452, 0.51570536, 0.37552104],
       [0.4322841 , 0.82248218, 0.14803168]])])

The assumption is that, by relying on the parquet format, we get get to leverage all of the work that has gone into making this an efficient medium for analysis. For example, let's create a file that has 1000000 streamlines, each with 100 points, checking the size of the file and how long it takes to load into memory.

In [209]:
import tempfile
import time
from pathlib import Path


with tempfile.NamedTemporaryFile(suffix=".parquet") as _f:
    f = Path(_f.name)
    trx = trxparquet.init_example_trxparquet(1000000, 100)
    trx.to_file(f)
    file_size = f.stat().st_size
    start = time.time()
    trx2 = trx.from_file(f, loadtype="memory")
    end = time.time()

object_size = trx2.data.estimated_size("mb")
print(f"{file_size=}")
print(f"{object_size=}")
print(f"reading time: {end - start}")

trx2
del trx, trx2

file_size=2435918725
object_size=3051.7578125
reading time: 3.426846981048584


There are different ways of loading parquet files, each optimized for different purposes. The previous cell loaded everything into memory. If only some streamlines need to be processed (or only some columns), then Lazy loading has many advantages. See: https://pola-rs.github.io/polars/user-guide/concepts/lazy-vs-eager/ . The reading time for lazy loading is minimal, but we can still extract useful information from the result.

In [210]:
with tempfile.NamedTemporaryFile(suffix=".parquet") as _f:
    f = Path(_f.name)
    trx = trxparquet.init_example_trxparquet(1000000, 100)
    trx.to_file(f)
    start = time.time()
    trx2 = trx.from_file(f, loadtype="lazy")
    n_streamlines_in_file = trx2.n_streamlines
    end = time.time()

print(f"reading time: {end - start}")
print(f"{n_streamlines_in_file=}")
del trx, trx2

reading time: 0.565061092376709
n_streamlines_in_file=1000000


Although unidimensional series (columns) have the best support, columns can store multi-dimensional arrays. For example, let's load data from `trx-python`.

In [211]:
import os
from trx import fetcher

with tempfile.TemporaryDirectory() as tmp_d:
    os.environ["TRX_HOME"] = str(tmp_d)
    data = {
        k: v
        for k, v in fetcher.get_testing_files_dict().items()
        if k == "gold_standard.zip"
    }
    fetcher.fetch_data(data)
    del os.environ["TRX_HOME"]

    out = trxparquet.TrxParquet.from_trx_file(
        tmp_d + "/gold_standard/gs_fldr.trx"
    )

out

TrxParquet(header=TrxHeader(DIMENSIONS=array([ 5, 10, 20], dtype=uint16), VOXEL_TO_RASMM=array([[ 3.9696155e+00, -2.4557561e-01,  7.5961235e-03,  1.2082228e+01],
       [ 4.9115121e-01,  1.9696155e+00, -1.2278780e-01,  2.2164438e+01],
       [ 3.0384494e-02,  2.4557561e-01,  9.9240386e-01,  3.7917774e+01],
       [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  1.0000000e+00]],
      dtype=float32)), data=shape: (104, 8)
┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ protected_ ┆ protected_ ┆ protected_ ┆ protected ┆ dps_rando ┆ dpv_color ┆ dpv_color ┆ dpv_color │
│ streamline ┆ position_0 ┆ position_1 ┆ _position ┆ m_coord   ┆ _y        ┆ _x        ┆ _z        │
│ ---        ┆ ---        ┆ ---        ┆ _2        ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ i64        ┆ f32        ┆ f32        ┆ ---       ┆ array[f32 ┆ array[f32 ┆ array[f32 ┆ array[f32 │
│            ┆            ┆            ┆ f32       ┆ , 3]      ┆ , 1]

For a few additional examples, please see the tests.

Note that, at the time of writing, no group-level information has been incorporated into the prototype.