Converting tar archives into a reference filesystem

Zarr files can challenge metadata-server of HPC systems due to their millions of files. One way to circumvent this challenge is to collect all files in a file container, e.g. in tar files and create a look-up table of byte ranges where the content of each file is saved within the container. Tar-ing zarr files makes it also easy to store and reuse data on tape-archives.

tar_referencer creates these look-up tables that can be used with the preffs package.

Usage

The package can be installed with

pip install git+https://github.com/observingClouds/tar_referencer.git

The look-up files (parquet reference files) are created with

tar_referencer -t file.*.tar -p file_index.preffs

If zarr files have been packed into tars and indexed with tar_referencer the tars can be opened with:

import xarray as xr
storage_options={"preffs":{"prefix":/path/to/tar/files/"}}
ds = xr.open_zarr("preffs::file_index.preffs", storage_options=storage_options)

Creating tar files

Technically all sorts of tar files can be referenced. However, tar_referencer currently does only supports tar files that are split at the file level. Tar files that are split within the header or data block are not supported.

Warning This does not work:

tar -cvf - big.tar | split --bytes=32000m --suffix-length=3 --numeric-suffix - part%03d.tar

To generate compatible tar files from zarr files or other directory structures, tar_referencer provides tar_creator:

tar_creator -i dataset.zarr -t dataset_part{:03d}.tar -s MAX_SIZE_BYTES

where MAX_SIZE_BYTES is the maximum size of a tar file, before writing further output to an additional archive.

To split already existing tar files, Splitar has been successfully tested.

splitar -S 32000m big.tar part.tar-

Tips and tricks

For very big zarr-datasets, especially those that contain several variables, it might be advisable to pack each variable-subfolder of the zarr file into their own set of tars. The benefit of this approach is that only those tars need to be downloaded/retrieved that are actually containing the variable of interest. For each of these sets a separate look-up table can be generated and merged to an overaching look-up table containing the entire dataset

import pandas as pd
df_coords = pd.read_parquet("file_index.coords.preffs")
df_var1 = pd.read_parquet("file_index.var1.preffs")
df_var2 = pd.read_parquet("file_index.var2.preffs")
df_entire_dataset = pd.concat([df_coords, df_var1, df_var2]).sort_index()
df_entire_dataset.to_parquet("entire_dataset.preffs")

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
tar_referencer		tar_referencer
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.rst		CHANGELOG.rst
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Converting tar archives into a reference filesystem

Usage

Creating tar files

Tips and tricks

About

Releases 2

Packages

Languages

License

observingClouds/tar_referencer

Folders and files

Latest commit

History

Repository files navigation

Converting tar archives into a reference filesystem

Usage

Creating tar files

Tips and tricks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages