Need to have the Minio bucket with 30420.zarr in. Use s5cmd to copy, to make sure the metadata hidden files are included

Set up lakeFS and the config

In [8]:
import s3fs
import zarr
import numpy as np
import xarray as xr
import dask.array as da
import lakefs_client
import lakefs as lf
from lakefs_client import models
from lakefs_client.client import LakeFSClient
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

Might need to restart VS code if credentials have changed.

In [7]:
# Configure lakefs
lakefs_cred = {
    "key": "",
    "secret": "",
    "endpoint_url": "", 
}
configuration = lakefs_client.Configuration()
configuration.username = lakefs_cred['key']
configuration.password = lakefs_cred['secret']
configuration.host = lakefs_cred['endpoint_url']
client = LakeFSClient(configuration)

In [9]:
repo = "lakefs-zarr-test"
branch = "zarr-data"

In [11]:
repo = lf.Repository("lakefs-zarr-test").create(storage_namespace="s3://zarr-example")
print(repo)

{'id': 'lakefs-zarr-test', 'creation_date': 1719220965, 'default_branch': 'main', 'storage_namespace': 's3://zarr-example'}


If you check the LakeFS UI, you'll see that a repo has been created.

We now want to ingest our data from s3:

## Import data from bucket to LakeFS

The easiest way to get the data into LakeFS is through the UI. Click the green `Import` button, and point it to your bucket `s3://zarr-example`. Give it a nice commit.

You can also do this through the command line tool.

In [12]:
z = xr.open_zarr(f"lakefs://lakefs-zarr-test/main/30420.zarr")
z

Unnamed: 0,Array,Chunk
Bytes,117.19 kiB,117.19 kiB
Shape,"(30000,)","(30000,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 117.19 kiB 117.19 kiB Shape (30000,) (30000,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",30000  1,

Unnamed: 0,Array,Chunk
Bytes,117.19 kiB,117.19 kiB
Shape,"(30000,)","(30000,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,117.19 kiB,117.19 kiB
Shape,"(30000,)","(30000,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 117.19 kiB 117.19 kiB Shape (30000,) (30000,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",30000  1,

Unnamed: 0,Array,Chunk
Bytes,117.19 kiB,117.19 kiB
Shape,"(30000,)","(30000,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Create a branch to do some experimenting

In [14]:
branch1 = lf.repository("lakefs-zarr-test").branch("experiment1").create("main")

In [16]:
for branch in lf.repository("lakefs-zarr-test").branches():
    print(branch)

Branch(repository="lakefs-zarr-test", id="experiment1")
Branch(repository="lakefs-zarr-test", id="main")


## Add new group to the shot

Opening and changing will change the files on LakeFS. They will be under 'uncommitted' changes though so they will need to then be committed. 

In [19]:
# Open the existing Zarr file
zarr_file_path = f'lakefs://lakefs-zarr-test/experiment1/30420.zarr'
zarr_file = zarr.open(zarr_file_path, mode='a')
# Create a new group
new_group = zarr_file.create_group('bar')

for diff in branch1.uncommitted():
    print(diff)

{'type': 'added', 'path': '30420.zarr/bar/.zgroup', 'path_type': 'object', 'size_bytes': 24}


Then here we can actually make a commit. Look on the UI and we will see the commit history on our experiment1 branch

In [20]:
ref = branch1.commit(message='Add new group', metadata={'using': 'python_sdk'})

Then we can look at the differences, and merge it into main.

In [21]:
main = repo.branch("main")
for diff in main.diff(other_ref=branch1):
    print(diff)

{'type': 'added', 'path': '30420.zarr/bar/.zgroup', 'path_type': 'object', 'size_bytes': 24}


In [22]:
res = branch1.merge_into(main)

## Add new array to a group

So now let's add an array to the group 'bar' that we just created

Create branch to do the changes on

In [23]:
branch2 = lf.repository("lakefs-zarr-test").branch("experiment2").create("main")

Make the changes

In [30]:
# Open the existing Zarr file
zarr_file_path = f'lakefs://lakefs-zarr-test/experiment2/30420.zarr'
zarr_file = zarr.open(zarr_file_path, mode='a')
zarr_file['bar'] = np.array([42,3,2,4,5])

In [31]:
for diff in branch2.uncommitted():
    print(diff)

{'type': 'added', 'path': '30420.zarr/bar/.zarray', 'path_type': 'object', 'size_bytes': 309}
{'type': 'removed', 'path': '30420.zarr/bar/.zgroup', 'path_type': 'object', 'size_bytes': 0}
{'type': 'added', 'path': '30420.zarr/bar/0', 'path_type': 'object', 'size_bytes': 56}


Commit the changes and merge them into main

In [32]:
ref = branch2.commit(message='Add array to group bar', metadata={'using': 'python_sdk'})

In [34]:
main = repo.branch("main")
for diff in main.diff(other_ref=branch2):
    print(diff)

# Merge branch into main
res = branch2.merge_into(main)

{'type': 'added', 'path': '30420.zarr/bar/.zarray', 'path_type': 'object', 'size_bytes': 309}
{'type': 'removed', 'path': '30420.zarr/bar/.zgroup', 'path_type': 'object', 'size_bytes': 24}
{'type': 'added', 'path': '30420.zarr/bar/0', 'path_type': 'object', 'size_bytes': 56}
