In [4]:
import pickle
from lakefs_spec import LakeFSFileSystem
import pandas as pd
import os
import lakefs
from datetime import date, time
from pyspark.sql.functions import col,isnan,when,count

Creating the repo can be done through lakectl (command line), or through the below cell:

In [5]:
fs = LakeFSFileSystem()

REPO_NAME = "testing"
repo = lakefs.Repository(REPO_NAME, fs.client).create(storage_namespace="s3://example-data")

ConflictException: code: 409, reason: Conflict, body: {'message': 'error creating repository: not unique'}

If you check the LakeFS UI, you'll see that a repo has been created. We now want to ingest our data from s3:

## Import data from bucket to LakeFS

The easiest way to get the data into LakeFS is through the UI. Click the green `Import` button, and point it to your bucket `s3://example-data`. 

You can also do this through the command line tool with:

```bash
lakectl import --from s3://example-data/ --to lakefs://testing/main/
```

We can create a new branch, where we will start modifying the data we can pull from LakeFS


In [8]:
NEW_BRANCH = lakefs.Branch(REPO_NAME, "transform-raw-data", client=fs.client)
NEW_BRANCH.create("main")

ConflictException: code: 409, reason: Conflict, body: {'message': 'branch already exists: not unique'}

Let's now pull one of the files from the LakeFS repo and do some things to it:

In [6]:
# Simply read the parquet file
my_data = pd.read_parquet(f"lakefs://testing/main/30390.parquet")

# Make some random changes
my_data['new_column'] = 'new_values'

Great, let's put that onto our branch:

In [9]:
with fs.transaction(REPO_NAME, NEW_BRANCH) as tx:
    my_data.to_parquet(f"lakefs://{REPO_NAME}/{tx.branch.id}/30390.parquet")
    tx.commit(message="Added some data to 30390")

No changes to commit on branch 'transaction-154969'.


If we check the UI, we will see that in `transform-raw-data` branch, the parquet file has been updated and committed. 

Now we decide we want to access the older version of that file, not the updated one. We can get the commit ID through here, or find the commit we want through the UI

In [11]:
repo = lakefs.Repository(REPO_NAME, fs.client)

# access the data of the previous commit with a lakefs ref expression, in this case the same as in git.
previous_commit = repo.ref(f"{NEW_BRANCH.id}~").get_commit()
fixed_commit_id = previous_commit.id
print(fixed_commit_id)

3bc6f74d6ade4fded9d2e92d36a562a91a1eb486ed9dea7eb26cd25965a4500b


Great, let's use that to get the original version of the 30390 file:

In [13]:
orig_file = pd.read_parquet(f"lakefs://{REPO_NAME}/{fixed_commit_id}/30390.parquet")
orig_file

Unnamed: 0,dataset_item_uuid,description,file_name,format,label,mds_name,name,pass_,rank,shape,...,right,strobe,taps,top,trigger,vbin,view,width,_ARRAY_DIMENSIONS,INTERLACE_MODE
0,fe655d78-9c51-5974-b75c-f0038f64bcd5,Shot used for calibration (obsolete),,,Calibration Shot,\TOP.ANALYSED.ABM:CALIB_SHOT,ABM_CALIB_SHOT,0.0,1.0,[1],...,,,,,,,,,,
1,d8f2807e-a3c8-5d90-8f81-25318efb4bfc,"Failed = 0, OK = 1",,,channel_status,\TOP.ANALYSED.ABM.CHANNEL:STATUS,ABM_CHANNEL_STATUS,0.0,2.0,"[1, 32]",...,,,,,,,,,,
2,87598b5c-3805-5796-b918-33f65925c6e1,"Channel type (0 = poloidal, 1 = co-tangential,...",,,channel_type,\TOP.ANALYSED.ABM:CHANNEL_TYPE,ABM_CHANNEL_TYPE,0.0,2.0,"[1, 32]",...,,,,,,,,,,
3,15cba832-1046-58f8-975e-0ca2f2c6236d,Gain of pre-amplifiers,,,GAIN,\TOP.ANALYSED.ABM:GAIN,ABM_GAIN,0.0,2.0,"[1, 32]",...,,,,,,,,,,
4,ee332e1b-7404-5bcb-a933-4fd77a27e02e,Incident powers (x - channel),,,i-bol,\TOP.ANALYSED.ABM:I_BOL,ABM_I-BOL,0.0,2.0,"[7500, 32]",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1042,d1a508cf-2c3b-5111-b22f-26e078afb051,Phantom colour camera,rco030390.ipx,IPX,,,RCO,-1.0,,,...,512.0,0.000000e+00,0.0,149.0,-0.1,0.0,HM11 - normal,512.0,,INTERLACE_PIXEL
1043,75db70bc-a66c-5cf1-aff0-676ab4a0e5c8,RBG 2D multi-colour visible bremsstrahlung camera,rgb030390.ipx,IPX,,,RGB,-1.0,,,...,640.0,0.000000e+00,1.0,1.0,-0.1,0.0,Sector9U,640.0,,
1044,a6e4aa24-84a7-5754-95ae-601b5b9567bf,RBG 2D multi-colour visible bremsstrahlung camera,rgc030390.ipx,IPX,,,RGC,-1.0,,,...,640.0,0.000000e+00,2.0,1.0,-0.1,0.0,Sector9U,640.0,,
1045,f273f736-ca87-5a22-a2f0-fcbba9ca00fb,Medium wavelength infrared camera,rir030390.ipx,IPX,,,RIR,-1.0,,,...,320.0,0.000000e+00,4.0,185.0,-0.1,0.0,Lower divertor view#6,320.0,,


No `new_column`!

We haven't gone through merging branches here, but we can do that through the UI or through the command line tool.

Following [this](https://lakefs-spec.org/latest/tutorials/demo_data_science_project/), you can also TAG commits with your own string, so we could tag the above, updated, file with something like `first-file-change` and then use that tag to pull that version of the file.