<img src="https://store-images.s-microsoft.com/image/apps.22094.728e1f25-a784-458f-90e1-7729049edba2.144bf785-b784-41dd-bcef-c91792108c09.f0be1bc2-af8f-49fc-ac4c-dfd9d53d9e8d" alt="lakeFS logo" width=130/> 

# Using [Lua Hooks](https://docs.lakefs.io/howto/hooks/lua.html) in lakeFS (similar to GitHub Actions)

This notebook demonstrated how to create a pre-merge hook in lakeFS that validates the metadata before merging data into the production branch. 

1. Define hook configuration files and a Lua scripts for metadata validations. 
2. Perform an ETL process by creating an ingestion branch, uploading data files with metadata and atomically promoting the data to the production branch through a merge.
3. The pre-merge hook prevents the promotion due to metadata issues, resulting in a Precondition Failed error.
4. Attempt to change the metadata and promote it to production again. 

# Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [71]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsExternalEndpoint = 'http://lakefs:28220'
lakefsAccessKey = 'V42FCGRVMK24JJ8DHUYG'
lakefsSecretKey = 'bKhWxVF3kQoLY9kFmt91l+tDrEoZjqnWXzY9Eza'

### Object Storage

In [35]:
storageNamespace = 's3://lakefs-demo-bucket' # e.g. "s3://bucket"

---

# Setup

**(you shouldn't need to change anything in this section, just run it)**

In [36]:
repo_name = "demo"

### Versioning Information

In [37]:
mainBranch = "main"
ingestionBranch = "dev"
fileName1 = "userdata1.parquet"
fileName2 = "userdata2.parquet"

### Some helper functions

In [38]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os

def print_diff(diff):
    results = map(
        lambda n:[n.path,n.path_type,n.size_bytes,n.type],
        diff)

    from tabulate import tabulate
    print(tabulate(
        results,
        headers=['Path','Path Type','Size(Bytes)','Type']))

def print_commit(log):
    from datetime import datetime
    from pprint import pprint

    print('Message:', log.message)
    print('ID:', log.id)
    print('Committer:', log.committer)
    print('Creation Date:', datetime.utcfromtimestamp(log.creation_date).strftime('%Y-%m-%d %H:%M:%S'))
    print('Parents:', log.parents)
    print('Metadata:')
    pprint(log.metadata)

def lakefs_ui_endpoint(lakefsEndPoint):
    if lakefsEndPoint.startswith('http://host.docker.internal'):
        lakefsUIEndPoint = lakefsEndPoint.replace('host.docker.internal','127.0.0.1')
    elif lakefsEndPoint.startswith('http://lakefs'):
        lakefsUIEndPoint = lakefsEndPoint.replace('lakefs','127.0.0.1')
    else:
        lakefsUIEndPoint = lakefsEndPoint
        
    return lakefsUIEndPoint

### Import libraries

In [39]:
%xmode Minimal
import os
import lakefs
import lakefs_sdk
from lakefs_sdk.client import LakeFSClient
from lakefs_sdk import models
import yaml

Exception reporting mode: Minimal


### Set environment variables

In [40]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [41]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=lakefs.client.Client().version
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v}")

Verifying lakeFS credentials‚Ä¶
‚Ä¶‚úÖlakeFS credentials verified

‚ÑπÔ∏èlakeFS version 1.43.0


Working with the lakeFS Python client API

In [42]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=lakefsAccessKey,
    password=lakefsSecretKey,
)
lakefsClient = LakeFSClient(configuration)

### Define lakeFS Repository

In [43]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

{'id': 'demo', 'creation_date': 1732532388, 'default_branch': 'main', 'storage_namespace': 's3://lakefs-demo-bucket/demo'}


---

# Main demo starts here üö¶ üëáüèª

## Setup and Configure Hooks

### Configure hooks in the repository

* Upload [Hooks config YAML file](./hooks/pre-merge-metadata-validation.yaml) for metadata validation to check for mandatory metadata before data is merged into the main branch
* Hooks config file must be uploaded to "_lakefs_actions" prefix

In [44]:
hooks_config_yaml = "pre-merge-metadata-validation.yaml"
hooks_prefix = "_lakefs_actions"

contentToUpload = open(f'./hooks/{hooks_config_yaml}', 'r').read()
print(branchMain.object(f'{hooks_prefix}/{hooks_config_yaml}').upload(data=contentToUpload, mode='wb', pre_sign=False))

_lakefs_actions/pre-merge-metadata-validation.yaml


### Upload 1st script

##### The script [commit_metadata_validator.lua](./hooks/commit_metadata_validator.lua) checks commit metadata to validate that mandatory metadata fields are present and value for the metadata fields match the required pattern

In [45]:
lua_script_file_name = "commit_metadata_validator.lua"
lua_scripts_path = "scripts"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

scripts/commit_metadata_validator.lua


### Upload 2nd script

##### The script [dataset_validator.lua](./hooks/dataset_validator.lua) validates the existence of mandatory metadata describing a dataset

In [46]:
lua_script_file_name = "dataset_validator.lua"
lua_scripts_path = "scripts"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

scripts/dataset_validator.lua


### Commit changes to the lakeFS repo

In [47]:
ref = branchMain.commit(message='Added hooks config file and metadata validation scripts')
print_commit(ref.get_commit())

Message: Added hooks config file and metadata validation scripts
ID: a672fc9595fb01dd46586f82788bf300b1ff3af668c254a1e4f89f5a6c233151
Committer: quickstart
Creation Date: 2024-11-25 11:00:34
Parents: ['fbab0e2b947ddcaec1d5acf4bf74aa26c8449e2d3edecf45939119331536aef7']
Metadata:
{}


### Protect main branch so no one can write directly to the main branch and any subsequent writes must be done via the merge of a branch

In [48]:
lakefsClient.repositories_api.set_branch_protection_rules(
    repository=repo_name,
    branch_protection_rule=[models.BranchProtectionRule(
        pattern=mainBranch)])

# ETL Job Starts

## Create a new branch which will be used to ingest data

In [49]:
branchIngestion = repo.branch(ingestionBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{ingestionBranch} ref:", branchIngestion.get_commit().id)

dev ref: a672fc9595fb01dd46586f82788bf300b1ff3af668c254a1e4f89f5a6c233151


## Upload data files

In [50]:
obj = branchIngestion.object(path=f"datasets/{fileName1}")

with open(f"./data/{fileName1}", mode='rb') as reader, obj.writer(mode='wb') as writer:
    writer.write(reader.read())

In [51]:
obj = branchIngestion.object(path=f"datasets/{fileName2}")

with open(f"./data/{fileName2}", mode='rb') as reader, obj.writer(mode='wb') as writer:
    writer.write(reader.read())

## Upload metadata file

In [52]:
dataset_metadata_definition = {
   'contains_pii': 'yes',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

## Commit changes

In [54]:
ref = branchIngestion.commit(message='Added data and metadata files')
print_commit(ref.get_commit())

Message: Added data and metadata files
ID: 71dbb339f232ce7252113e18146ba8bda3d91985ac3d42ef6bd692a89bbb04ae
Committer: quickstart
Creation Date: 2024-11-25 11:04:44
Parents: ['a672fc9595fb01dd46586f82788bf300b1ff3af668c254a1e4f89f5a6c233151']
Metadata:
{}


## Promote the Data into production

#### Merging the ingestion branch with the current metadata to the production branch
#### üõëüõë Merge will fail because 'spark_version' metadata key is missing in the merge metadata.  Review the error message.

In [55]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': '1 error occurred:\n\t* hook run id \'0000_0000\' failed on action \'Validate Commit Metadata and Dataset Metadata Fields\' hook \'check_commit_metadata\': runtime error: [string "lua"]:33: missing mandatory metadata field: spark_version\nstack traceback:\n\t[Go]: in function \'__index\'\n\t[string "lua"]:33: in main chunk\n\n'}

#### Add 'spark_version' metadata and try to merge again.
#### üõëüõë Merge will fail again because metadata field 'notebook_url' does not match the pattern: 'github.com/.*'.

In [57]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.ai/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': '1 error occurred:\n\t* hook run id \'0000_0000\' failed on action \'Validate Commit Metadata and Dataset Metadata Fields\' hook \'check_commit_metadata\': runtime error: [string "lua"]:36: current value for commit metadata field notebook_url does not match pattern: github.com/.* - got: https://github.ai/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb\nstack traceback:\n\t[Go]: in function \'for iterator\'\n\t[string "lua"]:36: in main chunk\n\n'}

#### Change 'github.ai' to 'github.com' in the value of 'notebook_url' metadata and try to merge again.
#### üõëüõë Merge will fail again because field 'contains_pii' in dataset_metadata.yaml file should be of type boolean.

In [58]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': "1 error occurred:\n\t* validate_datasets: datasets/dataset_metadata.yaml: field 'contains_pii' should be of type boolean\n\n"}

#### Change value for the field 'contains_pii' in dataset_metadata.yaml file to 'True' and try to merge again.
#### üõëüõë Merge will fail again because field 'approval_link' is required in the dataset_metadata.yaml file.

In [59]:
dataset_metadata_definition = {
   'contains_pii': True,
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

Message: Changed metadata file
ID: 18ee652360b8a44c666dd5e33563b6bcb9270bf0b2ecac89c6df0acc1fdf7c20
Committer: quickstart
Creation Date: 2024-11-25 11:07:29
Parents: ['71dbb339f232ce7252113e18146ba8bda3d91985ac3d42ef6bd692a89bbb04ae']
Metadata:
{}


In [60]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': "1 error occurred:\n\t* validate_datasets: datasets/dataset_metadata.yaml: field 'approval_link' is required but no value given\n\n"}

#### Add field 'approval_link' in the dataset_metadata.yaml file and try to merge again.
#### üõëüõë Merge will fail again because value for field 'approval_link' should match the pattern 'https?:\\/\\/.*'.

In [61]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'example.com',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

Message: Changed metadata file
ID: 8569cab26cfb27784303e009f31916f9e5a737b9ca501d4ae946ed1118f02baf
Committer: quickstart
Creation Date: 2024-11-25 11:08:48
Parents: ['18ee652360b8a44c666dd5e33563b6bcb9270bf0b2ecac89c6df0acc1fdf7c20']
Metadata:
{}


In [62]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': "1 error occurred:\n\t* validate_datasets: datasets/dataset_metadata.yaml: field approval_link should match pattern 'https?:\\/\\/.*'\n\n"}

#### Change value for the field 'approval_link' from 'example.com' to 'https://example.com' and try to merge again.
#### üõëüõë Merge will fail again because value for the field 'department' should be one of 'hr, it, other'.

In [63]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'https://example.com',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

Message: Changed metadata file
ID: 05f0a2769abc8dbe292f4966a58eeac39fc0c022297cacdd2f86f7a632a686e9
Committer: quickstart
Creation Date: 2024-11-25 11:09:11
Parents: ['8569cab26cfb27784303e009f31916f9e5a737b9ca501d4ae946ed1118f02baf']
Metadata:
{}


In [64]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

ServerException: code: 412, reason: Precondition Failed, body: {'message': "1 error occurred:\n\t* validate_datasets: datasets/dataset_metadata.yaml: field 'department' should be one of 'hr, it, other'\n\n"}

#### Change value for the field 'department' from 'finance' to 'hr' and try to merge again.
#### Merge will succeed this time.

In [65]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'https://example.com',
   'rank': 1,
   'department': 'hr'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

Message: Changed metadata file
ID: f557d8996434387502c5c3c83c72b33b9e08f36165757bdc84c446c9548fe7ca
Committer: quickstart
Creation Date: 2024-11-25 11:09:42
Parents: ['05f0a2769abc8dbe292f4966a58eeac39fc0c022297cacdd2f86f7a632a686e9']
Metadata:
{}


In [66]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

fa90c522501cb87e916880bdddbc8ae37670cfdcf6611b24cdce8dd899daec8a


## You can also review all Actions in lakeFS UI

In [69]:
lakefsUIEndPoint = lakefs_ui_endpoint(lakefsEndPoint)
print(f"üëâüèª {lakefsUIEndPoint}/repositories/{repo_name}/actions")

üëâüèª http://127.0.0.1:28220/repositories/demo/actions
