<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Data Lineage with lakeFS

**Use Case**: Understand data transformations by using commits with metadata and "Blame" functionality

In this example, data sets (employees & salaries) are ingested through two separated branches. Then, merged together on a transformation branch. And finally, promoted to the production branch.

At the very end of the process, the lakeFS "Blame" functionality (`log_commits`) is used to trace the origin of a specific file or dataset.

![](./images/data-lineage/CommitFlow.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "data-lineage"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_config()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v['version_config']['version']}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version 0.104.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository data-lineage does not exist, so going to try and create it now.
Created new repo data-lineage using storage namespace s3://example/data-lineage


### Set up Spark

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

## Versioning Information

In [8]:
productionBranch = "main"
ingestionBranch1 = "ingest1"
ingestionBranch2 = "ingest2"
transformationBranch = "transformation"
newPath = "partitioned_data"
fileName = "Employees.csv"

---

# Main demo starts here 🚦 👇🏻

## Ingest data into the first ingestion branch

In [9]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=ingestionBranch1,
        source=productionBranch))

'565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222'

In [10]:
import os
contentToUpload = open(f"/data/{fileName}", 'rb') # Only a single file per upload which must be named \\\"content\\\"
lakefs.objects.upload_object(
    repository=repo.id,
    branch=ingestionBranch1,
    path=fileName, content=contentToUpload)

{'checksum': '4451cd251e4801764528483315b3d2b4',
 'content_type': 'text/csv',
 'mtime': 1689579651,
 'path': 'Employees.csv',
 'path_type': 'object',
 'physical_address': 's3://example/data-lineage/data/gmd1dg22tk3c76sjud40/ciqf10q2tk3c76sjudsg',
 'size_bytes': 771}

## Commit changes to first ingest branch and attach some metadata

In [11]:
lakefs.commits.commit(
    repository=repo.id,
    branch=ingestionBranch1,
    commit_creation=CommitCreation(
        message='Ingesting employees IDs',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': 'Employees.csv'}))

{'committer': 'everything-bagel',
 'creation_date': 1689579651,
 'id': '76021a4d6f0587783abbd6fe3886c9fd641da3d0f20a8c8b77b91f69a142aaa9',
 'message': 'Ingesting employees IDs',
 'meta_range_id': '',
 'metadata': {'::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
              'source': 'Employees.csv',
              'using': 'python_api'},
 'parents': ['565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222']}

## Ingest data into the second ingestion branch

In [12]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=ingestionBranch2,
        source=productionBranch))

'565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222'

In [13]:
fileName = "Salaries.csv"

import os
contentToUpload = open(f"/data/{fileName}", 'rb') # Only a single file per upload which must be named \\\"content\\\"
lakefs.objects.upload_object(
    repository=repo.id,
    branch=ingestionBranch2,
    path=fileName, content=contentToUpload)

{'checksum': '4399afd66bf99ea96717d711ff1624ea',
 'content_type': 'text/csv',
 'mtime': 1689579651,
 'path': 'Salaries.csv',
 'path_type': 'object',
 'physical_address': 's3://example/data-lineage/data/gmd1dg22tk3c76sjud40/ciqf10q2tk3c76sjudv0',
 'size_bytes': 836}

## Commit changes to second ingest branch with metadata

In [14]:
lakefs.commits.commit(
    repository=repo.id,
    branch=ingestionBranch2,
    commit_creation=CommitCreation(
        message='Ingesting Salaries',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': '/Salaries.csv'}))

{'committer': 'everything-bagel',
 'creation_date': 1689579652,
 'id': '34338a6dfc45a53bab344513eb09a6b2d3f4b62aae0550605b50844a397ca253',
 'message': 'Ingesting Salaries',
 'meta_range_id': '',
 'metadata': {'::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
              'source': '/Salaries.csv',
              'using': 'python_api'},
 'parents': ['565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222']}

## Merge the lists in a transformation branch

In [15]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=transformationBranch,
        source=productionBranch))

'565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222'

In [16]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=ingestionBranch1, 
    destination_branch=transformationBranch)

{'reference': 'ff6694ea7c30408aae2fa97cc614d52ecb0dcb2ee55cd0b624767d5b696ca4fe'}

In [17]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=ingestionBranch2, 
    destination_branch=transformationBranch)

{'reference': 'bdabb4a97a4c7cfc0f266ff6fa8777a327164a959c24f77d0339a9e6b76674bb'}

In [18]:
employeeFile="Employees.csv"
SalariesFile="Salaries.csv"

In [19]:
dataPath = f"s3a://{repo.id}/{transformationBranch}/{employeeFile}"

df1 = spark.read.option("header", "true").csv(dataPath)
df1.show()


+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|101|    John| 32|  Male|
|102|    Jane| 28|Female|
|103|     Bob| 40|  Male|
|104|   Alice| 36|Female|
|105|    Mark| 44|  Male|
|106|   Julia| 29|Female|
|107|   David| 50|  Male|
|108|   Emily| 34|Female|
|109| Michael| 41|  Male|
|110|Samantha| 31|Female|
|111|   Chris| 45|  Male|
|112|   Megan| 27|Female|
|113|    Adam| 38|  Male|
|114|  Olivia| 33|Female|
|115|    Nick| 43|  Male|
|116|    Kate| 30|Female|
|117|     Max| 47|  Male|
|118|   Chloe| 25|Female|
|119|     Tom| 39|  Male|
|120|    Lisa| 35|Female|
+---+--------+---+------+
only showing top 20 rows



In [20]:
dataPath = f"s3a://{repo.id}/{transformationBranch}/{SalariesFile}"

df2 = spark.read.option("header", "true").csv(dataPath)
df2.show()

+---+---------------+------+
| id|     department|salary|
+---+---------------+------+
|101|          Sales| 60000|
|102|      Marketing| 55000|
|103|    Engineering| 70000|
|104|        Finance| 65000|
|105|Human Resources| 50000|
|106|          Sales| 62000|
|107|      Marketing| 57000|
|108|    Engineering| 72000|
|109|        Finance| 66000|
|110|Human Resources| 51000|
|111|          Sales| 63000|
|112|      Marketing| 58000|
|113|    Engineering| 73000|
|114|        Finance| 67000|
|115|Human Resources| 52000|
|116|          Sales| 64000|
|117|      Marketing| 59000|
|118|    Engineering| 74000|
|119|        Finance| 68000|
|120|Human Resources| 53000|
+---+---------------+------+
only showing top 20 rows



In [21]:
mergedDataset = df1.join(df2,["id"])
mergedDataset.show()

+---+--------+---+------+---------------+------+
| id|    name|age|gender|     department|salary|
+---+--------+---+------+---------------+------+
|101|    John| 32|  Male|          Sales| 60000|
|102|    Jane| 28|Female|      Marketing| 55000|
|103|     Bob| 40|  Male|    Engineering| 70000|
|104|   Alice| 36|Female|        Finance| 65000|
|105|    Mark| 44|  Male|Human Resources| 50000|
|106|   Julia| 29|Female|          Sales| 62000|
|107|   David| 50|  Male|      Marketing| 57000|
|108|   Emily| 34|Female|    Engineering| 72000|
|109| Michael| 41|  Male|        Finance| 66000|
|110|Samantha| 31|Female|Human Resources| 51000|
|111|   Chris| 45|  Male|          Sales| 63000|
|112|   Megan| 27|Female|      Marketing| 58000|
|113|    Adam| 38|  Male|    Engineering| 73000|
|114|  Olivia| 33|Female|        Finance| 67000|
|115|    Nick| 43|  Male|Human Resources| 52000|
|116|    Kate| 30|Female|          Sales| 64000|
|117|     Max| 47|  Male|      Marketing| 59000|
|118|   Chloe| 25|Fe

## Partition by department

In [22]:
newDataPath = f"s3a://{repo.id}/{transformationBranch}/{newPath}"

mergedDataset.write.partitionBy("department").csv(newDataPath)

## Commit with metadata

In [23]:
lakefs.commits.commit(
    repository=repo.id,
    branch=transformationBranch,
    commit_creation=CommitCreation(
        message='Repartitioned by departments',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb'}))

{'committer': 'everything-bagel',
 'creation_date': 1689579662,
 'id': 'a5553491b803755f6454b39327243f5e1c0d9cf1b8e2a0db0533798c288155d3',
 'message': 'Repartitioned by departments',
 'meta_range_id': '',
 'metadata': {'::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
              'using': 'python_api'},
 'parents': ['bdabb4a97a4c7cfc0f266ff6fa8777a327164a959c24f77d0339a9e6b76674bb']}

## Atomically promote data to Production

In [24]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=transformationBranch, 
    destination_branch=productionBranch)

{'reference': '1e90f483fbcd6fd8c7260cbae13badd9c429efeb61d829d536fc3ba0db8a68bd'}

## Where did a dataset come from?

In [25]:
commits = lakefs.refs.log_commits(repository=repo.id, ref='main', amount=1, limit=True, prefixes=['partitioned_data/department=Engineering/'])
print(commits.results)

[{'committer': 'everything-bagel',
 'creation_date': 1689579662,
 'id': 'a5553491b803755f6454b39327243f5e1c0d9cf1b8e2a0db0533798c288155d3',
 'message': 'Repartitioned by departments',
 'meta_range_id': '9e512c33f1136faa0660db932c6a78ee79eacd661a9100481d217c2d5b8bbdd1',
 'metadata': {'::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
              'using': 'python_api'},
 'parents': ['bdabb4a97a4c7cfc0f266ff6fa8777a327164a959c24f77d0339a9e6b76674bb']}]


In [26]:
commits = lakefs.refs.log_commits(repository=repo.id, ref='main', amount=1, objects=['Employees.csv'])
print(commits.results)


[{'committer': 'everything-bagel',
 'creation_date': 1689579651,
 'id': '76021a4d6f0587783abbd6fe3886c9fd641da3d0f20a8c8b77b91f69a142aaa9',
 'message': 'Ingesting employees IDs',
 'meta_range_id': 'd5bfb714473c6c5c6da1f0ab2dd2df54f78ff94b4a674d88ec9c2283d0fe8e7a',
 'metadata': {'::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
              'source': 'Employees.csv',
              'using': 'python_api'},
 'parents': ['565ac7c45d1207c91d85aaea8714013116747d8c8cb684aea0077b9f3a816222']}]


----

----

In [27]:
# The section below will only work on lakeFS cloud. 
# This cell will stop execution which is useful if the notebook has been 
# run from the top or is being run as part of automated testing.
import sys
print("ending notebook execution")
sys.exit(0)

ending notebook execution


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


----

# Auditing (lakeFS Cloud only)

## Setup

### Creating an Engineering group

In [None]:
lakefs.auth.create_group(
    group_creation=GroupCreation(
        id='Engineering'))

### Creating an engineer1 User

In [None]:
lakefs.auth.create_user(
    user_creation=UserCreation(
        id='engineer1'))

### Adding the engineer1 User to the group

In [None]:
lakefs.auth.add_group_membership(
    group_id='Engineering',
    user_id='engineer1')

## Generating credentials and setting up a client for the Engineer1 User

In [28]:
credentials = lakefs.auth.create_credentials(user_id='engineer1')
print(credentials)
engineer1AccessKey = credentials.access_key_id
engineer1SecretKey = credentials.secret_access_key

NotFoundException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': '9686a272-3cd5-42e3-b25d-e36be8928c50', 'Date': 'Mon, 17 Jul 2023 07:41:02 GMT', 'Content-Length': '35'})
HTTP response body: {"message":"engineer1: not found"}



In [None]:
# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = engineer1AccessKey
configuration.password = engineer1SecretKey
configuration.host = lakefsEndPoint

# Creating a client for engineer1
engineer1Client = LakeFSClient(configuration)
print("Created lakeFS client for engineer1.")

## Providing Engineers with Full Access to the Filesystem

In [None]:
lakefs.auth.attach_policy_to_group(
    group_id='Engineering',
    policy_id='FSFullAccess')

## Engineer1 will now read the salary of Finance... 

In [None]:
engineer1Client.objects.list_objects(
    repository=repo.id,
    ref='main',
    prefix='partitioned_data/department=Finance/'
)