

![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/main/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Workbook

# Publishing Open Lineage events to Marquez

## Introduction

[Marquez](https://marquezproject.ai/) is an open source data catalog that specializes in data observability.  It is particularly relevant to the open metadata ecosystem because it supports the visualization of [Open Lineage Events](https://egeria-project.org/features/lineage-management/overview/#the-open-lineage-standard) that track data flowing through different systems.

## Accessing Marquez

If you don't have Marquez running, `egeria_workspaces` offer a docker compose script that starts up Marquez and [Apache Airflow](https://airflow.apache.org/).  See **[airflow-marquez-compose](https://github.com/odpi/egeria-workspaces/tree/main/compose-configs/airflow-marquez-compose)**.  This will activate a Marquez server at `host.docker.internal:5050` and a web-based UI at `localhost:3000`.

This workbook creates an endpoint description of where Marquez is running and links it to the [Open Lineage API Publisher](https://egeria-project.org/features/lineage-management/overview/#egerias-open-lineage-support) connector.  This means each open lineage event either generated by Egeria, or sent to Egeria will be passed on to Marquez.

### Creating the Marquez endpoint

---

In [1]:
# Initialize pyegeria

%run ../../pyegeria/initialize-pyegeria.ipynb

In [2]:
# Initialise the client

egeria_client = EgeriaTech(view_server, url, user_id, user_pwd)
token = egeria_client.create_egeria_bearer_token()


In [3]:
# Create endpoint - notice that Egeria connects to the server at host.docker.internal:5050 rather than the UI at localhost:3000.

marquezEndpointTemplateGUID="9ea4bff4-d193-492f-bcad-6e68c07c6f9e"

body = {
    "templateGUID": marquezEndpointTemplateGUID,
    "isOwnAnchor": True,
    "placeholderPropertyValues": {
        "description" : "Link to Marquez",
        "serverName" : "Marquez",
        "hostURL" : "http://localhost",
        "portNumber" : "5050",
        "apiOperation" : "/api/v1/lineage"
    }
}
            
endpointGUID = egeria_client.create_element_from_template(body)
print("GUID of Marquez endpoint is: " + endpointGUID)


GUID of Marquez endpoint is: cf0bf3e8-ce40-4e53-8f18-a6c5862058bd


---

Once the endpoint is defined, we can add it as a catalog target to the **OpenLineageAPIPublisher** integration connector.

----


In [4]:

OpenLineageAPIPublisherGUID="2156bc98-973a-4859-908d-4ccc96f53cc5"

egeria_client.add_catalog_target(OpenLineageAPIPublisherGUID, 
                                 endpointGUID, 
                                 "marquez",
                                 None,
                                 None,
                                 None,
                                 None)
                      


'3b41cdf6-4d69-4232-97a7-e48e0728d32f'

----

Use the `hey_egeria_ops show integrations status` command to check that the endpoint is created.  You should see the endpoint cofigured with the `OpenLineageAPIPublisher` connector.

![Marquez endpoint running in qs-integration-daemon](images/marquez-endpoint-running.png)

### Testing the integration

A simple way to test that the integration between Egeria and Marquez is working is to run a [Governance Action Process](https://egeria-project.org/concepts/governance-action-process/) which causes Egeria to product open lineage events.

The **DailyGovernanceActionProcess** is a simple process that outputs the day of the week.  The flow for a governance action process is stored in the open metadata repository. It is possible to see what it does using  `pyegeria` functions as follows. (Note: the **[Viewing Processes with Mermaid](../../governance-actions/viewing-processes/with_mermaid.ipynb)** notebook describes how this works in more detail.)

---

In [5]:

process_name = "Egeria:DailyGovernanceActionProcess"
process_guid = egeria_client.get_element_guid_by_unique_name(process_name)

mermaid_graph = generate_process_graph(process_guid)
render_mermaid(mermaid_graph)


----

So when it runs, it performs two steps.  It first determines the day of the week - and depending on the result, runs a specific task for that day.

The code below runs the process.

----

In [6]:

egeria_client.initiate_gov_action_process(process_name, None, None, None, None, None, None)


'570b6690-960c-409f-89fd-2922f59f0e52'

----

It is possible to see the process running using the `hey_egeria_ops show engines activity` command in a Terminal window of this JupyterLab environment.


![Engine actions for the daily process](images/engine-actions-for-daily-process.png)


Each time the process is run, a record of the steps that ran is created.  The records are called *GovernanceActionProcessInstances*.
The code below extracts the list of GovernanceActionProcessInstances for DailyGovernanceActionProcess ...

---

In [7]:

processInstanceGUIDs = get_process_instances(egeria_client, process_name)


Process Instances:
 * Egeria:DailyGovernanceActionProcess@1742896958173:0df6407b-b191-46ca-a635-4287d361a6da [570b6690-960c-409f-89fd-2922f59f0e52]


----

The code below renders the GovernanceActionProcessInstances as mermaid graphs.  Notice that the shape of the graph is different.  The graph of the process definition shown above shows all possible paths - whereas the GovernanceActionProcessInstances shows the path that actually ran.

----

In [8]:

print_process_instances(egeria_client, process_name)


Process Instances:
 * Egeria:DailyGovernanceActionProcess@1742896958173:0df6407b-b191-46ca-a635-4287d361a6da [570b6690-960c-409f-89fd-2922f59f0e52]


---

Each time the process runs, Open Lineage events are being sent to Marquez.

You can see the events for this process in the Marquez UI by opening a new browser tab and going to URL [`http://localhost:3000/events`](http://localhost:3000/events) and you should see something like this ...

![Open lineage events list](images/marquez-events-list.png)

If you click on one of the events, you can see the contents of the event.

![Open lineage events list](images/marquez-event-content.png)

By clicking on the cogs icon on the left-hand menu, you switch to the **Jobs** display.  At the top right-hand corner, there is a drop-down to select the namespace (ns).  

![Open lineage events list](images/marquez-ns-selection.png)

Select **GovernanceActions** and the list of governance actions run by Egeria is displayed.

![Open lineage events list](images/marquez-job-list.png)

Select one of the completed runs and a graph of the run is displayed.  Notice it is a different shape to the mermaid.  This is because the mermaid shows all possible paths (design lineage) and the flow in marquez showed what actually happened.

![Open lineage events list](images/marquez-simple-flow.png)

You can toggle the switches on the upper right to see more detail.

![Open lineage events list](images/marquez-full-flow.png)


## Where next?

So now you have Egeria publishing open lineage events to Marquez, what should you do next?  Here are some suggestions:

* Set up Egeria to receive open lineage events from other systems such as Apache Airflow, via the [Open Lineage Proxy Back-End](https://egeria-project.org/features/lineage-management/overview/#integrating-with-the-open-lineage-standard) and Apache Kafka [-> Link to notebook](../apache-kafka/kafka-open-lineage-events.ipynb).
* Run other governance action processes to survey and catalog other systems in order to generate more open lineage events [-> Link to Notebook](../cataloguing-and-surveys.ipynb).
