![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Hands-On Lab
# Welcome to the Open Lineage Lab

## Introduction

Egeria is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information (called metadata) about data and the technology that supports it.

In this hands-on lab you will get a chance to work with Egeria metadata and governance servers and learn how to manually create  metadata to describe lineage for data movement processes. For this purpose we use **Open Lineage Services** governance server solution designed to capture and manage a historical warehouse of lineage information.
We will aslo show how using General **Egeria UI** you can search data assets and visualize lineage previously created.

To read more about lineage concepts in Egeria, see https://egeria.odpi.org/open-metadata-publication/website/lineage/.

## The Scenario

The Egeria team use the personas and scenarios from the fictitious company called Coco Pharmaceuticals. (See https://opengovernance.odpi.org/coco-pharmaceuticals/ for more information).

On their business transformation journey, after they successfuly created data catalog for the data lake, new challange emerged. Due to regulatory requirements, business came up with request to improve data traceability. Introducing data lineage for critical data flows is the next level of maturity in their governance program.

In this lab we discover how to manually catalogue data assets in the data lake and describe data movement for simple data transformation process executed by their in-house built ETL tool. Finally, the users can find data assets and visualize end to end lineage in the web UI.

Peter Profile and Erin Overview got assigned to work on a solution to capture and report data lineage using Egeria. 


## Setting up

Coco Pharmaceuticals make widespread use of Egeria for tracking and managing their data and related assets.
Figure 1 below shows their metadata servers and the Open Metadata and Governance (OMAG) Server Platforms that are hosting them.  Each metadata server supports a department in the organization.  The servers are distributed across the platform to even out the workload.  Servers can be moved to a different platform if needed.

![Figure 1](./images/coco-pharmaceuticals-systems-omag-server-platforms-metadata-server.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms

The code below checks that the platforms are running.  It checks that the servers are configured and then if they are running on the platform.  If a server is configured, but not running, it will start it.

Look for the "Done." message.  This appears when `environment-check` has finished.


In [None]:
%run common/environment-check.ipynb

## Excercise 1 

### Capturing lineage manually

In this first excercise, Peter and Erin will have to understand in-house built data platform tools used to design ETL processes ingesting data into the data lake. To start, they have chosen to manually catalogue data file assets from previous clinical trials. They will aslo need to create asset for the process definition describing the data movement between different data stores.

For use-cases like this one, Date Engine Access Service (OMAS) API seems perfect match. It enables external data platfroms, tools or engines to interact with Egeria and share metadata needed to construct lineage graph. 

----

The diagram below describes external interaction, metadata catalog server **cocoMDS1** internal processing and lineage information storage in Open Lineage Services governance server **cocoOLS1**.


![Figure 2](./images/data-engine-omas-open-lineage.png)
> **Figure 2:** Lineage processing flow

Once lineage gets stored, UI platforms and different UI soloutions can query asset and lineage information and display nice vizualizations like end to end data lineage. General **Egeria UI** currently supports this feature. (*See UI Lab reference here but first it needs update! ....*)

----

#### Check if assets are present in the catalog

Erin wants to be sure upfront that the assets are not present in the catalog. She uses Egeria UI Asset Catalog search option but fist she need to log in. Access Egeria UI on https://localhost:8080/
    
    username: erinoverview
    password: secret

![Erin Logon](./images/egeria-ui-erin-logon.png)
> **Figure 3** Log on as user Erin Overview

Once logged on, from the top navigation bar she navigates to "Search" leading to Asset Catalog search page.

![Navigation bar](./images/egeria-ui-nav-bar.png)
> **Figure 4** Navigate to Asset Catalog search page

Erin already knows the name of the data file asset in interest so she inputs the tex "OldMeasurementsArchive" in the search box and selects type "Asset" from the list.

![Asset Catalog no results](./images/egeria-ui-asset-catalog-asset-not-found.png)

The UI responds with message that no assets are found with the input provided. This is expected since at this moment the assets are not yet created.

#### Adding assets in the catalog

Peter is now ready to create the assets representing old clinical data form previous period not covered by the regular ingestion process.

 - SoftwareServer capabililyty by registering the CocoETL they are using;
 - Asset of type DataFile describing generic file store. This represents original raw clinical data archive used as input for TransformData process;
 - Asset of type Process, decribing definition of simple ETL process producing new CSV file as output result; 
 - Asset of type CSVFile, more specific file data store created as output result of the transformation process designed in CocoETL tool.
 
The level asset details shared with Egeria can vary, depending on the needs. Egeria supports from high level process to low column level lineage capturing both data stores and processes. In this first exercise we are focusing on the high level approach.

For the API calls, Peter is going to use the Data Lake Platform hosting `cocoMDS1` repository server and Data Engine Access Service (OMAS).

In [None]:
adminCommandURLRoot = dataLakePlatformURL
mdrServerName       = "cocoMDS1"
cocoETLEngineUser   = "cocoDEnpa1"
cocoETLName         = "CocoPharma/DataEngine/CocoETL"
filesRoot           = "file://secured/research/previous-clinical-trials/"

1. Create `SoftwareServerCapability` by registering CocoETL, in-house tool they are using.

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/registration'

requestBody = {
    "dataEngine":
        {
            "qualifiedName": cocoETLName,
            "displayName": "CocoETL",
            "description": "Requesting to register external data engine capability for Coco Pharmaceuticals in-house Data Platform ETL tool CocoETL.",
            "engineType": "DataEngine",
            "engineVersion": "1",
            "enginePatchLevel": "0",
            "vendor": "Coco Pharmaceuticals",
            "version": "1",
            "source": "CocoPharma"
        }
}


print(requestBody)

postAndPrintResult(url, json=requestBody, headers=jsonContentHeader)

2. Create Asset of type DataFile describing generic file store. This represents original raw clinical data archive used as input for the transormation process.

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/data-files'
fileName1 = "OldMeasurementsArchive"
qualifiedFileName1 = filesRoot + fileName1 + "@" + cocoETLName

requestBodyFile1 = {
    "externalSourceName": cocoETLName,
    "file": {
        "fileType": "DataFile",
        "qualifiedName": qualifiedFileName1,
        "displayName": fileName1,
        "pathName": filesRoot + fileName1,
         "columns": [
            {
                "qualifiedName": "Id@" + fileName1,
                "displayName": "Id"
            },
            {
                "qualifiedName": "First@" + fileName1,
                "displayName": "First"
            },
            {
                "qualifiedName": "Last@" + fileName1,
                "displayName": "Last"
            },
            {
                "qualifiedName": "Location@" + fileName1,
                "displayName": "Location"
            }
        ]
    }
}

print(requestBodyFile1)

postAndPrintResult(url, json=requestBodyFile1, headers=jsonContentHeader)

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/data-files'
fileName2 = cocoETL + "/home/files/emplname.csv"

requestBodyFile2 = {
    "externalSourceName": cocoETLName,
    "file": {
        "fileType": "CSVFile",
        "qualifiedName": fileName2,
        "displayName": "emplname.csv",
        "pathName": "/home/files/emplname.csv",
         "columns": [
            {
                "qualifiedName": "EMPID@" + fileName2,
                "displayName": "EMPID"
            },
            {
                "qualifiedName": "FNAME@" + fileName2,
                "displayName": "FNAME"
            },
            {
                "qualifiedName": "Last@" + fileName2,
                "displayName": "LNAME"
            },
            {
                "qualifiedName": "LOC@" + fileName2,
                "displayName": "LOC"
            }
        ]
    }
}

print(requestBodyFile2)

postAndPrintResult(url, json=requestBodyFile2, headers=jsonContentHeader)

In [None]:
url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/processes'
copyColumnsProcess = cocoETLName + "/CopyColumnsProcess"

requestBodyCopyColumnsProcess = {
    "process":
        {
            "qualifiedName": copyColumnsProcess,
            "displayName": "CopyColumns",
            "name": "CopyColumnsETL.py",
            "description": "Process named 'CopyColumns' representing simple high level processing activity performed by CocoETL tool.",
            "owner": cocoETLEngineUser,
            "updateSemantic": "REPLACE"
        },
    "externalSourceName": cocoETL
}

print(requestBodyCopyColumnsProcess)
postAndPrintResult(url, json=requestBodyCopyColumnsProcess, headers=jsonContentHeader)

In [None]:
url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/lineage-mappings'

requestBodyLineageMappings = {
    "lineageMappings": [
        {
            "sourceAttribute": fileName1,
            "targetAttribute": copyColumnsProcess
        },
        {
            "sourceAttribute": copyColumnsProcess,
            "targetAttribute": fileName2
        }
    ],
    "externalSourceName": cocoETLName
}

print(requestBodyCopyColumnsProcess)
postAndPrintResult(url, json=requestBodyLineageMappings, headers=jsonContentHeader)

#### Adding lineage mappings in the catalog
#### Finding assets in the UI and showing lineage