![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Hands-On Lab
# Welcome to the Open Lineage Lab

## Introduction

Egeria is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information (called metadata) about data and the technology that supports it.

In this hands-on lab you will get a chance to work with Egeria metadata and governance servers and learn how to manually create  metadata to describe lineage for simple data movement processes. We will aslo show how using Egeria UI you can search data assets and visualize lineage previously created.

To read more about lineage concepts in Egeria, see https://egeria.odpi.org/open-metadata-publication/website/lineage.

## The Scenario

The Egeria team use the personas and scenarios from the fictitious company called Coco Pharmaceuticals. (See https://opengovernance.odpi.org/coco-pharmaceuticals/ for more information).

On their business transformation journey, after they successfuly created data catalog for the data lake, new challange emerged. Due to regulatory requirements, business came up with request to improve data traceability. Introducing data lineage for critical data flows is the next level of maturity in their governance program.

In this lab we discover how to manually catalogue data assets in the data lake and describe data movement for simple data transformation process executed by their in-house built ETL tool. Finally, the users can find data assets and visualize end to end lineage in the web UI.

Peter Profile and Erin Overview got assigned to work on a solution to vizualize and report data lineage using Egeria. In their effort they are going to focus on the ETL tooling and manually do calls to the data catalog to register relevant metadata about assets and connection in-between describing data movement.


## Setting up

Coco Pharmaceuticals make widespread use of Egeria for tracking and managing their data and related assets.
Figure 1 below shows their metadata servers and the Open Metadata and Governance (OMAG) Server Platforms that are hosting them.  Each metadata server supports a department in the organization.  The servers are distributed across the platform to even out the workload.  Servers can be moved to a different platform if needed.

![Figure 1](./images/coco-pharmaceuticals-systems-omag-server-platforms-metadata-server.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms

The code below checks that the platforms are running.  It checks that the servers are configured and then if they are running on the platform.  If a server is configured, but not running, it will start it.

Look for the "Done." message.  This appears when `environment-check` has finished.


In [None]:
%run common/environment-check.ipynb

## Excercise 1 
### Manually creating lineage

For this exercise, Peter and Erin came up with list of items needed:

- Identify the data assets and manually catalogue them;
- Identify the tool that is used for data transfomration and register it in the data lake;
- Undestand how to use metadata provided by the tooling and create asset for the data transformation process itself;
- Undestand how to use metadata provided by the tooling and create the mappings between the assets;



//TODO describe Egeria components involved

![Figure 1](./images/data-engine-omas-open-lineage.png)



### Create Metadata

In [None]:
adminCommandURLRoot = dataLakePlatformURL
#dataEngineUser = "bobnitter"
cocoETLEngineUser = "cocoDEnpa1"

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/registration'
cocoETL = "CocoPharma/DataEngine/CocoETL"

requestBody = {
    "dataEngine":
        {
            "qualifiedName": cocoETL,
            "displayName": "CocoETL",
            "description": "Requesting to register external data engine capability for Coco Pharmaceuticals in-house Data Platform ETL tool.",
            "engineType": "DataEngine",
            "engineVersion": "1",
            "enginePatchLevel": "0",
            "vendor": "Coco Pharmaceuticals",
            "version": "1",
            "source": "CocoPharma"
        }
}


print(requestBody)

postAndPrintResult(url, json=requestBody, headers=jsonContentHeader)

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/data-files'
fileName1 = cocoETL + "/home/files/names.csv"

requestBodyFile1 = {
    "externalSourceName": cocoETL,
    "file": {
        "fileType": "CSVFile",
        "qualifiedName": fileName1,
        "displayName": "names.csv",
        "pathName": "/home/files/names.csv",
         "columns": [
            {
                "qualifiedName": "Id@" + fileName1,
                "displayName": "Id"
            },
            {
                "qualifiedName": "First@" + fileName1,
                "displayName": "First"
            },
            {
                "qualifiedName": "Last@" + fileName1,
                "displayName": "Last"
            },
            {
                "qualifiedName": "Location@" + fileName1,
                "displayName": "Location"
            }
        ]
    }
}

print(requestBodyFile1)

postAndPrintResult(url, json=requestBodyFile1, headers=jsonContentHeader)

In [None]:

url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/data-files'
fileName2 = cocoETL + "/home/files/emplname.csv"

requestBodyFile2 = {
    "externalSourceName": cocoETL,
    "file": {
        "fileType": "CSVFile",
        "qualifiedName": fileName2,
        "displayName": "emplname.csv",
        "pathName": "/home/files/emplname.csv",
         "columns": [
            {
                "qualifiedName": "EMPID@" + fileName2,
                "displayName": "EMPID"
            },
            {
                "qualifiedName": "FNAME@" + fileName2,
                "displayName": "FNAME"
            },
            {
                "qualifiedName": "Last@" + fileName2,
                "displayName": "LNAME"
            },
            {
                "qualifiedName": "LOC@" + fileName2,
                "displayName": "LOC"
            }
        ]
    }
}

print(requestBodyFile2)

postAndPrintResult(url, json=requestBodyFile2, headers=jsonContentHeader)

In [None]:
url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/processes'
copyColumnsProcess = cocoETL + "/CopyColumnsProcess"

requestBodyCopyColumnsProcess = {
    "process":
        {
            "qualifiedName": copyColumnsProcess,
            "displayName": "CopyColumns",
            "name": "CopyColumnsETL.py",
            "description": "Process named 'CopyColumns' representing simple high level processing activity performed by CocoETL tool.",
            "owner": cocoETLEngineUser,
            "updateSemantic": "REPLACE"
        },
    "externalSourceName": cocoETL
}

print(requestBodyCopyColumnsProcess)
postAndPrintResult(url, json=requestBodyCopyColumnsProcess, headers=jsonContentHeader)

In [None]:
url = adminCommandURLRoot + '/servers/' + mdrServerName + '/open-metadata/access-services/data-engine/users/' + cocoETLEngineUser + '/lineage-mappings'

requestBodyLineageMappings = {
    "lineageMappings": [
        {
            "sourceAttribute": fileName1,
            "targetAttribute": copyColumnsProcess
        },
        {
            "sourceAttribute": copyColumnsProcess,
            "targetAttribute": fileName2
        }
    ],
    "externalSourceName": cocoETL
}

print(requestBodyCopyColumnsProcess)
postAndPrintResult(url, json=requestBodyLineageMappings, headers=jsonContentHeader)