![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Hands-On Lab
# Welcome to the Building a Data Catalog Lab

## Introduction

Egeria is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information (called metadata) about data and the technology that supports it.

In this hands-on lab you will get a chance to work with three Egeria metadata servers to build a distributed catalog of data assets and then experiment with attaching feedback (comments) to the catalog entries from different servers.  We will also cover how governance zones can be used to group assets together and control who can discover them in the data catalog.

## The Scenario

The Egeria team use the personas and scenarios from the fictitious company called Coco Pharmaceuticals.  (See https://opengovernance.odpi.org/coco-pharmaceuticals/ for more information).

As part of the huge business transformation that Coco Pharmaceuticals has embarked on, they
have created a data lake for managing data for research, analytics, exchange between their internal organizations and business partners (such as hospitals).  As a result, the data lake has to be
designed to handle a wide variety of data, including some highly sensitive and regulated data.

In this lab we look at how data is catalogued in the data lake.  The two main character engaged in the first part of this lab are Peter Profile and Erin Overview.

![Peter and Erin](../images/peter-and-erin.png)

Peter and Erin are cataloguing new data sets that have been received from a hospital.  These data sets are part of a clinical trial that the hospital is participating in.

## Setting up

Coco Pharmaceuticals make widespread use of Egeria for tracking and managing their data and related assets.
Figure 1 below shows their metadata servers and the Open Metadata and Governance (OMAG) Server Platforms that are hosting them.  Each metadata server supports a department in the organization.  The servers are distributed across the platform to even out the workload.  Servers can be moved to a different platform if needed.

![Figure 1](../images/coco-pharmaceuticals-systems-omag-server-platforms-metadata-server.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms

The code below checks that the platforms are running.  It checks that the servers are configured and then if they are running on the platform.  If a server is configured, but not running, it will start it.

Look for the "Done." message.  This appears when `environment-check` has finished.

In [None]:
%run ../common/environment-check.ipynb

----
Peter is using the data lake operations metadata server called `cocoMDS1`. This server is hosted on the Data Lake OMAG Server Platform.

If any of the platforms are not running, follow [this link to set up and run the platform](https://egeria.odpi.org/open-metadata-resources/open-metadata-labs/).  If any server is reporting that it is not configured then
run the steps in the [Server Configuration](../egeria-server-config.ipynb) lab to configure
the servers.  Then re-run the previous step to ensure all of the servers are started.

----
## Exercise 1

### Adding assets to the catalog

In the first exercise, Peter Profile is adding descriptions of some new data sets to the catalog. They are stored in the catalog as **Assets**.  An Asset represent a real resource of value that needs to be governed to ensure it is properly managed and used.

Every Asset identifies the owner of the resource.  This is either a person or a team.  The owner's role is to set up the Asset with the correct properties that define how the real resources (data sets in this case) should be managed.  This management is performed by tools, platforms and engines that host and/or work with the real resources.  If these technologies can connect to an open metadata repository, they can read these properties directly and ensure the correct actions are taken.  Some technologies do not support a direct connection to an open metadata repository.  Egeria also provides governance servers to actively push the Asset properties to these types of technologies using their native interfaces.

In either case, the owner's role in setting up the correct properties is an important one.

Peter will be acting a the owner of these new data sets. He uses the **Asset Owner** Open Metadata Access Service (OMAS) API to set up the Assets in the catalog.  

----
Before adding the new Assets, Peter queries the current list of Clinical Trial Assets from cocoMDS1 to check that these data sets have not been added already.

In [None]:

assetOwnerPrintAssets(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, ".*file.*")


----
We can see here that no assets are returned as the repository is empty.

#### Adding weekly clinical trial assets


Peter is now going to create three weeks of clinical asset data. This data is stored in three data sets, one for each week.

He begins with week 1.  The Asset he creates includes the full path of the data set as well as some descriptive information.  This descriptive information helps others to locate and understand the data set.

In [None]:

displayName = "Week 1: Drop Foot Clinical Trial Measurements"
description = "One week's data covering foot angle, hip displacement and mobility measurements."
fullPath    = "file://secured/research/clinical-trials/drop-foot/DropFootMeasurementsWeek1.csv"

asset1guids = assetOwnerCreateCSVAsset(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, displayName, description, fullPath)

print("Result of creating an asset is: ")
printGUIDList(asset1guids)


----
Notice the result is the list of unique identifiers (GUIDs) of the chain of assets for the folder structure and the file itself.

![Figure 2](../images/file-asset-hierarchy.png)
> **Figure 2:** Hierarchy of assets for a file

We need to save the file's unique identifier (the last one in the list) in a variable to use later.

In [None]:

asset1guid = getLastGUID(asset1guids)

print (" ")
print ("The GUID for asset 1 is: " + asset1guid)


----
Now let's take a look again at what assets are in the repository using the same get request we used earlier.


In [None]:

assetOwnerPrintAssets(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, ".*file.*")


----

Notice that five assets are returned.  Four are folders and one is for the file.  The file system is not returned because strictly speaking, it is not an [Asset](https://egeria.odpi.org/open-metadata-publication/website/open-metadata-types/0010-Base-Model.html), it is a [SoftwareServerCapability](https://egeria.odpi.org/open-metadata-publication/website/open-metadata-types/0042-Software-Server-Capabilities.html).  This is part of a [SoftwareServer](https://egeria.odpi.org/open-metadata-publication/website/open-metadata-types/0040-Software-Servers.html) description.

Peter is now going to add the files for the next two weeks:

In [None]:
displayName = "Week 2: Drop Foot Clinical Trial Measurements"
description = "One week's data covering foot angle, hip displacement and mobility measurements."
fullPath    = "file://secured/research/clinical-trials/drop-foot/DropFootMeasurementsWeek2.csv"

asset2guids = assetOwnerCreateCSVAsset(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, displayName, description, fullPath)
    
print ("\nRequest to create the week 2 Asset responded with: " )
printGUIDList(asset2guids)
asset2guid = getLastGUID(asset2guids)

displayName = "Week 3: Drop Foot Clinical Trial Measurements"
description = "One week's data covering foot angle, hip displacement and mobility measurements."
fullPath    = "file://secured/research/clinical-trials/drop-foot/DropFootMeasurementsWeek3.csv"

asset3guids = assetOwnerCreateCSVAsset(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, displayName, description, fullPath)
    
print ("\nRequest to create the week 3 Asset responded with: " )
printGUIDList(asset3guids)
asset3guid = getLastGUID(asset3guids)

print (" ")
print ("Summary of the assets so far:")
print (' Asset 1 GUID is: ' + asset1guid)
print (' Asset 2 GUID is: ' + asset2guid)
print (' Asset 3 GUID is: ' + asset3guid)

----
Peter has successfully onboarded three file assets.  When we query the assets again, there are now seven assets.  All of the files are stored in the same folder on disk, so all of the Assets for these files are stored under the same FileFolder Asset in the metadata server.  So there are now four FileFolder Assets and 3 DataFile Assets.

In [None]:

assetOwnerPrintAssets(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, ".*file.*")
    

----
## Exercise 2 - Sharing the catalog and adding feedback

In this next exercise Erin is going to work with the assets that Peter created.  Erin is part of the governance team.  She is accessing
metadata using the `cocoMDS2` server.  It sits on the core OMAG Server Platform.

![Figure 1](../images/coco-pharmaceuticals-systems-omag-server-platforms-metadata-server.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms (repeat)

So Erin is using a different server located on a different platform to Peter.

----
The metadata servers `cocoMDS1` and `cocoMDS2` are part of the same open metadata cohort called `cocoCohort`.  This means that they are actively sharing metadata.

![Figure 3](../images/coco-pharmaceuticals-systems-cohorts.png)
> **Figure 3:** Membership of Coco Pharmaceuticals' cohorts

----
Even though Erin is connected to a different server to Peter, she can see the same assets.  The search request below uses the Asset Consumer's OMAS interface of cocoMDS2 to return the unique identifiers (GUIDs) of the assets for the three new files.

In [None]:
newFilesSearchString=".*Drop Foot Clinical Trial Measurements.*"

print("Current assets defined: ")
assetConsumerPrintAssets(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, newFilesSearchString)

----
These are the same GUIDs as the ones saved when Peter created the assets:

In [None]:
print (" ")
print ("Review of the assets so far:")
print (' Asset 1 GUID is: ' + asset1guid)
print (' Asset 2 GUID is: ' + asset2guid)
print (' Asset 3 GUID is: ' + asset3guid)

----
Erin looks at the new assets that Peter has defined and has a question.  She adds a comment to the first asset.

In [None]:
commentType = "QUESTION"
commentText = "This file has much less data than normal.  Did the hospital provide any additional information about this batch to explain it?"
isPublic    = True

commentGUID = addCommentToAsset(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, asset1guid, commentText, commentType, isPublic)

print (" ")
if commentGUID:
    print ('Erin\'s comment guid is: ' + commentGUID)

----
The comment is attached to the asset.  Peter can query an asset's comments as follows:

In [None]:
assetConsumerPrintAssetComments(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, petersUserId, asset1guid)

----
He replies to Erin's question

In [None]:
commentType = "ANSWER"
commentText = "I checked back with Bobbie Records and they had an air conditioning failure that caused them to cancel patient appointments for 2 days - hence less data.  They are working to catch up on their waiting list so expect increased data for the next few weeks."
isPublic    = True

print(asset1guid)
replyGUID = addReplyToAssetComment(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, asset1guid, commentGUID, commentText, commentType, isPublic)

print (" ")
if replyGUID:
    print ('Peter\'s comment guid is: ' + replyGUID)


----
Erin views the reply.

In [None]:
assetConsumerPrintAssetCommentReplies(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, petersUserId, asset1guid, commentGUID)

----
This is the current information known about the first asset:

In [None]:
assetConsumerPrintAssetUniverse(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, petersUserId, asset1guid)

---

## Summary of Exercise 1 and 2

In the first two exercises of this hands-on lab you have shown that two servers with their own repositories can share and extend the metadata contributed by the other.  It began by Peter creating three assets in cocoMDS1.  Erin then connected to cocoMDS2 and she could also see these assets.  Then Erin was able to attach a comment to one of those assets through cocoMDS2 and Peter was then able to response through cocoMDS1.

Hence this is a truly distributed catalog.


![Figure 3](../images/distributed-asset-with-comments.png)
> **Figure 3:** Asset and Comments distributed across 2 servers


----
## Exercise 3 - controlling access to assets

In the next exercise we will consider how organizations control the visability of assets.
Peter and Erin are joined by their colleague Callie Quartile, a data scientist working in the research team.

![Callie Quartile](https://raw.githubusercontent.com/odpi/data-governance/master/docs/coco-pharmaceuticals/personas/callie-quartile.png)

Callie has heard that the clinical trial files have arrived.  She is keen to start working on them as there was a delay in receiving the first two weeks worth of data.

Since Callie works in the research team, she uses the `cocoMDS3` metadata server.  She tries a search for the files.

In [None]:
assetConsumerPrintAssets(cocoMDS3Name, cocoMDS3PlatformName, cocoMDS3PlatformURL, calliesUserId, newFilesSearchString)

----
Even though the assets are defined and being shared across the `cocoCohort` Callie can not see them because, by default, `cocoMDS1` is set up to create assets in what is called the `quarantine zone` and `cocoMDS3` can not access assets in the `quarantine zone`.

Governance zones are groups of related assets.  Coco Pharmaceuticals have created the `quarantine zone` for assets that are only partially catalogued.  They can only be accessed through the data lake operations and governance servers.  Once Peter has completed setting up the Assets, they will be moved into the `data lake zone` and Callie will be able to see them.

![Figure 4](../images/asset-zones-for-building-catalog.png)
> **Figure 4:** Governance Zones affecting the building of the catalog


The next section completes the onboarding.

----


In [None]:

assetOwner = "tanyatidie"
ownerType  = "USER_ID"

addOwner(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 1", asset1guid, assetOwner, ownerType)
addOwner(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 2", asset2guid, assetOwner, ownerType)
addOwner(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 3", asset3guid, assetOwner, ownerType)

addZones(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 1", asset1guid, ["data-lake", "clinical-trials"])
addZones(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 2", asset2guid, ["data-lake", "clinical-trials"])
addZones(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, erinsUserId, "Asset 3", asset3guid, ["data-lake", "clinical-trials"])



In [None]:
assetOwnerPrintAssets(cocoMDS1Name, cocoMDS1PlatformName, cocoMDS1PlatformURL, petersUserId, ".*DropFootMeasurements.*")

----
Once these zones are set up, Callie can see the assets:


In [None]:
assetConsumerPrintAssets(cocoMDS3Name, cocoMDS3PlatformName, cocoMDS3PlatformURL, calliesUserId, newFilesSearchString)

In [None]:
assetConsumerPrintAssetUniverse(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL, petersUserId, asset1guid)

----
## Bonus material

This final section is an opportunity to dig a little deeper into the workings of Egeria.

The APIs used in the exercises above are from the access services - or Open Metadata Access Services (OMASs) to give them their formal name.  These APIs are domain specific - designed to use by tools, engines and platforms.

Underneath the access services are the repository services (Open Metadata Repository Services (OMRS)) and the platform services (Open Metadata and Governance (OMAG) Server Platform Services).

The repository services manage the exchange of metadata between servers.  The platform services provide a platform for running Egeria servers such as cocoMDS1 and cocoMDS2.


### Repository services

The repository services provide the ability for metadata to be accessed and exchanged from different servers.
Each server that has a repository (store) of metadata is assigned a **metadata collection id**.  This is a unique identifer that is associated with all metadata that originates from that repository.

The command below extracts the metadata collection id for cocoMDS1.

In [None]:
server1RepositoryServicesURL = cocoMDS1PlatformURL + '/servers/' + cocoMDS1Name + '/open-metadata/repository-services/users/' + adminUserId 
server1MetadataCollectionIdQuery = server1RepositoryServicesURL + '/metadata-collection-id'

print (" ")
print ("GET " + server1MetadataCollectionIdQuery)

response = requests.get(server1MetadataCollectionIdQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

serverStatus = response.json().get('relatedHTTPCode')
if serverStatus == 200:
    cocoMDS1MetadataCollectionId = response.json().get('metadataCollectionId')
    print("Metadata collection id for " + cocoMDS1Name + " is " + cocoMDS1MetadataCollectionId)
else:
    print("Server " + cocoMDS1Name + " is not able to supply a metadata collection id")

----
Now we extract the metadata collection id for cocoMDS2.

In [None]:
server2RepositoryServicesURL = cocoMDS2PlatformURL + '/servers/' + cocoMDS2Name + '/open-metadata/repository-services/users/' + adminUserId 
server2MetadataCollectionIdQuery = server2RepositoryServicesURL + '/metadata-collection-id'

print (" ")
print ("GET " + server2MetadataCollectionIdQuery)

response = requests.get(server2MetadataCollectionIdQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

serverStatus = response.json().get('relatedHTTPCode')
if serverStatus == 200:
    cocoMDS2MetadataCollectionId = response.json().get('metadataCollectionId')
    print("Metadata collection id for " + cocoMDS2Name + " is " + cocoMDS2MetadataCollectionId)
else:
    print("Server " + cocoMDS2Name + " is not able to supply a metadata collection id")

----

The metadata collection id is allocated when the server is first configured.  Once the server starts sharing metadata, the metadata collection id must never change as it is used in the metadata repository to identify where each piece of metadata came from.

The cocoMDS4 server does not have a repository and uses federated queries to retrieve metadata from other servers.

In [None]:

server4RepositoryServicesURL = cocoMDS4PlatformURL + '/servers/' + cocoMDS4Name + '/open-metadata/repository-services/users/' + adminUserId 
server4MetadataCollectionIdQuery = server4RepositoryServicesURL + '/metadata-collection-id'

print (" ")
print ("GET " + server4MetadataCollectionIdQuery)

response = requests.get(server4MetadataCollectionIdQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

serverStatus = response.json().get('relatedHTTPCode')
if serverStatus == 200:
    cocoMDS4MetadataCollectionId = response.json().get('metadataCollectionId')
    print("Metadata collection id for " + cocoMDS2Name + " is " + cocoMDS4MetadataCollectionId)
else:
    print("Server " + cocoMDS4Name + " is not able to supply a metadata collection id")

----
This result is also a demonstration of the error handling in Egeria. All errors consist of a message, system action and user response.

----
Metadata instances such as the Assets and Comments that you were working with in Exercises 1 and 2 are stored in the repository as entities.  These entities are linked together with relationships (it is a logical graph model).

The command below uses the respository services to retrieve one of the assets created in exercise 1

In [None]:
server1RepositoryServicesURL = cocoMDS1PlatformURL + '/servers/' + cocoMDS1Name + '/open-metadata/repository-services/users/' + petersUserId 

server1AssetEntityQuery = server1RepositoryServicesURL + '/enterprise/instances/entity/' + asset1guid

print (" ")
print ("GET " + server1AssetEntityQuery)

response = requests.get(server1AssetEntityQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

The entity includes its type definition and the properties of the asset.  Also notice the metadata collection id for cocoMDS1 around the middle of the structure.

Contrast the asset entity with the comment that Erin created.  Notice the type information is different, and the metadata collection id for cocoMDS2.

In [None]:
server2CommentEntityQuery = server2RepositoryServicesURL + '/enterprise/instances/entity/' + commentGUID

print (" ")
print ("GET " + server2CommentEntityQuery)

response = requests.get(server2CommentEntityQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

----
Finally, consider the relationship between the asset and the comment.  It includes summary information about the two entities (called an **entity proxy**).  This is how it is possible to transmit and even store relationships independently of the entities.

In [None]:
server2AssetRelationshipQuery = server2RepositoryServicesURL + '/enterprise/instances/entity/' + asset1guid + '/relationships'

print (" ")
print ("POST " + server2AssetRelationshipQuery)

relationshipRequestBody={
	"class" : "TypeLimitedFindRequest",
	"startingFrom" : "0",
	"pageSize" : "100" 
}
jsonHeader = {'content-type':'application/json'}

response = requests.post(server2AssetRelationshipQuery, json=relationshipRequestBody, headers=jsonHeader)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")



Which server was the relationship created in?

----
#### Open Metadata Cohorts

The metadata exchange between the servers is a peer-to-peer protocol.  Each server registers with one or more open metadata cohorts.  

Figure 4 shows which metadata servers belong to each cohort.

![Figure 4](../images/coco-pharmaceuticals-systems-cohorts.png)
> **Figure 4:** Membership of Coco Pharmaceuticals' cohorts

----
The command below queries cocoMDS2's view of the cohorts

In [None]:

printServerCohorts(cocoMDS2Name, cocoMDS2PlatformName, cocoMDS2PlatformURL)


----
There are more examples and explanation about the way that the cohorts work in the [**Understanding Cohorts**](../administration-labs/understanding-cohorts.ipynb) notebook.


----
### Metadata security

Security of metadata is extremely important.  Egeria has multiple levels of security so that access to individual metadata instances can be controlled.  The command below is a simple test when an unauthorized user tries to access one of Coco Pharmaceutical metadata servers.


In [None]:
unauthorizedUserQuery = cocoMDS2PlatformURL + '/servers/' + cocoMDS2Name + '/open-metadata/repository-services/users/evilEdna/metadata-collection-id'

print (" ")
print ("GET " + unauthorizedUserQuery)

response = requests.get(unauthorizedUserQuery)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

----
### Platform services

The platform services are for the infrastructure team running an Egeria service.  In the case of a cloud service, this may be a different organization to the metadata owners.  As a result, there is a separation of users able to work with the platform services verses the access and repository services.

This first command queries the servers running on a platform.

In [None]:
corePlatformServices = corePlatformURL + '/open-metadata/platform-services/users/' + adminUserId + '/server-platform'
corePlatformServers  = corePlatformServices + '/servers'

print (" ")
print ("CorePlatform's Servers ")
print ("GET " + corePlatformServers)

response = requests.get(corePlatformServers)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

dataLakePlatformServices = dataLakePlatformURL + '/open-metadata/platform-services/users/' + adminUserId + '/server-platform'
dataLakePlatformServers  = dataLakePlatformServices + '/servers'

print (" ")
print ("DataLakePlatform's Servers ")
print ("GET " + dataLakePlatformServers)

response = requests.get(dataLakePlatformServers)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

----
This last command queries the services active on server 1

In [None]:
server1Services = dataLakePlatformServices + '/servers/' + cocoMDS1Name + '/services'

print (" ")
print (cocoMDS1Name + " services ")
print ("GET " + server1Services)

response = requests.get(server1Services)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

----