![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)
### Egeria Hands-On Lab
# Welcome to the Open Discovery Lab

**NOTE - This lab is under construction and is only partly completed**

## Introduction

Egeria is an open source project that provides open standards and implementation libraries to connect tools,
catalogs and platforms together so they can share information about data and technology (called metadata).

In this hands-on lab you will get a chance to run an Egeria metadata server, configure discovery services in a discovery engine and run the discovery engine in an Engine Host OMAG server.

## What is open discovery?

[Metadata discovery](https://egeria.odpi.org/open-metadata-publication/website/metadata-discovery/) is the
ability to automatically analyze and create metadata about assets.  Egeria provides an [Open Discovery Framework (ODF)](https://egeria.odpi.org/open-metadata-implementation/frameworks/open-discovery-framework/) that defines open interfaces for components that implement specific types of metadata discovery.   These components can then be called from tools offered by different vendors through the open APIs.
We call this ability to invoke metadata discovery components from many different vendor tools, **open discovery**.

The Open Discovery Framework (ODF) provides standard interfaces for **discovery services**.  This is the ODF
name for the metadata discovery components.  The ODF interfaces control how a discovery service is started and stopped, how it can access the existing metadata about an asset, and store any additional information about the asset that it discovers. 

Discovery services are specialist **governance services**. They are grouped together into a useful collection of capability called a **governance engine**. The same discovery service may be used in multiple governance engines.

Egeria provides a governance server called the **engine host server** that can host one or more governance engines.
The engine host server has APIs to call the discovery services in order to drive the analysis a specific asset, and then to view the results.  The discovery services can also scan through all assets, running specific analysis on any it finds.

Governance engines tend to be paired and deployed close to the data platforms they are analyzing because the discovery services
tend to make many calls to access the content of the asset.  It is not uncommon for an organization to deploy multiple governance engines if their data is distributed.

A discovery service connects to a metadata server to retrieve and store metadata about the asset.
It uses the Discovery Engine OMAS APIs and events of the metadata server.
A single metadata server can support many governance engines.
The Governance Engine OMAS  supports the
maintenance of the discovery services' and governance engines' definitions.

![Figure 1](../images/distributed-engine-services-config.png)
> **Figure 1:** governance engine deployments

A particular discovery engine may be assigned to run in multiple servers. This is useful if the type of
data it is able to analyze is distributed across different locations.

The exercises that follow take you through the process of defining discovery engines and services, verifying that
they are available in the engine host server and then running discovery requests against various assets.


## The scenario

Peter Profile is Coco Pharmaceuticals' Information Analyst.  He is experienced in managing and analyzing data.
In this lab, Peter is setting up automated metadata discovery services for use when new data sets are
sent to Coco Pharmaceuticals' data lake.  These data sets come from both internal systems and external partners
such as hospitals that are participating in clinical trials.

![Peter Profile](https://raw.githubusercontent.com/odpi/data-governance/master/docs/coco-pharmaceuticals/personas/peter-profile.png)

Peter's collegue, **Gary Geeke**, the IT Infrastructure leader at Coco Pharmaceuticals,
has already configured an engine host server called `governDL01` for Peter to use
(see the **[Server Configuration](../egeria-server-config.ipynb)** lab).

![Figure 2](../images/coco-pharmaceuticals-systems-omag-server-platforms-engine-host.png)
> **Figure 2:** Coco Pharmaceuticals' OMAG Server Platforms

The `governDL01` server is running on the Data Lake OMAG Server Platform, along with `cocoMDS1`,
which is the metadata server that `governDL01` will use to retrieve and store metadata.

The first step is to ensure all of the platforms and servers are running.

In [None]:

# Start up the metadata servers
%run ../common/environment-check.ipynb

print("Start up the Engine Host Server")
activatePlatform(dataLakePlatformName, dataLakePlatformURL, [governDL01Name])
print("Done. ")



----
You should see that both the metadata server `cocoMDS1` and the engine host server `governDL01` are started.
If any of the platforms are not running, follow [this link to set up and run the platform](https://egeria.odpi.org/open-metadata-resources/open-metadata-labs/).  If any server is reporting that it is not configured then
run the steps in the [Server Configuration](../egeria-server-config.ipynb) lab to configure
the servers.  Then re-run the previous step to ensure all of the servers are started.

----
The `governDL01` server has been configured to run the Asset Analysis Open Metadata Engine Service (OMES).  Asset Analysis OMES is able to host Open Discovery Framework (ODF) discovery engines.  It has been configured to host two discovery engines.  The command below lists the discovery engines and their status.

In [None]:

printGovernanceEngineStatuses(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId)


The status code `ASSIGNED` means that the governance engine was listed in Engine Host's configuration
document - ie the governance engine was assigned to this server - but Engine Host has not been
able to retrieve the configuration for the governance engine from the metadata server (`cocoMDS1`).

When the basic governance engine properties have been retrieved from the metadata server then the status code
becomes `CONFIGURING` and more decriptive information is returned with the status.

When governance services are registered with the governance engine, the status moves to `RUNNING` and it is possible to see the list of supported request types for the governance engine.

The next step in the lab is to add configuration for the discovery engine to `cocoMDS1` until the
`AssetDiscovery` discovery engine is running.

## Exercise 1 - Configuring the Governance Engine with Open Discovery Services

Figure 3 shows the structure of the configuration that needs to be stored in the metadata server for
a governance engine.

The discovery engine has a set of descriptive properties.  These are linked to a list of discovery request types.
The discovery request types are memorable names for the types of analysis that the users of the discovery
engines will want to run.  It also includes a default set of analysis parameters that can be overridden when
a specific discovery request is made.

Each discovery request type is further linked either to a discovery service or a **discovery pipeline**.
(A discovery pipeline is a discovery service that coordinates the execution of other discovery services.)

When a discovery request is made it specifies a discovery request type. The discovery engine runs the
discovery service or discovery pipeline linked to the requested discovery type.

![Figure 3](../images/discovery-engine-configuration.png)
> **Figure 3:** Structure of discovery engine configuration

The discovery engine is configured using calls to the Discovery Engine OMAS running in the metadata server `cocoMDS1`.  The first configuration call is to store the discovery engine properties.

In [None]:
assetDiscoveryEngineName = "AssetDiscovery"
assetDiscoveryEngineDisplayName = "Asset Discovery Engine"
assetDiscoveryEngineDescription = "Extracts metadata about an asset on request."

assetDiscoveryEngineGUID = createGovernanceEngine(cocoMDS1Name,
                                                  cocoMDS1PlatformName,
                                                  cocoMDS1PlatformURL,
                                                  petersUserId,
                                                  "OpenDiscoveryEngine",
                                                  assetDiscoveryEngineName,
                                                  assetDiscoveryEngineDisplayName,
                                                  assetDiscoveryEngineDescription)

print (" ")
print ("The guid for the " + assetDiscoveryEngineName + " discovery engine is: " + assetDiscoveryEngineGUID)
print (" ")


----
The properties for the discovery engine are now on `cocoMDS1`.  This configuration will eventually propagate to
the server `governDL01` through the Discovery Engine OMAS events.  However to propagate the
configuration immediately, there is a `refresh configuration` REST API call that can be made to the Asset Analysis
OMES to request that it calls the metadata server to retrieve its configuration.

In [None]:

refreshGovernanceEngineConfig(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId, assetDiscoveryEngineName)


----

When the status of the discovery engines is requested, the AssetDiscovery discovery engine is now showing `CONFIGURING`.  This means the discovery engine is defined, but it does not have any discovery request types
defined and hence can not run any discovery services.  It is effectively "empty".


In [None]:

printGovernanceEngineStatuses(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId)


----
To complete the configuration of the discovery engine it needs at least one discovery service registered.

The next set of calls creates the definition for a discovery service and then registers it with the discovery
engine. The registration request is the point where the discovery
request types are linked to the discovery service as shown in **figure 3** above.

The definition of the discovery service is independent of the registration with the discovery engine because
discovery services can be reused in multiple discovery pipelines and engines.


In [None]:
discoveryServiceName = "csv-asset-discovery-service"
discoveryServiceDisplayName = "CSV Asset Discovery Service"
discoveryServiceDescription = "Discovers columns for CSV Files."
discoveryServiceProviderClassName = "org.odpi.openmetadata.adapters.connectors.discoveryservices.CSVDiscoveryServiceProvider"
discoveryServiceRequestType = "small-csv"

discoveryServiceGUID = createGovernanceService(cocoMDS1Name,
                                               cocoMDS1PlatformName,
                                               cocoMDS1PlatformURL,
                                               petersUserId,
                                               "OpenDiscoveryService",
                                               discoveryServiceName,
                                               discoveryServiceDisplayName,
                                               discoveryServiceDescription,
                                               discoveryServiceProviderClassName,
                                               {})

if discoveryServiceGUID:
    registerGovernanceServiceWithEngine(cocoMDS1Name,
                                        cocoMDS1PlatformName,
                                        cocoMDS1PlatformURL,
                                        petersUserId,
                                        assetDiscoveryEngineGUID,
                                        discoveryServiceGUID,
                                        discoveryServiceRequestType)
    print (" ")
    print ("Service registered as: " + discoveryServiceGUID)
    print (" ")
    
print ("Done. ")

In [None]:
refreshGovernanceEngineConfig(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId, assetDiscoveryEngineName)
print ("Done. ")

----
Now the discovery engine has sufficient configuration to offer a useful service to its callers.

In [None]:

printGovernanceEngineStatuses(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId)


----
Asset Analysis OMES is ready to run automated discovery requests on the **AssetDiscovery** discovery engine.  The **AssetQuality** discovery engine will be configured in a later release of Egeria when the quaity management function is enabled.

----
## Exercise 2 - Analysing Assets

The next exercise is to run a metadata discovery service.  It is work in progress and will be added soon.
The commands below do not currently work because the discovery service is incomplete.

In [None]:

# reportGUID = runDiscoveryService(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId, "AssetDiscovery", "small-csv", asset1guid)


This is how to query the result of a discovery request.

In [None]:

# Return the report header
#getDiscoveryReport(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId, "AssetDeduplicator", reportGUID)



# Return the annotations
#getDiscoveryReportAnnotations(governDL01Name, governDL01PlatformName, governDL01PlatformURL, petersUserId, "AssetDeduplicator", reportGUID)



----
## Exercise 3 - Exploring Asset Contents

The next exercise is to run metadata discovery on a new asset to discovery its schema (structure) and the
characteristics of its content.


 __Details coming soon ...__

----
## Exercise 3 - Assessing the quality of assets

The final exercise is to use metadata discovery to report on errors in the data from an asset and provide an assessment of its quality.


__Details coming soon ...__