![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)
### ODPi Egeria Hands-On Lab
# Welcome to the Open Discovery Lab

**NOTE - This lab is under construction and is only partly completed**

## Introduction

ODPi Egeria is an open source project that provides open standards and implementation libraries to connect tools,
catalogs and platforms together so they can share information about data and technology (called metadata).

In this hands-on lab you will get a chance to run an Egeria metadata server, configure discovery services in a discovery engine and run the discovery engine in a discovery server.

## What is open discovery?

[Metadata discovery](https://egeria.odpi.org/open-metadata-publication/website/metadata-discovery/) is the
ability to automatically analyze and create metadata about assets.  ODPi Egeria provides an [Open Discovery Framework (ODF)](https://egeria.odpi.org/open-metadata-implementation/frameworks/open-discovery-framework/) that defines open interfaces for components that implement specific types of metadata discovery.   These components can then be called from tools offered by different vendors through the open APIs.
We call this ability to invoke metadata discovery components from many different vendor tools, **open discovery**.

The Open Discovery Framework (ODF) provides standard interfaces for **discovery services**.  This is the ODF
name for the metadata discovery components.  The ODF interfaces control how a discovery service is started and stopped, how it can access the existing metadata about an asset, and store any additional information about the asset that it discovers. 

Discovery services are grouped together into a useful collection of capability called a **discovery engine**. The same discovery service may be used in multiple discovery engines.

ODPi Egeria provides a governance server called the **discovery server** that can host one or more discovery engines.  The discovery server has APIs to call the discovery engines and their services inside to drive the analysis a specific asset, and then to view the results.  The discovery server can also scan through all assets, running specific analysis on any it finds.

Discovery servers tend to be paired and deployed close to the data platforms they are analyzing because the discovery process makes many calls to access the content of the asset.  It is not uncommon for an organization to deploy multiple discovery servers if their data is distributed.

The discovery server connects to a metadata server to retreive and store metadata about the asset.  It uses the Discovery Engine OMAS APIs and events of the metadata server.
A single metadata server can support many discovery servers.
The Discovery Engine OMAS also supports the
maintenance of the discovery services' and discovery engines' definitions.

![Figure 1](../images/discovery-servers.png)
> **Figure 1:** Discovery server deployments

A particular discovery engine may be assigned to run in multiple discovery servers. This is useful if the type of
data it is able to analyze is distributed across different locations.

The exercises that follow take you through the process of defining discovery engines and services, verifying that
they are available in the discovery server and then running discovery requests against various assets.


## The scenario

Peter Profile is Coco Pharmaceuticals' Information Analyst.  He is experienced in managing and analyzing data.
In this lab, Peter is setting up automated metadata discovery services for use when new data sets are
sent to Coco Pharmaceuticals' data lake.  These data sets come from both internal systems and external partners
such as hospitals that are participating in clinical trials.

![Peter Profile](https://raw.githubusercontent.com/odpi/data-governance/master/docs/coco-pharmaceuticals/personas/peter-profile.png)

Peter's collegue, **Gary Geeke**, the IT Infrastructure leader at Coco Pharmaceuticals,
has already configured a discovery server called `findItDL01` for Peter to use.

In the **[Server Configuration](../egeria-server-config.ipynb)** lab, Gary configured the
Open Metdata and Governance (OMAG) Server Platforms shown in Figure 2 to host the Egeria servers in use by Coco Pharmaceuticals.

![Figure 2](../images/coco-pharmaceuticals-systems-omag-server-platforms-with-discovery.png)
> **Figure 2:** Coco Pharmaceuticals' OMAG Server Platforms

The discovery server `findItDL01` is running on the Data Lake OMAG Server Platform, along with `cocoMDS1`,
which is the metadata server that the discovery server will use to retrieve and store metadata.

The first step is to ensure all of the platforms and servers are running.

In [None]:
%run ../common/environment-check.ipynb

print("Start up the Discovery Server")
activatePlatform(dataLakePlatformName, dataLakePlatformURL, [findItDL01Name])
print("Done. ")



----
You should see that both the metadata server `cocoMDS1` and the discovery engine `findItDL01` are started.
If any of the platforms are not running, follow [this link to set up and run the platform](https://egeria.odpi.org/open-metadata-resources/open-metadata-labs/).  If any server is reporting that it is not configured then
run the steps in the **[Server Configuration](../egeria-server-config.ipynb)** lab to configure
the servers.  Then re-run the previous step to ensure all of the servers are started.

----
The discovery server has been configured to host 3 discovery engines.  The command below lists the discovery engines
and their status.

In [None]:

getDiscoveryEngineStatuses(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId)


The status code `ASSIGNED` means that the discovery engine was listed in the discovery server's configuration
document - ie the discovery engine was assigned to the discovery server - but the discovery server has not been
able to retrieve the configuration for the discovery engine from the metadata server (`cocoMDS1`).

When the basic discovery engine properties have been retrieved from the metadata server then the status code
becomes `CONFIGURING` and more decriptive information is returned with the status.   When discovery services are registered with the discovery engine, the status moved to `RUNNING` and it is possible to see the list of supported
discovery request types with the status.

The next step in the lab is to add configuration for the discovery engine to `cocoMDS1` until the
`AssetDeduplicator` discovery engine is running.

## Exercise 1 - Configuring the Discovery Engine

Figure 3 shows the structure of the configuration that needs to be stored in the metadata server for
a discovery engine.

The discovery engine has a set of descriptive properties.  These are linked to a list of discovery request types.
The discovery request types are memorable names for the types of analysis that the users of the discovery
server will want to run.  It also includes a default set of analysis parameters that can be overridden when
a specific discovery request is made.

Each discovery request type is further linked either to a discovery service or a **discovery pipeline**.
(A discovery pipeline is a discovery service that manages the running of other discovery services.)

When a discovery request is made it specifies a discovery request type. The discovery engine runs the
discovery service or discovery pipeline linked to the requested discovery type.

![Figure 3](../images/discovery-engine-configuration.png)
> **Figure 3:** Structure of discovery engine configuration

The discovery engine is configured using calls to the Discovery Engine OMAS running in the metadata server `cocoMDS1`.  The first configuration call is to store the discovery engine properties.

In [None]:
assetDeduplicatorEngineName = "AssetDeduplicator"
assetDeduplicatorEngineDisplayName = "Asset Deduplicator Discovery Engine"
assetDeduplicatorEngineDescription = "Discovery engine for scanning the asset catalog and identifying which assets are duplicate definitions of the same physical asset."


assetDeduplicatorEngineGUID = createDiscoveryEngine(cocoMDS1Name,
                                                    cocoMDS1PlatformName,
                                                    cocoMDS1PlatformURL,
                                                    petersUserId,
                                                    assetDeduplicatorEngineName,
                                                    assetDeduplicatorEngineDisplayName,
                                                    assetDeduplicatorEngineDescription)

print (" ")
print ("The guid for the " + assetDeduplicatorEngineName + " discovery engine is: " + assetDeduplicatorEngineGUID)
print (" ")


----
The properties for the discovery engine are now on `cocoMDS1`.  This configuration will eventually propagate to
the discovery server `findItDL01` through the Discovery Engine OMAS events.  However to propagate the
configuration immediately, there is a `refresh configuration` REST API call that can be made to the discovery
server to request that it calls the metadata server to retrieve its configuration.

In [None]:

refreshDiscoveryEngineConfig(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId, assetDeduplicatorEngineName)


----
The result shows that there is a problem with the discovery engine's configuration.

When the status of the discovery engines is requested, the AssetDuplicator discovery engine is now showing `CONFIGURING`.  This means the discovery engine is defined, but it does not have any discovery request types
defined and hence can not run any discovery services.  It is effectively "empty".


In [None]:

getDiscoveryEngineStatuses(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId)


----
To complete the configuration of the discovery engine it needs at least one discovery service registered.

The next set of calls creates the definition for a discovery service and then registers it with the discovery
engine. The registration request is the point where the discovery
request types are linked to the discovery service as shown in **figure 3** above.

The definition of the discovery service is independent of the registration with the discovery engine because
discovery services can be reused in multiple discovery pipelines and engines.


In [None]:

dupAssetIdentificationDiscoveryServiceName = "duplicate-asset-identification-discovery-service"
dupAssetIdentificationDiscoveryServiceDisplayName = "Duplicate Asset Identification Discovery Service"
dupAssetIdentificationDiscoveryServiceDescription = "Creates a report that lists the other assets that seem to describe the same physical asset."
dupAssetIdentificationDiscoveryServiceProviderClassName = "org.odpi.openmetadata.adapters.connectors.discoveryservices.DuplicateSuspectDiscoveryProvider"
dupAssetIdentificationDiscoveryServiceRequestTypes = [ "identify-duplicates" ]
dupAssetIdentificationDiscoveryServiceGUID = createDiscoveryService(cocoMDS1Name,
                                                                    cocoMDS1PlatformName,
                                                                    cocoMDS1PlatformURL,
                                                                    petersUserId,
                                                                    dupAssetIdentificationDiscoveryServiceName,
                                                                    dupAssetIdentificationDiscoveryServiceDisplayName,
                                                                    dupAssetIdentificationDiscoveryServiceDescription,
                                                                    dupAssetIdentificationDiscoveryServiceProviderClassName)

print (" ")
print ("The guid for the " + dupAssetIdentificationDiscoveryServiceName + " is: " + dupAssetIdentificationDiscoveryServiceGUID)
print (" ")

registerDiscoveryServiceWithEngine(cocoMDS1Name,
                                   cocoMDS1PlatformName,
                                   cocoMDS1PlatformURL,
                                   petersUserId,
                                   assetDeduplicatorEngineGUID,
                                   dupAssetIdentificationDiscoveryServiceGUID,
                                   dupAssetIdentificationDiscoveryServiceRequestTypes)


print (" ")
print ("Service registered")
print (" ")

refreshDiscoveryEngineConfig(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId, assetDeduplicatorEngineName)


----
Now the discovery engine has sufficient configuration to offer a useful service to its callers.

In [None]:

getDiscoveryEngineStatuses(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId)


----
The code below adds the configuration for the AssetDiscovery discovery engines and its services.


In [None]:

assetDiscoveryEngineName = "AssetDiscovery"
assetDiscoveryEngineDisplayName = "Asset Discovery Engine"
assetDiscoveryEngineDescription = "Extracts metadata about an asset on request."

assetDiscoveryEngineGUID = createDiscoveryEngine(cocoMDS1Name,
                                                 cocoMDS1PlatformName,
                                                 cocoMDS1PlatformURL,
                                                 petersUserId,
                                                 assetDiscoveryEngineName,
                                                 assetDiscoveryEngineDisplayName,
                                                 assetDiscoveryEngineDescription)

discoveryServiceName = "csv-asset-discovery-service"
discoveryServiceDisplayName = "CSV Asset Discovery Service"
discoveryServiceDescription = "Discovers columns for CSV Files."
discoveryServiceProviderClassName = "org.odpi.openmetadata.adapters.connectors.discoveryservices.CSVDiscoveryServiceProvider"
discoveryServiceRequestTypes = [ "small-csv" ]

discoveryServiceGUID = createDiscoveryService(cocoMDS1Name,
                                              cocoMDS1PlatformName,
                                              cocoMDS1PlatformURL,
                                              petersUserId,
                                              discoveryServiceName,
                                              discoveryServiceDisplayName,
                                              discoveryServiceDescription,
                                              discoveryServiceProviderClassName)

registerDiscoveryServiceWithEngine(cocoMDS1Name,
                                   cocoMDS1PlatformName,
                                   cocoMDS1PlatformURL,
                                   petersUserId,
                                   assetDiscoveryEngineGUID,
                                   discoveryServiceGUID,
                                   discoveryServiceRequestTypes)

refreshDiscoveryEngineConfig(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId, assetDiscoveryEngineName)


----
The configuration for the `findItDL01` discovery server now looks like this:


In [None]:

getDiscoveryEngineStatuses(findItDL01Name, findItDL01PlatformName, findItDL01PlatformURL, petersUserId)


----
The discovery server is ready to run automated discovery requests on both the **AssetDeduplicator** discovery engine and the **AssetDiscovery** discovery engine.  The **AssetQuality** discovery engine will be configured in a later release of Egeria when the quaity management function is enabled.

----
## Exercise 2 - Discovering Duplicate Assets

The next exercise is to run a metadata discovery service on a selection of asset descriptions in the metadata repositories to determine if they each represent a unique real asset, or if there are duplicate descriptions.

Duplicate asset descriptions are inevitable when Egeria combines metadata from different tools and the users of these tools are working with the same physical assets.  Each tool will load their own private description of the
asset.  When the tools are linked together, and Egeria queries the combined 

----
## Exercise 3 - Exploring Asset Contents

The final exercise is to run metadata discovery on a new asset to discovery its schema (structure) and the
characteristics of its content.