<!-- SPDX-License-Identifier: CC-BY-4.0 -->
<!-- Copyright Contributors to the ODPi Egeria project 2024. -->

![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/main/assets/img/ODPi_Egeria_Logo_color.png)

### Egeria Workbook

# Cataloguing and surveying files

## Introduction

This workbook explains how to survey and catalog files in a file system.  Files are used for many purposes in data management.  A single file may contain an entire database with many tables and columns, or it may represent a single row of data in a table.  The contents themselves may be encoded in different formats.  The result is a large variation in the business value of a file-full of data.

Egeria's file system survey service helps to identify where the most valuable files are located in your file systems.  It produces a report that shows the types of files that you have, classified in multiple ways, their size and an assessment of which files have been read, updated and deleted recently.

Egeria's file system catalog service creates [asset](https://egeria-project.org/concepts/asset/) entries in open metadata, making it possible for data professionals to search for and locate files for their projects.  The catalog service works independently to the survey service.  It also captures the same information about each file as the survey service.  So if you want to catalog all files, you can use the catalog service without using the survey service first.  However, if you suspect that many files are of no interest to your data professionals, the information from the survey service can be used to configure the catalog service to ensure only potentially interesting files are catalogued.

The file system catalog service is able to catalog:

* The files in a specific directory (folder)
* The subdirectories nested under a specific directory
* The files and folders nested under a specific directory.

Once a file is catalogued in open metadata, it is possible, for certain types of files, to survey their contents.

This workbook uses Egeria's python libaries called *pyegeria* to activate different types of surveys and cataloguing, and then to view the results.  The code below activates pyegeria.

----

In [None]:
# Initialize pyegeria

%run ../../pyegeria/initialize-pyegeria.ipynb


----

This next cell creates a pyegeria client that is used to access the function that is designed for technical people, *EgeriaTech*.  It also requests an access token which is used for each call to Egeria's Open Metadata and Governance services.  The token times-out in about an hour.  So you can always rerun this cell to get a new token.

-----

In [None]:

egeria_tech = EgeriaTech(view_server, url, user_id, user_pwd)
token = egeria_tech.create_egeria_bearer_token()


----

To find out the names of the file system services, you can use the *find_elements_by_property_value()* method.  The call below displays the help for this function.

----

In [None]:
help(EgeriaTech.find_elements_by_property_value)

----

The code below calls *find_elements_by_property_value()* to request details of the [Governance Action Processes](https://egeria-project.org/concepts/governance-action-process/) that work with file systems. 

----

In [None]:

elements = egeria_tech.find_elements_by_property_value(property_value="FileDirectory", property_names=['name'], open_metadata_type_name="GovernanceActionProcess")
if type(elements) == str:
    print (elements)
else:
    for element in elements:
        if element:
            properties=element.get('properties')
            if properties:
                qualifiedName=properties.get('qualifiedName')
                description=properties.get('description')
                print('* ' + qualifiedName + ' - ' + description)
    

----

Governance action processes combine governance actions that are often together into a flow that can be executed in a single command.  For example, the *FileDirectory:CreateAndSurveyGovernanceActionProcess* is a three step process as follows:

* It creates an asset entry to represent the top level directory to survey.
* It runs the survey.  The results are linked to the asset created in the first step.
* It creates a survey report markdown document based on the results of the survey.  This is stored in `/distribution-hub/surveys/survey-reports`.

The information needed to run the survey (such as, which directory to start in) is listed in the specification.  The *supportedRequestParameters* identify the names of the values to supply in the request parameters passed to the process when it runs.

----

In [None]:

createAndSurveyProcessName="FileDirectory:CreateAndSurveyGovernanceActionProcess"

process_guid = egeria_tech.get_element_guid_by_unique_name(createAndSurveyProcessName)

process_graph = egeria_tech.get_gov_action_process_graph(process_guid)
print_governance_action_process_graph(process_graph)


----

Below is the command that executes the survey.  It is set up to survey the files that are part of Egeria's deployment.  

You can change the name of the directory to survey by changing *directoryPathName* and *directoryName*.  It has to be a directory that is reachable from Egeria's runtime where the survey executes.

----

In [None]:
requestParameters = {
    "fileSystemName" : "Egeria Deployment",
    "directoryPathName" : ".",
    "directoryName" : "platform",
    "versionIdentifier" : "1.0",
    "description" : "Files used to deploy Egeria."
}

instance_guid = egeria_tech.initiate_gov_action_process(createAndSurveyProcessName, None, None, None, requestParameters, None, None)
print(instance_guid)

----

The command below lists the outcome of the survey request. 

----

In [None]:

display_engine_activity_c()
