![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)

### ODPi Egeria and Palisade Hands-On Lab
# Welcome to the Restricting Data Access Lab

## Introduction

[ODPi Egeria](https://egeria.odpi.org/) is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information about data and technology (called metadata).

[Palisade](https://github.com/gchq/Palisade) is a scalable data policy management and enforcement capability.

In this hands-on lab you will get a chance to explore different methods for protecting a data file that is cataloged in Egeria metadata. 

This lab includes three methods to illustrate different approaches to controlling access to data:

- **File Based Access Control** using a file's Access Control List (ACL)
- **Data Content Based Access Control** using metadata from Egeria to determine the data's sensitivity
- **Context Based Access Control** using Palisade to manage enforcement by taking the context of the query into consisderation.

The three methods highlight different ways to protect data.

## The scenario

<img src="https://raw.githubusercontent.com/odpi/data-governance/master/docs/coco-pharmaceuticals/personas/callie-quartile.png" style="float:left">
Callie Quartile is a data scientist at Coco Pharmaceuticals. She is responsible for analyzing data for Human Resources (HR) and the Clinical Trials team. 

Callie has been asked to provide analytics for two different HR projects:

  * perform a staff salary analysis that identifies any pay biases in the salaries and bonuses of Coco Pharmaceutical employees.
  * identify staff eligible for a 5 year anniversary health screening project.

The data that Callie will access contains both sensitive and personal data which she is not normally authorized to view, such as Salary, Date of Birth and so on.

In this notebook you will learn how to redact data elements, so Callie can only view the data that is essential to each project in a way which does not provide her with an inappropriate level of data access.

For example, in the staff salary review project, it is inappropriate for Callie to view her colleagues' salary details along with their names, employee number or other items which identify them.  If the fields that identify individuals are redacted then Callie may see the salary data with no knowledge of who the data pertains to.

Similarly, when identifying the list of staff eligible for the health screening program, she needs to see identifying information such as names and email addresses along with their start date.  However, she does not need to see salary information.

## Inside the Employee file

Figure 1 shows the structure of an Avro file that contains Coco Pharmacceuticals employee data.  It include a rich mixture of public, personal and financial information.  No individual would ever need access to all values in all records.  Instead, the access control processes need to filter out the data that is appropiate for an individual to see.  This filtering may remove whole records, specific values, or reduce the precision of a specific value.

![Figure 1](images/avro-employee-uml.png)
> **Figure 1:** Structure of the Employee data

Callie's two projects (Salary Review and Health Screening) use overlapping sets of fields from the same data file.  The specific fields she needs to access for each project are shown in the following sections. 

**Salary Bias Review Project Data**

The data access for this project consists of employee records with a number of fields. In order to perform her analysis, Callie needs to be able to access Date of Birth, Hire Date, Salary, Bonus, Department, Manager, Sex,  Nationality and Work Location. During this project Callie must NOT be allowed to see any fields that would enable her to identify the employee so we would need to redact employee ID, Name, Address, etc.  With these fields redacted Callie can perform her salary analysis.

Figure 2 below shows the fields she needs.
Light blue colouring means the field is needed (see `dateOfBirth` in the `Salary Bias Review Project Data`), fields in white are not needed (see `name` in the `Salary Bias Review Project Data`). The dark blue highlighing the table means that all fields of the instances that are linked to can be viewed (see `Address` in the `Salary Bias Review Project Data`.

![Figure 2](images/avro-employee-uml-salary-bias-context.png)
> **Figure 2:** Employee data needed for the Salary Bias Review


**Health Screening Project Data**

In the Health Screening project Callie will need to see fields that identify each employee, including employee ID, Name, Address, Hiring Date, Work Location but she should NOT be allowed to see any fields containing financial information, such as Salary, Bonus or Bank Details. 

This is illustrated in Figure 3.

![Figure 3](images/avro-employee-uml-health-screening-context.png)
> **Figure 3:** Employee data needed for the Health Screening Project

Aside from the two projects above, there are two additional "purposes" that Callie may have for querying employee data.

**Default Use**

The default use of the employee file is as a company directory.  In this context,
Callie should be able to see employee's userId, name, department, manager, work location and the work contact numbers (but not personal or emergency contact numbers).  This is shown in Figure 4.

![Figure 4](images/avro-employee-uml-default-context.png)
> **Figure 4:** Default use of Employee data used as a company directory

**Update My Profile Use**

The final purpose is for when Callie needs to **Update** her own employee record - in this case Callie should be able to see all fields and update selected fields that relate to her personally.

![Figure 5](images/avro-employee-uml-update-context.png)
> **Figure 5:** Employee updating their own record


# Setting up

Coco Pharmaceuticals make widespread use of ODPi Egeria for tracking and managing their data and related assets.
Figure 2 below shows the metadata servers and the platforms that are hosting them.

![Figure 2](https://raw.githubusercontent.com/odpi/egeria/master/open-metadata-resources/open-metadata-labs/images/coco-pharmaceuticals-systems-omag-server-platforms.png)
> **Figure 2:** Coco Pharmaceuticals' OMAG Server Platforms

In [5]:
import os

corePlatformURL     = os.environ.get('corePlatformURL','http://localhost:18080') 
dataLakePlatformURL = os.environ.get('dataLakePlatformURL','http://localhost:18081') 
devPlatformURL      = os.environ.get('devPlatformURL','http://localhost:18082')

Callie is using the research team's metadata server called `cocoMDS3`. This server is hosted on the Core OMAG Server Platform.  Her userId is `calliequartile`.

In [6]:
calliesUserId = "calliequartile"
calliesServer = "cocoMDS3"
calliesServerPlatformURL = corePlatformURL

However, before Callie can begin to access the employee file, it needs to be cataloged by the data lake operations team, Peter Profile and Erin Overview.  Peter uses `cocoMDS1` and Erin uses `cocoMDS2`.

In [7]:
petersUserId = "peterprofile"
petersServer = "cocoMDS1"
petersServerPlatformURL = dataLakePlatformURL

erinsUserId = "erinoverview"
erinsServer = "cocoMDS2"
erinsServerPlatformURL = corePlatformURL

The following request checks that their servers are running.  

In [9]:
import requests
import pprint
import json

adminUserId = "garygeeke"

def checkServer(serverName, platformURL):
    print("Checking server", serverName, "...")
    url = platformURL + "/open-metadata/platform-services/users/" + adminUserId + "/server-platform/servers/" + serverName + "/status"
    response = requests.get(url)
    serverStatus = response.json().get('active')
    if serverStatus == True:
        print("Server " + serverName + " is active - ready to begin")
    else:
        print("Server " + serverName + " is down - start it before proceeding")


checkServer(calliesServer, calliesServerPlatformURL)
checkServer(petersServer, petersServerPlatformURL)
checkServer(erinsServer, erinsServerPlatformURL)


Checking server cocoMDS3 ...
Server cocoMDS3 is active - ready to begin
Checking server cocoMDS1 ...
Server cocoMDS1 is active - ready to begin
Checking server cocoMDS2 ...
Server cocoMDS2 is active - ready to begin


----
The next set of code sets up the asset - it is subject to change.

In [None]:
assetOwnerURL = petersServerPlatformURL + '/servers/' + petersServer + '/open-metadata/access-services/asset-owner/users/' + petersUserId 
createAssetURL = assetOwnerURL + '/assets/data-files/avro'

jsonHeader = {'content-type':'application/json'}
body = {
	"class" : "NewFileAssetRequestBody",
	"displayName" : "Coco Pharmaceuticals Employee Records",
	"description" : "Detailed Employee Records.",
	"fullPath" : "file://secured/hr/Employees.avro"
}

fileSystemGUID = "<Unknown>"
folder1GUID    = "<Unknown>"
folder2GUID    = "<Unknown>"
fileGUID       = "<Unknown>"

response=requests.post(createAssetURL, json=body, headers=jsonHeader)
if response.status_code == 200:
    guids = response.json().get('guids')
    if guids == None:
        print ("No assets returned")
        prettyResponse = json.dumps(response.json(), indent=4)
        print ("Response: ")
        print (prettyResponse)
        print (" ")
    else:
        if len(guids) == 4:
            fileSystemGUID = guids[0]
            folder1GUID    = guids[1]
            folder2GUID    = guids[2]
            fileGUID       = guids[3]

print ("File system GUID is: " + fileSystemGUID)
print ("Folder 1 GUID is:    " + folder1GUID)
print ("Folder 2 GUID is:    " + folder2GUID)
print ("File GUID is:        " + fileGUID)

print (" ")

body = {
	"class" : "OwnerRequestBody",
	"ownerType" : "USER_ID",
	"ownerId" : "faithbroker"
}

def addOwner(assetName, assetGUID):
    print ("Setting owner on " + assetName + " ...")
    addOwnerURL = assetOwnerURL + "/assets/" + assetGUID + "/owner"
    response=requests.post(addOwnerURL, json=body, headers=jsonHeader)
    if response.status_code != 200:
        prettyResponse = json.dumps(response.json(), indent=4)
        print ("Response: ")
        print (prettyResponse)
        print (" ")
    

addOwner("file", fileGUID)
addOwner("folder 2", folder2GUID)

governanceURL = erinsServerPlatformURL + '/servers/' + erinsServer + '/open-metadata/access-services/asset-owner/users/' + erinsUserId 

def addZones(assetName, assetGUID, zones):
    print ("Setting governance zones on " + assetName + " ...")
    addZonesURL = governanceURL + "/assets/" + assetGUID + "/governance-zones"
    response=requests.post(addZonesURL, json=zones, headers=jsonHeader)
    if response.status_code != 200:
        prettyResponse = json.dumps(response.json(), indent=4)
        print ("Response: ")
        print (prettyResponse)
        print (" ")
    

addZones("file", fileGUID, ["data-lake", "human-resources"])
addZones("folder 2", folder2GUID, ["data-lake", "human-resources"])
addZones("folder 1", folder1GUID, ["data-lake"])



----
The code below retrieves the assets.

In [None]:
findAssetsURL = assetOwnerURL + '/assets/by-search-string?startFrom=0&pageSize=50'
searchString=".*hr.*"

print (" ")
print ("GET " + findAssetsURL)
print ("{ " + searchString + " }")
print (" ")

response=requests.post(findAssetsURL, data=searchString)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

if response.json().get('assets'):
    if len(response.json().get('assets')) == 1:
        print ("1 asset found")
    else:
        print (str(len(response.json().get('assets'))) + " assets found")
else:
    print ("No assets found")

----
## Protecting the Employee file using Access Control Lists (ACLs)

An Access Control List (ACL) can be used to specify which users can access a file. Each user that permitted to access the file can read the whole file - there is no ability to redact or mask sensitive fields. Where finer grained access is required organizations must create separate copies of the file, with different subsets of content and different access control permissions.

**_more to come_**


----
## Protecting the Employee file using consistent metadata definitions

Here policies bsaed on the metadata catalog are used to control access. A policy may be based on the characteristics of a specific Asset - such as its attributes, owner, etc - or in a slightly more sophisticated implementation, it may be based on the characteristics of a Glossary Term that is associated with the Asset. 

Either of these enables finer-grained control of access to a file and enables redaction and/or masking of certain fields. In the simple policy-based access control example, access permission can be based on an individual user's identity, so Callie may be allowed to see more/less than her co-workers.

Policy-based access control is typically implemented using an Enforcement Point, such as Apache Ranger or Palisade, that is able to access the metadata to make a decision.

**_more to come_**

----  
## Protecting the Employee file using context with metadata definitions

Considering the context of a request in the access control decision provides finer grained control to data access then just using the characteristics of the data itself because it enables Callie's access to be dynamically determined based on the context of her work. As Callie switches between two different projects, her access to various data fields is dynamically modified to suit the current project.

**_more to come_**

## Conclusions

In all the examples in this notebook, the same user (Callie Quartile) is attempting to access the same file (employee records).  This file contains a mixture of fields that we want Callie to be allowed to read, and other fields that she should not be allowed to read. 

We showed that:
 * the simple approach using file-based access control can be used to control access to a whole file. If a subset of the file should only be visible, a new file needs to be created containing only that subset and the secured appropriately.  Although simple and widely supported, this appropach can lead to a proliferation of project-specific copies of the same data with potentially different standards of security implemented in each copy. The team needs to keep track of the copies and remove/archive them once each project is complete.
 
 * the content based approach uses metadata definitions to ensure all data of the same type is secured consistently.  With the approach, an individual sees the same data, irrespective of which copy they are looking at. This is valuable in a data lake environment where there are many copies of data optimized on different platforms for different processing.  However, for the use case we have been working with above, Callie actually needs project-specific views of the same data.
   If a view-based access point is being used, these separate views can be defined.  While this avoids the duplication of data, it does require extra administration to manage the views.
 
 * the final approach with Palisade uses the context definition to restrict the access to data.  It is still based on metadata as in the content based approach, but the context identifer provided allows the view of data to be dynamically controlled by policy.  This avoids the technical administration and means that access policies can be changed immediately by the asset owners.  The asset owners are in full control.

----