![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/master/assets/img/ODPi_Egeria_Logo_color.png)

### ODPi Egeria Hands-On Lab
# Welcome to the Understanding Methods for Restricting Asset Access Lab

## Introduction

ODPi Egeria is an open source project that provides open standards and implementation libraries to connect tools, catalogs and platforms together so they can share information about data and technology (called metadata).

In this hands-on lab you will get a chance to explore different methods for protecting a data file that is cataloged in Egeria metadata. 

This lab includes three methods to illustrate different degrees of data access:

- Simple Access Control using an Access Control List (ACL)
- Policy Based Access Control (BPC)
- Context Sensitive Policy Based Access Control using Palisade (see below)

The three methods highlight different ways to protect data. In all the examples in this notebook, the same user(Callie Quartile) is accessing the same file (employee records) containing a mixture of fields that we want the user to be allowed to read, and other fields that the user should not be allowed to read. Each of the above methods provides the access control that becomes increasingly more fine grained.

- **ACL**: An Access Control List can be used to specify which users can access a file. Each user that permittied to access the file can read the whole file - there is no ability to redact or mask sensitive fields. Where finer grained access is required some organisations might be tempted to create separate copies of the file, with different subsets of content and differet access control permissions. This is not a recommended approach as it does not scale well; it duplicates some of the information and it becomes hard to keep track of which users have access to the different fields across the multiple copies. Furthermore, if a file is copied there is a risk that the copy does not have the appropriate file access controls applied to it. The use of an ACL is included in this hands-on lab to illustrate the problem that Callie would be able to see all the data in the file.

The other two methods are based on a different approach in which the file is in a data lake and all access to the file is virtualized - allowing the introduction of an enforcement point between the user and the data. This allows the organisation to use policy based access control, based on the metadata representing the data file. This lab contains two policy-based options, described below:

- **PBAC**, here policies bsaed on the metadata catalog are used to control access. A policy may be based on the characteristics of a specific Asset - such as its attributes, owner, etc - or in a slightly more sophisticated implementation, it may be based on the characteristics of a Glossary Term that is associated with the Asset. Either of these enables finer-grained control of access to a file and enables redaction and/or masking of certain fields. In the simple policy-based access control example, access permission can be based on an individual user's identity, so Callie may be allowed to see more/less than her co-workers. Policy-based access control is typically implemented using an Enforcement Point, such as Apache Ranger or Palisade. This hands-on lab includes an example of owner-based policy control implemented using Apache Ranger. It does not include glossary-based policies, but this is a logical extension of the owner-based policies that in practice requires that the organisation has developed a Glossary with Glossary Terms assigned to Assets.  
   
- **Context Sensitive PBAC**, is finer grained than plain PBAC and enables Callie's access to be dynamically determined based on the context of her work. As Callie switches between two different projects, her access to various data fields is dynamically modified to suit the current project. This hands-on lab demonstrates how to achieve this with Egeria and Palisade. Palisade is an open source framework for Scalable Data Access Policy Management and Enforcement (https://github.com/gchq/Palisade).


## The scenario

Callie Quartile is a data scientist at Coco Pharmaceuticals. She is responsible for analysing data for HR and the Clinical Trials team. 

Callie has been asked to provide analytics for two different projects:

- a staff salary review that identifies any pay biases
- identify staff elligble for a 5 year anniversay health screening project

The data that Callie will access contains sensitive and personal data which she is not authorised to view, such as Salary, Data of Birth, Employee Name and so on.  In this notebook you will learn how to redact data elements, so Callie can only view the data that is essential to each project in a way which does not violate her level of data access.

There are many situations where data needs to be redacted for different members of staff, based on their role, access, security clearance and so on.  In the first report it is inappropriate for Callie to view her colleagues salary details along with their names, employee number or other items which identify them.  If certain details are redacted then Callie may see the salaries with no knowledge of who the data pertains to.  In the analytics she provides for the Marketing team Callie will see a different set of data which is redacted based on a different set of rules based on context.


![Callie Quartile](https://raw.githubusercontent.com/odpi/data-governance/master/docs/coco-pharmaceuticals/personas/callie-quartile.png)

Callie's userId is `calliequartile`.

In [4]:
calliesUserId = "calliequartile"

The two projects (Salary Review and Health Screenin) use overlapping sets of fields from the same data file.  The specific fields she needs to access for each project are as follows:

**Salary Bias Review Project Data** 
The data access for this project consists of employee records with a number of fields. In order to perform her analysis, Callie needs to be able to access Date of Birth, Hire Date, Salary, Bonus, Deprtment, Manager, Sex,  Nationality and Work Location. During this project Callie must NOT be allowed to see any fields that would enable her to identify the employee so we would need to redact employee ID, Name, Address, etc.  Wth these fields redacted Callie can perform her salaty analysis.

**Health Screening Project Data**
In this instance Callie will need to see fields that identify each employee, including employee ID, Name, Address, Hiring Date, Work Location but she should NOT be allowed to see any fields containing financial information, such as Salart, Bonus or Bank Details. 

Note that between the two projects, Callie needs to see different fields for each analysis - the project she is performing is referred to as the Context within which she is working. This context-based access control is possible using Palisade.

Aside from the two projects above, there are two additional "purposes" that Callie may have for querying employee data:

- One purpose is the **Default** - under which Callie should be able to see employeeID, name, department, manager, work location and the work contact numbers (but not personal contact numbers).

- The other purpose is for when Callie needs to **Update** her own employee record - in this case Callie should be able to see all fields.

# Setting up

Coco Pharmaceuticals make widespread use of ODPi Egeria for tracking and managing their data and related assets.
Figure 1 below shows the metadata servers and the platforms that are hosting them.

![Figure 1](../images/coco-pharmaceuticals-systems-omag-server-platforms.png)
> **Figure 1:** Coco Pharmaceuticals' OMAG Server Platforms

In [3]:
import os

corePlatformURL     = os.environ.get('corePlatformURL','http://localhost:8080') 
dataLakePlatformURL = os.environ.get('dataLakePlatformURL','http://localhost:8081') 
devPlatformURL      = os.environ.get('devPlatformURL','http://localhost:8082')

Callie is using the research team's metadata server called `cocoMDS3`. This server is hosted on the Core OMAG Server Platform.

In [None]:
server            = "cocoMDS3"
serverPlatformURL = corePlatformURL

The following request checks that this server is running.

In [None]:
import requests
import pprint
import json

adminUserId = "garygeeke"

isServerActiveURL = serverPlatformURL + "/open-metadata/platform-services/users/" + adminUserId + "/server-platform/servers/" + server + "/status"

print (" ")
print ("GET " + isServerActiveURL)
print (" ")

response = requests.get(isServerActiveURL)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

serverStatus = response.json().get('active')
if serverStatus == True:
    print("Server " + server + " is active - ready to begin")
else:
    print("Server " + server + " is down - start it before proceeding")


----
The next set of code sets up the asset - it is subject to change.

In [None]:
assetOwnerURL = serverPlatformURL + '/servers/' + server + '/open-metadata/access-services/asset-owner/users/' + calliesUserId 
createAssetURL = assetOwnerURL + '/assets/csv-files'
print (createAssetURL)

jsonHeader = {'content-type':'application/json'}
body = {
	"class" : "NewFileAssetRequestBody",
	"displayName" : "CoCo Pharmaceuticals Employee Records",
	"description" : "Detailed Employee Records.",
	"fullPath" : "file://secured/hr/Employees.avro"
}

response=requests.post(createAssetURL, json=body, headers=jsonHeader)

response.json()

In [None]:
getAssetsURL = serverAssetOwnerURL + '/assets/by-name?startFrom=0&pageSize=50'
searchString="*Patient*"

print (" ")
print ("GET " + getAssetsURL)
print ("{ " + searchString + " }")
print (" ")

response=requests.post(getAssetsURL, data=searchString)

print ("Returns:")
prettyResponse = json.dumps(response.json(), indent=4)
print (prettyResponse)
print (" ")

if response.json().get('assets'):
    if len(response.json().get('assets')) == 1:
        print ("1 asset found")
    else:
        print (str(len(response.json().get('assets'))) + " assets found")
else:
    print ("No assets found")