# Provena Dataset Registration with Metadata
This notebook will demonstrate how a Provena user can register a new dataset in the Provena Data Store.
The demonstration will include the prerequisites to register a dataset, the registration process it self, and post-registration activities such as updating metadata with new/missed or updated fields, releasing the dataset, creating new versions, changing access permissions, and more.

## Configuration Set up

In [1]:
# helpers for interacting with the datastore
from helpers import datastore
# This is a helper function for managing authentication with Provena
# Others
import json
from utils import pprint_json
from env_setup import get_auth, data_store_endpoint

No storage or object provided, using default location: .tokens.json.
Using storage type: FILE.
Using DEVICE auth flow.
Attempting to generate authorisation tokens.

Looking for existing tokens in local storage.

Validating found tokens

Trying to use found tokens to refresh the access token.

Token refresh successful.



## Prerequisites to Dataset registration
Dataset metadata makes references to organisations, owners, and optionally more users (e.g. Data Custodian). These entities must be registered in the Provena Data Store before they can be referenced in a dataset registration. This is generally a one off activity and therefore is best performed using the the friendly web user interfaces. [A guide for registering entities is available](http://docs.provena.io/registry/registering_and_updating.html). Further more, you must also ensure you are registered as a Person Entity and your user account is linked to this enitity. More info below.

You are minimally required to register the following entities prior to dataset registration:
* **Person Entity** of yourself (for Provena to automatically assign your person entity as the dataset entity owner)
* (Optional) **Person Entity** of the Dataset's Data Custodian
* **Organisation Entity** of the Dataset's Record Creator Organisation
* **Organisation Entity** of the Dataset's Publisher


In addition to registering a Person Entity of yourself, you must also then [link your account to this Person Entity](http://docs.provena.io/getting-started-is/linking-identity.html).


### Pre-requisit entities

I have pre-registered the following entities in the web user-interface which generated the following references to be used in the dataset metadata fields later:  

| Entity Type and Purpose                       | Entity Handle + Link                                      |
|-----------------------------------|-------------------------------------------------|
| Person Entity of myself           | [10378.1/1764273](https://hdl.handle.net/10378.1/1764273)   |
| Person Entity of the Dataset's Data Custodian (Peter Baker)  | [10378.1/1758949](https://hdl.handle.net/10378.1/1758949)   |
| Organisation Entity of the Dataset's Record Creator Organisation (CSIRO)  | [10378.1/1764284](https://hdl.handle.net/10378.1/1764284)   |
| Organisation Entity of the Dataset's Publisher (CSIRO)  | [10378.1/1764284](https://hdl.handle.net/10378.1/1764284)   |



In [4]:
# TODO during demonstration.
record_creator = "10378.1/1764284" 
publisher = "10378.1/1764284"
data_custodian = "10378.1/1758949"

## Dataset Registration
Now the prerequisites are done. The following sections will demonstrate how to register a dataset in the Provena Data Store. You can check the expected json payload for the endpoints using https://data-api.dev.rrap-is.com/redoc. For registering a new dataset, see https://data-api.dev.rrap-is.com/redoc#tag/Register-dataset.

#### Load in Dataset Metadata

In [5]:
# Get path to file containing the dataset metadata

dataset_metadata_path = "configs/example_dataset_registration.json"

# load into dict
with open(dataset_metadata_path) as f:
    dataset_metadata = json.load(f)

# Inject references
dataset_metadata = datastore.inject_references(dataset_metadata, record_creator, publisher, data_custodian)

# Pretty Display
#pprint_json(dataset_metadata)

#### Post Dataset Metadata to Provena for Creation of Dataset Entity

In [6]:
register_response = datastore.register_dataset(datastore_endpoint=data_store_endpoint, dataset_metadata=dataset_metadata, auth=get_auth())
print(f"Registered dataset with id {register_response['handle']}")
pprint_json(register_response)

Registering dataset with metadata: {'associations': {'organisation_id': '10378.1/1764284', 'data_custodian_id': '10378.1/1758949', 'point_of_contact': 'Lazaros'}, 'approvals': {'ethics_registration': {'relevant': False, 'obtained': False}, 'ethics_access': {'relevant': False, 'obtained': False}, 'indigenous_knowledge': {'relevant': False, 'obtained': False}, 'export_controls': {'relevant': False, 'obtained': False}}, 'dataset_info': {'name': 'The Test Dataset: A Mirror to the Soul of the Software', 'description': 'This is a test dataset purposed for demonstrating registration via API endpoint.', 'access_info': {'reposited': True}, 'publisher_id': '10378.1/1764284', 'created_date': '2022-10-02', 'published_date': '2023-10-03', 'license': 'https://creativecommons.org/licenses/by/4.0/', 'purpose': "But why, you might ask, was the Test Dataset so important? Well, dear reader, it served as a mirror reflecting the very essence of the software, exposing its vulnerabilities and frailties. It w

Registered dataset with id 10378.1/1764777
{
  "status": {
    "success": true,
    "details": "Successfully seeded location - see location details."
  },
  "handle": "10378.1/1764777",
  "s3_location": {
    "bucket_name": "restored-dev-dev-rrap-storage-bucket-11102022-11102022",
    "path": "datasets/10378-1-1764777/",
    "s3_uri": "s3://restored-dev-dev-rrap-storage-bucket-11102022-11102022/datasets/10378-1-1764777/"
  },
  "register_create_activity_session_id": "6928323c-7087-4c4c-9b88-a1664a3a4448"
}


In [7]:
print(register_response['handle'])

10378.1/1764777
