# Create Protocols and Protocol executions from file

With this notebook you can generate a basic data provenance with protocols and protocols executions. This particular notebook considers a 3-step provenance starting with a subject and ending with a file (see figure)

<img src="img/simpleProvenance.png" width="800px" height="250px" align="center"/>

The steps in the notebook are as follows:
1. Create protocols from an excel file
2. Extract linked subject states, tissue sample states and file bundles from the KG and use them in the protocol executions as input and output
3. Create protocol executions from file and link the generated protocols to it
4. Post the newly created instances to the KGE

To be able to run the script, you need to the following requirements:
- Python version >= 3.6
- openMINDS package (can be downloaded from https://pypi.org/project/openMINDS/)
- read and write permission to the KG via the API

A template file for the protocol and protocol execution is available as a macro-enabled excel file. If you enable the macros, you can use this file to select multiple options in the dropdown menu. When you have filled out the template, save it as an .xlsx file to remove the VBA so that it can imported in this notebook.

Information about the protocols should be stored in the .xlsx file with the following column names written in sheet 'P'. 
- protocolName
- protocolDescription
- technique (dropdown of controlled instances)

Information about the protocol executions should be stored in the same .xlsx file with the following column names written in sheet 'PE'.
- protocolExecutionName
- protocolExecutionDescription
- preparationType (dropdown menu of controlled instances)
- protocolUsed (dropdown menu from the protocol sheet)
- inputType (dropdown menu of controlled instances)
- input
- outputType (dropdown menu of controlled instances)
- output



In [None]:
# import important packages
import os
import json
import glob
import pandas as pd
from getpass import getpass
import requests
import openMINDS
import openMINDS.version_manager

# Initialise the openMINDS package version 3
openMINDS.version_manager.init()
openMINDS.version_manager.version_selection("v3")
helper = openMINDS.Helper()


First specify the dataset version for which you want to create a data provenance and then give the name of the template file in which the information for the protocol and protocol executions are stored

In [None]:
# Define Location of the files
cwd = os.getcwd()

answer = ""
while answer not in ["y", "n"]: 
    answer = input(f"Is this where your files are stored: {cwd}? yes (y) or no (n) " ) 
    if answer == "y":
        fpath = cwd
        break
    elif answer == "n":
        fpath = input("Please define you path: ")
        break

dsv = input("What is the UUID of the dataset version? ")
print("The UUID of the dataset is: " + dsv)
output_path = os.path.join(cwd, dsv)
print('The output folder is: ' + dsv)

protocol_file = input("What is name of the file with the protocol information? ")
protocols = pd.read_excel(os.path.join(cwd, protocol_file + '.xlsx'), sheet_name = 'P')
PEs = pd.read_excel(os.path.join(cwd, protocol_file + '.xlsx'), sheet_name = 'PE')

## Create Protocols

Using the information from the template file in sheet 'P', we can now create the protocols using the openMINDS package.
The generated instances are saved in a folder called 'protocol' and can be found in the output folder.

An overview of the generated instances with their UUIDs is stored in the output folder as well with the name "createdProtocols.csv". This information will be used to link the newly generated protocols to the protocol executions that will be created below.

In [None]:
# Function to create protocol instances
def createProtocols(protocols, output_path):
    """
    Parameters
    ----------
    protocols : pandas DataFrame 
        Imported file excel file with information about the protocols

    Returns
    -------
    df : pandas DataFrame
        dataframe with relevant information about the created protocols
        
    """
        
    df = pd.DataFrame([])    
    protocol_dict = {} 
    for p in range(len(protocols)):

        mycol = helper.create_collection()
        
        # If more techniques were used, ensure that they are stored in the correct way
        techniques = []
        if protocols.technique[p].find(',') != -1:
            for t in protocols.technique[p].split(","):
                techniques.append({"@id": "https://openminds.ebrains.eu/instances/technique/" + str(t.strip())})
        else:
            techniques = {"@id": "https://openminds.ebrains.eu/instances/technique/" + str(protocols.technique[p].strip())}
            
        # Create a protocol instance
        protocol_dict[protocols.protocolName[p]] = mycol.add_core_protocol(name = protocols.protocolName[p],
                                                                description = protocols.protocolDescription[p],
                                                                technique = techniques
                                                                )
        
        # Create an overview table with the important information
        df = df.append(pd.DataFrame({"type" : "protocol",
                        "name" : protocols.protocolName[p],
                        "description" : protocols.protocolDescription[p],
                        "technique" : protocols.technique[p],
                        "protocolAtid" :  protocol_dict[protocols.protocolName[p]].split("/")[-1]},                
                    index=[0]),
                ignore_index=True)
        
        # Save the openMINDS instance
        mycol.save(os.path.join(output_path, "")) 

    print("Saving created instances...")
    
    # Store the information in an overview file
    filename = os.path.join(output_path, 'createdProtocols.csv')
    df.to_csv(filename, index = False, header=True)  

    print("Done")
    
    return df

In [None]:
# Execute the function to create protocols
createdProtocols = createProtocols(protocols, output_path)

# print the overview file to ensure it was successful
print(createdProtocols)

## Create Protocol Executions

To create the protocol executions, we need to perform a couple of steps to ensure that most of the information is filled in automatically.

### Extract information for input and outputs

We will first query the Knowledge Graph Editor to extract information that could be used as the input and output of protocol execution steps. To ensure that this works, it is important that specimen and file bundles are already in the system and that they are linked in the correct way, with "descendedFrom".

The query below will find the tissue sample collection of the dataset version that you specified earlier. It will also extract the subject and file bundle it is linked to, which allows you to create a data provenance as depicted in the image above.

**Note**: If your data provenance deviates from the above example, you may not be able to link the correct input and output to the protocol executions. Nevertheless, the protocol executions will still be generated and you can make manual edits in the Knowledge graph editor.

In [None]:
# This query will extract important information, such as the DOI of the dataset, the tissue sample collections and the linked subjects.
query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "merge": {
      "@type": "@id",
      "@id": "merge"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "name": "get-dsv-specimen-fb",
    "responseVocab": "https://schema.hbp.eu/myQuery/",
    "type": "https://openminds.ebrains.eu/core/DatasetVersion"
  },
  "structure": [
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True,
      "filter": {
        "op": "CONTAINS",
        "value": dsv
      }
    },
    {
      "propertyName": "query:studiedSpecimen",
      "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
      "required": True,
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id",
          "required": True
        },
        {
          "propertyName": "query:lookupLabel",
          "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
          "required": True
        },
        {
          "propertyName": "query:internalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
        },
        {
          "propertyName": "query:type",
          "path": "@type",
          "required": True,
          "filter": {
            "op": "CONTAINS",
            "value": "TissueSampleCollection"
          }
        },
        {
        "propertyName": "query:studiedState",
        "path": "https://openminds.ebrains.eu/vocab/studiedState",
        "structure": [
          {
            "propertyName": "query:id",
            "path": "@id"
          },
          {
            "propertyName": "query:lookupLabel",
            "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
          },
          {
            "propertyName": "query:descendedFromFile",
            "path": {
              "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
              "reverse": True
            },
            "structure": [
              {
                "propertyName": "query:id",
                "path": "@id"
              },
              {
                "propertyName": "query:name",
                "path": "https://openminds.ebrains.eu/vocab/name"
              }
            ]
          },
          {
            "propertyName": "query:descendedFromSubject",
            "path": "https://openminds.ebrains.eu/vocab/descendedFrom",
            "structure": [
              {
              "propertyName": "query:id",
              "path": "@id"
              },
              {
              "propertyName": "query:lookupLabel",
              "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
              }
            ]
          }
        ]
      }
      ]
    }
  ]
}

# Function to get the info based on what is defined in the query
def getInfo(token, stage="IN_PROGRESS"):
    """
    Parameters
    ----------
    token : string 
        Authentication token to access data and metadata in the KGE via the API
    stage : string
        Stage the data are in, e.g. "RELEASED", "IN_PROGRESS". Default is "IN_PROGRESS".

    Returns
    -------
    data : dictionary
        All data that was specified in the query
        
    """
    
    headers = {"accept": "*/*",
        "Authorization": "Bearer " + token
        }

    url = "https://core.kg.ebrains.eu/v3-beta/queries/?vocab=https://schema.hbp.eu/myQuery/&stage={}"
    response = requests.post(url.format(stage), json=query, headers=headers)
    
    if response.status_code == 200:
        print(response, "OK! Query as successful!" )
        data = response.json()
    elif response.status_code == 401:
        print(response, "Token not valid, authorisation not successful")
        return
    else:
        print(response)
        return
    
    return data

### Authentication

To interact with the API, you need an access token. To request a token, copy your token from the Knowledge Graph Editor or Query Builder (if you do not have access, request access via support@ebrains.eu).


In [None]:
token = getpass(prompt='Please paste your token: ')

With the token, we will now execute the query and extract information from the KGE. The default state of release of the data is set to "IN_PROGRESS". You can change this to "RELEASED if you want to.

**Note:** If you get a 401 error, it indicates that your token is not valid and may be expired. Refresh the browser where you extracted the token from, run the authetication cell again and past the new token in the input cell.

In [None]:
data  = getInfo(token)

In [None]:
# Function to extract tissue sample collection information
def extractInfo(tsc_list):
    """
    Parameters
    ----------
    tsc_list : list 
        Nested list of tsc information

    Returns
    -------
    data : pandas DataFrame
        Overview table with extracted information
        
    """

    data = pd.DataFrame([])
    for tsc in tsc_list:
        if not "descendedFromFile" in tsc['studiedState'][0].keys():
            fileBundle_name = ""
            fileBundle_uuid = ""
        else:
            if tsc["studiedState"][0]["descendedFromFile"] == []:
                fileBundle_name = ""
                fileBundle_uuid = ""
            else:
                fileBundle_name = tsc["studiedState"][0]["descendedFromFile"][0]["name"]
                fileBundle_uuid = tsc["studiedState"][0]["descendedFromFile"][0]["id"].split("/")[-1]


        data = data.append(pd.DataFrame({"subject_state_name" : tsc["studiedState"][0]["descendedFromSubject"][0]["lookupLabel"],
                        "subject_state_uuid" : tsc["studiedState"][0]["descendedFromSubject"][0]["id"].split("/")[-1],
                        "tsc_state_name" : tsc["studiedState"][0]["lookupLabel"],
                        "tsc_state_uuid" : tsc["studiedState"][0]["id"].split("/")[-1],
                        "fileBundle_name" : fileBundle_name,  
                        "fileBundle_uuid" : fileBundle_uuid},                
                    index=[0]), 
                ignore_index=True)

    return data

In [None]:
# From the query we only use the specimen to extract information from 
tsc_list = data["data"][0]["studiedSpecimen"]
extractedData = extractInfo(tsc_list)

# Print the extracted data so that you know if the query was successful and all the information is available.
print(extractedData)

### Create the instances for protocol executions

Using the information extracted from the KGE, the newly created protocol instances and the information you defined in the excel file, we can now create the procotol execution instances.

The function below will try to find the correct input or output of the protocol executions based on the information is available. If an input or output cannot be found, it will state this when you run the cell. If you think this is a mistake, please check whether the names in the extracted data table above matches the names you entered in the excel file.

In [None]:
# Making protocol executions based on newly made protocol instances
def makeProtocolExecutions(createdProtocols, extractedData, PEs, dsv, output_path):

    df = pd.DataFrame([])  
    protocolEx_dict = {} 
    protocolNames = createdProtocols.name.tolist()
    for p in range(len(PEs)):

        mycol = helper.create_collection()

        # If more than one protocol was used, ensure that it is formatted correctly
        protocolAtid = []
        protocolIDs = []
        if PEs.protocolUsed[p].find(',') != -1:
            for t in PEs.protocolUsed[p].split(", "):
                if t in protocolNames:
                    protocol_atid = createdProtocols.protocolAtid[protocolNames.index(t)]
                    protocolAtid.append({"@id": "https://kg.ebrains.eu/api/instances/" + str(protocol_atid)})
                    protocolIDs.append(protocol_atid)
        else:
            if PEs.protocolUsed[p] in protocolNames:
                protocol_atid = createdProtocols.protocolAtid[protocolNames.index(PEs.protocolUsed[p])]
                protocolAtid = {"@id": "https://kg.ebrains.eu/api/instances/" + str(protocol_atid)}
                protocolIDs = protocol_atid
            
        
        # Find the corresponding input and outputs
        if PEs.inputType[p] == "subject state":
            if PEs.input[p] in extractedData.subject_state_name.to_list():
                input = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.subject_state_uuid.to_list()[extractedData.subject_state_name.to_list().index(PEs.input[p])]}
            else:
                input = None
                print(f"No input found for protocol execution {PEs.protocolExecutionName[p]}")

        if PEs.inputType[p] == "tsc state":
            if PEs.input[p] in extractedData.tsc_state_name.to_list():
                input = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.tsc_state_uuid.to_list()[extractedData.tsc_state_name.to_list().index(PEs.input[p])]}
            else:
                input = None
                print(f"No input found for protocol execution {PEs.protocolExecutionName[p]}")
        
        if PEs.inputType[p] == "fileBundle":
            if PEs.input[p] in extractedData.fileBundle_name.to_list():
                input = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.fileBundle_uuid.to_list()[extractedData.fileBundle_name.to_list().index(PEs.input[p])]}
            else:
                input = None
                print(f"No input found for protocol execution {PEs.protocolExecutionName[p]}")

        if PEs.outputType[p] == "subject state":
            if PEs.output[p] in extractedData.subject_state_name.to_list():
                output = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.subject_state_uuid.to_list()[extractedData.subject_state_name.to_list().index(PEs.output[p])]}
            else:
                output = None
                print(f"No output found for protocol execution {PEs.protocolExecutionName[p]}")

        if PEs.outputType[p] == "tsc state":
            if PEs.output[p] in extractedData.tsc_state_name.to_list():
                output = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.tsc_state_uuid.to_list()[extractedData.tsc_state_name.to_list().index(PEs.output[p])]}
            else:
                output = None
                print(f"No output found for protocol execution {PEs.protocolExecutionName[p]}")
        
        if PEs.outputType[p] == "fileBundle":
            if PEs.output[p] in extractedData.fileBundle_name.to_list():
                output = {"@id": "https://kg.ebrains.eu/api/instances/" + 
                extractedData.fileBundle_uuid.to_list()[extractedData.fileBundle_name.to_list().index(PEs.output[p])]}
            else:
                output = None
                print(f"No output found for protocol execution {PEs.protocolExecutionName[p]}")
        
        if PEs.inputType[p] == "file":
            input = None
            print(f"No input found for protocol execution {PEs.protocolExecutionName[p]}")
        elif PEs.outputType[p] == "file":
            output = None
            print(f"No output found for protocol execution {PEs.protocolExecutionName[p]}")
        
        # Create the protocol execution instances
        protocolEx_dict[PEs.protocolExecutionName[p]] = mycol.add_core_protocolExecution(input = input,
                                                                                            output = output,
                                                                                            protocol = protocolAtid,
                                                                                            isPartOf = {"@id": "https://kg.ebrains.eu/api/instances/" + dsv})
        
        mycol.get(protocolEx_dict[PEs.protocolExecutionName[p]]).preparationDesign = {"@id": "https://openminds.ebrains.eu/instances/preparationType/" + PEs.preparationType[p]}
        mycol.get(protocolEx_dict[PEs.protocolExecutionName[p]]).lookupLabel = PEs.protocolExecutionName[p]
        mycol.get(protocolEx_dict[PEs.protocolExecutionName[p]]).description = PEs.protocolExecutionDescription[p]
        
        # Save the intances
        mycol.save(os.path.join(output_path, "")) 

        # Create an overview file
        df = df.append(pd.DataFrame({"PE_name" : PEs.protocolExecutionName[p],
                "PE_uuid" : protocolEx_dict[PEs.protocolExecutionName[p]].split("/")[-1],
                "input" : input,
                "output" : output,
                "protocol" : PEs.protocolUsed[p],  
                "protocol_uuid" : ", ".join(protocolIDs)},                
            index=[0]), 
        ignore_index=True)

    print("Saving created instances...")
    
    # Store information in an overview file
    filename = os.path.join(output_path, 'createdProtocolExecutions.csv')
    df.to_csv(filename, index = False, header=True)  

    print("Done")
    return df

In [None]:
# Execute the function to create protocol executions
df = makeProtocolExecutions(createdProtocols, extractedData, PEs, dsv, output_path)

## Upload instances to the KGE

When you have created all the instances you want to create, you can upload them the the Knowledge Graph editor.

Ensure that the token is still up to date. In case your token is expired, you will receive a message to update your token. Go back to the authorisation cell and run ONLY that cell again to refresh your token.

In [None]:
# Function to upload the instances to the KGE
def upload(instances_fnames, token, space_name):
    """
    
    Parameters
    ----------
    instances_fnames : List 
        list of file paths to instances that need to be uploaded
    token : string
        Authorisation token to get access to the KGE
    space_name : string
        Space that the instances needs to be uploaded to, e.g. "dataset", "common", etc.

    Returns
    -------
    response : dictionary
        For each UUID as response is stored that indications if the upload 
        was successful

    """
    
    hed = {"accept": "*/*",
           "Authorization": "Bearer " + token,
           "Content-Type": "application/json"
           }
    
    # Prefix to upload to the right space
    url = "https://core.kg.ebrains.eu/v3-beta/instances/{}?space=" + space_name
    kg_prefix = "https://kg.ebrains.eu/api/instances/"
    
    new_instances = []
    for fname in instances_fnames:
        with open(fname, 'r') as f:
            new_instances.append(json.load(f))
        f.close()
    
    # Correct the capitalisation in the openMINDS package
    for instance in new_instances:
        atid = kg_prefix + instance["@id"].split("/")[-1] #only take the UUID 
        instance["@id"] = atid
        if instance["@type"].endswith("Protocolexecution"):
            splittype = instance["@type"].split("/")[:-1]
            splittype.append("ProtocolExecution")
            instance["@type"] = "/".join(splittype)

    # Upload to the KGE
    print("\nUploading instances now:\n")
    
    count = 0
    response = {}    
    for instance in new_instances:
        count += 1
        print("Posting instance " + str(count) + "/" + str(len(new_instances)))
        atid = instance["@id"].split("/")[-1]  
        response[atid] = requests.post(url.format(atid), json=instance, headers=hed)
        if response[atid].status_code == 200:
            print(response[atid], "OK!" )
        elif response[atid].status_code == 409:
            print(response[atid], "Instance already exists")
        elif response[atid].status_code == 401:
            print(response[atid], "Token not valid, authorisation not successful")
        else:
            print(response[atid])
        
    return response  

In [None]:
# Upload instances to the KGE
answer = input("Would you like to upload the instances you created to the KGE? yes (y) or no (n) " ) 

if answer == "y":
    instances_fnames = glob.glob(os.path.join(output_path,"") + "*\\*", recursive = True)
    
    if token != "":
        if instances_fnames == []:
            print("No files found")
        else: 
            response = upload(instances_fnames, token, space_name = "dataset")  
        
elif answer == "n":
    print("\nDone!")