## Linking specimen to file bundles

With this notebook you can do the following:
1. Query file bundles and tissue sample collection metadata via the API for a particular dataset version(s)
2. Link a specimen state to its corresponding file bundle
3. Save the information to a CSV file so that it can be used for image ingestion

**Note:** \
This script only works if:
- Tissue sample collections have been added to a dataset version
- Data has been uploaded to a container or bucket
- A regex pattern (file structure pattern) has been defined and linked to repository so that file bundle can be created
- File bundles have been created
- The internal identifier of the tissue sample collection matches the file bundle name

**If the script fails, please check if the points above.**

To be able to run the script, you need to the following requirements:
- Python version >= 3.6
- read and write permission to the KG via the API

In [None]:
# import relevant packages
from getpass import getpass
import requests
import pandas as pd
import os

### Authentication

To interact with the API, you need an access token. To request a token, copy your token from the Knowledge Graph Editor or Query Builder (if you do not have access, request access via support@ebrains.eu).

In [None]:
token = getpass(prompt='Please paste your token: ')

### Identify the dataset version

First choose whether you want to extract tissue sample collection information from 1 or more dataset versions. If you only want to extract information from 1 dataset version, choose 1 for the first question and then fill in the UUID of the dataset version you are interested in.
If you want to extract information from multiple dataset versions, choose 2 for the first question and then choose a keywords that exists in the title of all the dataset versions you are interested in. Make this as specific as possible to ensure that onlye the dataset versions of interest are queried.

In [None]:
# First identify the dataset version

dsv = input("What is the UUID of the dataset version? ")
print(f"The UUID of the dataset is: {dsv}")
cwd = os.getcwd()
output_path = os.path.join(cwd, dsv)
print(f"The output folder is: {output_path}")

if os.path.isdir(output_path):
    print("\nOutput folder already exists")
else:
    print("\nOutput folder does not exist, making folder")        
    os.mkdir(output_path) 


### Query the Knowledge Graph and extract information

The following information will be extracted based on the dataset version UUID that you provided.
- repository information
- structure pattern (regex) used to create the file bundles for this dataset version
- file bundle name, UUID, and grouping type

In [None]:
# This query will extract important information, such as the DOI of the dataset, the tissue sample collections and the linked subjects.
query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "type": "https://openminds.ebrains.eu/core/DatasetVersion",
    "responseVocab": "https://schema.hbp.eu/myQuery/"
  },
  "structure": [
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True,
      "filter": {
        "op": "CONTAINS",
        "value": dsv
      }
    },
    {
      "propertyName": "query:repository",
      "path": "https://openminds.ebrains.eu/vocab/repository",
      "structure": [
        {
          "propertyName": "query:name",
          "path": "https://openminds.ebrains.eu/vocab/name",
          "required": True
        },
        {
          "propertyName": "query:isPartOf",
          "path": {
            "@id": "https://openminds.ebrains.eu/vocab/isPartOf",
            "reverse": True
          },
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id",
              "required": True
            },
            {
              "propertyName": "query:name",
              "path": "https://openminds.ebrains.eu/vocab/name",
              "required": True
            },
            {
              "propertyName": "query:groupingType",
              "path": "https://openminds.ebrains.eu/vocab/groupingType",
              "structure": {
                "propertyName": "query:name",
                "path": "https://openminds.ebrains.eu/vocab/name",
                "required": True
              }
            }
          ]
        },
        {
          "propertyName": "query:IRI",
          "path": "https://openminds.ebrains.eu/vocab/IRI",
          "required": True
        },
        {
          "propertyName": "query:structurePattern",
          "path": "https://openminds.ebrains.eu/vocab/structurePattern",
          "structure": [
            {
              "propertyName": "query:lookupLabel",
              "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
            },
            {
              "propertyName": "query:id",
              "path": "@id"
            },
            {
              "propertyName": "query:filePathPattern",
              "path": "https://openminds.ebrains.eu/vocab/filePathPattern",
              "structure": {
                "propertyName": "query:regex",
                "path": "https://openminds.ebrains.eu/vocab/regex",
                "required": True
              }
            }
          ]
        }
      ]
    },
    {
      "propertyName": "query:studiedSpecimen",
      "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id",
          "required": True
        },
        {
          "propertyName": "query:internalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/internalIdentifier",
          "required": True
        },
        {
          "propertyName": "query:lookupLabel",
          "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
        },
        {
          "propertyName": "query:type",
          "path": "@type",
          "required": True,
          "filter": {
            "op": "CONTAINS",
            "value": "TissueSampleCollection"
          }
        },
        {
          "propertyName": "query:studiedState",
          "path": "https://openminds.ebrains.eu/vocab/studiedState",
          "required": True,
          "structure": {
            "propertyName": "query:id",
            "path": "@id",
            "required": True
          }
        }
      ]
    }
  ]
}

# Function to get the info based on what is defined in the query
def getInfo(token, stage="IN_PROGRESS"):
    """
    Parameters
    ----------
    token : string 
        Authentication token to access data and metadata in the KGE via the API
    stage : string
        Stage the data are in, e.g. "RELEASED", "IN_PROGRESS". Default is "IN_PROGRESS".

    Returns
    -------
    data : dictionary
        All data that was specified in the query
        
    """
    
    headers = {
                "Authorization": "Bearer " + token
              }

    url = "https://core.kg.ebrains.eu/v3-beta/queries/?vocab=https://schema.hbp.eu/myQuery/&stage={}"
    response = requests.post(url.format(stage), json=query, headers=headers)

    if response.status_code == 200:
        print(response, "OK!" )
        data = response.json()
    elif response.status_code == 401:
        print(response, "Token not valid, authorisation not successful")
        return
    else:
        print(response)
        return
    
    return data

To execute the query we need to define the stage of release. If the data is under embargo, the stage is "IN_PROGRESS", if the data has already been released, the stage is "RELEASED". The default setting is "in progress" as it will find both released and ongoing curated metadata.


In [None]:
# Define the stage
stage = "IN_PROGRESS"

# Execute the getInfo function
result = getInfo(token, stage=stage)

if result:
    # extract file bundle info
    fb_list = result["data"][0]["repository"][0]["isPartOf"]
    tsc_list = result["data"][0]["studiedSpecimen"]

    # Regex patterns used for this dataset
    regex = result["data"][0]["repository"][0]["structurePattern"][0]["filePathPattern"]
    regex_list = []
    for s in range(len(regex)):
        if s == 0:
            regex_list = regex[s]["regex"]
        else:
            regex_list = regex_list + " \n" + regex[s]["regex"]


    # Repository used in this dataset
    repoName = result["data"][0]["repository"][0]["name"]
    repoIRI = result["data"][0]["repository"][0]["IRI"]

    print(f"\nNumber of tissue sample collections in this dataset: {len(tsc_list)}")
    print(f"\nNumber of file bundles in this dataset: {len(fb_list)}")
else:
    print("\nRefresh token and run the cells again!")            

### Organise and save the metadata

The metadata extracted by the query will now be organised into an easier to read format so that it can eventually be saved as a CSV file. 

In [None]:
# Function to extract file bundle information
def extractInfo(fb_list, regex_list, repoName, repoIRI):
    """
    Parameters
    ----------
    fb_list : list 
        Nested list of file bundle information

    Returns
    -------
    data : pandas DataFrame
        Overview table with extracted information
        
    """

    data = pd.DataFrame([])
    for fb in fb_list:
        data = data.append(pd.DataFrame({"name" : fb["name"],
                                        "fileBundle_uuid": fb["id"].split("/")[-1],
                                        "groupedBy" : fb["groupingType"][0]["name"],
                                        "repositoryName" : repoName,
                                        "fromRepository" : repoIRI,
                                        "regexPatternUsed" : regex_list},                
                                                index=[0]), 
                                ignore_index=True)

    return data

# Extract relevant tissue sample metadata and add to the file bundle information
def addInfo2fb(tsc_list, fb_data):

    """
    Parameters
    ----------
    tsc_list : list 
        Nested list of tissue sample information information
    fb_data : dataframe
        DataFrame with file bundle information. Tissue sample information will be added to this DataFrame

    Returns
    -------
    fb_data : pandas DataFrame
        Overview table with extracted information
        
    """

    if not 'linkedSpecimenState' in fb_data.columns:
        fb_data.insert(0, 'linkedSpecimenState', '')
    for tsc in tsc_list:
        state_atid = tsc["studiedState"][0]["id"].split("/")[-1]
        tsc_id = tsc["internalIdentifier"]
        if tsc_id in fb_data.name.to_list():
            idx = fb_data.index[fb_data.name == tsc_id][0]
            fb_data.loc[idx, 'linkedSpecimenState'] = state_atid

    print("Specimen information has been extracted and added to the file bundle information overview!")
    
    return fb_data

Execute the two functions above to extract the metadata

In [None]:
# Extract metadata for file bundles first and then add more metadata about the tissue sample
fb_data = extractInfo(fb_list, regex_list, repoName, repoIRI)
fb_data = addInfo2fb(tsc_list, fb_data)

### Link specimen states to file bundles

Now you are interacting with the Knowledge Graph editor via the API.

In [None]:
# Link the specimen to their corresponding file bundles via the API
def linkSpecimen2fb(token, fb_data):
    kg_prefix = "https://kg.ebrains.eu/api/instances/"

    hed = {'Authorization': 'Bearer ' + token}
    url = "https://core.kg.ebrains.eu/v3-beta/instances/{}?space=dataset"

    response = {}
    for i in range(len(fb_data)):
        fb_atid = fb_data.fileBundle_uuid[i]
        fb_name = fb_data.name[i]
        instance = {"@context": {"@vocab": "https://openminds.ebrains.eu/vocab/"},
                    "descendedFrom": [{"@id" : kg_prefix + str(fb_data.linkedSpecimenState[i])}]
                    }

        
        print(f"Linking specimen to file bundle {fb_name} with uuid: {fb_atid}")
        response[fb_atid] = requests.patch(url.format(fb_atid), json=instance, headers=hed)
        if response[fb_atid].status_code == 200:
            print(response[fb_atid], "OK!" )
        elif response[fb_atid].status_code == 401:
            print(response[fb_atid], "Token not valid, authorisation not successful")
            return
        else:
            print(response[fb_atid])
            return

    print("Specimen are now linked to the file bundles")   

    return fb_data     



After extracting and organising all the important metadata, you can link a tissue sample collection to its corresponding file bundle. Below you are asked whether you want to link the specimen or not. Type "y" if you want to create these links.

In [None]:
answer = input("Do you want to link the specimen states to the file bundles? Yes (y) or No (n): ")
if answer == "y":
    fb_data = linkSpecimen2fb(token, fb_data)
else:
    print("Specimen were not linked to the file bundles")

### Save the output file

Your output file will be saved now and can be used for future reference. The name of the CSV file is "fb_UUID-of-datasetVersion".

In [None]:
filename = os.path.join(output_path, 'fb_' + dsv + '.csv')

# save the table locally in the current folder using the name "fb_UUID-of-dataset"
fb_data.to_csv(filename, index = False, header=True)

print("Done! File is saved")