## Extracting important metadata of tissue sample collections

With this notebook you can do the following:
1. Query tissue sample collection metadata via the API for a particular dataset version
2. Save the information to a CSV file so that it can be used to create service links

To be able to run the script, you need to the following requirements:
- Python version >= 3.6
- read and write permission to the KG via the API

In [16]:
# import relevant packages
from getpass import getpass
import requests
import pandas as pd

### Authentication

To interact with the API, you need an access token. To request a token, follow this link: https://nexus-iam.humanbrainproject.org/v0/oauth2/authorize or copy your token from the Knowledge Graph Editor (if you have access).

In [17]:
token = getpass(prompt='Please paste your token: ')

### Identify the dataset version

Fill in the UUID of the dataset version you want to extract the tissue sample collection information from.

In [18]:
# First identify the dataset version
dsv = input("What is the UUID of the dataset version? ")
print("The UUID of the dataset is: " + dsv)

The UUID of the dataset is: 9677359c-73fa-4425-b8fa-3de794e9017a


### Query the Knowledge Graph and extract information

The following information will be extracted based on the dataset version UUID that you provided.
- tsc name, UUID, and internal identifier
- linked file bundle name and UUID
- DOI of the dataset version
- The URL of the repository

In [19]:
# This query will extract important information, such as the DOI of the dataset, the tissue sample collections and the linked subjects.
query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "merge": {
      "@type": "@id",
      "@id": "merge"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "name": "get-info",
    "responseVocab": "https://schema.hbp.eu/myQuery/",
    "type": "https://openminds.ebrains.eu/core/DatasetVersion"
  },
  "structure": [
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True,
      "filter": {
        "op": "CONTAINS",
        "value": dsv
      }
    },
    {
      "propertyName": "query:digitalIdentifier",
      "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
      "required": True,
      "structure": {
        "propertyName": "query:identifier",
        "path": "https://openminds.ebrains.eu/vocab/identifier",
        "required": True
      }
    },
    {
      "propertyName": "query:repository",
      "path": "https://openminds.ebrains.eu/vocab/repository",
      "structure": {
        "propertyName": "query:IRI",
        "path": "https://openminds.ebrains.eu/vocab/IRI"
      }
    },
    {
      "propertyName": "query:studiedSpecimen",
      "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
      "required": True,
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id",
          "required": True
        },
        {
          "propertyName": "query:lookupLabel",
          "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
          "required": True
        },
        {
          "propertyName": "query:type",
          "path": "@type",
          "required": True,
          "filter": {
            "op": "CONTAINS",
            "value": "TissueSampleCollection"
          }
        },
        {
          "propertyName": "query:internalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
        },
        {
          "propertyName": "query:studiedState",
          "path": "https://openminds.ebrains.eu/vocab/studiedState",
          "required": True,
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id",
              "required": True
            },
            {
              "propertyName": "query:descendedFrom",
              "path": {
                "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                "reverse": True
              },
              "required": True,
              "structure": [
                {
                  "propertyName": "query:id",
                  "path": "@id",
                  "required": True
                },
                {
                  "propertyName": "query:name",
                  "path": "https://openminds.ebrains.eu/vocab/name",
                  "required": True
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}


In [20]:
# Function to get the info based on what is defined in the query
def getInfo(token, stage="IN_PROGRESS"):
    """
    Parameters
    ----------
    token : string 
        Authentication token to access data and metadata in the KGE via the API
    stage : string
        Stage the data are in, e.g. "RELEASED", "IN_PROGRESS". Default is "IN_PROGRESS".

    Returns
    -------
    data : dictionary
        All data that was specified in the query
        
    """
    
    headers = {"accept": "*/*",
        "Authorization": "Bearer " + token
        }

    url = "https://core.kg.ebrains.eu/v3-beta/queries/?vocab=https://schema.hbp.eu/myQuery/&stage={}"
    response = requests.post(url.format(stage), json=query, headers=headers)
    data = response.json()
    
    if response.status_code == 200:
        print(response, "OK!" )
    elif response.status_code == 401:
        print(response, "Token not valid, authorisation not successful")
    else:
        print(response)
    
    return(data)

To execute the query we need to define the stage of release. If the data is under embargo, the stage is "IN_PROGRESS", if the data has already been released, the stage is "RELEASED". The default setting is "in progress" as it will find both released and ongoing curated metadata.

**Note:** the data should already be released, since file bundles are not created when datasets are under embargo.

In [21]:
# Define the stage
stage = "IN_PROGRESS"

# Execute the getInfo function
result = getInfo(token, stage=stage)

# Save relevant metadata
tsc_list = result["data"][0]["studiedSpecimen"]
DOI =  result["data"][0]["digitalIdentifier"][0]["identifier"]
Repo = result["data"][0]["repository"][0]["IRI"]

print('\nNumber of tissue sample collections in this dataset: ' + str(len(tsc_list)))
print('\nDOI of this dataset: ' + DOI)
print('\nRepository of this dataset: ' + Repo)

<Response [200]> OK!

Number of tissue sample collections in this dataset: 17

DOI of this dataset: https://doi.org/10.25493/RYZ4-DB9

Repository of this dataset: https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000072-Nr2f1-Thy1YFP-mice_pub


### Organise and save the metadata

The metadata extracted by the query is now organised into an easier to read format and saved as a CSV file. The name of the CSV file is "tsc_UUID-of-datasetVersion". 

In [22]:
# Function to extract tissue sample collection information
def extractInfo(tsc_list, DOI, Repo):
    """
    Parameters
    ----------
    tsc_list : list 
        Nested list of tsc information

    Returns
    -------
    data : pandas DataFrame
        Overview table with extracted information
        
    """

    data = pd.DataFrame([])
    for tsc in tsc_list:
        data = data.append(pd.DataFrame({"tsc_name" : tsc["lookupLabel"],
                        "tsc_internalID" : tsc["internalIdentifier"],
                        "tsc_uuid" : tsc["id"].split("/")[-1],
                        "tsc_state_uuid" : tsc["studiedState"][0]["id"].split("/")[-1],
                        "fileBundle_name" : tsc["studiedState"][0]["descendedFrom"][0]["name"],  
                        "fileBundle_uuid" : tsc["studiedState"][0]["descendedFrom"][0]["id"].split("/")[-1],
                        "DOI_dataset" : DOI,
                        "repository" : Repo},                
                               index=[0]), ignore_index=True)

    return data

In [23]:
# Execute the extractInfo function
tsc_data = extractInfo(tsc_list, DOI, Repo)

# save the table locally in the current folder using the name "tsc_UUID-of-dataset"
tsc_data.to_csv('.\\tsc_' + dsv + '.csv', index = False, header=True)

print("Done! File is saved")

Done! File is saved
