## Extracting important metadata of tissue sample collections

With this notebook you can do the following:
1. Query tissue sample collection metadata via the API for a particular dataset version(s)
2. Save the information to a CSV file so that it can be used to create service links

To be able to run the script, you need to the following requirements:
- Python version >= 3.6
- read and write permission to the KG via the API

In [None]:
# import relevant packages
from getpass import getpass
import requests
import pandas as pd
import os
from datetime import datetime

### Authentication

To interact with the API, you need an access token. To request a token, copy your token from the Knowledge Graph Editor or Query Builder (if you do not have access, request access via support@ebrains.eu).

In [None]:
token = getpass(prompt='Please paste your token: ')

### Identify the dataset version

First choose whether you want to extract tissue sample collection information from 1 or more dataset versions. If you only want to extract information from 1 dataset version, choose 1 for the first question and then fill in the UUID of the dataset version you are interested in.
If you want to extract information from multiple dataset versions, choose 2 for the first question and then choose a keywords that exists in the title of all the dataset versions you are interested in. Make this as specific as possible to ensure that onlye the dataset versions of interest are queried.

In [None]:
# First identify the dataset version

# Ask the service the service links should be opened in.
answer = input("Do you want to get the information from 1) one dataset version, or 2) multiple dataset versions? ")
if answer == "1":
    dsv = input("What is the UUID of the dataset version? ")
    print("The UUID of the dataset is: " + dsv)
    cwd = os.getcwd()
    output_path = cwd + "\\"  + dsv
    print('The output folder is: ' + dsv)

elif answer == "2":
    keywords = input("which keywords are in all titles of the dataset versions of interest? ")
    print("Query all dataset version with that contain " + str(keywords) + " in the title")
    cwd = os.getcwd()
    now = datetime.now()
    output_path = cwd + "\\"  + now.strftime("%d%m%Y") + "\\"
    print('The output folder is: ' + now.strftime("%d%m%Y"))



if os.path.isdir(output_path):
    print("\nOutput folder already exists")
else:
    print("\nOutput folder does not exist, making folder")        
    os.mkdir(output_path) 
os.chdir(output_path + "\\")

### Query the Knowledge Graph and extract information

The following information will be extracted based on the dataset version UUID or keywords that you provided.
- tsc name, tsc UUID, and internal identifier
- linked file bundle name and UUID (if file bundles have been generated yet)
- DOI of the dataset version
- The URL of the repository

In [None]:
# This query will extract important information, such as the DOI of the dataset, the tissue sample collections and the linked subjects.
if answer == '1':
  query = {
    "@context": {
      "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
      "query": "https://schema.hbp.eu/myQuery/",
      "propertyName": {
        "@id": "propertyName",
        "@type": "@id"
      },
      "merge": {
        "@type": "@id",
        "@id": "merge"
      },
      "path": {
        "@id": "path",
        "@type": "@id"
      }
    },
    "meta": {
      "name": "get-dsv-specimen-fb",
      "responseVocab": "https://schema.hbp.eu/myQuery/",
      "type": "https://openminds.ebrains.eu/core/DatasetVersion"
    },
    "structure": [
      {
        "propertyName": "query:id",
        "path": "@id",
        "required": True,
        "filter": {
          "op": "CONTAINS",
          "value": dsv
        }
      },
      {
        "propertyName": "query:digitalIdentifier",
        "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
        "required": True,
        "structure": {
          "propertyName": "query:identifier",
          "path": "https://openminds.ebrains.eu/vocab/identifier",
          "required": True
        }
      },
      {
        "propertyName": "query:repository",
        "path": "https://openminds.ebrains.eu/vocab/repository",
        "required": True,
        "structure": {
          "propertyName": "query:IRI",
          "path": "https://openminds.ebrains.eu/vocab/IRI",
          "required": True
        }
      },
      {
        "propertyName": "query:studiedSpecimen",
        "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
        "required": True,
        "structure": [
          {
            "propertyName": "query:id",
            "path": "@id",
            "required": True
          },
          {
            "propertyName": "query:lookupLabel",
            "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
            "required": True
          },
          {
            "propertyName": "query:internalIdentifier",
            "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
          },
          {
            "propertyName": "query:type",
            "path": "@type",
            "required": True,
            "filter": {
              "op": "CONTAINS",
              "value": "TissueSampleCollection"
            }
          },
          {
            "propertyName": "query:studiedState",
            "path": "https://openminds.ebrains.eu/vocab/studiedState",
            "required": True,
            "structure": [
              {
                "propertyName": "query:id",
                "path": "@id",
                "required": True
              },
              {
                "propertyName": "query:descendedFrom",
                "path": {
                  "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                  "reverse": True
                },
                "structure": [
                  {
                    "propertyName": "query:id",
                    "path": "@id",
                    "required": True
                  },
                  {
                    "propertyName": "query:name",
                    "path": "https://openminds.ebrains.eu/vocab/name",
                    "required": True
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
elif answer == '2':
  query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "merge": {
      "@type": "@id",
      "@id": "merge"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "name": "get-dsv-specimen-fb",
    "responseVocab": "https://schema.hbp.eu/myQuery/",
    "type": "https://openminds.ebrains.eu/core/DatasetVersion"
  },
  "structure": [
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True
    },
    {
      "propertyName": "query:digitalIdentifier",
      "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
      "required": True,
      "structure": {
        "propertyName": "query:identifier",
        "path": "https://openminds.ebrains.eu/vocab/identifier",
        "required": True
      }
    },
    {
      "propertyName": "query:repository",
      "path": "https://openminds.ebrains.eu/vocab/repository",
      "required": True,
      "structure": {
        "propertyName": "query:IRI",
        "path": "https://openminds.ebrains.eu/vocab/IRI",
        "required": True
      }
    },
    {
      "propertyName": "query:shortName",
      "path": "https://openminds.ebrains.eu/vocab/shortName",
      "filter": {
        "op": "CONTAINS",
        "value": keywords
      }
    },
    {
      "propertyName": "query:studiedSpecimen",
      "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
      "required": True,
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id",
          "required": True
        },
        {
          "propertyName": "query:lookupLabel",
          "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
          "required": True
        },
        {
          "propertyName": "query:internalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
        },
        {
          "propertyName": "query:type",
          "path": "@type",
          "required": True,
          "filter": {
            "op": "CONTAINS",
            "value": "TissueSampleCollection"
          }
        },
        {
          "propertyName": "query:studiedState",
          "path": "https://openminds.ebrains.eu/vocab/studiedState",
          "required": True,
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id",
              "required": True
            },
            {
              "propertyName": "query:descendedFrom",
              "path": {
                "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                "reverse": True
              },
              "structure": [
                {
                  "propertyName": "query:id",
                  "path": "@id",
                  "required": True
                },
                {
                  "propertyName": "query:name",
                  "path": "https://openminds.ebrains.eu/vocab/name",
                  "required": True
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

In [None]:
# Function to get the info based on what is defined in the query
def getInfo(token, stage="IN_PROGRESS"):
    """
    Parameters
    ----------
    token : string 
        Authentication token to access data and metadata in the KGE via the API
    stage : string
        Stage the data are in, e.g. "RELEASED", "IN_PROGRESS". Default is "IN_PROGRESS".

    Returns
    -------
    data : dictionary
        All data that was specified in the query
        
    """
    
    headers = {"accept": "*/*",
        "Authorization": "Bearer " + token
        }

    url = "https://core.kg.ebrains.eu/v3-beta/queries/?vocab=https://schema.hbp.eu/myQuery/&stage={}"
    response = requests.post(url.format(stage), json=query, headers=headers)
    data = response.json()
    
    if response.status_code == 200:
        print(response, "OK!" )
    elif response.status_code == 401:
        print(response, "Token not valid, authorisation not successful")
    else:
        print(response)
    
    return data

To execute the query we need to define the stage of release. If the data is under embargo, the stage is "IN_PROGRESS", if the data has already been released, the stage is "RELEASED". The default setting is "in progress" as it will find both released and ongoing curated metadata.

**Note:** the data should already be released, since file bundles are not created when datasets are under embargo.

In [None]:
# Define the stage
stage = "IN_PROGRESS"

# Execute the getInfo function
result = getInfo(token, stage=stage)

# Save relevant metadata
if answer == '1':
    tsc_list = []
    for i in range(len(result["data"][0]["studiedSpecimen"])):

        tsc_list.append(result["data"][0]["studiedSpecimen"][i])
        tsc_list[i]['DOI'] =  result["data"][0]["digitalIdentifier"][0]["identifier"]
        tsc_list[i]['Repo'] = result["data"][0]["repository"][0]["IRI"]

    print('\nNumber of tissue sample collections in this dataset: ' + str(len(tsc_list)))
    print('\nDOI of this dataset: ' + tsc_list[0]["DOI"])
    print('\nRepository of this dataset: ' + tsc_list[0]['Repo'])
elif answer == '2':
    print("Number of dataset version found that match the keywords: " + str(len(result['data'])))
    tsc_list = []
    count = 0
    for i in range(len(result['data'])):
        if len(result["data"][i]["studiedSpecimen"]) > 1:
            for ii in range(len(result["data"][i]["studiedSpecimen"])):
                tsc_list.append(result["data"][i]["studiedSpecimen"][ii])
                tsc_list[count]['DOI'] = result["data"][i]["digitalIdentifier"][0]["identifier"]
                tsc_list[count]['Repo'] = result["data"][i]["repository"][0]["IRI"]
                count += 1
        else:
            tsc_list.append(result["data"][i]["studiedSpecimen"][0])
            tsc_list[count]['DOI'] = result["data"][i]["digitalIdentifier"][0]["identifier"]
            tsc_list[count]['Repo'] = result["data"][i]["repository"][0]["IRI"]
            count += 1


### Organise and save the metadata

The metadata extracted by the query is now organised into an easier to read format and saved as a CSV file. The name of the CSV file is "tsc_UUID-of-datasetVersion" or "tsc_list", depending on whether the information applies to 1 or more dataset versions.

In [None]:
# Function to extract tissue sample collection information
def extractInfo(tsc_list):
    """
    Parameters
    ----------
    tsc_list : list 
        Nested list of tsc information

    Returns
    -------
    data : pandas DataFrame
        Overview table with extracted information
        
    """

    data = pd.DataFrame([])
    for tsc in tsc_list:
        if not "descendedFrom" in tsc['studiedState'][0].keys():
            fileBundle_name = ""
            fileBundle_uuid = ""
        else:
            if tsc["studiedState"][0]["descendedFrom"] == []:
                fileBundle_name = ""
                fileBundle_uuid = ""
            else:
                fileBundle_name = tsc["studiedState"][0]["descendedFrom"][0]["name"],  
                fileBundle_uuid = tsc["studiedState"][0]["descendedFrom"][0]["id"].split("/")[-1]

        data = data.append(pd.DataFrame({"sub_name" : "",
                        "tsc_name" : tsc["lookupLabel"],
                        "tsc_internalID" : tsc["internalIdentifier"],
                        "tsc_uuid" : tsc["id"].split("/")[-1],
                        "tsc_state_uuid" : tsc["studiedState"][0]["id"].split("/")[-1],
                        "fileBundle_name" : fileBundle_name,  
                        "fileBundle_uuid" : fileBundle_uuid,
                        "DOI_dataset" : tsc['DOI'],
                        "repository" : tsc['Repo'],
                        "URL_link" : ""},                
                                index=[0]), ignore_index=True)

    return data

In [None]:
# Execute the extractInfo function
tsc_data = extractInfo(tsc_list)

if answer == '1':
    filename = 'tsc_' + dsv + '.csv'
elif answer == '2':
    filename = 'tsc_list' + '.csv'
# save the table locally in the current folder using the name "tsc_UUID-of-dataset"
tsc_data.to_csv(filename, index = False, header=True)

print("Done! File is saved")