## Extracting important metadata of tissue sample collections
*Latest version 15 August 2022*

With this notebook you can do the following:
1. Query tissue sample collection metadata via the API for a particular dataset version(s)
2. Save the information to a CSV file so that it can be used to create service links

To be able to run the script, you need to the following requirements:
- Python version >= 3.6
- read and write permission to the KG via the API

In [None]:
# import relevant packages
from getpass import getpass
import requests
import pandas as pd
import os
from datetime import datetime
import json

### Authentication

To interact with the API, you need an access token. To request a token, copy your token from the Knowledge Graph Editor or Query Builder (if you do not have access, request access via support@ebrains.eu).

In case your token is expired, rerun the cell below.

In [None]:
token = getpass(prompt='Please paste your token: ')

### Define where you want to extract information from

First choose whether you want to extract tissue sample collection information from 1 or more dataset versions or from a project. 

1. Choose 1 to extract information from 1 dataset version. You will be asked to provide the UUID of the dataset version you are interested in.
2. Choose 2 to extract information from multiple dataset versions. You will be asked to provide keyword(s) that exists in the title of all the dataset versions you are interested in. Make this as specific as possible to ensure that onlye the dataset versions of interest are queried.
3. Choose 3 to extract information from all dataset versions associated with 1 project. You will be asked to provide the UUID of the project you are interested in.

In [None]:
# First identify the dataset version

cwd = os.getcwd()

# Ask the service the service links should be opened in.
answer = input("Do you want to get the information from 1) one dataset version, 2) multiple dataset versions, or 3) all dataset versions of a project? ")
if answer == "1":
    dsv = input("What is the UUID of the dataset version? ")
    print(f"The UUID of the dataset is: {dsv}")
    output_path = os.path.join(cwd, dsv)

elif answer == "2":
    keywords = input("which keywords are in all titles of the dataset versions of interest? ")
    print(f"Query all dataset version with that contain {keywords} in the title")
    now = datetime.now()
    output_path = os.path.join(cwd, now.strftime("%d%m%Y"))

elif answer == "3":
    project = input("What is the UUID of the project? ")
    print(f"The UUID of the project is: {project}")
    output_path = os.path.join(cwd, project)
    
print(f"The output folder is: {output_path}")   

if os.path.isdir(output_path):
    print("\nOutput folder already exists")
else:
    print("\nOutput folder does not exist, making folder")        
    os.mkdir(output_path) 

### Query the Knowledge Graph and extract information

The following information will be extracted based on the dataset version UUID or keywords that you provided.
- tsc name, tsc UUID, and internal identifier
- linked file bundle name and UUID (if file bundles have been generated yet)
- DOI of the dataset version
- The URL of the repository

In [None]:
# This query will extract important information, such as the DOI of the dataset, the tissue sample collections and the linked subjects.
if answer == '1':
  query = {
    "@context": {
      "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
      "query": "https://schema.hbp.eu/myQuery/",
      "propertyName": {
        "@id": "propertyName",
        "@type": "@id"
      },
      "merge": {
        "@type": "@id",
        "@id": "merge"
      },
      "path": {
        "@id": "path",
        "@type": "@id"
      }
    },
    "meta": {
      "name": "get-dsv-specimen-fb",
      "responseVocab": "https://schema.hbp.eu/myQuery/",
      "type": "https://openminds.ebrains.eu/core/DatasetVersion"
    },
    "structure": [
      {
        "propertyName": "query:id",
        "path": "@id",
        "required": True,
        "filter": {
          "op": "CONTAINS",
          "value": dsv
        }
      },
      {
        "propertyName": "query:digitalIdentifier",
        "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
        "structure": {
          "propertyName": "query:identifier",
          "path": "https://openminds.ebrains.eu/vocab/identifier"
        }
      },
      {
        "propertyName": "query:repository",
        "path": "https://openminds.ebrains.eu/vocab/repository",
        "structure": {
          "propertyName": "query:IRI",
          "path": "https://openminds.ebrains.eu/vocab/IRI"
        }
      },
      {
        "propertyName": "query:studiedSpecimen",
        "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
        "required": True,
        "structure": [
          {
            "propertyName": "query:id",
            "path": "@id",
            "required": True
          },
          {
            "propertyName": "query:lookupLabel",
            "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
            "required": True
          },
          {
            "propertyName": "query:internalIdentifier",
            "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
          },
          {
            "propertyName": "query:type",
            "path": "@type",
            "required": True,
            "filter": {
              "op": "CONTAINS",
              "value": "TissueSampleCollection"
            }
          },
          {
          "propertyName": "query:studiedState",
          "path": "https://openminds.ebrains.eu/vocab/studiedState",
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id"
            },
            {
              "propertyName": "query:lookupLabel",
              "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
            },
            {
              "propertyName": "query:descendedFromFile",
              "path": {
                "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                "reverse": True
              },
              "structure": [
                {
                  "propertyName": "query:id",
                  "path": "@id"
                },
                {
                  "propertyName": "query:name",
                  "path": "https://openminds.ebrains.eu/vocab/name"
                }
              ]
            },
            {
              "propertyName": "query:descendedFromSubject",
              "path": "https://openminds.ebrains.eu/vocab/descendedFrom",
              "structure": [
                {
                "propertyName": "query:id",
                "path": "@id"
                },
                {
                "propertyName": "query:lookupLabel",
                "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
                },
                {
                "propertyName": "query:studiedState",
                "path": {
                  "@id": "https://openminds.ebrains.eu/vocab/studiedState",
                  "reverse": True
                },
                "structure": [
                  {
                    "propertyName": "query:id",
                    "path": "@id"
                  },
                  {
                    "propertyName": "query:internalIdentifier",
                    "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
                  }
                ]
                }
              ]
            }
          ]
        }
        ]
      }
    ]
  }
elif answer == '2':
  query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "merge": {
      "@type": "@id",
      "@id": "merge"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "name": "get-dsv-specimen-fb",
    "responseVocab": "https://schema.hbp.eu/myQuery/",
    "type": "https://openminds.ebrains.eu/core/DatasetVersion"
  },
  "structure": [
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True
    },
    {
      "propertyName": "query:digitalIdentifier",
      "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
      "required": True,
      "structure": {
        "propertyName": "query:identifier",
        "path": "https://openminds.ebrains.eu/vocab/identifier",
        "required": True
      }
    },
    {
      "propertyName": "query:repository",
      "path": "https://openminds.ebrains.eu/vocab/repository",
      "required": True,
      "structure": {
        "propertyName": "query:IRI",
        "path": "https://openminds.ebrains.eu/vocab/IRI",
        "required": True
      }
    },
    {
      "propertyName": "query:shortName",
      "path": "https://openminds.ebrains.eu/vocab/shortName",
      "filter": {
        "op": "CONTAINS",
        "value": keywords
      }
    },
    {
      "propertyName": "query:studiedSpecimen",
      "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
      "required": True,
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id",
          "required": True
        },
        {
          "propertyName": "query:lookupLabel",
          "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
          "required": True
        },
        {
          "propertyName": "query:internalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
        },
        {
          "propertyName": "query:type",
          "path": "@type",
          "required": True,
          "filter": {
            "op": "CONTAINS",
            "value": "TissueSampleCollection"
          }
        },
        {
          "propertyName": "query:studiedState",
          "path": "https://openminds.ebrains.eu/vocab/studiedState",
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id"
            },
            {
              "propertyName": "query:lookupLabel",
              "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
            },
            {
              "propertyName": "query:descendedFromFile",
              "path": {
                "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                "reverse": True
              },
              "structure": [
                {
                  "propertyName": "query:id",
                  "path": "@id"
                },
                {
                  "propertyName": "query:name",
                  "path": "https://openminds.ebrains.eu/vocab/name"
                }
              ]
            },
            {
              "propertyName": "query:descendedFromSubject",
              "path": "https://openminds.ebrains.eu/vocab/descendedFrom",
              "structure": [
                {
                "propertyName": "query:id",
                "path": "@id"
                },
                {
                "propertyName": "query:lookupLabel",
                "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
                },
                {
                "propertyName": "query:studiedState",
                "path": {
                  "@id": "https://openminds.ebrains.eu/vocab/studiedState",
                  "reverse": True
                },
                "structure": [
                  {
                    "propertyName": "query:id",
                    "path": "@id"
                  },
                  {
                    "propertyName": "query:internalIdentifier",
                    "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
                  }
                ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
elif answer == '3':
  query = {
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "type": "https://openminds.ebrains.eu/core/Project",
    "responseVocab": "https://schema.hbp.eu/myQuery/"
  },
  "structure": [
    {
      "propertyName": "query:fullName",
      "path": "https://openminds.ebrains.eu/vocab/fullName"
    },
    {
      "propertyName": "query:id",
      "path": "@id",
      "required": True,
      "filter": {
        "op": "CONTAINS",
        "value": project
      }
    },
    {
      "propertyName": "query:hasResearchProducts",
      "path": "https://openminds.ebrains.eu/vocab/hasResearchProducts",
      "structure": [
        {
          "propertyName": "query:id",
          "path": "@id"
        },
        {
          "propertyName": "query:shortName",
          "path": "https://openminds.ebrains.eu/vocab/shortName"
        },
        {
          "propertyName": "query:digitalIdentifier",
          "path": "https://openminds.ebrains.eu/vocab/digitalIdentifier",
          "structure": {
            "propertyName": "query:identifier",
            "path": "http://schema.org/identifier"
          }
        },
        {
          "propertyName": "query:repository",
          "path": "https://openminds.ebrains.eu/vocab/repository",
          "structure": {
            "propertyName": "query:IRI",
            "path": "https://openminds.ebrains.eu/vocab/IRI"
          }
        },
        {
          "propertyName": "query:studiedSpecimen",
          "path": "https://openminds.ebrains.eu/vocab/studiedSpecimen",
          "required": True,
          "structure": [
            {
              "propertyName": "query:id",
              "path": "@id",
              "required": True
            },
            {
              "propertyName": "query:lookupLabel",
              "path": "https://openminds.ebrains.eu/vocab/lookupLabel",
              "required": True
            },
            {
              "propertyName": "query:internalIdentifier",
              "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
            },
            {
              "propertyName": "query:type",
              "path": "@type",
              "required": True,
              "filter": {
                "op": "CONTAINS",
                "value": "TissueSampleCollection"
              }
            },
            {
            "propertyName": "query:studiedState",
            "path": "https://openminds.ebrains.eu/vocab/studiedState",
            "structure": [
              {
                "propertyName": "query:id",
                "path": "@id"
              },
              {
                "propertyName": "query:lookupLabel",
                "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
              },
              {
                "propertyName": "query:descendedFromFile",
                "path": {
                  "@id": "https://openminds.ebrains.eu/vocab/descendedFrom",
                  "reverse": True
                },
                "structure": [
                  {
                    "propertyName": "query:id",
                    "path": "@id"
                  },
                  {
                    "propertyName": "query:name",
                    "path": "https://openminds.ebrains.eu/vocab/name"
                  }
                ]
              },
              {
                "propertyName": "query:descendedFromSubject",
                "path": "https://openminds.ebrains.eu/vocab/descendedFrom",
                "structure": [
                  {
                  "propertyName": "query:id",
                  "path": "@id"
                  },
                  {
                  "propertyName": "query:lookupLabel",
                  "path": "https://openminds.ebrains.eu/vocab/lookupLabel"
                  },
                  {
                  "propertyName": "query:studiedState",
                  "path": {
                    "@id": "https://openminds.ebrains.eu/vocab/studiedState",
                    "reverse": True
                  },
                  "structure": [
                    {
                      "propertyName": "query:id",
                      "path": "@id"
                    },
                    {
                      "propertyName": "query:internalIdentifier",
                      "path": "https://openminds.ebrains.eu/vocab/internalIdentifier"
                    }
                  ]
                  }
                ]
              }
            ]
          }
          ]
        }
      ]
    }
  ]
}

In [None]:
# Function to get the info based on what is defined in the query
def getInfo(token, stage="IN_PROGRESS"):
    """
    Parameters
    ----------
    token : string 
        Authentication token to access data and metadata in the KGE via the API
    stage : string
        Stage the data are in, e.g. "RELEASED", "IN_PROGRESS". Default is "IN_PROGRESS".

    Returns
    -------
    data : dictionary
        All data that was specified in the query
        
    """
    
    headers = {
                "Authorization": "Bearer " + token
                }

    url = "https://core.kg.ebrains.eu/v3-beta/queries/?vocab=https://schema.hbp.eu/myQuery/&stage={}"
    response = requests.post(url.format(stage), json=query, headers=headers)
    
    if response.status_code == 200:
        print(response, "OK!" )
        data = response.json()
    elif response.status_code == 401:
        print(response, "Token not valid, authorisation not successful")
        return
    else:
        print(response)
        return
    
    return data

To execute the query we need to define the stage of release. If the data is under embargo, the stage is "IN_PROGRESS", if the data has already been released, the stage is "RELEASED". The default setting is "in progress" as it will find both released and ongoing curated metadata.

**Note:** the data should already be released, since file bundles are not created when datasets are under embargo.

In [None]:
# Define the stage
stage = "IN_PROGRESS"

# Execute the getInfo function
result = getInfo(token, stage=stage)

if len(result) == 0:
    print("No metadata extracted \nRefresh token and run the cells again!")   
elif len(result["data"]) == 0:
    print("No data could be extracted. The required metadata defined in the query does not match the information in the dataset versions or project. Check if the information is available (see list above)")
else:
    # Save relevant metadata
    if answer == '1':
        tsc_list = []
        for i in range(len(result["data"][0]["studiedSpecimen"])):
            tsc_list.append(result["data"][0]["studiedSpecimen"][i])
            if result["data"][0]["digitalIdentifier"] == []:
                tsc_list[i]['DOI'] = "N/A"
            else:
                tsc_list[i]['DOI'] =  result["data"][0]["digitalIdentifier"][0]["identifier"]
            if "repository" in result["data"][0]:
                tsc_list[i]['Repo'] = result["data"][0]["repository"][0]["IRI"]
            else:
                tsc_list[i]['Repo'] = "N/A"
            tsc_list[i]['dsv_uuid'] = result["data"][0]["id"].split("/")[-1]
        print('\nNumber of tissue sample collections in this dataset: ' + str(len(tsc_list)))
        print('\nDOI of this dataset: ' + tsc_list[0]["DOI"])
        print('\nRepository of this dataset: ' + tsc_list[0]['Repo'])
    elif answer == '2':
        print("Number of dataset version found that match the keywords: " + str(len(result['data'])))
        tsc_list = []
        count = 0
        for i in range(len(result['data'])):
            
            if len(result["data"][i]["studiedSpecimen"]) > 1:
                for ii in range(len(result["data"][i]["studiedSpecimen"])):
                    tsc_list.append(result["data"][i]["studiedSpecimen"][ii])
                    if result["data"][0]["digitalIdentifier"] == []:
                        tsc_list[count]['DOI'] = "N/A"
                    else:
                        tsc_list[count]['DOI'] = result["data"][i]["digitalIdentifier"][0]["identifier"]
                    if "repository" in result["data"][0]:
                        tsc_list[count]['Repo'] = result["data"][i]["repository"][0]["IRI"]
                    else:
                        tsc_list[count]['Repo'] = "N/A"
                    tsc_list[count]['dsv_uuid'] = result["data"][i]["id"].split("/")[-1]
                    count += 1
            else:
                tsc_list.append(result["data"][i]["studiedSpecimen"][0])
                if result["data"][0]["digitalIdentifier"] == []:
                    tsc_list[count]['DOI'] = "N/A"
                else:
                    tsc_list[count]['DOI'] = result["data"][i]["digitalIdentifier"][0]["identifier"]
                if "repository" in result["data"][0]:
                    tsc_list[count]['Repo'] = result["data"][i]["repository"][0]["IRI"]
                else:
                    tsc_list[count]['Repo'] = "N/A"
                tsc_list[count]['dsv_uuid'] = result["data"][i]["id"].split("/")[-1]
                count += 1
    elif answer == '3':
        print("Number of dataset versions in this project: " + str(len(result["data"][0]["hasResearchProducts"])))
        tsc_list = []
        count = 0
        for i in range(len(result["data"][0]["hasResearchProducts"])):
            
            if len(result["data"][0]["hasResearchProducts"][i]["studiedSpecimen"]) > 1:
                for ii in range(len(result["data"][0]["hasResearchProducts"][i]["studiedSpecimen"])):
                    tsc_list.append(result["data"][0]["hasResearchProducts"][i]["studiedSpecimen"][ii])
                    if result["data"][0]["hasResearchProducts"][0]["digitalIdentifier"] == []:
                        tsc_list[count]['DOI'] = "N/A"
                    else:
                        tsc_list[count]['DOI'] = result["data"][0]["hasResearchProducts"][i]["digitalIdentifier"][0]["identifier"]
                    if "repository" in result["data"][0]["hasResearchProducts"][i]:
                        tsc_list[count]['Repo'] = result["data"][0]["hasResearchProducts"][i]["repository"][0]["IRI"]
                    else:
                        tsc_list[count]['Repo'] = "N/A"
                    tsc_list[count]['dsv_uuid'] = result["data"][0]["hasResearchProducts"][i]["id"].split("/")[-1]
                    count += 1
            else:
                tsc_list.append(result["data"][0]["hasResearchProducts"][i]["studiedSpecimen"][0])
                if result["data"][0]["hasResearchProducts"][i]["digitalIdentifier"] == []:
                    tsc_list[count]['DOI'] = "N/A"
                else:
                    tsc_list[count]['DOI'] = result["data"][0]["hasResearchProducts"][i]["digitalIdentifier"][0]["identifier"]
                if "repository" in result["data"][0]["hasResearchProducts"][0]:
                    tsc_list[count]['Repo'] = result["data"][0]["hasResearchProducts"][i]["repository"][0]["IRI"]
                else:
                    tsc_list[count]['Repo'] = "N/A"
                tsc_list[count]['dsv_uuid'] = result["data"][0]["hasResearchProducts"][i]["id"].split("/")[-1]
                count += 1
    

### Organise and save the metadata

The metadata extracted by the query is now organised into an easier to read format and saved as a CSV file. The name of the CSV file is "tsc_UUID-of-datasetVersion" or "tsc_list", depending on whether the information applies to 1 or more dataset versions.

For the organisation it is important that particular naming conventions are used.
Each subject has a subjectID in the file name which needs to be used as the internal identifier of the subject. The internal identifier is used for the generation of file bundles and later also for the generation of regex patterns for the image service ingestion. The lookup label of the subject can be a user-defined name, but it is important that the lookup label for the tissue sample collection is an extension of this name. For example, if the subject is called "A102", the tissue sample collection should be called "A102_tsc". For the states of the subject and the tissue sample collection, the naming convention is "lookupLabel_state-01", so for the above example the state name is "A102_state-01" and "A102_tsc_state-01" for the subject and tissue sample collection, respectively. 

Note: Use the underscore to separate names and states, and dashes to separate elements within a name, e.g. sub-01, state-01

In [None]:
# Function to extract tissue sample collection information
def extractInfo(tsc_list, file_extension):
    """
    Parameters
    ----------
    tsc_list : list 
        Nested list of tsc information

    Returns
    -------
    data : pandas DataFrame
        Overview table with extracted information
        
    """

    data = pd.DataFrame([])
    for tsc in tsc_list:
        if not "descendedFromFile" in tsc['studiedState'][0].keys():
            fileBundle_name = ""
            fileBundle_uuid = ""
            regex_pattern = ""
        else:
            if tsc["studiedState"][0]["descendedFromFile"] == []:
                fileBundle_name = ""
                fileBundle_uuid = ""
                regex_pattern = ""
            else:
                fileBundle_name = tsc["studiedState"][0]["descendedFromFile"][0]["name"]
                fileBundle_uuid = tsc["studiedState"][0]["descendedFromFile"][0]["id"].split("/")[-1]
                if file_extension == "":
                    print("No file extension defined, add extension to the regex pattern yourself")
                    regex_pattern = fileBundle_name + ".*s[\d]{1,3}" + "\\"
                else:
                    regex_pattern = fileBundle_name + ".*s[\d]{1,3}\\" +  file_extension

        # If only one DOI is available, use that one, otherwise concatenate all DOIs in one string
        if isinstance(tsc["DOI"], str):
            DOI = tsc['DOI']
        else:
            DOI = '\n'.join(tsc['DOI'])

        data = data.append(pd.DataFrame({"sub_name" : tsc["studiedState"][0]["descendedFromSubject"][0]["studiedState"][0]["internalIdentifier"],
                        # "sub_uuid" : tsc["studiedState"][0]["descendedFromSubject"][0]["studiedState"][0]["id"],
                        "tsc_name" : tsc["lookupLabel"],
                        "tsc_internalID" : tsc["internalIdentifier"],
                        "tsc_uuid" : tsc["id"].split("/")[-1],
                        "tsc_state_uuid" : tsc["studiedState"][0]["id"].split("/")[-1],
                        "bucket_name" : "img-" + tsc["id"].split("/")[-1],
                        "collab_name" : "Image chunks for tsc " + tsc["lookupLabel"] +  " for dataset version: " + tsc["dsv_uuid"],
                        "fileBundle_name" : fileBundle_name,  
                        "fileBundle_uuid" : fileBundle_uuid,
                        "regex_pattern" : regex_pattern,
                        "DOI_dataset" : DOI,
                        "repository" : tsc['Repo'],
                        "viewer_link" : "Fill in the viewer link here"},                
                    index=[0]), 
                ignore_index=True)

    return data

If the dataset is well organised, we can extract the regex pattern for each tissue sample collection. The general idea is that each file contains the subject name (which is used in the regex pattern to create file bundles, e.g. H109), a section number (e.g. s001) and a file extention (e.g. '.tif'). If the files in the folder do not follow this naming convention (e.g. H109...s001.tif), the regex pattern in the cell above needs to be changed. The cell below will ask you to define the file extension of the files you want to create a task for (typically '.tif' for 2d images). 

In [None]:
# Execute the extractInfo function
file_extension = input("What is the file extension of the files that need to be ingested (e.g. '.tif'): ")
tsc_data = extractInfo(tsc_list, file_extension)

if answer == '1':
    filename = os.path.join(output_path, 'tsc_' + dsv + '.csv')
elif answer == '2':
    filename = os.path.join(output_path, 'tsc_list' + '.csv')
elif answer == '3':
    filename = os.path.join(output_path, 'tsc_' + project + '.csv')
# save the table locally in the current folder using the name "tsc_UUID-of-dataset"
tsc_data.to_csv(filename, index = False, header=True)

print("Done! File is saved")

## Create tasks

To be able to ingest the images, we first need to create a task. The cells below will use the extracted information to create a task for each tissue sample collection and saves it as a JSON file with the following naming convention: task_[filebundleName].json

In [None]:
# Function to build task instruction
def build_task(source_url: str, input_filter: str, collab_name: str):
    
    task_definition = {
                        "description": collab_name,
                        "definition": {
                            "type": "ingest",
                            "url": source_url,
                            "two_d": True,
                            "runtime_limit":"24h",
                            "filter": input_filter,
                            # "ingestion_parameters":{
                            #     "is_stack":False,
                            #     "type":"image",
                            #     "data_type":"uint8"
                            # }
                        },
                        # "bucket_name": bucket_name   
                    }
    
    return task_definition


In [None]:
# Create a task per tissue sample collection using the defined collab name, bucket_name, regex filter and source location.
for task_num in range(len(tsc_data)):

    task_definition = build_task(tsc_data.repository[task_num], tsc_data.regex_pattern[task_num], tsc_data.collab_name[task_num])
    fname = os.path.join(output_path, "task_" + tsc_data.tsc_name[task_num] + ".json")
    with open(fname, 'w', encoding='utf-8') as fi:
        fi.write(json.dumps(task_definition, ensure_ascii=False, indent=4))

print("Task definitions have been created")