# DerivaML Dataset

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import MLVocab, DatasetBag, ExecutionConfiguration, Workflow, DerivaSystemColumns
import pandas as pd
from IPython.display import display, Markdown, HTML, JSON

Set the details for the catalog we want and authenticate to the server if needed.

In [4]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DemoML class.

In [5]:
test_catalog = create_demo_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

## Configure DerivaML Datasets

In Deriva-ML a dataset is used to aggregate instances of entities.  However, before we can create any datasets, we must configure 
Deriva-ML for the specifics of the datasets.  The first stp is we need to tell Deriva-ML what types of use defined objects can be associated with a dataset.  

Note that out of the box, Deriva-ML is configured to allow datasets to contained dataset (i.e. nested datasets), so we don't need to do anything for that specific configuration.

In [6]:
print(f"Current dataset_table element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset_table element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

Current dataset_table element types: ['Dataset']
New dataset_table element types ['Dataset', 'Subject', 'Image']


Now that we have configured our datasets, we need to identify the dataset types so we can distinguish between them.

In [7]:
# Create a new dataset_table
ml_instance.add_term(MLVocab.dataset_type, "DemoSet", description="A test dataset_table")
ml_instance.add_term(MLVocab.dataset_type, 'Partitioned', description="A partitioned dataset_table for ML training.")
ml_instance.add_term(MLVocab.dataset_type, "Subject", description="A test dataset_table")
ml_instance.add_term(MLVocab.dataset_type, "Image", description="A test dataset_table")
ml_instance.add_term(MLVocab.dataset_type, "Training", description="Training dataset_table")
ml_instance.add_term(MLVocab.dataset_type, "Testing", description="Training dataset_table")
ml_instance.add_term(MLVocab.dataset_type, "Validation", description="Validation dataset_table")

ml_instance.list_vocabulary_terms(MLVocab.dataset_type)

[VocabularyTerm(name='DemoSet', synonyms=[], id='ml-test:3AW', uri='/id/3AW', description='A test dataset_table', rid='3AW'),
 VocabularyTerm(name='Partitioned', synonyms=[], id='ml-test:3AY', uri='/id/3AY', description='A partitioned dataset_table for ML training.', rid='3AY'),
 VocabularyTerm(name='Subject', synonyms=[], id='ml-test:3B0', uri='/id/3B0', description='A test dataset_table', rid='3B0'),
 VocabularyTerm(name='Image', synonyms=[], id='ml-test:3B2', uri='/id/3B2', description='A test dataset_table', rid='3B2'),
 VocabularyTerm(name='Training', synonyms=[], id='ml-test:3B4', uri='/id/3B4', description='Training dataset_table', rid='3B4'),
 VocabularyTerm(name='Testing', synonyms=[], id='ml-test:3B6', uri='/id/3B6', description='Training dataset_table', rid='3B6'),
 VocabularyTerm(name='Validation', synonyms=[], id='ml-test:3B8', uri='/id/3B8', description='Validation dataset_table', rid='3B8')]

Now create datasets and populate with elements from the test catalogs.

In [8]:
ml_instance.add_term(MLVocab.workflow_type, "Create Dataset Notebook", description="A Workflow that creates a new dataset_table")

# Now lets create model configuration for our program.
api_workflow = Workflow(
    name="API Workflow",
    url="https://github.com/informatics-isi-edu/deriva-ml/blob/main/docs/Notebooks/DerivaML%20Dataset.ipynb",
    workflow_type="Create Dataset Notebook"
)

dataset_execution = ml_instance.create_execution(
    ExecutionConfiguration(
        workflow=api_workflow,
        description="Our Sample Workflow instance")
)

In [9]:
subject_dataset = dataset_execution.create_dataset(['DemoSet', 'Subject'], description="A subject dataset_table")
image_dataset = dataset_execution.create_dataset(['DemoSet', 'Image'], description="A image training dataset_table")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=DerivaSystemColumns)
display(
    Markdown('## Datasets'),
    datasets)

## Datasets

Unnamed: 0,RID,Description,Version,MLVocab.dataset_type
0,3BG,A subject dataset_table,3BR,"[DemoSet, Subject]"
1,3BT,A image training dataset_table,3C2,"[DemoSet, Image]"


And now that we have defined some datasets, we can add elements of the appropriate type to them.  We can see what is in our new datasets by listing the dataset members.

In [10]:
# Get list of subjects and images from the catalog using the DataPath API.
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

# List the contents of our datasets, and let's not include columns like modify time.
display(
    Markdown('## Subject Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=DerivaSystemColumns),
    Markdown('## Image Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=DerivaSystemColumns))

## Subject Dataset

Unnamed: 0,RID,Name
0,33M,Thing1
1,33P,Thing2
2,33R,Thing3
3,33T,Thing4
4,33W,Thing5
5,33Y,Thing6
6,340,Thing7
7,342,Thing8
8,344,Thing9
9,346,Thing10


## Image Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,34W,/hatrac/Image/26c58aba708feb63cdb2e0e7dd7e3c39...,test_33M.txt,A test image,31,26c58aba708feb63cdb2e0e7dd7e3c39,,33M
1,34Y,/hatrac/Image/02d08f98e5c68b6e57ed7b9199e91f03...,test_33P.txt,A test image,31,02d08f98e5c68b6e57ed7b9199e91f03,,33P
2,350,/hatrac/Image/7418d42e0a002ead932d349f6d96ae2e...,test_33R.txt,A test image,32,7418d42e0a002ead932d349f6d96ae2e,,33R
3,352,/hatrac/Image/53eb519da6de7509bfdeb6b920244396...,test_33T.txt,A test image,31,53eb519da6de7509bfdeb6b920244396,,33T
4,354,/hatrac/Image/5d85d8527dcab04f3d8a3aeb6359e5e1...,test_33W.txt,A test image,30,5d85d8527dcab04f3d8a3aeb6359e5e1,,33W
5,356,/hatrac/Image/d0d02cac8ec2740c0c43cd9b8e478bc6...,test_33Y.txt,A test image,32,d0d02cac8ec2740c0c43cd9b8e478bc6,,33Y
6,358,/hatrac/Image/9d7ef397078b3f2325f5614067bea21d...,test_340.txt,A test image,31,9d7ef397078b3f2325f5614067bea21d,,340
7,35A,/hatrac/Image/be24b72e0ce49f24bcb0730822c817cd...,test_342.txt,A test image,32,be24b72e0ce49f24bcb0730822c817cd,,342
8,35C,/hatrac/Image/5422e4e0e69950821430ea62c3da60b9...,test_344.txt,A test image,32,5422e4e0e69950821430ea62c3da60b9,,344
9,35E,/hatrac/Image/37cbe61e794c67225a36b3c422e2418d...,test_346.txt,A test image,31,37cbe61e794c67225a36b3c422e2418d,,346


## Create partitioned dataset

Now let's create some subsets of the original dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects. We will download the subject dataset and look at its metadata to figure out how to partition the original data. Since we are not going to look at the images, we use the materialize=False option to save some time.

In [12]:
bag_path, bag_rid, bag_minid = ml_instance.download_dataset_bag(subject_dataset, materialize=False)
dataset_bag = DatasetBag(bag_rid)
print(f"Bag materialized to {bag_path}")

ValidationError: 2 validation errors for DatasetBag.__init__
1
  Input should be a valid string [type=string_type, input_value=PosixPath('/var/folders/0...bfbe9b9e9c/Dataset_3BG'), input_type=PosixPath]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type
dbase
  Missing required argument [type=missing_argument, input_value=ArgsKwargs((<deriva_ml.da...be9b9e9c/Dataset_3BG'))), input_type=ArgsKwargs]
    For further information visit https://errors.pydantic.dev/2.10/v/missing_argument

The domain model has two objects: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Let's look at the subjects and partition into test and training datasets.

In [None]:
# Get information about the subjects.....
subject_df = dataset_bag.get_table_as_dataframe('Subject')[['RID', 'Name']]
image_df = dataset_bag.get_table_as_dataframe('Image')[['RID', 'Subject', 'URL']]
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(metadata_df)

For ths example, lets partition the data based on the name of the subject.  Of course in real examples, we would do a more complex analysis in deciding
what subset goes into each data set.

In [None]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()

print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')

Now that we know what we want in each dataset, lets create datasets for each of our partitioned elements along with a nested dataset to track the entire collection.

In [None]:
nested_dataset = dataset_execution.create_dataset(['Partitioned', 'Image'], description='A nested dataset_table for machine learning')
training_dataset = dataset_execution.create_dataset('Training', description='An image dataset_table for training')
testing_dataset = dataset_execution.create_dataset('Testing', description='A image dataset_table for testing')
validation_dataset = dataset_execution.create_dataset('Validation', description='A image dataset_table for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=DerivaSystemColumns)

And then fill the datasets with the appropriate members.

In [None]:
ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)

Ok, lets see what we have now.

As our very last step, lets get a PID that will allow us to share and cite the dataset that we just created

In [None]:
display(
    Markdown('## Nested Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=DerivaSystemColumns),
    Markdown('## Training Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=DerivaSystemColumns),
    Markdown('## Testing Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=DerivaSystemColumns),
    Markdown('## Validation Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=DerivaSystemColumns),)

In [None]:
print(f'Dataset parents: {ml_instance.list_dataset_parents(training_dataset)}')
print(f'Dataset children: {ml_instance.list_dataset_children(nested_dataset)}')


In [None]:
dataset_citation = ml_instance.cite(nested_dataset)
display(
    HTML(f'Nested dataset_table citation: <a href={dataset_citation}>{dataset_citation}</a>')
)

In [None]:
display(
     Markdown('## Nested Dataset -- Recursive Listing'),
    JSON(ml_instance.list_dataset_members(nested_dataset, recurse=True))
)

### Dataset Versions
Datasets have a version number which can be retrieved or incremented.  We follow the equivalent of semantic versioning, but for data rather than code.  Note that datasets are also versioned by virtue of the fact that the dataset RID can include a catalog snapshot ID as well.

In [None]:
print(f'Current dataset_table version for training_dataset: {ml_instance.dataset_version(training_dataset)}')
next_version = ml_instance.increment_dataset_version(training_dataset, SemanticVersion.minor)
print(f'Next dataset_table version for training_dataset: {next_version}')

In [None]:
display(HTML(f'<a href={ml_instance.chaise_url("Dataset")}>Browse Datasets</a>'))

In [None]:
test_catalog.delete_ermrest_catalog(really=True)