# DerivaML Dataset Example.

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [18]:
from schema_setup.create_schema import define_table_dataset
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
import os
import pandas as pd
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML
from deriva_ml.deriva_ml_base import VocabularyTables as vt
from IPython.display import display, Markdown

Set the details for the catalog we want and authenticate to the server if needed.

In [20]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


2024-10-17 22:44:08,818 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-17 22:44:08,819 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>


You are already logged in.


Create a test catalog and get an instance of the DemoML class.

In [21]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

2024-10-17 22:44:08,848 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-17 22:44:08,849 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-17 22:44:19,064 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-17 22:44:19,065 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-17 22:44:29,290 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-17 22:44:29,291 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.au

## Configure DerivaML Datasets

In Deriva-ML a dataset is used to aggregate instances of entities.  However, before we can create any datasets, we must configure 
Deriva-ML for the specifics of the datasets.  The first stp is we need to tell Deriva-ML what types of use defined objects can be associated with a dataset.  

Note that out of the box, Deriva-ML is configured to allow datasets to contained dataset (i.e. nested datasets), so we don't need to do anything for that specific configuration.

In [22]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

Current dataset element types: ['Dataset']
New dataset element types ['Dataset', 'Subject', 'Image']


Now that we have configured our datasets, we need to identify the dataset types so we can distiguish between them.

In [23]:
# Create a new dataset
ml_instance.add_term(vt.dataset_type, "DemoSet", description="A test dataset")
ml_instance.add_term(vt.dataset_type, 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term(vt.dataset_type, "Subject", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Image", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Training", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Testing", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Validation", description="Validation dataset")

ml_instance.list_vocabulary_terms(vt.dataset_type)

[VocabularyTerm(name='DemoSet', synonyms=[], id='ml-test:36R', uri='/id/36R', description='A test dataset', rid='36R'),
 VocabularyTerm(name='Partitioned', synonyms=[], id='ml-test:36T', uri='/id/36T', description='A partitioned dataset for ML training.', rid='36T'),
 VocabularyTerm(name='Subject', synonyms=[], id='ml-test:36W', uri='/id/36W', description='A test dataset', rid='36W'),
 VocabularyTerm(name='Image', synonyms=[], id='ml-test:36Y', uri='/id/36Y', description='A test dataset', rid='36Y'),
 VocabularyTerm(name='Training', synonyms=[], id='ml-test:370', uri='/id/370', description='Training dataset', rid='370'),
 VocabularyTerm(name='Testing', synonyms=[], id='ml-test:372', uri='/id/372', description='Training dataset', rid='372'),
 VocabularyTerm(name='Validation', synonyms=[], id='ml-test:374', uri='/id/374', description='Validation dataset', rid='374')]

Now create datasets and populate with elements from the test catalogs.

In [24]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
display(datasets)

Unnamed: 0,RID,Description,Dataset_Type
0,376,A subject dataset,"[DemoSet, Subject]"
1,37C,A image training dataset,"[DemoSet, Image]"


And now that we have defined some datasets, we can add elements of the approproate type to them.  We can see what is in our new datasets by listing the dataset members.

In [25]:
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

# List the contents of our datasets, and lets not include columns like modify time.
display(pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=system_columns))

Unnamed: 0,RID,Name
0,2ZG,Thing1
1,2ZJ,Thing2
2,2ZM,Thing3
3,2ZP,Thing4
4,2ZR,Thing5
5,2ZT,Thing6
6,2ZW,Thing7
7,2ZY,Thing8
8,300,Thing9
9,302,Thing10


Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30R,/hatrac/image_assetsb71bd70629d930854e611f4147...,test_2ZG.txt,A test image,32,b71bd70629d930854e611f41474da51e,,2ZG
1,30T,/hatrac/image_assets904777ec32f6a0144f18ad78d1...,test_2ZJ.txt,A test image,32,904777ec32f6a0144f18ad78d176c685,,2ZJ
2,30W,/hatrac/image_assetscfee07cf5b6351e526b1c2712f...,test_2ZM.txt,A test image,31,cfee07cf5b6351e526b1c2712f6b31cd,,2ZM
3,30Y,/hatrac/image_assetsfb2d55705ece42d11fd2c38775...,test_2ZP.txt,A test image,32,fb2d55705ece42d11fd2c387753bc95e,,2ZP
4,310,/hatrac/image_assets02c11010086e04e6869697a683...,test_2ZR.txt,A test image,32,02c11010086e04e6869697a683b44ba5,,2ZR
5,312,/hatrac/image_assets931083d1b85e4b82555f12a0dd...,test_2ZT.txt,A test image,32,931083d1b85e4b82555f12a0dd78822d,,2ZT
6,314,/hatrac/image_assetsae8262448684327c44f0f06e65...,test_2ZW.txt,A test image,31,ae8262448684327c44f0f06e65771246,,2ZW
7,316,/hatrac/image_assets11bc191fdba828cb80afe1f69b...,test_2ZY.txt,A test image,31,11bc191fdba828cb80afe1f69b331d2f,,2ZY
8,318,/hatrac/image_assets9fb797ffe42dd0d826bb480f22...,test_300.txt,A test image,31,9fb797ffe42dd0d826bb480f22a5fd20,,300
9,31A,/hatrac/image_assetsbfb5f4686e4ba384ad5482bfc6...,test_302.txt,A test image,31,bfb5f4686e4ba384ad5482bfc6654392,,302


## Create partitioned dataset

Now lets create some subsets of the origional dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects.  SO we will download the subject dataset and look at its metadata to figure out whow to partition the origional data. Since we are not going to look at the images, we use dowload_dataset_bag, rather than materialize_bag.

In [26]:
bag_path, bag_rid = ml_instance.download_dataset_bag(subject_dataset)
ml_instance.materialize_dataset_bag(subject_dataset)
print(f"Bag materialized to {bag_path}")

2024-10-17 22:44:40,584 - INFO - Initializing downloader: GenericDownloader v1.7.4 [Python 3.12.3, macOS-15.0.1-x86_64-i386-64bit]
2024-10-17 22:44:40,586 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-17 22:44:40,586 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-17 22:44:40,589 - INFO - Validating credentials for host: dev.eye-ai.org
2024-10-17 22:44:40,738 - INFO - Creating bag directory: /var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp28t2efvs/Dataset_376
2024-10-17 22:44:40,740 - INFO - Creating bag for directory /var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp28t2efvs/Dataset_376
2024-10-17 22:44:40,740 - INFO - Creating data directory
2024-10-17 22:44:40,741 - INFO - Moving /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp28t2efvs/Dataset_376/tmp

Bag materialized to /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp3qfll_e6/376_555b0f19b3d8c1d1520530a402efda9b86d81bc5d8cdd8e93191ae7a3a7f8979/Dataset_376


The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Lets look at the subjects and partition into test and training datasets.

In [27]:
print(f"Bag path is: {bag_path}")
os.chdir(bag_path / 'data/Subject')
%ls 

# Get information about the subjects.....        
subject_df = pd.read_csv('Subject.csv', usecols=['RID', 'Name'])
image_df = pd.read_csv('Image/Image.csv', usecols=['RID', 'Subject', 'URL'])
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(metadata_df)

Bag path is: /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp3qfll_e6/376_555b0f19b3d8c1d1520530a402efda9b86d81bc5d8cdd8e93191ae7a3a7f8979/Dataset_376
[34mImage[m[m/       Subject.csv


Unnamed: 0,RID_subject,Name,RID_image,URL,Subject
0,2ZG,Thing1,30R,/hatrac/image_assetsb71bd70629d930854e611f4147...,2ZG
1,2ZJ,Thing2,30T,/hatrac/image_assets904777ec32f6a0144f18ad78d1...,2ZJ
2,2ZM,Thing3,30W,/hatrac/image_assetscfee07cf5b6351e526b1c2712f...,2ZM
3,2ZP,Thing4,30Y,/hatrac/image_assetsfb2d55705ece42d11fd2c38775...,2ZP
4,2ZR,Thing5,310,/hatrac/image_assets02c11010086e04e6869697a683...,2ZR
5,2ZT,Thing6,312,/hatrac/image_assets931083d1b85e4b82555f12a0dd...,2ZT
6,2ZW,Thing7,314,/hatrac/image_assetsae8262448684327c44f0f06e65...,2ZW
7,2ZY,Thing8,316,/hatrac/image_assets11bc191fdba828cb80afe1f69b...,2ZY
8,300,Thing9,318,/hatrac/image_assets9fb797ffe42dd0d826bb480f22...,300
9,302,Thing10,31A,/hatrac/image_assetsbfb5f4686e4ba384ad5482bfc6...,302


For ths example, lets partition the data based on the name of the subject.  Of course in real examples, we would do a more complex analysis in deciding
what subset goes into each data set.

In [28]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()

print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')


Training images: ['30W', '312', '318', '31E', '31M', '31T']
Testing images: ['30R', '30Y', '314', '31A', '31G', '31P', '31W']
Validation images: ['30T', '310', '316', '31C', '31J', '31R', '31Y']


Now that we know what we want in each dataset, lets create datasets for each of our partitioned elements along with a nested dataset to track the entire collection.

In [29]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)

Unnamed: 0,RID,Description,Dataset_Type
0,376,A subject dataset,"[DemoSet, Subject]"
1,37C,A image training dataset,"[DemoSet, Image]"
2,3A2,A nested dataset for machine learning,"[Partitioned, Image]"
3,3A8,An image dataset for training,[Training]
4,3AC,A image dataset for testing,[Testing]
5,3AG,A image dataset for validation,[Validation]


And then fill the datasets with the appropriate members.

In [30]:

ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)


'3AG'

Ok, lets see what we have now.

In [31]:
display(Markdown('## Nested Dataset'))
display(pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=system_columns))
display(Markdown('## Training Dataset'))
display(pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=system_columns))
display(Markdown('## Testing Dataset'))
display(pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=system_columns))
display(Markdown('## Validation Dataset'))
display(pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=system_columns))

## Nested Dataset

Unnamed: 0,RID,Description
0,3A2,A nested dataset for machine learning


## Training Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30W,/hatrac/image_assetscfee07cf5b6351e526b1c2712f...,test_2ZM.txt,A test image,31,cfee07cf5b6351e526b1c2712f6b31cd,,2ZM
1,312,/hatrac/image_assets931083d1b85e4b82555f12a0dd...,test_2ZT.txt,A test image,32,931083d1b85e4b82555f12a0dd78822d,,2ZT
2,318,/hatrac/image_assets9fb797ffe42dd0d826bb480f22...,test_300.txt,A test image,31,9fb797ffe42dd0d826bb480f22a5fd20,,300
3,31E,/hatrac/image_assetsaa6ff2d033b6902a56b942ec6b...,test_306.txt,A test image,31,aa6ff2d033b6902a56b942ec6b71cb3b,,306
4,31M,/hatrac/image_assetsfc84e37bcf1de39e01b51ceae4...,test_30C.txt,A test image,30,fc84e37bcf1de39e01b51ceae43911a5,,30C
5,31T,/hatrac/image_assets7a924c67c2073a9268a6e7ce0a...,test_30J.txt,A test image,31,7a924c67c2073a9268a6e7ce0a603b30,,30J


## Testing Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30R,/hatrac/image_assetsb71bd70629d930854e611f4147...,test_2ZG.txt,A test image,32,b71bd70629d930854e611f41474da51e,,2ZG
1,30Y,/hatrac/image_assetsfb2d55705ece42d11fd2c38775...,test_2ZP.txt,A test image,32,fb2d55705ece42d11fd2c387753bc95e,,2ZP
2,314,/hatrac/image_assetsae8262448684327c44f0f06e65...,test_2ZW.txt,A test image,31,ae8262448684327c44f0f06e65771246,,2ZW
3,31A,/hatrac/image_assetsbfb5f4686e4ba384ad5482bfc6...,test_302.txt,A test image,31,bfb5f4686e4ba384ad5482bfc6654392,,302
4,31G,/hatrac/image_assets37f3a0c54155407ca092c8b77c...,test_308.txt,A test image,31,37f3a0c54155407ca092c8b77c60fce3,,308
5,31P,/hatrac/image_assetsae4018e9a8c82d1f12d2e843ef...,test_30E.txt,A test image,32,ae4018e9a8c82d1f12d2e843ef179414,,30E
6,31W,/hatrac/image_assets694c841965032e137b61ecfbce...,test_30M.txt,A test image,31,694c841965032e137b61ecfbce1a3b65,,30M


## Validation Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30T,/hatrac/image_assets904777ec32f6a0144f18ad78d1...,test_2ZJ.txt,A test image,32,904777ec32f6a0144f18ad78d176c685,,2ZJ
1,310,/hatrac/image_assets02c11010086e04e6869697a683...,test_2ZR.txt,A test image,32,02c11010086e04e6869697a683b44ba5,,2ZR
2,316,/hatrac/image_assets11bc191fdba828cb80afe1f69b...,test_2ZY.txt,A test image,31,11bc191fdba828cb80afe1f69b331d2f,,2ZY
3,31C,/hatrac/image_assetsbed7cb2db57317cbf792f78d72...,test_304.txt,A test image,31,bed7cb2db57317cbf792f78d7277285a,,304
4,31J,/hatrac/image_assets44e4c5cd1d8ce9fa66658c0be7...,test_30A.txt,A test image,31,44e4c5cd1d8ce9fa66658c0be728e85b,,30A
5,31R,/hatrac/image_assetsa4b9342ffeb1630803aec494ef...,test_30G.txt,A test image,31,a4b9342ffeb1630803aec494ef7a869d,,30G
6,31Y,/hatrac/image_assets7f64d569608a69df9a3ac7f5b5...,test_30P.txt,A test image,32,7f64d569608a69df9a3ac7f5b53035af,,30P


As our very last step, lets get a PID that will allow us to share and and cite the dataset that we just created

In [32]:
ml_instance.cite(nested_dataset)

'https://dev.eye-ai.org/id/423/3A2@329-EEDR-2B0P'

In [33]:
test_catalog.delete_ermrest_catalog(really=True)

<Response [204]>