# DerivaML Dataset Example.

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import pandas as pd
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML

  dataset_rid = re.match('_([\w/d]+).zip', bag_path)[1]


Set the details for the catalog we want and authenticate to the server if needed.

In [3]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DemoML class.

In [4]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

2024-10-14 17:19:41,355 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-14 17:19:41,356 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>


In [5]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

Current dataset element types: ['Dataset']
New dataset element types ['Dataset', 'Subject', 'Image']


## Configure DerivaML Datasets

Create vocabulary terms for the dataset types

In [6]:
# Create a new dataset
ml_instance.add_term("Dataset_Type", "DemoSet", description="A test dataset")
ml_instance.add_term('Dataset_Type', 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term("Dataset_Type", "Subject", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Image", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Training", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Testing", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Validation", description="Validation dataset")

VocabularyTerm(name='Validation', synonyms=[], id='ml-test:374', uri='/id/374', description='Validation dataset', rid='374')

Now create datasets and populate with elements from the test catalogs.

In [7]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
display(datasets)

Unnamed: 0,RID,Description,Dataset_Type
0,376,A subject dataset,"[DemoSet, Subject]"
1,37C,A image training dataset,"[DemoSet, Image]"


In [8]:
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

display(pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=system_columns))

Unnamed: 0,RID,Name
0,2ZG,Thing1
1,2ZJ,Thing2
2,2ZM,Thing3
3,2ZP,Thing4
4,2ZR,Thing5
5,2ZT,Thing6
6,2ZW,Thing7
7,2ZY,Thing8
8,300,Thing9
9,302,Thing10


Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30R,/hatrac/image_assets3d762cd638ee14fd1778152881...,test_2ZG.txt,A test image,32,3d762cd638ee14fd17781528810eb375,,2ZG
1,30T,/hatrac/image_assetsc972aa455de02391560fd9c83b...,test_2ZJ.txt,A test image,32,c972aa455de02391560fd9c83b7a39bb,,2ZJ
2,30W,/hatrac/image_assets25d1c9ab9bdc672cc697bf8285...,test_2ZM.txt,A test image,31,25d1c9ab9bdc672cc697bf82854e5dcc,,2ZM
3,30Y,/hatrac/image_assetsfaeeddf3382e84943054a74d8d...,test_2ZP.txt,A test image,31,faeeddf3382e84943054a74d8d841168,,2ZP
4,310,/hatrac/image_assetsdf50b50b23d47b83fea7ad8481...,test_2ZR.txt,A test image,31,df50b50b23d47b83fea7ad8481395232,,2ZR
5,312,/hatrac/image_assets306c9b49da2f2ccde8270bdf77...,test_2ZT.txt,A test image,32,306c9b49da2f2ccde8270bdf77b9675d,,2ZT
6,314,/hatrac/image_assetsd6344967e4eb42ce591e45814e...,test_2ZW.txt,A test image,31,d6344967e4eb42ce591e45814ea26af1,,2ZW
7,316,/hatrac/image_assets3e53309903336e89c39b965489...,test_2ZY.txt,A test image,31,3e53309903336e89c39b965489f2133d,,2ZY
8,318,/hatrac/image_assets5bafbe257a1d725e081fe5dcf7...,test_300.txt,A test image,31,5bafbe257a1d725e081fe5dcf7fd8f6c,,300
9,31A,/hatrac/image_assetsa663e84cad864d22e7ade7d73f...,test_302.txt,A test image,31,a663e84cad864d22e7ade7d73fb8418b,,302


## Create partitioned dataset

Now lets create some subsets of the origional dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects.  SO we will download the subject dataset and look at its metadata to figure out whow to partition the origional data. Since we are not going to look at the images, we use dowload_dataset_bag, rather than materialize_bag.

In [17]:
bag_path, bag_rid = ml_instance.download_dataset_bag(subject_dataset)
ml_instance.materialize_bdbag(subject_dataset)
print(f"Bag materialized to {bag_path}")

2024-10-14 17:34:56,752 - INFO - Initializing downloader: GenericDownloader v1.7.4 [Python 3.12.3, macOS-15.0.1-x86_64-i386-64bit]
2024-10-14 17:34:56,754 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-14 17:34:56,754 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-14 17:34:56,757 - INFO - Validating credentials for host: dev.eye-ai.org
2024-10-14 17:34:56,904 - INFO - Creating bag directory: /var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpinpl3c34/Dataset_376
2024-10-14 17:34:56,905 - INFO - Creating bag for directory /var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpinpl3c34/Dataset_376
2024-10-14 17:34:56,906 - INFO - Creating data directory
2024-10-14 17:34:56,907 - INFO - Moving /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpinpl3c34/Dataset_376/tmp

bags ['/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7/Dataset_376-2024-10-14_17.24.45', '/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7/Dataset_376-2024-10-14_17.26.01', '/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7/Dataset_376-2024-10-14_17.25.17', '/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7/Dataset_376-2024-10-14_17.30.35', '/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7/Dataset_376-2024-10-14_17.28.06', '/private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d

DerivaMLException: Invalid bag directory: /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpp6j8t24c/376_145b008cde4a14e979dd46d96bb361fad3a4d5e63f0ff39e08b451d0dc6426d7

The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Lets look at the subjects and partition into test and training datasets.

In [None]:
print(f"Bag path is: {bag_path}")
os.chdir(bag_path / 'data/Subject')
%ls 

# Get information about the subjects.....        
subject_df = pd.read_csv('Subject.csv', usecols=['RID', 'Name'])
image_df = pd.read_csv('Image/Image.csv', usecols=['RID', 'Subject', 'URL'])
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(metadata_df)

In [None]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()
print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')


In [None]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)

In [None]:

ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)


Ok, lets see what we have now.

In [None]:
display(pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=system_columns))

In [None]:
ml_instance.cite(nested_dataset)

In [None]:
test_catalog.delete_ermrest_catalog(really=True)