DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML
from math import floor

Set the details for the catalog we want and authenticate to the server if needed.

In [None]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


Create a test catalog and get an instance of the DemoML class.

In [None]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

In [None]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

In [None]:
# Create a new dataset
ml_instance.add_term("Dataset_Type", "DemoSet", description="A test dataset")
ml_instance.add_term('Dataset_Type', 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term("Dataset_Type", "Subject", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Image", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Training", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Testing", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Validation", description="Validation dataset")

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")

dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

In [None]:
def strip_system(d):
    return {k:v for k,v in d.items() if k not in ['RCT', 'RMT', 'RCB', 'RMB']}

display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(subject_dataset)['Subject']]))
display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(image_dataset)['Image']]))


Now lets create some subsets of the origional dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects.  SO we will download the subject dataset and look at its metadata to figure out whow to partition the origional data.

In [None]:
bag_path, bag_rid = ml_instance.materialize_bdbag(subject_dataset)
print(f"Bag materialized to {bag_path}")

The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Lets look at the subjects and partition into test and training datasets.

In [None]:
import os
import csv
print(f"Bag path is: {bag_path}")
os.chdir(bag_path / 'data/Subject')
%ls 

# Get information about the subjects.....
with open('Subject.csv') as csvfile:
    subject_map = {s['RID']: {'Subject RID': s['RID'], 'Name': s['Name']} for s in csv.DictReader(csvfile)}

# and combine with image (lets assume that there is only one image per subject).
with open('Image/Image.csv') as csvfile:
    metadata = [ subject_map[row['Subject']] | {'Image RID': row['RID'], 'URL': row['URL']} for row in csv.DictReader(csvfile)]
        

display(pd.DataFrame(metadata))

In [None]:
def thing_number(name: str) -> int:
    return int(name.replace('Thing',''))
    
training_rids = [s['Image RID'] for s in metadata if thing_number(s['Name']) % 3 == 0]
testing_rids =  [s['Image RID'] for s in metadata if thing_number(s['Name']) % 3 == 1]
validation_rids = [s['Image RID'] for s in metadata if thing_number(s['Name']) % 3 == 2]
print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')


In [None]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')

ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)


Ok, lets see what we have now.

In [None]:
pd.DataFrame([strip_system(d) for d in ml_instance.find_datasets()])

In [None]:
display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(nested_dataset)['Dataset']]))
display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(training_dataset)['Image']]))
display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(testing_dataset)['Image']]))
display(pd.DataFrame([strip_system(d) for d in ml_instance.list_dataset_members(validation_dataset)['Image']]))


In [None]:
display(pd.DataFrame([strip_system(m) for m in ml_instance.find_datasets()]))

In [None]:
ml_instance.cite(nested_dataset)

Now lets download a dataset so that we can compute on it locally.

In [None]:
test_catalog.delete_ermrest_catalog(really=True)