# DerivaML Dataset Example.

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import pandas as pd
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML

Set the details for the catalog we want and authenticate to the server if needed.

In [None]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


Create a test catalog and get an instance of the DemoML class.

In [None]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

In [None]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

## Configure DerivaML Datasets

Create vocabulary terms for the dataset types

In [None]:
# Create a new dataset
ml_instance.add_term("Dataset_Type", "DemoSet", description="A test dataset")
ml_instance.add_term('Dataset_Type', 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term("Dataset_Type", "Subject", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Image", description="A test dataset")
ml_instance.add_term("Dataset_Type", "Training", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Testing", description="Training dataset")
ml_instance.add_term("Dataset_Type", "Validation", description="Validation dataset")

Now create datasets and populate with elements from the test catalogs.

In [None]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
display(datasets)

In [None]:
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

display(pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=system_columns))

## Create partitioned dataset

Now lets create some subsets of the origional dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects.  SO we will download the subject dataset and look at its metadata to figure out whow to partition the origional data. Since we are not going to look at the images, we use dowload_dataset_bag, rather than materialize_bag.

In [None]:
bag_path, bag_rid = ml_instance.download_dataset_bag(subject_dataset)
ml_instance.materialize_bdbag(subject_dataset)
print(f"Bag materialized to {bag_path}")

The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Lets look at the subjects and partition into test and training datasets.

In [None]:
print(f"Bag path is: {bag_path}")
os.chdir(bag_path / 'data/Subject')
%ls 

# Get information about the subjects.....        
subject_df = pd.read_csv('Subject.csv', usecols=['RID', 'Name'])
image_df = pd.read_csv('Image/Image.csv', usecols=['RID', 'Subject', 'URL'])
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(metadata_df)

In [None]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()
print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')


In [None]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)

In [None]:

ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)


Ok, lets see what we have now.

In [None]:
display(pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=system_columns))
display(pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=system_columns))

In [None]:
ml_instance.cite(nested_dataset)

In [None]:
test_catalog.delete_ermrest_catalog(really=True)