# DerivaML Dataset Example.

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML
from deriva_ml.deriva_ml_base import MLVocab as vt
from deriva_ml.dataset_bag import DatasetBag
import pandas as pd
from IPython.display import display, Markdown

Set the details for the catalog we want and authenticate to the server if needed.

In [None]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


Create a test catalog and get an instance of the DemoML class.

In [None]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

## Configure DerivaML Datasets

In Deriva-ML a dataset is used to aggregate instances of entities.  However, before we can create any datasets, we must configure 
Deriva-ML for the specifics of the datasets.  The first stp is we need to tell Deriva-ML what types of use defined objects can be associated with a dataset.  

Note that out of the box, Deriva-ML is configured to allow datasets to contained dataset (i.e. nested datasets), so we don't need to do anything for that specific configuration.

In [None]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")

Now that we have configured our datasets, we need to identify the dataset types so we can distiguish between them.

In [None]:
# Create a new dataset
ml_instance.add_term(vt.dataset_type, "DemoSet", description="A test dataset")
ml_instance.add_term(vt.dataset_type, 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term(vt.dataset_type, "Subject", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Image", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Training", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Testing", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Validation", description="Validation dataset")

ml_instance.list_vocabulary_terms(vt.dataset_type)

Now create datasets and populate with elements from the test catalogs.

In [None]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
display(Markdown('### All Datasets'), datasets)

And now that we have defined some datasets, we can add elements of the approproate type to them.  We can see what is in our new datasets by listing the dataset members.

In [None]:
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

# List the contents of our datasets, and lets not include columns like modify time.
display(
    Markdown('### Subject Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=system_columns))
display(
    Markdown('### Image Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=system_columns))

## Create partitioned dataset

Now lets create some subsets of the origional dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects.  SO we will download the subject dataset and look at its metadata to figure out whow to partition the origional data. Since we are not going to look at the images, we use dowload_dataset_bag, rather than materialize_bag.

In [None]:
bag_path, bag_rid = ml_instance.download_dataset_bag(subject_dataset)
ml_instance.materialize_dataset_bag(subject_dataset)
dataset_bag = DatasetBag(bag_path)
print(f"Bag materialized to {bag_path}")

The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Lets look at the subjects and partition into test and training datasets.

In [10]:
# Get information about the subjects.....
subject_df = dataset_bag.get_table_as_dataframe('Subject')[['RID', 'Name']]
image_df = dataset_bag.get_table_as_dataframe('Image')[['RID', 'Subject', 'URL']]
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(
    Markdown('### Subject Metadata'),
    metadata_df)

Unnamed: 0,RID_subject,Name,RID_image,Subject,URL
0,2ZG,Thing1,30R,2ZG,/hatrac/Image/d3e13bd089fd5b4015504c11d3a95128...
1,2ZJ,Thing2,30T,2ZJ,/hatrac/Image/d40ef06e08b522b68a927a575060924d...
2,2ZM,Thing3,30W,2ZM,/hatrac/Image/df9b37eaae70d40033abbfffbbf8a6f0...
3,2ZP,Thing4,30Y,2ZP,/hatrac/Image/fb06b64f66eec5c358db35eafd62f458...
4,2ZR,Thing5,310,2ZR,/hatrac/Image/940815243f4ad0772e1b832c95544f65...
5,2ZT,Thing6,312,2ZT,/hatrac/Image/6ec450059fd4963dce29e2859e4ab0d6...
6,2ZW,Thing7,314,2ZW,/hatrac/Image/ff0b37b515551238cdd585781137c50b...
7,2ZY,Thing8,316,2ZY,/hatrac/Image/7a54442ed3d101bbc9dcbe341a7305ea...
8,300,Thing9,318,300,/hatrac/Image/1bdd3dd956695cf5f5df3e7ac43175ca...
9,302,Thing10,31A,302,/hatrac/Image/95c2645713bbc8249d121a8bf11fdbfd...


For ths example, lets partition the data based on the name of the subject.  Of course in real examples, we would do a more complex analysis in deciding
what subset goes into each data set.

In [11]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()

print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')

Training images: ['30W', '312', '318', '31E', '31M', '31T']
Testing images: ['30R', '30Y', '314', '31A', '31G', '31P', '31W']
Validation images: ['30T', '310', '316', '31C', '31J', '31R', '31Y']


Now that we know what we want in each dataset, lets create datasets for each of our partitioned elements along with a nested dataset to track the entire collection.

In [12]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)

Unnamed: 0,RID,Description,Dataset_Type
0,376,A subject dataset,"[DemoSet, Subject]"
1,37C,A image training dataset,"[DemoSet, Image]"
2,3A2,A nested dataset for machine learning,"[Partitioned, Image]"
3,3A8,An image dataset for training,[Training]
4,3AC,A image dataset for testing,[Testing]
5,3AG,A image dataset for validation,[Validation]


And then fill the datasets with the appropriate members.

In [13]:
ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)

'3AG'

Ok, lets see what we have now.

In [14]:
display(
    Markdown('## Nested Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=system_columns),
    Markdown('## Training Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=system_columns),
    Markdown('## Testing Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=system_columns),
    Markdown('## Validation Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=system_columns)
)

## Nested Dataset

Unnamed: 0,RID,Description
0,3A8,An image dataset for training
1,3AC,A image dataset for testing
2,3AG,A image dataset for validation


## Training Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30W,/hatrac/Image/df9b37eaae70d40033abbfffbbf8a6f0...,test_2ZM.txt,A test image,31,df9b37eaae70d40033abbfffbbf8a6f0,,2ZM
1,312,/hatrac/Image/6ec450059fd4963dce29e2859e4ab0d6...,test_2ZT.txt,A test image,31,6ec450059fd4963dce29e2859e4ab0d6,,2ZT
2,318,/hatrac/Image/1bdd3dd956695cf5f5df3e7ac43175ca...,test_300.txt,A test image,31,1bdd3dd956695cf5f5df3e7ac43175ca,,300
3,31E,/hatrac/Image/7ea9b359393e35db6a2d33b73414a310...,test_306.txt,A test image,31,7ea9b359393e35db6a2d33b73414a310,,306
4,31M,/hatrac/Image/2b059b6154e42bdf3c738927ca66c12d...,test_30C.txt,A test image,32,2b059b6154e42bdf3c738927ca66c12d,,30C
5,31T,/hatrac/Image/944feb112c3dd4e52684f2d0fb0bcb63...,test_30J.txt,A test image,31,944feb112c3dd4e52684f2d0fb0bcb63,,30J


## Testing Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30R,/hatrac/Image/d3e13bd089fd5b4015504c11d3a95128...,test_2ZG.txt,A test image,31,d3e13bd089fd5b4015504c11d3a95128,,2ZG
1,30Y,/hatrac/Image/fb06b64f66eec5c358db35eafd62f458...,test_2ZP.txt,A test image,31,fb06b64f66eec5c358db35eafd62f458,,2ZP
2,314,/hatrac/Image/ff0b37b515551238cdd585781137c50b...,test_2ZW.txt,A test image,32,ff0b37b515551238cdd585781137c50b,,2ZW
3,31A,/hatrac/Image/95c2645713bbc8249d121a8bf11fdbfd...,test_302.txt,A test image,32,95c2645713bbc8249d121a8bf11fdbfd,,302
4,31G,/hatrac/Image/9f46fa47c76d62fc3599620c64edaad0...,test_308.txt,A test image,31,9f46fa47c76d62fc3599620c64edaad0,,308
5,31P,/hatrac/Image/7dbd1b021d0688a71f8ce7b38141ecf4...,test_30E.txt,A test image,30,7dbd1b021d0688a71f8ce7b38141ecf4,,30E
6,31W,/hatrac/Image/2292701999c1af5783a95bb557c61af5...,test_30M.txt,A test image,31,2292701999c1af5783a95bb557c61af5,,30M


## Validation Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,30T,/hatrac/Image/d40ef06e08b522b68a927a575060924d...,test_2ZJ.txt,A test image,31,d40ef06e08b522b68a927a575060924d,,2ZJ
1,310,/hatrac/Image/940815243f4ad0772e1b832c95544f65...,test_2ZR.txt,A test image,32,940815243f4ad0772e1b832c95544f65,,2ZR
2,316,/hatrac/Image/7a54442ed3d101bbc9dcbe341a7305ea...,test_2ZY.txt,A test image,32,7a54442ed3d101bbc9dcbe341a7305ea,,2ZY
3,31C,/hatrac/Image/b9cc6a226c50bf30b16bb7e3d3b1a56f...,test_304.txt,A test image,32,b9cc6a226c50bf30b16bb7e3d3b1a56f,,304
4,31J,/hatrac/Image/020784e0b1b6d5afd216e9f442c3e06a...,test_30A.txt,A test image,31,020784e0b1b6d5afd216e9f442c3e06a,,30A
5,31R,/hatrac/Image/3bb42962c2d25cdf7f77c3dc8e42be89...,test_30G.txt,A test image,33,3bb42962c2d25cdf7f77c3dc8e42be89,,30G
6,31Y,/hatrac/Image/ef24814f412f3c9de2ffe15bfaff8d03...,test_30P.txt,A test image,32,ef24814f412f3c9de2ffe15bfaff8d03,,30P


As our very last step, lets get a PID that will allow us to share and and cite the dataset that we just created

In [15]:
ml_instance.cite(nested_dataset)

'https://dev.eye-ai.org/id/592/3A2@32E-QDV8-83E0'

In [16]:
test_catalog.delete_ermrest_catalog(really=True)

<Response [204]>