# DerivaML Dataset Example.

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

## Set up DerivaML  for test case

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import MLVocab as vt, DatasetBag
import pandas as pd
from IPython.display import display, Markdown, HTML

Set the details for the catalog we want and authenticate to the server if needed.

In [3]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DemoML class.

In [4]:
test_catalog = create_demo_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

## Configure DerivaML Datasets

In Deriva-ML a dataset is used to aggregate instances of entities.  However, before we can create any datasets, we must configure 
Deriva-ML for the specifics of the datasets.  The first stp is we need to tell Deriva-ML what types of use defined objects can be associated with a dataset.  

Note that out of the box, Deriva-ML is configured to allow datasets to contained dataset (i.e. nested datasets), so we don't need to do anything for that specific configuration.

In [5]:
print(f"Current dataset element types: {[a.name for a in ml_instance.list_dataset_element_types()]}")
ml_instance.add_dataset_element_type("Subject")
ml_instance.add_dataset_element_type("Image")
print(f"New dataset element types {[a.name for a in ml_instance.list_dataset_element_types()]}")



Current dataset element types: ['Dataset']
New dataset element types ['Dataset', 'Subject', 'Image']


Now that we have configured our datasets, we need to identify the dataset types so we can distinguish between them.

In [6]:
# Create a new dataset
ml_instance.add_term(vt.dataset_type, "DemoSet", description="A test dataset")
ml_instance.add_term(vt.dataset_type, 'Partitioned', description="A partitioned dataset for ML training.")
ml_instance.add_term(vt.dataset_type, "Subject", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Image", description="A test dataset")
ml_instance.add_term(vt.dataset_type, "Training", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Testing", description="Training dataset")
ml_instance.add_term(vt.dataset_type, "Validation", description="Validation dataset")

ml_instance.list_vocabulary_terms(vt.dataset_type)

[VocabularyTerm(name='DemoSet', synonyms=[], id='ml-test:38J', uri='/id/38J', description='A test dataset', rid='38J'),
 VocabularyTerm(name='Partitioned', synonyms=[], id='ml-test:38M', uri='/id/38M', description='A partitioned dataset for ML training.', rid='38M'),
 VocabularyTerm(name='Subject', synonyms=[], id='ml-test:38P', uri='/id/38P', description='A test dataset', rid='38P'),
 VocabularyTerm(name='Image', synonyms=[], id='ml-test:38R', uri='/id/38R', description='A test dataset', rid='38R'),
 VocabularyTerm(name='Training', synonyms=[], id='ml-test:38T', uri='/id/38T', description='Training dataset', rid='38T'),
 VocabularyTerm(name='Testing', synonyms=[], id='ml-test:38W', uri='/id/38W', description='Training dataset', rid='38W'),
 VocabularyTerm(name='Validation', synonyms=[], id='ml-test:38Y', uri='/id/38Y', description='Validation dataset', rid='38Y')]

Now create datasets and populate with elements from the test catalogs.

In [7]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']

subject_dataset = ml_instance.create_dataset(['DemoSet', 'Subject'], description="A subject dataset")
image_dataset = ml_instance.create_dataset(['DemoSet', 'Image'], description="A image training dataset")
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
display(
    Markdown('## Datasets'),
    datasets)

## Datasets

Unnamed: 0,RID,Description,Dataset_Type
0,390,A subject dataset,"[DemoSet, Subject]"
1,396,A image training dataset,"[DemoSet, Image]"


And now that we have defined some datasets, we can add elements of the appropriate type to them.  We can see what is in our new datasets by listing the dataset members.

In [8]:
dp = ml_instance.domain_path  # Each call returns a new path instance, so only call once...
subject_rids = [i['RID'] for i in dp.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in dp.tables['Image'].entities().fetch()]

ml_instance.add_dataset_members(dataset_rid=subject_dataset, members=subject_rids)
ml_instance.add_dataset_members(dataset_rid=image_dataset, members=image_rids)

# List the contents of our datasets, and let's not include columns like modify time.
display(
    Markdown('## Subject Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(subject_dataset)['Subject']).drop(columns=system_columns),
    Markdown('## Image Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(image_dataset)['Image']).drop(columns=system_columns))

## Subject Dataset

Unnamed: 0,RID,Name
0,31A,Thing1
1,31C,Thing2
2,31E,Thing3
3,31G,Thing4
4,31J,Thing5
5,31M,Thing6
6,31P,Thing7
7,31R,Thing8
8,31T,Thing9
9,31W,Thing10


## Image Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,32J,/hatrac/Image/0eac5db56873bc2fe8ca4b512eb271be...,test_31A.txt,A test image,32,0eac5db56873bc2fe8ca4b512eb271be,,31A
1,32M,/hatrac/Image/ef1fa620d471b3cbec88d06f2bfbcc5f...,test_31C.txt,A test image,32,ef1fa620d471b3cbec88d06f2bfbcc5f,,31C
2,32P,/hatrac/Image/5527620402b22cc6cd1e89432513853c...,test_31E.txt,A test image,32,5527620402b22cc6cd1e89432513853c,,31E
3,32R,/hatrac/Image/663e946a6b6c2aa604023459d6ee6569...,test_31G.txt,A test image,31,663e946a6b6c2aa604023459d6ee6569,,31G
4,32T,/hatrac/Image/211f350dd878712239c1dcae74f341ff...,test_31J.txt,A test image,31,211f350dd878712239c1dcae74f341ff,,31J
5,32W,/hatrac/Image/74b1f373d2f1f7816d944fa43a5ebee9...,test_31M.txt,A test image,30,74b1f373d2f1f7816d944fa43a5ebee9,,31M
6,32Y,/hatrac/Image/99e80a3eacd8fe0488b1039bf19142a0...,test_31P.txt,A test image,31,99e80a3eacd8fe0488b1039bf19142a0,,31P
7,330,/hatrac/Image/b8c4bb05c1ba52d25465f2c84bf37e8a...,test_31R.txt,A test image,32,b8c4bb05c1ba52d25465f2c84bf37e8a,,31R
8,332,/hatrac/Image/44fdd8b1efe2700ad703b02bc4cfadfe...,test_31T.txt,A test image,32,44fdd8b1efe2700ad703b02bc4cfadfe,,31T
9,334,/hatrac/Image/1ec4a45ec2ac4439b498d3155ee55c0c...,test_31W.txt,A test image,31,1ec4a45ec2ac4439b498d3155ee55c0c,,31W


## Create partitioned dataset

Now let's create some subsets of the original dataset based on subject level metadata. We are going to create the subsets based on the metadata values of the subjects. We will download the subject dataset and look at its metadata to figure out how to partition the original data. Since we are not going to look at the images, we use download_dataset_bag, rather than materialize_bag.

In [9]:
bag_path, bag_rid = ml_instance.download_dataset_bag(subject_dataset)
ml_instance.materialize_dataset_bag(subject_dataset)
dataset_bag = DatasetBag(bag_path)
print(f"Bag materialized to {bag_path}")

Bag materialized to /private/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmp1a0gbs_g/390_aaa4b9035a9b5265c459d5bb75bae4551a7696fcac4ada2bdd216ab34651c84b/Dataset_390


The domain model has two object: Subject and Images where an Image is associated with a subject, but a subject can have multiple images associated with it.  Let's look at the subjects and partition into test and training datasets.

In [10]:
# Get information about the subjects.....
subject_df = dataset_bag.get_table_as_dataframe('Subject')[['RID', 'Name']]
image_df = dataset_bag.get_table_as_dataframe('Image')[['RID', 'Subject', 'URL']]
metadata_df = subject_df.join(image_df, lsuffix="_subject", rsuffix="_image")
display(metadata_df)

Unnamed: 0,RID_subject,Name,RID_image,Subject,URL
0,31A,Thing1,32J,31A,/hatrac/Image/0eac5db56873bc2fe8ca4b512eb271be...
1,31C,Thing2,32M,31C,/hatrac/Image/ef1fa620d471b3cbec88d06f2bfbcc5f...
2,31E,Thing3,32P,31E,/hatrac/Image/5527620402b22cc6cd1e89432513853c...
3,31G,Thing4,32R,31G,/hatrac/Image/663e946a6b6c2aa604023459d6ee6569...
4,31J,Thing5,32T,31J,/hatrac/Image/211f350dd878712239c1dcae74f341ff...
5,31M,Thing6,32W,31M,/hatrac/Image/74b1f373d2f1f7816d944fa43a5ebee9...
6,31P,Thing7,32Y,31P,/hatrac/Image/99e80a3eacd8fe0488b1039bf19142a0...
7,31R,Thing8,330,31R,/hatrac/Image/b8c4bb05c1ba52d25465f2c84bf37e8a...
8,31T,Thing9,332,31T,/hatrac/Image/44fdd8b1efe2700ad703b02bc4cfadfe...
9,31W,Thing10,334,31W,/hatrac/Image/1ec4a45ec2ac4439b498d3155ee55c0c...


For ths example, lets partition the data based on the name of the subject.  Of course in real examples, we would do a more complex analysis in deciding
what subset goes into each data set.

In [11]:
def thing_number(name: pd.Series) -> pd.Series:
    return name.map(lambda n: int(n.replace('Thing','')))

training_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 0]['RID_image'].tolist()
testing_rids =  metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 1]['RID_image'].tolist()
validation_rids = metadata_df.loc[lambda df: thing_number(df['Name']) % 3 == 2]['RID_image'].tolist()

print(f'Training images: {training_rids}')
print(f'Testing images: {testing_rids}')
print(f'Validation images: {validation_rids}')

Training images: ['32P', '32W', '332', '338', '33E', '33M']
Testing images: ['32J', '32R', '32Y', '334', '33A', '33G', '33P']
Validation images: ['32M', '32T', '330', '336', '33C', '33J', '33R']


Now that we know what we want in each dataset, lets create datasets for each of our partitioned elements along with a nested dataset to track the entire collection.

In [12]:
nested_dataset = ml_instance.create_dataset(['Partitioned', 'Image'], description='A nested dataset for machine learning')
training_dataset = ml_instance.create_dataset('Training', description='An image dataset for training')
testing_dataset = ml_instance.create_dataset('Testing', description='A image dataset for testing')
validation_dataset = ml_instance.create_dataset('Validation', description='A image dataset for validation')
pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)

Unnamed: 0,RID,Description,Dataset_Type
0,390,A subject dataset,"[DemoSet, Subject]"
1,396,A image training dataset,"[DemoSet, Image]"
2,3BW,A nested dataset for machine learning,"[Partitioned, Image]"
3,3C2,An image dataset for training,[Training]
4,3C6,A image dataset for testing,[Testing]
5,3CA,A image dataset for validation,[Validation]


And then fill the datasets with the appropriate members.

In [13]:
ml_instance.add_dataset_members(dataset_rid=nested_dataset, members=[training_dataset, testing_dataset, validation_dataset])
ml_instance.add_dataset_members(dataset_rid=training_dataset, members=training_rids)
ml_instance.add_dataset_members(dataset_rid=testing_dataset, members=testing_rids)
ml_instance.add_dataset_members(dataset_rid=validation_dataset, members=validation_rids)

'3CA'

Ok, lets see what we have now.

In [14]:
display(
    Markdown('## Nested Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(nested_dataset)['Dataset']).drop(columns=system_columns),
    Markdown('## Training Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(training_dataset)['Image']).drop(columns=system_columns),
    Markdown('## Testing Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(testing_dataset)['Image']).drop(columns=system_columns),
    Markdown('## Validation Dataset'),
    pd.DataFrame(ml_instance.list_dataset_members(validation_dataset)['Image']).drop(columns=system_columns))

## Nested Dataset

Unnamed: 0,RID,Description
0,3C2,An image dataset for training
1,3C6,A image dataset for testing
2,3CA,A image dataset for validation


## Training Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,32P,/hatrac/Image/5527620402b22cc6cd1e89432513853c...,test_31E.txt,A test image,32,5527620402b22cc6cd1e89432513853c,,31E
1,32W,/hatrac/Image/74b1f373d2f1f7816d944fa43a5ebee9...,test_31M.txt,A test image,30,74b1f373d2f1f7816d944fa43a5ebee9,,31M
2,332,/hatrac/Image/44fdd8b1efe2700ad703b02bc4cfadfe...,test_31T.txt,A test image,32,44fdd8b1efe2700ad703b02bc4cfadfe,,31T
3,338,/hatrac/Image/5566714c8fbab49e36dc0ad85fa8e372...,test_320.txt,A test image,32,5566714c8fbab49e36dc0ad85fa8e372,,320
4,33E,/hatrac/Image/7a122143bffad139ea3c7d64408f7ead...,test_326.txt,A test image,31,7a122143bffad139ea3c7d64408f7ead,,326
5,33M,/hatrac/Image/158657c8de91ead4b0bbebf8487894e7...,test_32C.txt,A test image,32,158657c8de91ead4b0bbebf8487894e7,,32C


## Testing Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,32J,/hatrac/Image/0eac5db56873bc2fe8ca4b512eb271be...,test_31A.txt,A test image,32,0eac5db56873bc2fe8ca4b512eb271be,,31A
1,32R,/hatrac/Image/663e946a6b6c2aa604023459d6ee6569...,test_31G.txt,A test image,31,663e946a6b6c2aa604023459d6ee6569,,31G
2,32Y,/hatrac/Image/99e80a3eacd8fe0488b1039bf19142a0...,test_31P.txt,A test image,31,99e80a3eacd8fe0488b1039bf19142a0,,31P
3,334,/hatrac/Image/1ec4a45ec2ac4439b498d3155ee55c0c...,test_31W.txt,A test image,31,1ec4a45ec2ac4439b498d3155ee55c0c,,31W
4,33A,/hatrac/Image/3f094a6a2940eeebc05f2f8d7cca4544...,test_322.txt,A test image,30,3f094a6a2940eeebc05f2f8d7cca4544,,322
5,33G,/hatrac/Image/ee95c9a72879cf23c86630a44f01345b...,test_328.txt,A test image,31,ee95c9a72879cf23c86630a44f01345b,,328
6,33P,/hatrac/Image/8359e758b4b5f9ca5e54f43f661414dc...,test_32E.txt,A test image,31,8359e758b4b5f9ca5e54f43f661414dc,,32E


## Validation Dataset

Unnamed: 0,RID,URL,Filename,Description,Length,MD5,Name,Subject
0,32M,/hatrac/Image/ef1fa620d471b3cbec88d06f2bfbcc5f...,test_31C.txt,A test image,32,ef1fa620d471b3cbec88d06f2bfbcc5f,,31C
1,32T,/hatrac/Image/211f350dd878712239c1dcae74f341ff...,test_31J.txt,A test image,31,211f350dd878712239c1dcae74f341ff,,31J
2,330,/hatrac/Image/b8c4bb05c1ba52d25465f2c84bf37e8a...,test_31R.txt,A test image,32,b8c4bb05c1ba52d25465f2c84bf37e8a,,31R
3,336,/hatrac/Image/d7348d17eaba35a57048de4f70cbeaf6...,test_31Y.txt,A test image,31,d7348d17eaba35a57048de4f70cbeaf6,,31Y
4,33C,/hatrac/Image/92a10bfc46294bc6ec30defb4d8fb291...,test_324.txt,A test image,31,92a10bfc46294bc6ec30defb4d8fb291,,324
5,33J,/hatrac/Image/a62175896d2709a2852b56bb3693e5bc...,test_32A.txt,A test image,32,a62175896d2709a2852b56bb3693e5bc,,32A
6,33R,/hatrac/Image/34571f0c2021afd83bb797451c1be5a0...,test_32G.txt,A test image,31,34571f0c2021afd83bb797451c1be5a0,,32G


As our very last step, lets get a PID that will allow us to share and cite the dataset that we just created

In [19]:
dataset_citation = ml_instance.cite(nested_dataset)
display(
    HTML(f'Nested dataset citation: <a href={nested_dataset}>{nested_dataset}</a>')
)

In [20]:
display(HTML(f'<a href={ml_instance.chaise_url("Dataset")}>Browse Datasets</a>'))

In [None]:
test_catalog.delete_ermrest_catalog(really=True)