DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

In [15]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
import builtins
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.deriva_ml_base import ColumnDefinition, BuiltinTypes
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML
from deriva_ml.execution_configuration import ExecutionConfiguration, Workflow, Execution, WorkflowTerm, Term
from pathlib import Path
import pandas as pd
import tempfile

Set the details for the catalog we want and authenticate to the server if needed.

In [17]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


2024-10-12 00:49:40,521 - DEBUG - on lookup, default setting: GLOBUS_SDK_ENVIRONMENT=production
2024-10-12 00:49:40,522 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-12 00:49:40,522 - DEBUG - Service URL Lookup for "auth" under env "production"
2024-10-12 00:49:40,522 - DEBUG - Service URL Lookup Result: "auth" is at "https://auth.globus.org/"
2024-10-12 00:49:40,523 - DEBUG - on lookup, default setting: GLOBUS_SDK_VERIFY_SSL=True
2024-10-12 00:49:40,523 - DEBUG - on lookup, default setting: GLOBUS_SDK_HTTP_TIMEOUT=60.0
2024-10-12 00:49:40,524 - DEBUG - initialized transport of type <class 'globus_sdk.transport.requests.RequestsTransport'>
2024-10-12 00:49:40,524 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-12 00:49:40,525 - DEBUG - Using code handlers (<fair_research_

You are already logged in.


Create a test catalog and get an instance of the DerivaML class.

In [18]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

2024-10-12 00:49:56,110 - DEBUG - on lookup, default setting: GLOBUS_SDK_ENVIRONMENT=production
2024-10-12 00:49:56,111 - INFO - Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
2024-10-12 00:49:56,112 - DEBUG - Service URL Lookup for "auth" under env "production"
2024-10-12 00:49:56,112 - DEBUG - Service URL Lookup Result: "auth" is at "https://auth.globus.org/"
2024-10-12 00:49:56,113 - DEBUG - on lookup, default setting: GLOBUS_SDK_VERIFY_SSL=True
2024-10-12 00:49:56,113 - DEBUG - on lookup, default setting: GLOBUS_SDK_HTTP_TIMEOUT=60.0
2024-10-12 00:49:56,114 - DEBUG - initialized transport of type <class 'globus_sdk.transport.requests.RequestsTransport'>
2024-10-12 00:49:56,114 - INFO - Finished initializing AuthLoginClient. client_id='8ef15ba9-2b4a-469c-a163-7fd910c9d111', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
2024-10-12 00:49:56,115 - DEBUG - Using code handlers (<fair_research_

In [19]:
ml_instance.chaise_url("Subject")

'https://dev.eye-ai.org/chaise/recordset/#395/demo-schema:Subject'

In [20]:
[t.name for t in ml_instance.find_features('Image')]

[]

A feature is a set of values that are attached to a table in the DerivaML catalog. Instances of features are distingushed from one another by the ID of the execution that produced the feature value. The execution could be the result of a program, or it could be a manual process by which a person defines a set of values

To create a new feature, we need to know the name of the feature, the table to which it is attached, and the set of values that make up the feature.  The values could be terms from a controlled vocabulary, a set of one or more file based assets, or other values, such as integers, or strings. However, use of strings outside of controlled vocabularies is discouraged.

In [21]:
# Prerequests for our feature, which will include a CV term and asset.

# Create a vocabulary and add a term to it to use in our features.
ml_instance.create_vocabulary("FeatureValue", "A vocab")
ml_instance.add_term("FeatureValue", "V1", description="A Feature Vale")

feature_asset = ml_instance.create_asset("TestAsset", comment="A asset")

# Now lets create and upload a simple asset.
with tempfile.TemporaryDirectory() as tmpdirname:
    assetdir = ml_instance.working_dir / "TestAsset"
    assetdir.mkdir(parents=True, exist_ok=True)
    with builtins.open(assetdir / "test.txt", "w") as fp:
        fp.write("Hi there")
    ml_instance.upload_assets(assetdir)

2024-10-12 00:51:54,889 - DEBUG - Resetting dropped connection: dev.eye-ai.org
2024-10-12 00:51:58,913 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/schema/demo-schema/table HTTP/11" 201 4778
2024-10-12 00:52:00,149 - DEBUG - https://dev.eye-ai.org:443 "GET /ermrest/catalog/395/schema HTTP/11" 200 103207
2024-10-12 00:52:00,474 - DEBUG - Inserting entities to path: /entity/demo-schema:FeatureValue?defaults=URI,RID,ID,RMB,RCB,RMT,RCT
2024-10-12 00:52:00,475 - DEBUG - yielding batch of 1/1 entities (0:1)
2024-10-12 00:52:01,475 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/entity/demo-schema:FeatureValue?defaults=URI,RID,ID,RMB,RCB,RMT,RCT HTTP/11" 200 560
2024-10-12 00:52:01,489 - DEBUG - Fetched 1 entities
2024-10-12 00:52:02,463 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/schema/demo-schema/table HTTP/11" 201 4467
2024-10-12 00:52:03,425 - DEBUG - https://dev.eye-ai.org:443 "GET /ermrest/catalog/395/schema HTTP/11" 200 107797
202

Now create a dataset with our new assets.

In [None]:
add_dataset_element_type()
create_dataset

In [22]:
# We are going to have three values in our feature, a controlled vocabulary term from the vocabulary FeatureValue, a file asset and
# an integer value which we will call "TestCol"
ml_instance.create_feature("Feature1", "Image",
                                        terms=["FeatureValue"],
                                        assets=[feature_asset],
                                        metadata=[ColumnDefinition(name='TestCol', type=BuiltinTypes.int2)])

[f.name for f in ml_instance.find_features("Image")]

2024-10-12 00:52:12,603 - DEBUG - Resetting dropped connection: dev.eye-ai.org
2024-10-12 00:52:14,752 - DEBUG - https://dev.eye-ai.org:443 "GET /ermrest/catalog/395/schema HTTP/11" 200 108339
2024-10-12 00:52:15,125 - DEBUG - Inserting entities to path: /entity/deriva-ml:Feature_Name?defaults=URI,RID,ID,RMB,RCB,RMT,RCT
2024-10-12 00:52:15,126 - DEBUG - yielding batch of 1/1 entities (0:1)
2024-10-12 00:52:16,017 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/entity/deriva-ml:Feature_Name?defaults=URI,RID,ID,RMB,RCB,RMT,RCT HTTP/11" 200 552
2024-10-12 00:52:16,018 - DEBUG - Fetched 1 entities
2024-10-12 00:52:17,049 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/schema/demo-schema/table HTTP/11" 201 6797
2024-10-12 00:52:17,966 - DEBUG - https://dev.eye-ai.org:443 "PUT /ermrest/catalog/395/schema/demo-schema/table/Execution_Image_Feature1/column/Feature_Name HTTP/11" 200 226


['Execution_Image_Feature1']

Now we can add some features to our images.  To streamline the creation of new feature, we create a class that is specific to the arguments required to create it.

In [23]:
TestFeatureClass = ml_instance.feature_record_class("Image", "Feature1")
TestFeatureClass.model_fields

{'Execution': FieldInfo(annotation=str, required=True),
 'Image': FieldInfo(annotation=str, required=True),
 'Feature_Name': FieldInfo(annotation=str, required=False, default='Feature1'),
 'TestAsset': FieldInfo(annotation=str, required=True),
 'FeatureValue': FieldInfo(annotation=str, required=True),
 'TestCol': FieldInfo(annotation=int, required=True)}

Now using TestFeatureClass, we can create some instances of the feature and add it.  We must have a exeuction_rid in order to define the feature.

In [None]:
config = ExecutionConfiguration(
    execution=Execution(description="Sample Execution"), 
    workflow=Workflow(
        name="Sample Workflow", 
        url="https://github.com/informatics-isi-edu/deriva-ml/blob/main/pyproject.toml", 
        workflow_type="Sample Workflow"), 
    workflow_terms=[WorkflowTerm(term=Term.workflow, name="Sample Workflow", description="Sample Workflow Example")],
    description="Our Sample Workflow instance")
configuration_record = ml_instance.initialize_execution(config)
execution_rid = configuration_record.execution_rid

with ml_instance.execution(configuration=configuration_record) as exec:
    output_dir = ml_instance.execution_assets_path / "Feature1"
    output_dir.mkdir(parents=True, exist_ok=True)
    with open(output_dir / "test.txt", "w+") as f:
        f.write("Hello there\n")    
        
e = (list(ml_instance.pathBuilder.deriva_ml.Execution.entities().fetch()))[0]

In [None]:
upload_status = ml_instance.upload_execution(configuration=configuration_record)
e = (list(ml_instance.pathBuilder.deriva_ml.Execution.entities().fetch()))[0]

In [None]:
def strip_system(d):
    return {k:v for k,v in d.items() if k not in ['RCT', 'RMT', 'RCB', 'RMB']}
    
pd.DataFrame([strip_system(i) for i in ml_instance.list_feature("Image", "Feature1")])

In [None]:
test_catalog.delete_ermrest_catalog(really=True)

In [None]:
test_catalog.delete_ermrest_catalog(really=True)

In [None]:
test_catalog.delete_ermrest_catalog(really=True)