DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
from deriva.core import DerivaServer, ErmrestCatalog, get_credential
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.deriva_ml_base import DerivaML, DerivaMLException, ColumnDefinition, BuiltinTypes
from deriva_ml.schema_setup.create_schema import create_ml_schema
from deriva_ml.schema_setup.test_catalog import create_test_catalog
from deriva_ml.execution_configuration import ExecutionConfiguration

Set the details for the catalog we want and authenticate to the server if needed.

In [None]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


Create a test catalog and get an instance of the DerivaML class.

In [None]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DerivaML(hostname, test_catalog.catalog_id, domain_schema, None, None, "1")

In [None]:
ml_instance.chaise_url("Subject")

In [None]:
[t.name for t in ml_instance.find_features('Image')]

A feature is a set of values that are attached to a table in the DerivaML catalog. Instances of features are distingushed from one another by the ID of the execution that produced the feature value. The execution could be the result of a program, or it could be a manual process by which a person defines a set of values

To create a new feature, we need to know the name of the feature, the table to which it is attached, and the set of values that make up the feature.  The values could be terms from a controlled vocabulary, a set of one or more file based assets, or other values, such as integers, or strings. However, use of strings outside of controlled vocabularies is discouraged.

In [None]:
# Lets create a feature called Feature1.  So we need to define the term for the feature name.
ml_instance.add_term("Feature_Name", "Feature1", description="A Feature Name")

# We are going to have three values in our feature, a controlled vocabluary term from the vocabulary FeatureValue, a file asset and 
# an integer value which we will call "TestCol"
ml_instance.create_vocabulary("FeatureValue", "A vocab")
ml_instance.add_term("FeatureValue", "V1", description="A Feature Vale")
feature_asset = ml_instance.create_asset("TestAsset", comment="A asset")


ml_instance.create_feature("Feature1", "Image",
                                        terms=["FeatureValue"],
                                        assets=[feature_asset],
                                        metadata=[ColumnDefinition(name='TestCol', type=BuiltinTypes.int2)])

[f.name for f in ml_instance.find_features("Image")]

Now we can add some features to our images.  To streamline the creation of new feature, we create a class that is specific to the arguments required to create it.

In [None]:
TestFeatureClass = ml_instance.feature_record_class("Image", "Feature1")
TestFeatureClass.model_fields

Now using TestFeatureClass, we can create some instances of the feature and add it.  We must have a exeuction_rid in order to define the feature.

In [None]:
# Get some images to attach the feature value to.
image_rids = [i['RID'] for i in ml_instance.domain_path.tables['Image'].entities().fetch()]

# Make some assets.  We are cheating here by just adding elements to the asset table without actually uploading the assets.
asset_rid = ml_instance.domain_path.tables["TestAsset"].insert([{'Name': "foo", 'URL': "foo/bar", 'Length': 2, 'MD5': 4}])[0]['RID']

# Get an execution RID.
ml_instance.add_term("Workflow_Type", "TestWorkflow", description="A workflow")
workflow_rid = ml_instance.ml_path.tables['Workflow'].insert([{'Name': "Test Workflow", 'Workflow_Type': "TestWorkflow"}])[0]['RID']
execution_rid = ml_instance.ml_path.tables['Execution'].insert([{'Description': "Test execution", 'Workflow': workflow_rid}])[0]['RID']
# Now create a list of features using the feature creation class returned by create_feature.
feature_list = [TestFeatureClass(
    Image=i,
    Execution=execution_rid,
    FeatureValue="V1",
    TestAsset=asset_rid,
    TestCol=23) for i in image_rids]
ml_instance.add_features(feature_list)

In [None]:
def strip_system(d):
    return {k:v for k,v in d.items() if k not in ['RCT', 'RMT', 'RCB', 'RMB']}
    
pd.DataFrame([strip_system(i) for i in ml_instance.list_feature("Image", "Feature1")])

In [None]:
test_catalog.delete_ermrest_catalog(really=True)