DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.

In [1]:
%load_ext autoreload
%autoreload 2

In [24]:
import builtins
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.dataset_bag import DatasetBag
from deriva_ml import ColumnDefinition, BuiltinTypes
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import ExecutionConfiguration, Workflow, Execution, MLVocab
import itertools
from IPython.display import display, Markdown
import pandas as pd
from pathlib import Path
import tempfile

Set the details for the catalog we want and authenticate to the server if needed.

In [4]:
hostname = 'tutorial.derivacloud.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DerivaML class.

In [10]:
test_catalog = create_demo_catalog(hostname, domain_schema, create_features=True, create_datasets=True)
ml_instance = DemoML(hostname, test_catalog.catalog_id)

In [11]:
display(
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Subject")],
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Image")]
)

['Subject:Health']

['Image:BoundingBox', 'Image:Quality']

In [23]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB']
datasets = pd.DataFrame(ml_instance.find_datasets()).drop(columns=system_columns)
training_dataset = [ds['RID'] for ds in ml_instance.find_datasets() if 'Training' in ds['Dataset_Type']][0]

display(
    Markdown(f'Training Dataset: {training_dataset}'),
    Markdown('## Datasets'),
    datasets)

Training Dataset: 3PM

## Datasets

Unnamed: 0,RID,Version,Description,Dataset_Type
0,3PC,,A nested dataset for machine learning,"[Partitioned, Image]"
1,3PM,,An image dataset for training,[Training]
2,3PT,,A image dataset for testing,[Testing]
3,3Q0,,A image dataset for validation,[Validation]


A feature is a set of values that are attached to a table in the DerivaML catalog. Instances of features are distingushed from one another by the ID of the execution that produced the feature value. The execution could be the result of a program, or it could be a manual process by which a person defines a set of values

To create a new feature, we need to know the name of the feature, the table to which it is attached, and the set of values that make up the feature.  The values could be terms from a controlled vocabulary, a set of one or more file based assets, or other values, such as integers, or strings. However, use of strings outside of controlled vocabularies is discouraged.

In [21]:
# Now lets create model configuration for our program.
with tempfile.TemporaryDirectory() as temp_dir:
    model_file = Path(temp_dir)  / 'modelfile.txt'
    with builtins.open(model_file, "w") as fp:
            fp.write(f"My model")
    training_model = ml_instance.upload_execution_asset(model_file, 'API_Model')

# Now lets create and upload a simple asset.
with tempfile.TemporaryDirectory() as tmpdirname:
    assetdir = ml_instance.working_dir / "TestAsset"
    assetdir.mkdir(parents=True, exist_ok=True)
    with builtins.open(assetdir / "test.txt", "w") as fp:
        fp.write("Hi there")
    ml_instance.upload_assets(assetdir)

2024-10-12 00:51:54,889 - DEBUG - Resetting dropped connection: dev.eye-ai.org
2024-10-12 00:51:58,913 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/schema/demo-schema/table HTTP/11" 201 4778
2024-10-12 00:52:00,149 - DEBUG - https://dev.eye-ai.org:443 "GET /ermrest/catalog/395/schema HTTP/11" 200 103207
2024-10-12 00:52:00,474 - DEBUG - Inserting entities to path: /entity/demo-schema:FeatureValue?defaults=URI,RID,ID,RMB,RCB,RMT,RCT
2024-10-12 00:52:00,475 - DEBUG - yielding batch of 1/1 entities (0:1)
2024-10-12 00:52:01,475 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/entity/demo-schema:FeatureValue?defaults=URI,RID,ID,RMB,RCB,RMT,RCT HTTP/11" 200 560
2024-10-12 00:52:01,489 - DEBUG - Fetched 1 entities
2024-10-12 00:52:02,463 - DEBUG - https://dev.eye-ai.org:443 "POST /ermrest/catalog/395/schema/demo-schema/table HTTP/11" 201 4467
2024-10-12 00:52:03,425 - DEBUG - https://dev.eye-ai.org:443 "GET /ermrest/catalog/395/schema HTTP/11" 200 107797
202

Now create a dataset with our new assets.

Now we can add some features to our images.  To streamline the creation of new feature, we create a class that is specific to the arguments required to create it.

In [23]:
TestFeatureClass = ml_instance.feature_record_class("Image", "Feature1")
TestFeatureClass.model_fields

{'Execution': FieldInfo(annotation=str, required=True),
 'Image': FieldInfo(annotation=str, required=True),
 'Feature_Name': FieldInfo(annotation=str, required=False, default='Feature1'),
 'TestAsset': FieldInfo(annotation=str, required=True),
 'FeatureValue': FieldInfo(annotation=str, required=True),
 'TestCol': FieldInfo(annotation=int, required=True)}

Now using TestFeatureClass, we can create some instances of the feature and add it.  We must have a exeuction_rid in order to define the feature.

In [None]:
ml_instance.add_term(MLVocab.workflow_type, "API Workflow", description="A Workflow that uses Deriva ML API")
ml_instance.add_term(MLVocab.execution_asset_type, "API_Model", description="Model for our API workflow")

api_workflow = Workflow(
    name="API Workflow",
    url="https://github.com/informatics-isi-edu/deriva-ml/blob/main/pyproject.toml",
    workflow_type="API Workflow",
    description="A workflow that uses Deriva ML"
)

config = ExecutionConfiguration(
    datasets=[training_dataset],
    assets = [training_model],
    execution=Execution(description="Sample Execution"), 
    workflow=api_workflow
)

ml_execution = ml_instance.initialize_execution(config)

In [None]:
with MLExecute(ml_execution):
    # Get the input datasets:
    dataset = DatasetBag(ml_execution.bag_paths[0])  # Input dataset

    # Get input files
    with open(ml_execution.asset_paths[0], 'rt') as model_file:
        model = model_file.read()
        print(f'Got model file: {model}')

    # Put your ML code here....
    pass

    # Write model
    # Write asset.
    with open(ml_execution.model_dir) as model_file:
        output_dir = ml_instance.execution_assets_path / "Fun Files"
        with open(output_dir / "test.txt", "w+") as f:
            f.write("Hello there a new model;\n")

    # Create some new feature values.
    bb_csv_path, bb_asset_paths = ml_execution.feature_paths('Image', 'BoundingBox')
    bounding_box_files = [bb_asset_paths['BoundingBox'] / f"box{i}.txt" for i in range(10)]
    for i in range(10):
        bounding_box_files.append(fn := bb_asset_paths['BoundingBox'] / f"box{i}.txt")
        with builtins.open(fn, "w") as fp:
            fp.write(f"Hi there {i}")

    image_bounding_box_feature_list = [ImageBoundingboxFeature(Image=image_rid,
                                                               Execution=ml_execution.execution_rid,
                                                               BoundingBox=asset_rid)
                                       for image_rid, asset_rid in zip(image_rids, itertools.cycle(bounding_box_files))]

    configuration_record.write_feature_file(image_bounding_box_feature_list)

In [None]:
upload_status = ml_instance.upload_execution(configuration=ml_execution)
e = (list(ml_instance.pathBuilder.deriva_ml.Execution.entities().fetch()))[0]

In [None]:

pd.DataFrame(ml_instance.list_feature_values("Image", "Feature1")])

In [None]:
test_catalog.delete_ermrest_catalog(really=True)