# DerivaML Features Example

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow.  This notebook reviews the basic features of the DerivaML library.


In DerivaML, "features" are the way we attach values to objects in the catalog. A feature could be a computed value that serves as input to a ML model, or it could be a label, that is the result of running a model.  A feature can be a controlled vocabulary term, an asset, or a value.

Each feature in the catalog is distinguished by the name of the feature, the identity of the object that the feature is being attached to, and the execution RID of the process that generated the feature value

## Set up Deriva for test case

In [1]:
import csv

from fontTools.misc.bezierTools import namedtuple
%load_ext autoreload
%autoreload 2

In [2]:
import builtins
import csv
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.deriva_ml_base import ColumnDefinition, BuiltinTypes, MLVocab
from deriva_ml.schema_setup.test_catalog import create_test_catalog, DemoML
from deriva_ml.execution_configuration import ExecutionConfiguration, Workflow, Execution
from IPython.display import display, Markdown
import itertools
import pandas as pd
import tempfile
import random

Set the details for the catalog we want and authenticate to the server if needed.

In [3]:
hostname = 'dev.eye-ai.org'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")


You are already logged in.


Create a test catalog and get an instance of the DerivaML class.

In [4]:
test_catalog = create_test_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)
display(f"Created demo catalog at {hostname}:{test_catalog.catalog_id}")

'Created demo catalog at dev.eye-ai.org:502'

A feature is a set of values that are attached to a table in the DerivaML catalog. Instances of features are distingushed from one another by the ID of the execution that produced the feature value. The execution could be the result of a program, or it could be a manual process by which a person defines a set of values

To create a new feature, we need to know the name of the feature, the table to which it is attached, and the set of values that make up the feature.  The values could be terms from a controlled vocabulary, a set of one or more file based assets, or other values, such as integers, or strings. However, use of strings outside of controlled vocabularies is discouraged.

In [5]:
# Pre=requests for our feature, which will include a CV term and asset.

# Create a vocabulary and add a term to it to use in our features.
ml_instance.create_vocabulary("SubjectHealth", "A vocab")
ml_instance.add_term("SubjectHealth", "Sick", description="The subject self reports that they are sick")
ml_instance.add_term("SubjectHealth", "Well", description="The subject self reports that they feel well")

ml_instance.create_vocabulary("ImageQuality", "Controlled vocabulary for image quality")
ml_instance.add_term("ImageQuality", "Good", description="The image is good")
ml_instance.add_term("ImageQuality", "Bad", description="The image is bad")

box_asset = ml_instance.create_asset("BoundingBox", comment="A file that contains a cropped version of a image")

In [6]:
# We are going to have three values in our feature, a controlled vocabulary term from the vocabulary FeatureValue, a file asset and 
# an integer value which we will call "TestCol"
ml_instance.create_feature("Health", "Subject",
                                        terms=["SubjectHealth"],
                                        metadata=[ColumnDefinition(name='Scale', type=BuiltinTypes.int2)])

ml_instance.create_feature("BoundingBox", "Image", assets=[box_asset])
ml_instance.create_feature("Quality", "Image", terms=["ImageQuality"])

deriva_ml.deriva_ml_base.ImageFeatureQuality

In [7]:
display(
    [f.name for f in ml_instance.find_features("Subject")],
    [f.name for f in ml_instance.find_features("Image")]
)

['Execution_Subject_Health']

['Execution_Image_BoundingBox', 'Execution_Image_Quality']

Now we can add some features to our images.  To streamline the creation of new feature, we create a class that is specific to the arguments required to create it.

In [8]:
ImageQualityFeature = ml_instance.feature_record_class("Image", "Quality")
ImageBoundingboxFeature = ml_instance.feature_record_class("Image", "BoundingBox")
SubjectWellnessFeature= ml_instance.feature_record_class("Subject", "Health")

display(
    Markdown('### SubjectWellnessFeature'),
    SubjectWellnessFeature.columns,
    Markdown('### ImageQualityFeature'),
    ImageQualityFeature.columns,
    Markdown('### ImageBoundingboxFeature'),
    ImageBoundingboxFeature.columns
)

### SubjectWellnessFeature

['Execution', 'Subject', 'Feature_Name', 'SubjectHealth', 'Scale']

### ImageQualityFeature

['Execution', 'Image', 'Feature_Name', 'ImageQuality']

### ImageBoundingboxFeature

['Execution', 'Image', 'Feature_Name', 'BoundingBox']

Now using TestFeatureClass, we can create some instances of the feature and add it.  We must have a execution_rid in order to define the feature.

In [9]:
ml_instance.add_term(MLVocab.workflow_type, "API Workflow", description="A Workflow that uses Deriva ML API")
api_workflow = Workflow(
    name="API Workflow", 
    url="https://github.com/informatics-isi-edu/deriva-ml/blob/main/pyproject.toml",
    workflow_type="API Workflow"
)

api_execution = ml_instance.initialize_execution(
    ExecutionConfiguration(
    execution=Execution(description="Sample Execution"), 
    workflow=api_workflow, 
    description="Our Sample Workflow instance")
)

In [10]:
# Get some images to attach the feature value to.

# Now lets create and upload a simple asset.
with tempfile.TemporaryDirectory() as temp_dir:
    assetdir = ml_instance.asset_directory('BoundingBox', prefix=temp_dir)
    for i in range(10):
        with builtins.open(assetdir / f"box{i}.txt", "w") as fp:
            fp.write(f"Hi there {i}")
    bounding_box_assets = ml_instance.upload_assets(assetdir) 

subject_rids = [i['RID'] for i in ml_instance.domain_path.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in ml_instance.domain_path.tables['Image'].entities().fetch()]
bounding_box_rids = [i['RID'] for i in ml_instance.domain_path.tables['BoundingBox'].entities().fetch()]

In [11]:
# Now create a list of features using the feature creation class returned by create_feature.
subject_feature_list = [SubjectWellnessFeature(
    Subject=subject_rid,
    Execution=api_execution.execution_rid,
    SubjectHealth= ["Well", "Sick"][random.randint(0,1)],
    Scale=random.randint(1, 10)) for subject_rid in subject_rids]

image_quality_feature_list = [ImageQualityFeature(
    Image=image_rid,
    Execution=api_execution.execution_rid,
    ImageQuality= ["Good", "Bad"][random.randint(0,1)])
        for image_rid in image_rids]

image_bounding_box_feature_list = [ImageBoundingboxFeature(
    Image=image_rid,
    Execution=api_execution.execution_rid,
    BoundingBox=asset_rid)
        for image_rid, asset_rid in zip(image_rids, itertools.cycle(bounding_box_rids))]

ml_instance.add_features(subject_feature_list)
ml_instance.add_features(image_quality_feature_list)
ml_instance.add_features(image_bounding_box_feature_list)

20

In [12]:
system_columns = ['RCT', 'RMT', 'RCB', 'RMB', 'Feature_Name']

display(
    Markdown('### Wellness'),
    pd.DataFrame(ml_instance.list_feature("Subject", "Health")).drop(columns=system_columns),
    Markdown('### Image Quality'),
    pd.DataFrame(ml_instance.list_feature("Image", "Quality")).drop(columns=system_columns),
    Markdown('### BoundingBox'),
    pd.DataFrame(ml_instance.list_feature("Image", "BoundingBox")).drop(columns=system_columns)
)

### Wellness

Unnamed: 0,RID,Execution,Subject,SubjectHealth,Scale
0,3HY,3H6,2ZG,Well,9
1,3J0,3H6,2ZJ,Sick,2
2,3J2,3H6,2ZM,Sick,6
3,3J4,3H6,2ZP,Sick,10
4,3J6,3H6,2ZR,Sick,6
5,3J8,3H6,2ZT,Sick,3
6,3JA,3H6,2ZW,Well,1
7,3JC,3H6,2ZY,Well,2
8,3JE,3H6,300,Sick,5
9,3JG,3H6,302,Sick,5


### Image Quality

Unnamed: 0,RID,Execution,Image,ImageQuality
0,3K6,3H6,30R,Bad
1,3K8,3H6,30T,Bad
2,3KA,3H6,30W,Bad
3,3KC,3H6,30Y,Good
4,3KE,3H6,310,Bad
5,3KG,3H6,312,Good
6,3KJ,3H6,314,Bad
7,3KM,3H6,316,Good
8,3KP,3H6,318,Bad
9,3KR,3H6,31A,Bad


### BoundingBox

Unnamed: 0,RID,Execution,Image,BoundingBox
0,3ME,3H6,30R,3HA
1,3MG,3H6,30T,3HC
2,3MJ,3H6,30W,3HE
3,3MM,3H6,30Y,3HG
4,3MP,3H6,310,3HJ
5,3MR,3H6,312,3HM
6,3MT,3H6,314,3HP
7,3MW,3H6,316,3HR
8,3MY,3H6,318,3HT
9,3N0,3H6,31A,3HW


Now lets make some more features, but this time, we will upload them from local files.

In [13]:
ml_instance.add_term(MLVocab.workflow_type, "File Workflow", description="A Workflow that loads features from file system")

fs_workflow = Workflow(
    name="File Workflow", 
    url="https://github.com/informatics-isi-edu/deriva-ml/blob/main/pyproject.toml",
    workflow_type="File Workflow"
)

fs_execution = ml_instance.initialize_execution(ExecutionConfiguration(
    execution=Execution(description="Sample Execution via filesystem"), 
    workflow=fs_workflow, 
    description="Our Sample Workflow instance")
)

In [14]:
# Create a new set of images.  For fun, lets wrap this in an execution so we get status updates

with ml_instance.execution(configuration=fs_execution) as exec:
    bb_csv_path, bb_asset_paths = ml_instance.feature_paths('Image', 'BoundingBox')
    bounding_box_files = [bb_asset_paths['BoundingBox'] / f"box{i}.txt" for i in range(10)]
    for i in range(10):
        bounding_box_files.append(fn := bb_asset_paths['BoundingBox'] / f"box{i}.txt")
        with builtins.open(fn, "w") as fp:
            fp.write(f"Hi there {i}")

    image_bounding_box_feature_list = [ImageBoundingboxFeature(Image=image_rid,
        Execution=fs_execution.execution_rid,
        BoundingBox=asset_rid)
            for image_rid, asset_rid in zip(image_rids, itertools.cycle(bounding_box_files))]

    with open(bb_csv_path, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=ImageBoundingboxFeature.columns)
        writer.writeheader()
        for bb in image_bounding_box_feature_list:
            writer.writerow(bb.dict())
    
    quality_csv_path, _ = ml_instance.feature_paths('Image', 'Quality')
    image_quality_feature_list = [ImageQualityFeature(
        Image=image_rid,
        Execution=fs_execution.execution_rid,
        ImageQuality= ["Good", "Bad"][random.randint(0,1)])
            for image_rid in image_rids]
    with open(quality_csv_path, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=ImageQualityFeature.columns)
        writer.writeheader()
        for iq in image_quality_feature_list:
            writer.writerow(iq.dict())
            
    wellness_csv_path, _ = ml_instance.feature_paths('Subject', 'Health')
    subject_feature_list = [SubjectWellnessFeature(
        Subject=subject_rid,
        Execution=fs_execution.execution_rid,
        SubjectHealth= ["Well", "Sick"][random.randint(0,1)],
        Scale=random.randint(1, 10)) for subject_rid in subject_rids]
    with open(wellness_csv_path, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=list(SubjectWellnessFeature.columns))
        writer.writeheader()
        for sw in subject_feature_list:
            writer.writerow(sw.dict())
        

In [15]:
ml_instance.upload_execution(fs_execution, clean_folder=False)

Uploading /var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpc3xhuvz7/DemoML_working/deriva-ml/demo-schema



/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpc3xhuvz7/DemoML_working/deriva-ml/demo-schema/feature/Image/BoundingBox/BoundingBox.csv -- [DerivaUploadCatalogCreateError] [HTTPError] 409 Client Error: CONFLICT for url: [https://dev.eye-ai.org/ermrest/catalog/502/entity/demo-schema:Execution_Image_BoundingBox?defaults=RID,RCB,RMB,RCT,RMT] Details: b'Request conflicts with state of server. Detail: Input data violates model. ERROR:  insert or update on table "Execution_Image_BoundingBox" violates foreign key constraint "Execution_Image_BoundingBox_BoundingBox_fkey"\nDETAIL:  Key (BoundingBox)=(/var/folders/0k/27qzm97x3t7g3j1m6ksf_9f40000gn/T/tmpc3xhuvz7/DemoML_working/deriva-ml/demo-schema/feature/Image/BoundingBox/BoundingBox/box0.txt) is not present in table "BoundingBox".\n\n' - Server responded: Request conflicts with state of server. Detail: Input data violates model. ERROR:  insert or update on table "Execution_Image_BoundingBox" violates foreign key constraint "Execution_Imag

In [16]:
display(
    Markdown('### Wellness'),
    pd.DataFrame(ml_instance.list_feature("Subject", "Health")).drop(columns=system_columns),
    Markdown('### Image Quality'),
    pd.DataFrame(ml_instance.list_feature("Image", "Quality")).drop(columns=system_columns),
    Markdown('### BoundingBox'),
    pd.DataFrame(ml_instance.list_feature("Image", "BoundingBox")).drop(columns=system_columns)
)

### Wellness

Unnamed: 0,RID,Execution,Subject,SubjectHealth,Scale
0,3HY,3H6,2ZG,Well,9
1,3J0,3H6,2ZJ,Sick,2
2,3J2,3H6,2ZM,Sick,6
3,3J4,3H6,2ZP,Sick,10
4,3J6,3H6,2ZR,Sick,6
5,3J8,3H6,2ZT,Sick,3
6,3JA,3H6,2ZW,Well,1
7,3JC,3H6,2ZY,Well,2
8,3JE,3H6,300,Sick,5
9,3JG,3H6,302,Sick,5


### Image Quality

Unnamed: 0,RID,Execution,Image,ImageQuality
0,3K6,3H6,30R,Bad
1,3K8,3H6,30T,Bad
2,3KA,3H6,30W,Bad
3,3KC,3H6,30Y,Good
4,3KE,3H6,310,Bad
5,3KG,3H6,312,Good
6,3KJ,3H6,314,Bad
7,3KM,3H6,316,Good
8,3KP,3H6,318,Bad
9,3KR,3H6,31A,Bad


### BoundingBox

Unnamed: 0,RID,Execution,Image,BoundingBox
0,3ME,3H6,30R,3HA
1,3MG,3H6,30T,3HC
2,3MJ,3H6,30W,3HE
3,3MM,3H6,30Y,3HG
4,3MP,3H6,310,3HJ
5,3MR,3H6,312,3HM
6,3MT,3H6,314,3HP
7,3MW,3H6,316,3HR
8,3MY,3H6,318,3HT
9,3N0,3H6,31A,3HW


In [17]:
#test_catalog.delete_ermrest_catalog(really=True)