# Biolink Model Subsetting
This notebook demonstrates how to subset the Biolink Model using the Biolink Model schema and a transformation specification in two different ways: first using a transformation specification defined manually in YAML and loaded from the filesystem, and second, using a subset of classes retrieved using SchemaView dynamically.

First, we import several LinkML helper packages including SchemaView: https://linkml.io/linkml/developers/schemaview.html 
SchemaView is a LinkML schema introspection tool that provides helpful functions to retrieve model elements by name.  It also supports navigating ancestors and descendants of any model element, including classes, slots, types, and permissible values in enumerations.  


In [41]:
from pathlib import Path
from pytest import fixture

from linkml_runtime.dumpers import yaml_dumper
from linkml_map.datamodel.transformer_model import TransformationSpecification, ClassDerivation, SlotDerivation, CopyDirective
from linkml_map.inference.schema_mapper import SchemaMapper
from linkml_map.session import Session
from linkml_runtime.utils.schemaview import SchemaView
from linkml_map.utils.loaders import load_specification
from linkml_runtime.utils.formatutils import camelcase, underscore
from pprint import pprint

REPO_ROOT = Path.cwd().parent.parent

SchemaView can accept a variety of imports to initialize, including a Path or string representation of a Path or (as we do in this case), a URL of a raw LinkML schema. 

In [42]:
schema_url = "https://raw.githubusercontent.com/biolink/biolink-model/master/biolink-model.yaml"
sv = SchemaView(schema_url)


### Creating a Transformation Specification Manually in YAML format.
In our first example, we develop a "Transform Specification" for Biolink Model using Class and Slot Derivations as defined by the linkml-map transformation language.  More about that here: https://linkml.io/linkml-map/#TransformationSpecification/

In [43]:
transform_file = REPO_ROOT / "tests/input/examples/biolink/transform/biolink-example-profile.transform.yaml"
# Initialize Session and SchemaBuilder
session = Session()

# Set the source schema in the session
session.set_source_schema(sv)

tr_spec = load_specification(transform_file)
mapper = SchemaMapper()
mapper.source_schemaview = sv

target_schema_obj = mapper.derive_schema(specification=tr_spec,
                                         target_schema_id="biolink-profile",
                                        target_schema_name="BiolinkProfile")


yaml_dumper.dump(target_schema_obj, str("biolink-profile.yaml"))

transformed_sv = SchemaView("biolink-profile.yaml")

for class_name in transformed_sv.all_classes():
    print(class_name)
print()
for slot_name in transformed_sv.all_slots():
    print(slot_name)

NamedThing
Gene
Disease
PhenotypicFeature
Association
GeneToPhenotypicFeatureAssociation

id
name
category
symbol
subject
predicate
object


In [44]:
!gen-pydantic biolink-profile.yaml

Exception: range: label type


Note:
* Still in development is the tracing of provenance between a source schema and a destination schema.  Right now there is no provenance.  
* If a class, slot, enum, or type is not included at all in the Derivation, it will not be pushed forward to the destination schema.  In development is an option to pull all non-specified components of the source model into the destination model.
* You can do transformations on a derivation as well, see: https://linkml.io/linkml-map/#examples/Tutorial/#using-expressions
* You can transform data as well as schemas, but this is currently "beta" level development.
* Custom types are not pulled forward; this is a result of an error at the moment.

### Creating a TransformationSpecification using SchemaView and an existing model.

In our second example, we use the Biolink Model directly to derive classes and slots programmatically according to a simple list of "subset classes" that we want to extract from the main model in order to produce a subset model according to our specification.

First, we write a method to extract the classes and slots from Biolink Model using SchemaView

In [36]:
def get_biolink_class_derivations(sv, subset_classes) -> dict:
    """
    Function to get Biolink class definitions

    :param sv: SchemaView object
    :param subset_classes: List of classes to subset
    :return: Dictionary of class derivations incl slot derivations
    """
    # Example implementation to fetch class definitions
    # This should be replaced with the actual implementation
    class_derivations ={}
    for class_name in subset_classes:
        class_derivation = ClassDerivation(populated_from=class_name,
                                           name=camelcase(class_name))
        induced_slots = sv.class_induced_slots(class_name)
        for slot in induced_slots:
            slot_derivation = SlotDerivation(populated_from=slot.name, name=underscore(slot.name))
            class_derivation.slot_derivations[underscore(slot.name)] = slot_derivation
        class_derivations[camelcase(class_name)] = class_derivation
    return class_derivations


In [40]:
session = Session()

# Set the source schema in the session
session.set_source_schema(sv)

subset_classes = [
        "gene",
        "disease",
        "case to phenotypic feature association",
        "gene to disease association",
        "gene to phenotypic feature association",
        "case",
        "phenotypic feature",
    ]

class_derivations = get_biolink_class_derivations(sv, subset_classes)
copy_type_directives = {
    type_name: CopyDirective(element_name=type_name, copy_all=True)
    for type_name, type_def in sv.all_types().items()
}

ts = TransformationSpecification(class_derivations=class_derivations, copy_directives=copy_type_directives)

mapper = SchemaMapper()
mapper.source_schemaview = sv

target_schema_obj = mapper.derive_schema(
    specification=ts, target_schema_id="biolink-subset", target_schema_name="BiolinkSubset"
)

yaml_dumper.dump(target_schema_obj, str("biolink-subset.yaml"))

transformed_sv = SchemaView("biolink-subset.yaml")

for class_name in transformed_sv.all_classes():
    print("class derived: ", class_name)
for slot_name in transformed_sv.all_slots():
    print("slot derived: ", slot_name)
for type_name in transformed_sv.all_types():
    print("type copied: ", type_name)

AttributeError: 'NoneType' object has no attribute 'is_a'

In [28]:
# print the content of the new schema in LinkML YAML format to view here in the notebook
yaml_content = yaml_dumper.dumps(target_schema_obj)  # Serialize to a string
print(yaml_content)

name: BiolinkSubset
id: biolink-subset
imports:
- linkml:types
prefixes:
  AGRKB:
    prefix_prefix: AGRKB
    prefix_reference: https://www.alliancegenome.org/
  apollo:
    prefix_prefix: apollo
    prefix_reference: https://github.com/GMOD/Apollo
  AspGD:
    prefix_prefix: AspGD
    prefix_reference: http://www.aspergillusgenome.org/cgi-bin/locus.pl?dbid=
  biolink:
    prefix_prefix: biolink
    prefix_reference: https://w3id.org/biolink/vocab/
  bioschemas:
    prefix_prefix: bioschemas
    prefix_reference: https://bioschemas.org/
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  CAID:
    prefix_prefix: CAID
    prefix_reference: http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=
  CHADO:
    prefix_prefix: CHADO
    prefix_reference: http://gmod.org/wiki/Chado/
  ChemBank:
    prefix_prefix: ChemBank
    prefix_reference: http://chembank.broadinstitute.org/chemistry/viewMolecule.htm?cbid=
  CHEMBL.MECHANIS

Notes:
* notice that if we remove a parent class from this, e.g. remove "NamedThing" - the `is_a` path in the descendent classes will be absent.  this is to prevent unreachable element errors.
* notice that helpful defaults are brought in like prefixes, descriptions, aliases, mappings, etc.
  * there will likely be cases where metamodel elements in LinkML are not automatically transferred to the derived schema (as will all our generators, we are working towards feature parity).  
* notice that the transformation automatically makes what were, independent `slot` definitions in Biolink Model into `attributes`.  These are more or less functionally equivalent in LinkML, however if you want to specify a slot that can be reused outside of a particular class, it is still best practice to do so with a slot definition rather than an attribute as to not repeat slot definitions.

### Use derived schema to generate different serializations of the derived model

Now that we have a YAML dump of our derived model, we can use standard LinkML tooling to produce:
* python dataclasses and pydantic models of our derived schema
* navigate our derived schema with a SchemaView instance
* create and deploy automated documentation with the derived schema (see https://github.com/linkml/linkml-project-cookiecutter for more details on using the derived schema in a standard setup)

In [39]:
!gen-pydantic biolink-subset.yaml

Exception: range: biological sequence
