# Data validation with and custom plugins

Let us consider another scenario where we know our data is wrong but it is not possible to validate and identify such errors via JSONSchema.

This will happen when we translate an RDF oriented data model into JSONSchema. There is loss of semantics and the constraints of JSONSchema becomes apparent.

## Schema

We will use Biolink Model v3.1.1.

In [1]:
schema_url = "https://raw.githubusercontent.com/biolink/biolink-model/v3.1.1/biolink-model.yaml"

## Data

And we have a an object that is incorrect:

In [2]:
data = {
    "id": "HGNC:9399",
    "name": "PRKCD",
    "category": [
        "biolink:GeneEntity" # <-- This should be a valid category from Biolink Model
    ],
    "provided_by": [
        "graph_nodes.tsv"
    ],
    "taxon": "NCBITaxon:9606"
}

## Define a custom validation plugin

In [3]:
from linkml_runtime.utils.schemaview import SchemaView

from linkml_validator.validator import Validator
from linkml_validator.plugins.base import BasePlugin
from linkml_validator.models import ValidationResult, ValidationMessage
from linkml_validator.utils import camelcase_to_sentencecase

class MyCustomPlugin(BasePlugin):
    """
    A plugin that checks if a given category of an object
    is valid and exists in Biolink Model.
    """
    NAME = "MyCustomPlugin"

    def __init__(self, schema: str, **kwargs) -> None:
        super().__init__(schema)
        self.schemaview = SchemaView(schema)

    def process(self, obj: dict, **kwargs) -> ValidationResult:
        valid = True
        categories = obj['category']
        validation_messages = []
        for category in categories:
            category_name = camelcase_to_sentencecase(category.split(':')[1])
            if category_name not in self.schemaview.all_classes():
                valid = False
                validation_message = ValidationMessage(
                    severity='Error',
                    field='category',
                    value=category,
                    message=f'Category {category} not in the schema'
                )
                validation_messages.append(validation_message)
                break
        result = ValidationResult(
            plugin_name=self.NAME,
            valid=valid,
            validation_messages=validation_messages
        )
        return result


## Validating data using custom validation plugin

First we instantiate the Validator with the Biolink Model YAML and provide a list of plugins:

In [4]:
from linkml_validator.validator import Validator

plugins = [
    {
        "plugin_class": MyCustomPlugin,
        "args": {}
    }
]
validator = Validator(schema=schema_url, plugins=plugins)

Then we can validate our data against the Biolink Model:

In [5]:
report = validator.validate(obj=data, target_class='Gene')

if not report.valid:
    for result in report.validation_results:
        for message in result.validation_messages:
            print(f"[{result.plugin_name}] {message.message} for {report.object}")

[MyCustomPlugin] Category biolink:GeneEntity not in the schema for {'id': 'HGNC:9399', 'name': 'PRKCD', 'category': ['biolink:GeneEntity'], 'provided_by': ['graph_nodes.tsv'], 'taxon': 'NCBITaxon:9606'}
