## DDE-compatible json converter

This script converts ingests the yml files and outputs jsonschema that should be mostly DDE-compatible. Note that there are exceptions as some missing logic is not yet in place. The DDE already has a copy of schema.org ingested; hence, it should not be necessary to include enums of all schema.org classes in the validation.

Note that the DDE intends to make schemas more human interpretable; hence, it does NOT like excessive nesting and circularity in the `$validation` portion of the jsonschema. While infinite, looping, references are allowed in schema.org, it can cause errors if enforced in the `$validation` portion. To bypass this, extensive nesting has been avoided by including simplified classes as objects in the definitions portion of the `$validation`. Although these simplified objects may not include ALL the properties allowable for that class, it should not raise errors if they are included during validation either.   

Already in place:
* If a property comes from bioschemas, it will need to be defined in the @graph
* If a property exists in the hierarchy and can be inherited, then it should not need to be defined. Instead, it should be defined in the `$validation`

If a property comes from schema.org it may or may not be need to be defined. This depends on whether or not the property exists in the hierarchy from which this class is derived. This is because the only real constraints provided by schema.org are class hierarchies and property<->class relationships. Hence, the DDE only allows the inheritance of properties that are within the class hierarchy. Marginality, cardinality, and other useful constraints (eg- ontologies) in the biomedical research space come from bioschemas.

Not yet in place:
* If a schema.org property does not exist in the hierarchy, it normally does not apply to this class,
* If this is the case, it should be created with the "sameAs" property

#### To Do:
Some of the yaml files are throwing errors the following errors:
`ScannerError: mapping values are not allowed here`
Currently, we log the error and ignore it

In [1]:
import os
import json
import pandas
import pathlib
import yaml
import requests

In [2]:
#script_path = pathlib.Path(__file__).parent.absolute()
tmp_dir = os.getcwd()
parent_dir = os.path.dirname(tmp_dir)
available = os.listdir(parent_dir)
outputdirectory = os.path.join(parent_dir,'specifications')
inputdirectory = os.path.join(parent_dir,'Bioschemas-Validator')
input_profiles = os.path.join(inputdirectory,'profile_json')
input_marginality = os.path.join(inputdirectory,'profile_marginality')
input_yml = os.path.join(inputdirectory,'profile_yml')

specifications = os.listdir(input_profiles)
datapath = 'results/'

In [3]:
#### Load the yaml file
def get_yml_dict(theymlfile):
    try:
        with open(theymlfile,'r',encoding="utf8") as ymlin:
            tmpyml = yaml.load_all(ymlin, Loader=yaml.FullLoader)
            for eachyml in tmpyml:
                if '<!DOCTYPE HTML>' in eachyml:
                    break
                ymldict = eachyml
    except:
        with open(theymlfile,'r',encoding="latin1") as ymlin:
            tmpyml = yaml.load_all(ymlin, Loader=yaml.FullLoader)
            for eachyml in tmpyml:
                if '<!DOCTYPE HTML>' in eachyml:
                    break
                ymldict = eachyml
    return(ymldict)


def test_yml_mapping(ymldict):
    if 'mapping' in ymldict.keys():
        mapping=True
    else:
        mapping=ymldict.keys()
    return(mapping)


#### Create the base class
def create_new_dict():
    newdict = {}
    newdict['@context'] = {
        "schema": "http://schema.org/",
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "dct": "https://dublincore.org/specifications/dublin-core/dcmi-terms/#",
        "bioschemas": "http://discovery.biothings.io/view/bioschemas/"
      } 
    return(newdict)


def get_schema_base():
    ## Note the DDE has been updated with the latest version
    schemabase = 'https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/releases/13.0/schemaorg-all-http.jsonld'
    r = requests.get(schemabase)
    schematree = r.text
    return(schematree)


def grab_class_info(schematree,eachversion,ymldict):
    parentclass = ymldict['hierarchy'][-1]
    if parentclass in schematree:
        parenttype = "schema:"
    else:
        parenttype = "bioschemas:"
    try:
        description = ymldict['spec_info']['subtitle']+" "+ymldict['spec_info']['description']+" Version: "+ymldict['spec_info']['version']
    except:
        description = ymldict['spec_info']['subtitle']+" "+" Version: "+ymldict['spec_info']['version']
    classinfo = {
      "@id": "bioschemas:"+ymldict['name'],
      "@type": "rdfs:Class",
      "rdfs:comment": description,
      "schema:schemaVersion": ["https://bioschemas.org/"+ymldict['spec_type'].lower()+"s/"+ymldict['name']+"/"+eachversion.replace(".json","")],  
      "rdfs:label": ymldict['name'],
      "rdfs:subClassOf": {
        "@id": parenttype+parentclass
      }
    }
    return(classinfo)

In [4]:
#### Create the validation for the class

def load_dictionaries():
    from dde_reusable_objects import expected_type_dict
    from dde_reusable_objects import reusable_definitions
    return(expected_type_dict, reusable_definitions)


def generate_type(expected_type_dict, propertytype):
    if propertytype in expected_type_dict.keys():
        matched_type = expected_type_dict[propertytype]
    else:
        matched_type = False
    return(matched_type)


def generate_base_dict(expectedtype):
    base_dict = {
        "@type": expectedtype,
        "type": "object",
        "properties": {
          "name": {
            "type": "string"  
          }  
        },
        "required": []
    }
    return(base_dict)


def import_reusable_objects(reusable_definitions,rangelist):
    expected_types = [x["@id"].replace("bioschemas:","") for x in rangelist]
    all_expected_types = [x.replace("schema:","") for x in expected_types]
    definitionslist = [x for x in all_expected_types if x.lower() in reusable_definitions.keys()]
    definitiondict = {}
    referencelist = {}
    for eachdefinition in definitionslist:
        definitiondict[eachdefinition.lower()] = reusable_definitions[eachdefinition.lower()]
        referencelist[eachdefinition.lower()] = {"$ref":"#/definitions/"+eachdefinition.lower()}        
    return(definitiondict,referencelist)


def check_type(expected_type_dict, referencelist, definitiondict, propertytype_info):
    referencename = propertytype_info.replace("bioschemas:","").replace("schema:","").lower()
    matched_type = generate_type(expected_type_dict, propertytype_info)
    if matched_type != False:
        actualtype = matched_type
    else:
        if referencename in referencelist.keys():
            actualtype = referencelist[referencename]
        else:
            ### create reference and property
            actualtype = {"$ref":"#/definitions/"+referencename}
            referencelist[referencename] = actualtype
            definitiondict[referencename] = generate_base_dict(propertytype_info)    
    return(actualtype, referencelist, definitiondict)


def cardinality_check(expected_type_dict, referencelist, definitiondict, eachproperty):
    rangelist = get_rangelist(eachproperty)
    ## Generate the base object
    if "bsc_description" in eachproperty.keys():
        valpropdict = {"description":eachproperty['bsc_description']+" "+eachproperty['description']}
    else:
        valpropdict = {"description":eachproperty['description']}
    ## Check cardinality    
    if eachproperty['cardinality'] != "MANY": ## There can only be one expected value or cardinality not defined
        ## Check number of expected types
        if len(rangelist) == 1: ## There can only be one expected type
            propertytype = rangelist[0]
            actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
            valpropdict.update(actualtype)
        else: ## There are more than one expected type
            propertyelements = []
            for propertytype in rangelist:
                actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
                if actualtype not in propertyelements:
                    propertyelements.append(actualtype)
            valpropdict["oneOf"] = propertyelements
                
    else: ## each property can have many elements, ie- Cardinality == Many
        ## Check number of expected types
        if len(rangelist) == 1: ## If only one type expected, but many of it are allowed
            propertytype = rangelist[0]
            actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
            valpropdict["oneOf"] = [
                  actualtype,
                  {
                    "type": "array",
                    "items": actualtype
                  }
              ]  
        else: ## Many types are allowed, and many values are expected
            propertyelements = []
            for propertytype in rangelist:
                actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
                if actualtype not in propertyelements:
                    propertyelements.append(actualtype)
                manyactualtype = {"type":"array", "items":actualtype}
                if manyactualtype not in propertyelements:
                    propertyelements.append(manyactualtype)
            valpropdict["anyOf"] = propertyelements            
    return(valpropdict, referencelist, definitiondict)


#### Generate validation content

def generate_validation(ymldict):
    expected_type_dict, reusable_definitions = load_dictionaries()
    propertylist = ymldict['mapping']
    validationdict = {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "type": "object",
      "properties":{},
      "required":[],
      "recommended":[],
      "optional":[],
      "definitions":{}
    }
    for eachproperty in propertylist:
        rangelist = get_rangelist(eachproperty)
        definitiondict,referencelist = import_reusable_objects(reusable_definitions,rangelist)
        valpropdict, referencelist, definitiondict = cardinality_check(expected_type_dict, referencelist, definitiondict, eachproperty)
        actualname, property_source = split_property_name(eachproperty["property"])
        validationdict["properties"][actualname]=valpropdict
        validationdict['definitions'].update(definitiondict)
        #### Include categorycode if propertyvalue is used
        if "propertyvalue" in validationdict['definitions'].keys():
            validationdict['definitions']['categorycode']=reusable_definitions["categorycode"]
        #### Include definedTermSet if definedTerm is used
        if "definedterm" in validationdict['definitions'].keys():
            validationdict['definitions']['definedtermset']=reusable_definitions["definedtermset"]
        #### Include creativework if person is used (but only if it isn't already included)
        if "person" in validationdict['definitions'].keys():
            if "creativework" not in validationdict['definitions'].keys():
                validationdict['definitions']['creativework']=reusable_definitions['creativework']
        try:
            marginality = eachproperty["marginality"]
        except:
            marginality = None
        if marginality == "Minimum":
            validationdict['required'].append(actualname)
        elif marginality == "Recommended":
            validationdict['recommended'].append(actualname)
        elif marginality == "Optional":
            validationdict['optional'].append(actualname)
    return(validationdict)
 

In [5]:
#### Define properties for the graph

def check_property_source(eachproperty): 
    if eachproperty['type']=="bioschemas":
        namespace = "bioschemas"
    elif eachproperty['type']=="external":
        namespace = "TBD"
    else:
        namespace = "schema"
    return(namespace)

def get_rangelist(eachproperty):
    typelist = eachproperty['expected_types']
    bioschemalist = [x for x in typelist if x in specifications]
    expected_bioschemas = ["bioschemas:"+x for x in bioschemalist]
    expected_schemas = ["schema:"+x for x in typelist if x not in bioschemalist]
    rangelist = [{"@id":x} for x in expected_bioschemas]
    for x in expected_schemas:
        rangelist.append({"@id":x})
    return(rangelist)

def get_domain(eachspec):
    domain = {"@id":"bioschemas:"+eachspec}
    return(domain)


def split_property_name(propertyname):
    try:
        propertynameinfo = propertyname.split(":")
        property_source = propertynameinfo[0]
        actualname = propertynameinfo[1]
    except:
        actualname = propertyname
        property_source = None
    return(actualname, property_source)


def create_bioschema_property(eachspec, eachproperty):
    domain = get_domain(eachspec)
    rangelist = get_rangelist(eachproperty)
    namespace = check_property_source(eachproperty)
    source4context = False
    if namespace=="bioschemas":
        ### Create property
        try:
            description = eachproperty["description"]+" "+eachproperty['bsc_description']
        except:
            description = eachproperty["description"]
        property_dict = {
            "@id": "bioschemas:"+eachproperty['property'],
            "rdfs:comment": description,
            "@type": "rdf:Property",
            "rdfs:label": eachproperty['property'],
            "schema:domainIncludes": domain,
            "schema:rangeIncludes": rangelist
        }
    elif namespace=="TBD":
        ### Create externally referencing property
        try:
            description = eachproperty["description"]+" "+eachproperty['bsc_description']
        except:
            description = eachproperty["description"]
        actualname, propertysource = split_property_name(eachproperty['property'])
        property_dict = {
            "@id": eachproperty['property'],
            "rdfs:comment": description,
            "@type": "rdf:Property",
            "rdfs:label": actualname,
            "schema:domainIncludes": domain,
            "schema:rangeIncludes": rangelist
        }
        if propertysource != None:
            source4context = {propertysource:eachproperty["type_url"].replace(actualname,'')}
    else:
        property_dict = False
    return(property_dict,source4context)


In [6]:
#### Assemble the jsonld file

def parse_spec_version(tmpinputyml,eachversion,eachspec,specifications):
    theymlfile = os.path.join(tmpinputyml,eachversion.replace('.json','.html'))
    graphlist = []
    tmpdict = create_new_dict()
    ymldict = get_yml_dict(theymlfile)
    schematree = get_schema_base()
    classinfo = grab_class_info(schematree,eachversion,ymldict)
    expected_type_dict, reusable_definitions = load_dictionaries()
    mapping = test_yml_mapping(ymldict)
    if mapping == True:
        propertylist = ymldict['mapping']
        validationdict = generate_validation(ymldict)
        classinfo['$validation']=validationdict
        graphlist.append(classinfo)
        for eachproperty in propertylist:
            bioschemaprop,source4context = create_bioschema_property(eachspec, eachproperty)
            if bioschemaprop != False:
                graphlist.append(bioschemaprop)
            if source4context != False:
                tmpdict['@context'].update(source4context)
        tmpdict['@graph']=graphlist
    else:
        print(mapping)
        tmpdict=False
    return(tmpdict)


#### Parse specifications
def parse_for_dde(input_profiles,datapath,test=False):
    specifications = os.listdir(input_profiles)
    failures = []
    if test==True:
        eachspec = specifications[-4] ##TaxonRank (-3), LabProtocol (14), Course (5)
        tmpinputprofilepath = os.path.join(input_profiles,eachspec)
        tmpinputyml = os.path.join(input_yml,eachspec)
        spec_profs = os.listdir(tmpinputprofilepath)
        eachversion = spec_profs[-1]
        tmpdict = parse_spec_version(tmpinputyml,eachversion,eachspec,specifications)
        if os.path.exists(os.path.join(datapath,str(eachspec)))==False:
            os.makedirs(os.path.join(datapath,str(eachspec)))
        outputpath = os.path.join(datapath,str(eachspec))
        with open(os.path.join(outputpath,str(eachspec)+'_v'+str(eachversion)),"w+") as outfile:
            outfile.write(json.dumps(tmpdict, indent=4, sort_keys=False))
    else:
        for eachspec in specifications:
            tmpinputprofilepath = os.path.join(input_profiles,eachspec)
            tmpinputyml = os.path.join(input_yml,eachspec)
            spec_profs = os.listdir(tmpinputprofilepath)
            for eachversion in spec_profs:
                try:
                    tmpdict = parse_spec_version(tmpinputyml,eachversion,eachspec,specifications)
                except:
                    tmpdict=False
                    failures.append("error parsing yml for: "+str(eachspec)+'_v'+str(eachversion))
                if tmpdict==False:
                    print("The specification: ",str(eachspec)+'_v'+str(eachversion),' does not have a mapping in the yaml')
                    failures.append("no mapping in yml for: "+str(eachspec)+'_v'+str(eachversion))
                else:
                    if os.path.exists(os.path.join(datapath,str(eachspec)))==False:
                        os.makedirs(os.path.join(datapath,str(eachspec)))
                    outputpath = os.path.join(datapath,str(eachspec))
                    with open(os.path.join(outputpath,str(eachspec)+'_v'+str(eachversion)),"w+") as outfile:
                        outfile.write(json.dumps(tmpdict, indent=4, sort_keys=False))
        with open(os.path.join(datapath,'failures.txt'),'w+') as failurelog:
            for eachitem in failures:
                failurelog.write(eachitem+'\n')

In [None]:
i=0
while i < len(specifications):
    print(i, specifications[i])
    i=i+1

In [None]:
## Inspect a dictionary parsed from the yaml file
#print(specifications)
eachspec = specifications[14] ##TaxonRank (-3), LabProtocol (14), Course (5)
tmpinputprofilepath = os.path.join(input_profiles,eachspec)
tmpinputmarginpath = os.path.join(input_marginality,eachspec)
tmpinputyml = os.path.join(input_yml,eachspec)
spec_profs = os.listdir(tmpinputprofilepath)
eachversion = spec_profs[-1]
thejsonfile = os.path.join(tmpinputprofilepath,eachversion)
themarginfile = os.path.join(tmpinputmarginpath,eachversion)
theymlfile = os.path.join(tmpinputyml,eachversion.replace('.json','.html'))
#print(os.listdir(tmpinputyml))
#print(theymlfile)

ymldict = get_yml_dict(theymlfile)
print(ymldict.keys())
i = 0
print(ymldict['mapping'][0].keys())
#mapping = test_yml_mapping(ymldict)
#print(mapping)

In [7]:
## Do a test run
parse_for_dde(input_profiles,datapath,test=True)

In [8]:
## Run through all specifications
parse_for_dde(input_profiles,datapath,test=False)

dict_keys(['name', 'previous_version', 'previous_release', 'status', 'spec_type', 'group', 'use_cases_url', 'cross_walk_url', 'gh_tasks', 'live_deploy', 'parent_type', 'hierarchy', 'spec_info'])
The specification:  Journal_v0.1-DRAFT-2019_01_29.json  does not have a mapping in the yaml
The specification:  Organization_v0.2-DRAFT-2019_07_17.json  does not have a mapping in the yaml
The specification:  Sample_v0.1-DRAFT-2018_02_25.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-DRAFT-2018_11_09.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-DRAFT-2018_11_10.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-RELEASE-2018_11_10.json  does not have a mapping in the yaml
The specification:  Tool_v0.3-DRAFT-2018_11_21.json  does not have a mapping in the yaml
The specification:  Tool_v0.3-DRAFT-2019_07_18.json  does not have a mapping in the yaml
The specification:  TrainingMaterial_v0.7-DRAFT-2019_11_08.json  does no