## DDE-compatible json converter

This script converts ingests the yml files and outputs jsonschema that should be mostly DDE-compatible. Note that there are exceptions as some missing logic is not yet in place. The DDE already has a copy of schema.org ingested; hence, it should not be necessary to include enums of all schema.org classes in the validation.

Note that the DDE intends to make schemas more human interpretable; hence, it does NOT like excessive nesting and circularity in the `$validation` portion of the jsonschema. While infinite, looping, references are allowed in schema.org, it can cause errors if enforced in the `$validation` portion. To bypass this, extensive nesting has been avoided by including simplified classes as objects in the definitions portion of the `$validation`. Although these simplified objects may not include ALL the properties allowable for that class, it should not raise errors if they are included during validation either.   

Already in place:
* If a property comes from bioschemas, it will need to be defined in the @graph
* If a property exists in the hierarchy and can be inherited, then it should not need to be defined. Instead, it should be defined in the `$validation`

If a property comes from schema.org it may or may not be need to be defined. This depends on whether or not the property exists in the hierarchy from which this class is derived. This is because the only real constraints provided by schema.org are class hierarchies and property<->class relationships. Hence, the DDE only allows the inheritance of properties that are within the class hierarchy. Marginality, cardinality, and other useful constraints (eg- ontologies) in the biomedical research space come from bioschemas.

Not yet in place:
* If a schema.org property does not exist in the hierarchy, it normally does not apply to this class,
* If this is the case, it should be created with the "sameAs" property

#### To Do:
Some of the yaml files are throwing errors the following errors:
`ScannerError: mapping values are not allowed here`
Currently, we log the error and ignore it

In [2]:
import os
import json
import pandas
import pathlib
import yaml
import requests

In [3]:
#script_path = pathlib.Path(__file__).parent.absolute()
tmp_dir = os.getcwd()
parent_dir = os.path.dirname(tmp_dir)
available = os.listdir(parent_dir)
outputdirectory = os.path.join(parent_dir,'specifications')
inputdirectory = os.path.join(parent_dir,'Bioschemas-Validator')
input_profiles = os.path.join(inputdirectory,'profile_json')
input_marginality = os.path.join(inputdirectory,'profile_marginality')
input_yml = os.path.join(inputdirectory,'profile_yml')

specifications = os.listdir(input_profiles)
datapath = os.path.join('results','resulting json')

In [5]:
#### Load the yaml file
def get_yml_dict(theymlfile):
    try:
        with open(theymlfile,'r',encoding="utf8") as ymlin:
            tmpyml = yaml.load_all(ymlin, Loader=yaml.FullLoader)
            for eachyml in tmpyml:
                if '<!DOCTYPE HTML>' in eachyml:
                    break
                ymldict = eachyml
    except:
        with open(theymlfile,'r',encoding="latin1") as ymlin:
            tmpyml = yaml.load_all(ymlin, Loader=yaml.FullLoader)
            for eachyml in tmpyml:
                if '<!DOCTYPE HTML>' in eachyml:
                    break
                ymldict = eachyml
    return(ymldict)


def test_yml_mapping(ymldict):
    if 'mapping' in ymldict.keys():
        mapping=True
    else:
        mapping=ymldict.keys()
    return(mapping)


#### Create the base class
def create_new_dict():
    newdict = {}
    newdict['@context'] = {
        "schema": "http://schema.org/",
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "bioschemas": "https://discovery.biothings.io/view/bioschemas/"
      } 
    return(newdict)


def get_schema_base():
    ## The DDE does not appear to be using schema version 13. Since the version is not know, we'll use 12 for now
    schemabase = 'https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/releases/12.0/schemaorg-all-http.jsonld'
    r = requests.get(schemabase)
    schematree = r.text
    return(schematree)


def grab_class_info(schematree,eachversion,ymldict):
    parentclass = ymldict['hierarchy'][-1]
    if parentclass in schematree:
        parenttype = "schema:"
    else:
        parenttype = "bioschemas:"
    try:
        description = ymldict['spec_info']['subtitle']+" "+ymldict['spec_info']['description']+" Version: "+ymldict['spec_info']['version']
    except:
        description = ymldict['spec_info']['subtitle']+" "+" Version: "+ymldict['spec_info']['version']
    classinfo = {
      "@id": "bioschemas:"+ymldict['name'],
      "@type": "rdfs:Class",
      "rdfs:comment": description,
      "schema:schemaVersion": ["https://bioschemas.org/"+ymldict['spec_type'].lower()+"s/"+ymldict['name']+"/"+eachversion.replace(".json","")],  
      "rdfs:label": ymldict['name'],
      "rdfs:subClassOf": {
        "@id": parenttype+parentclass
      }
    }
    return(classinfo)

In [6]:
#### Create the validation for the class

def load_dictionaries():
    from dde_reusable_objects import expected_type_dict
    from dde_reusable_objects import reusable_definitions
    return(expected_type_dict, reusable_definitions)


def generate_type(expected_type_dict, propertytype):
    if propertytype in expected_type_dict.keys():
        matched_type = expected_type_dict[propertytype]
    else:
        matched_type = False
    return(matched_type)


def generate_base_dict(expectedtype):
    base_dict = {
        "@type": expectedtype,
        "type": "object",
        "properties": {
          "name": {
            "type": "string"  
          }  
        },
        "required": []
    }
    return(base_dict)


def import_reusable_objects(reusable_definitions,rangelist):
    expected_types = [x["@id"].replace("bioschemas:","") for x in rangelist]
    all_expected_types = [x.replace("schema:","") for x in expected_types]
    definitionslist = [x for x in all_expected_types if x.lower() in reusable_definitions.keys()]
    definitiondict = {}
    referencelist = {}
    for eachdefinition in definitionslist:
        definitiondict[eachdefinition.lower()] = reusable_definitions[eachdefinition.lower()]
        referencelist[eachdefinition.lower()] = {"$ref":"#/definitions/"+eachdefinition.lower()}        
    return(definitiondict,referencelist)


def check_type(expected_type_dict, referencelist, definitiondict, propertytype_info):
    referencename = propertytype_info.replace("bioschemas:","").replace("schema:","").lower()
    matched_type = generate_type(expected_type_dict, propertytype_info)
    if matched_type != False:
        actualtype = matched_type
    else:
        if referencename in referencelist.keys():
            actualtype = referencelist[referencename]
        else:
            ### create reference and property
            actualtype = {"$ref":"#/definitions/"+referencename}
            referencelist[referencename] = actualtype
            definitiondict[referencename] = generate_base_dict(propertytype_info)    
    return(actualtype, referencelist, definitiondict)


def cardinality_check(expected_type_dict, referencelist, definitiondict, eachproperty):
    rangelist = get_rangelist(eachproperty)
    ## Generate the base object
    if "bsc_description" in eachproperty.keys():
        valpropdict = {"description":eachproperty['bsc_description']+" "+eachproperty['description']}
    else:
        valpropdict = {"description":eachproperty['description']}
    ## Check cardinality    
    if eachproperty['cardinality'] != "MANY": ## There can only be one expected value or cardinality not defined
        ## Check number of expected types
        if len(rangelist) == 1: ## There can only be one expected type
            propertytype = rangelist[0]
            actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
            valpropdict.update(actualtype)
        else: ## There are more than one expected type
            propertyelements = []
            for propertytype in rangelist:
                actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
                if actualtype not in propertyelements:
                    propertyelements.append(actualtype)
            valpropdict["oneOf"] = propertyelements
                
    else: ## each property can have many elements, ie- Cardinality == Many
        ## Check number of expected types
        if len(rangelist) == 1: ## If only one type expected, but many of it are allowed
            propertytype = rangelist[0]
            actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
            valpropdict["oneOf"] = [
                  actualtype,
                  {
                    "type": "array",
                    "items": actualtype
                  }
              ]  
        else: ## Many types are allowed, and many values are expected
            propertyelements = []
            for propertytype in rangelist:
                actualtype, referencelist, definitiondict = check_type(expected_type_dict, referencelist, definitiondict, propertytype['@id'])
                if actualtype not in propertyelements:
                    propertyelements.append(actualtype)
                manyactualtype = {"type":"array", "items":actualtype}
                if manyactualtype not in propertyelements:
                    propertyelements.append(manyactualtype)
            valpropdict["anyOf"] = propertyelements            
    return(valpropdict, referencelist, definitiondict)


#### Generate validation content

def generate_validation(ymldict):
    expected_type_dict, reusable_definitions = load_dictionaries()
    propertylist = ymldict['mapping']
    validationdict = {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "type": "object",
      "properties":{},
      "required":[],
      "recommended":[],
      "optional":[],
      "definitions":{}
    }
    for eachproperty in propertylist:
        rangelist = get_rangelist(eachproperty)
        definitiondict,referencelist = import_reusable_objects(reusable_definitions,rangelist)
        valpropdict, referencelist, definitiondict = cardinality_check(expected_type_dict, referencelist, definitiondict, eachproperty)
        actualname, property_source = split_property_name(eachproperty["property"])
        validationdict["properties"][actualname]=valpropdict
        validationdict['definitions'].update(definitiondict)
        #### Include categorycode if propertyvalue is used
        if "propertyvalue" in validationdict['definitions'].keys():
            validationdict['definitions']['categorycode']=reusable_definitions["categorycode"]
        #### Include definedTermSet if definedTerm is used
        if "definedterm" in validationdict['definitions'].keys():
            validationdict['definitions']['definedtermset']=reusable_definitions["definedtermset"]
        #### Include creativework if person is used (but only if it isn't already included)
        if "person" in validationdict['definitions'].keys():
            if "creativework" not in validationdict['definitions'].keys():
                validationdict['definitions']['creativework']=reusable_definitions['creativework']
        try:
            marginality = eachproperty["marginality"]
        except:
            marginality = None
        if marginality == "Minimum":
            validationdict['required'].append(actualname)
        elif marginality == "Recommended":
            validationdict['recommended'].append(actualname)
        elif marginality == "Optional":
            validationdict['optional'].append(actualname)
    return(validationdict)
 

In [5]:
#### Define properties for the graph

def check_property_source(eachproperty): 
    if eachproperty['type']=="bioschemas":
        namespace = "bioschemas"
    elif eachproperty['type']=="external":
        namespace = "TBD"
    else:
        namespace = "schema"
    return(namespace)

def get_rangelist(eachproperty):
    typelist = eachproperty['expected_types']
    bioschemalist = [x for x in typelist if x in specifications]
    expected_bioschemas = ["bioschemas:"+x for x in bioschemalist]
    expected_schemas = ["schema:"+x for x in typelist if x not in bioschemalist]
    rangelist = [{"@id":x} for x in expected_bioschemas]
    for x in expected_schemas:
        rangelist.append({"@id":x})
    return(rangelist)

def get_domain(eachspec):
    domain = {"@id":"bioschemas:"+eachspec}
    return(domain)


def split_property_name(propertyname):
    try:
        propertynameinfo = propertyname.split(":")
        property_source = propertynameinfo[0]
        actualname = propertynameinfo[1]
    except:
        actualname = propertyname
        property_source = None
    return(actualname, property_source)


def create_bioschema_property(eachspec, eachproperty):
    domain = get_domain(eachspec)
    rangelist = get_rangelist(eachproperty)
    namespace = check_property_source(eachproperty)
    source4context = False
    if namespace=="bioschemas":
        ### Create property
        try:
            description = eachproperty["description"]+" "+eachproperty['bsc_description']
        except:
            description = eachproperty["description"]
        property_dict = {
            "@id": "bioschemas:"+eachproperty['property'],
            "rdfs:comment": description,
            "@type": "rdf:Property",
            "rdfs:label": eachproperty['property'],
            "schema:domainIncludes": domain,
            "schema:rangeIncludes": rangelist
        }
    elif namespace=="TBD":
        ### Create externally referencing property
        try:
            description = eachproperty["description"]+" "+eachproperty['bsc_description']
        except:
            description = eachproperty["description"]
        actualname, propertysource = split_property_name(eachproperty['property'])
        property_dict = {
            "@id": eachproperty['property'],
            "rdfs:comment": description,
            "@type": "rdf:Property",
            "rdfs:label": actualname,
            "schema:domainIncludes": domain,
            "schema:rangeIncludes": rangelist
        }
        if propertysource != None:
            source4context = {propertysource:eachproperty["type_url"].replace(actualname,'')}
    else:
        property_dict = False
    return(property_dict,source4context)


In [7]:
#### Assemble the jsonld file

def parse_spec_version(tmpinputyml,eachversion,eachspec,specifications):
    theymlfile = os.path.join(tmpinputyml,eachversion.replace('.json','.html'))
    graphlist = []
    tmpdict = create_new_dict()
    ymldict = get_yml_dict(theymlfile)
    schematree = get_schema_base()
    classinfo = grab_class_info(schematree,eachversion,ymldict)
    expected_type_dict, reusable_definitions = load_dictionaries()
    mapping = test_yml_mapping(ymldict)
    if mapping == True:
        propertylist = ymldict['mapping']
        validationdict = generate_validation(ymldict)
        classinfo['$validation']=validationdict
        graphlist.append(classinfo)
        for eachproperty in propertylist:
            bioschemaprop,source4context = create_bioschema_property(eachspec, eachproperty)
            if bioschemaprop != False:
                graphlist.append(bioschemaprop)
            if source4context != False:
                tmpdict['@context'].update(source4context)
        tmpdict['@graph']=graphlist
    else:
        print(mapping)
        tmpdict=False
    return(tmpdict)


#### Parse specifications
def parse_for_dde(input_profiles,datapath,test=False):
    specifications = os.listdir(input_profiles)
    failures = []
    if test==True:
        eachspec = specifications[-4] ##TaxonRank (-3), LabProtocol (14), Course (5)
        tmpinputprofilepath = os.path.join(input_profiles,eachspec)
        tmpinputyml = os.path.join(input_yml,eachspec)
        spec_profs = os.listdir(tmpinputprofilepath)
        eachversion = spec_profs[-1]
        tmpdict = parse_spec_version(tmpinputyml,eachversion,eachspec,specifications)
        if os.path.exists(os.path.join(datapath,str(eachspec)))==False:
            os.makedirs(os.path.join(datapath,str(eachspec)))
        outputpath = os.path.join(datapath,str(eachspec),'jsonld')
        with open(os.path.join(outputpath,str(eachspec)+'_v'+str(eachversion)),"w+") as outfile:
            outfile.write(json.dumps(tmpdict, indent=4, sort_keys=False))
    else:
        for eachspec in specifications:
            tmpinputprofilepath = os.path.join(input_profiles,eachspec)
            tmpinputyml = os.path.join(input_yml,eachspec)
            spec_profs = os.listdir(tmpinputprofilepath)
            for eachversion in spec_profs:
                try:
                    tmpdict = parse_spec_version(tmpinputyml,eachversion,eachspec,specifications)
                except:
                    tmpdict=False
                    failures.append("error parsing yml for: "+str(eachspec)+'_v'+str(eachversion))
                if tmpdict==False:
                    print("The specification: ",str(eachspec)+'_v'+str(eachversion),' does not have a mapping in the yaml')
                    failures.append("no mapping in yml for: "+str(eachspec)+'_v'+str(eachversion))
                else:
                    if os.path.exists(os.path.join(datapath,str(eachspec)))==False:
                        os.makedirs(os.path.join(datapath,str(eachspec)))
                    outputpath = os.path.join(datapath,str(eachspec))
                    with open(os.path.join(outputpath,str(eachspec)+'_v'+str(eachversion)),"w+") as outfile:
                        outfile.write(json.dumps(tmpdict, indent=4, sort_keys=False))
        with open(os.path.join(datapath,'failures.txt'),'w+') as failurelog:
            for eachitem in failures:
                failurelog.write(eachitem+'\n')

In [8]:
i=0
while i < len(specifications):
    print(i, specifications[i])
    i=i+1

0 Beacon
1 BioSample
2 ChemicalSubstance
3 ComputationalTool
4 ComputationalWorkflow
5 Course
6 CourseInstance
7 DataCatalog
8 DataRecord
9 Dataset
10 Event
11 FormalParameter
12 Gene
13 Journal
14 LabProtocol
15 MolecularEntity
16 Organization
17 Person
18 Phenotype
19 Protein
20 ProteinAnnotation
21 ProteinStructure
22 PublicationIssue
23 PublicationVolume
24 RNA
25 Sample
26 ScholarlyArticle
27 SemanticTextAnnotation
28 Study
29 Taxon
30 TaxonName
31 Tool
32 TrainingMaterial


In [9]:
## Inspect a dictionary parsed from the yaml file
#print(specifications)
eachspec = specifications[14] ##TaxonRank (-3), LabProtocol (14), Course (5)
tmpinputprofilepath = os.path.join(input_profiles,eachspec)
tmpinputmarginpath = os.path.join(input_marginality,eachspec)
tmpinputyml = os.path.join(input_yml,eachspec)
spec_profs = os.listdir(tmpinputprofilepath)
eachversion = spec_profs[-1]
thejsonfile = os.path.join(tmpinputprofilepath,eachversion)
themarginfile = os.path.join(tmpinputmarginpath,eachversion)
theymlfile = os.path.join(tmpinputyml,eachversion.replace('.json','.html'))
#print(os.listdir(tmpinputyml))
#print(theymlfile)

ymldict = get_yml_dict(theymlfile)
#print(ymldict.keys())
i = 0
print(ymldict['mapping'][0].keys())
print(ymldict['mapping'][0]['property'])
print(ymldict['mapping'][1].keys())
#mapping = test_yml_mapping(ymldict)
#print(mapping)

dict_keys(['property', 'expected_types', 'description', 'type', 'type_url', 'bsc_description', 'marginality', 'cardinality', 'controlled_vocab', 'example'])
author
dict_keys(['property', 'expected_types', 'description', 'type', 'type_url', 'bsc_description', 'marginality', 'cardinality', 'controlled_vocab', 'example'])


In [8]:
## Do a test run
parse_for_dde(input_profiles,datapath,test=True)

In [11]:
## Run through all specifications
parse_for_dde(input_profiles,datapath,test=False)

dict_keys(['name', 'previous_version', 'previous_release', 'status', 'spec_type', 'group', 'use_cases_url', 'cross_walk_url', 'gh_tasks', 'live_deploy', 'parent_type', 'hierarchy', 'spec_info'])
The specification:  Journal_v0.1-DRAFT-2019_01_29.json  does not have a mapping in the yaml
The specification:  Organization_v0.2-DRAFT-2019_07_17.json  does not have a mapping in the yaml
The specification:  Sample_v0.1-DRAFT-2018_02_25.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-DRAFT-2018_11_09.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-DRAFT-2018_11_10.json  does not have a mapping in the yaml
The specification:  Sample_v0.2-RELEASE-2018_11_10.json  does not have a mapping in the yaml
The specification:  Tool_v0.3-DRAFT-2018_11_21.json  does not have a mapping in the yaml
The specification:  Tool_v0.3-DRAFT-2019_07_18.json  does not have a mapping in the yaml
The specification:  TrainingMaterial_v0.7-DRAFT-2019_11_08.json  does no

# Generating list of most common properties and expected types

The DDE is undergoing many improvements and getting refactored on the back-end. To ensure that the DDE can be used by as many people as possible with little/no knowledge of the JSON schema, the most common bioschemas properties and their expected types are being evaluated for inclusions as default options.

In [4]:
datapath = os.path.join('results','frequency')

In [55]:
def clean_illegal_chars(theymlfile):
    with open(theymlfile,'r',encoding="utf8") as ymlin:
        theymlfile.replace("#x0080","\#x0080")
    return(theymlfile)

In [56]:
%%time

## Get frequency of expected types
allproplist = []
for eachspec in specifications:
    tmpinputprofilepath = os.path.join(input_profiles,eachspec)
    tmpinputmarginpath = os.path.join(input_marginality,eachspec)
    tmpinputyml = os.path.join(input_yml,eachspec)
    spec_profs = os.listdir(tmpinputprofilepath)
    eachversion = spec_profs[-1]
    thejsonfile = os.path.join(tmpinputprofilepath,eachversion)
    themarginfile = os.path.join(tmpinputmarginpath,eachversion)
    theymlfile = os.path.join(tmpinputyml,eachversion.replace('.json','.html'))
    try:
        ymldict = get_yml_dict(theymlfile)
    except:
        ymlin = clean_illegal_chars(theymlfile)
        tmpyml = yaml.load_all(ymlin, Loader=yaml.FullLoader)
        for eachyml in tmpyml:
            if '<!DOCTYPE HTML>' in eachyml:
                break
            ymldict = eachyml
    try:
        mappingdict = ymldict['mapping']
        for eachdict in mappingdict:
            tmpdict = {'specification':eachspec,'property':eachdict['property'],'expected_type':eachdict['expected_types'],'marginality':eachdict['marginality']}
            allproplist.append(tmpdict)
    except:
        print(eachspec)

allpropdf = pandas.DataFrame(allproplist)
print(allpropdf.head(n=2))

Sample
  specification    property  expected_type  marginality
0        Beacon  aggregator      [Boolean]  Recommended
1        Beacon     dataset  [DataCatalog]      Minimum
Wall time: 1.38 s


In [57]:
## Get frequency of combinations of expected_type
allpropdf['expected_types_raw'] = allpropdf['expected_type'].astype(str)
prop_freq = allpropdf.groupby(['property','expected_types_raw']).size().reset_index(name='counts')
combi_exp_freq = allpropdf.groupby('expected_types_raw').size().reset_index(name='counts')

prop_freq.sort_values('counts',ascending=False,inplace=True)
combi_exp_freq.sort_values('counts',ascending=False,inplace=True)

prop_freq.to_csv(os.path.join(datapath,'property_frequency.tsv'),sep='\t',header=True)
combi_exp_freq.to_csv(os.path.join(datapath,'raw_expected_types_frequency.tsv'),sep='\t',header=True)

In [13]:
## Get frequency of base expected type
expected_types = allpropdf.explode('expected_type')
expected_freq = expected_types.groupby('expected_type').size().reset_index(name='counts')
expected_freq.sort_values('counts',ascending=False,inplace=True)
expected_freq.to_csv(os.path.join(datapath,'expected_types_frequency.tsv'),sep='\t',header=True)
print(expected_freq)

         expected_type  counts
74                Text     262
77                 URL     237
60       PropertyValue      71
19        CreativeWork      66
27         DefinedTerm      54
..                 ...     ...
38        HowToSection       1
1   AdministrativeArea       1
46       MedicalEntity       1
47     MolecularEntity       1
78     VirtualLocation       1

[79 rows x 2 columns]


In [58]:
def create_base_object(schematype):
    basedict = {"@type": schematype,"type": "object","properties":{},"required":[]}
    return basedict

def replace_json_validation(expected_dict):
    expected_dict["text"]={"type":"string"}
    expected_dict["url"]={"type":"string","format":"uri"}
    expected_dict["number"] = {"type":"number"}
    expected_dict["integer"] = {"type":"integer"}
    expected_dict["boolean"] = {"type":"boolean"}
    expected_dict["date"]={"type":"string","format":"date"}
    return expected_dict

def translate(typestring):
    if typestring == "Text":
        newtype = {"type":"string"}
    elif typestring == "URL":
        newtype = {"type":"string","format":"uri"}
    elif typestring == "Number":
        newtype = {"type":"number"}
    elif typestring == "Integer":
        newtype = {"type":"integer"}
    elif typestring == "Boolean":
        newtype = {"type":"boolean"}
    elif typestring == "Date":
        newtype = {"type":"string","format":"date"}
    else:
        newtype = {"type":"object","@type":typestring}
    return newtype

def generate_base_bs_rules(allpropdf):
    ## Get required properties for each class
    required_props = allpropdf.loc[allpropdf['marginality']=="Minimum"].copy()
    required_props.to_csv(os.path.join(datapath,'required_props.tsv'),sep='\t',header=True)
    required_props['expected_count'] = required_props.apply(lambda row: count_expected_types(row['expected_type']), axis=1)
    required_props['single_expected'] = [x for x in required_props['expected_type']]
    required_props = required_props.explode('single_expected')
    bioschema_list = required_props['specification'].unique().tolist()
    validation_dict = {}
    for eachspec in bioschema_list:
        tmpdf = required_props.loc[required_props['specification']==eachspec]
        property_list = tmpdf['property'].unique().tolist()
        property_dict = {}
        for eachprop in property_list:
            propdf = tmpdf.loc[tmpdf['property']==eachprop].copy()
            if len(propdf)<2:
                ## property has a single expected type
                property_dict[eachprop] = translate(propdf.iloc[0]['single_expected'])
            if len(propdf)>1:
                propdf['translated'] = propdf.apply(lambda row: translate(row['single_expected']), axis=1)
                property_dict[eachprop] = {"oneOf": propdf['translated'].tolist()}
        validation_dict[eachspec.lower()] = {"@type":eachspec,"type":"object","properties":property_dict,"required":property_list}
    return validation_dict

def count_expected_types(expected_type):
    type_count = len(expected_type)
    return type_count



In [59]:
## convert base expected type into json dictionary for expected types
validation_dict = generate_base_bs_rules(allpropdf)

expected_objects = pandas.read_csv(os.path.join(datapath,'expected_types_frequency.tsv'),delimiter='\t',header=0,index_col=0)

expected_dict = {}
for i in range(len(expected_objects)):
    schematype = expected_objects.iloc[i]['expected_type']
    expected_dict[schematype.lower()]=create_base_object(schematype)

expected_dict = replace_json_validation(expected_dict)
expected_dict.update(validation_dict)

with open(os.path.join(datapath,'base_types.json'),'w') as outfile:
    outfile.write(json.dumps(expected_dict, indent=2, sort_keys=False))

In [129]:
print(expected_dict['beacon'])

{'@type': 'Beacon', 'type': 'object', 'properties': {'dataset': {'type': 'object', '@type': 'DataCatalog'}, 'name': {'type': 'string'}, 'potentialAction': {'type': 'object', '@type': 'Action'}, 'provider': {'oneOf': [{'type': 'object', '@type': 'Organization'}, {'type': 'object', '@type': 'Person'}]}, 'rdf:type': {'type': 'string', 'format': 'uri'}, 'supportedRefs': {'type': 'string'}, 'url': {'type': 'string', 'format': 'uri'}}, 'required': ['dataset', 'name', 'potentialAction', 'provider', 'rdf:type', 'supportedRefs', 'url']}


### Defining rules for a validation builder

This section is just to test the logic for creating combined JSON schema validation rules
For example: how to convert one or many rules for single or multiple types of json validation rules

Example cases:
* ONE of a single rule type: ONE type: Text (string)
* Many of a single rule type: Many of type: Text (string)
* One of two potential rule types: Text (string) or Person (object)
* One of two potential rule types: Text (string) or URL (Formatted Text)
* Many of two potential rule types: Text (string) or Person (object)
* Many of two potential rules types: Text (string) or URL (formatted string)

In [80]:
def get_single_rule(expected_dict,rule_key):
    rule_dict = expected_dict[rule_key]
    return rule_dict

def get_one_many_rule(expected_dict,rule_list):
    if len(set(rule_list))==1:
        ## This is a list of a single rule, treat as such
        rule_dict = get_single_rule(expected_dict,rule_list[0])
    else:
        rule_dict = {}
        rule_val_list = []
        for each_rule in rule_list:
            rule_val_list.append(expected_dict[each_rule])
        rule_dict["oneOf"] = rule_val_list
    return rule_dict

def get_many_single_rule(expected_dict,rule_key):
    rule_dict = {}
    rule_val_list = []
    rule_val_list.append(expected_dict[rule_key])
    rule_val_list.append({"type":"array","items":expected_dict[rule_key]})
    rule_dict["oneOf"] = rule_val_list
    return rule_dict

def get_many_many_rules(expected_dict,rule_list):
    rule_dict = {}
    ### check if they are all of the same json schema types
    rule_set = set(rule_list)
    if len(rule_set) == 1 and len(rule_list) == 1:
        ## This is actually just a single rule placed in a list
        rule_dict = get_many_single_rule(expected_dict,rule_list[0])
    
    elif len(rule_set) == 1 and len(rule_list) > 1:
        ## This is actually just a multiples of a single rule in a list, treat as above
        rule_dict = get_many_single_rule(expected_dict,rule_list[0])
        
    elif len(rule_set)> 1:
        ## The options are mixed between types, use "anyOf" for the array
        rule_val_list = []
        for each_rule in rule_list:
            rule_val_list.append(expected_dict[each_rule])
            rule_val_list.append({"type":"array","items":expected_dict[each_rule]})
        rule_dict["anyOf"] = rule_val_list
        
    return rule_dict


def get_rules(expected_dict,rule_list,cardinality="one"):
    if isinstance(rule_list,str) == True:
        if cardinality.lower() == "one":
            rule_dict = get_single_rule(expected_dict,rule_list)
        if cardinality.lower() == "many":
            rule_dict = get_many_single_rule(expected_dict,rule_list)
    if isinstance(rule_list,list):
        if cardinality.lower() == "one":
            rule_dict = get_one_many_rule(expected_dict,rule_list)
        if cardinality.lower() == "many":
            rule_dict = get_many_many_rules(expected_dict,rule_list)
    return rule_dict

In [82]:
## Test the above functions
cardinality = "one"

with open(os.path.join(datapath,'base_types.json'),'r') as infile:
    expected_dict = json.load(infile, encoding="UTF-8")

In [84]:
## ONE of a single rule type: ONE type: Text (string)
cardinality = "One"
rule_dict = get_rules(expected_dict,"text",cardinality)
print(rule_dict,'\n')

## Many of a single rule type: Many of type: Text (string)
cardinality = "MANY"
rule_dict = get_rules(expected_dict,"text",cardinality)
print(rule_dict,'\n')

## One of two potential rule types: Text (string) or Person (object)
cardinality = "one"
rule_dict = get_rules(expected_dict,["text","person"],cardinality)
print(rule_dict,'\n')

## One of two potential rule types: Text (string) or URL (Formatted Text)
cardinality = "one"
rule_dict = get_rules(expected_dict,["text","url"],cardinality)
print(rule_dict,'\n')

## Many of two potential rule types: Text (string) or Person (object)
cardinality = "Many"
rule_dict = get_rules(expected_dict,["text","person"],cardinality)
print(rule_dict,'\n')

## Many of two potential rules types: Text (string) or URL (formatted string)
cardinality = "Many"
rule_dict = get_rules(expected_dict,["text","url"],cardinality)
print(rule_dict,'\n')


{'type': 'string'} 

{'oneOf': [{'type': 'string'}, {'type': 'array', 'items': {'type': 'string'}}]} 

{'oneOf': [{'type': 'string'}, {'@type': 'Person', 'type': 'object', 'properties': {'description': {'type': 'string'}, 'mainEntityOfPage': {'oneOf': [{'type': 'object', '@type': 'CreativeWork'}, {'type': 'string', 'format': 'uri'}]}, 'name': {'type': 'string'}}, 'required': ['description', 'mainEntityOfPage', 'name']}]} 

{'oneOf': [{'type': 'string'}, {'type': 'string', 'format': 'uri'}]} 

{'anyOf': [{'type': 'string'}, {'type': 'array', 'items': {'type': 'string'}}, {'@type': 'Person', 'type': 'object', 'properties': {'description': {'type': 'string'}, 'mainEntityOfPage': {'oneOf': [{'type': 'object', '@type': 'CreativeWork'}, {'type': 'string', 'format': 'uri'}]}, 'name': {'type': 'string'}}, 'required': ['description', 'mainEntityOfPage', 'name']}, {'type': 'array', 'items': {'@type': 'Person', 'type': 'object', 'properties': {'description': {'type': 'string'}, 'mainEntityOfPage'

### generate property to rule mappings

In [87]:
## Generate mapping of properties with the same expected type
from random import sample

def clean_prop_list(stringproplist):
    type_list = stringproplist.strip('[').strip(']').split(',')
    clean_list = [x.strip(' ').replace("'","") for x in type_list]
    rule_list = [x.lower() for x in clean_list]
    return clean_list, rule_list

def generate_id_from_proplist(stringproplist,propcount):
    clean_list, rule_list = clean_prop_list(stringproplist)
    idbase = [x[0] for x in clean_list]
    letters = ['a','b','c','t','u','v','w','x','y','z']
    randbase = sample(letters, k=5)
    idhash = "".join(idbase)+'_'+str(propcount)+'_'+''.join(randbase)
    return idhash
    
def generate_prop_rule_maps(datapath,expected_dict,prop_freq):
    onemapdict = {}
    manymapdict = {}
    onerulemap = {}
    manyrulemap = {}
    onepropmap = {}
    manypropmap = {}
    oneproprule = {}
    manyproprule = {}
    prop_freq_multi = prop_freq.loc[prop_freq['counts']>1].copy() ## filter out properties that appear only once
    for each_expected_type in prop_freq_multi['expected_types_raw'].unique().tolist():
        propcount = prop_freq_multi.loc[prop_freq_multi['expected_types_raw']==each_expected_type]['counts'].sum()
        proplist = prop_freq['property'].loc[prop_freq['expected_types_raw']==each_expected_type].unique().tolist()
        clean_list, rule_list = clean_prop_list(each_expected_type)
        idhash = generate_id_from_proplist(each_expected_type,propcount)
        idhash_many = idhash+'_many'
        onemapdict[idhash]=proplist
        manymapdict[idhash_many]=proplist
        oneruledict = get_rules(expected_dict,rule_list,"one")
        onerulemap[idhash]=oneruledict
        manyruledict = get_rules(expected_dict,rule_list,"many")
        manyrulemap[idhash_many]=manyruledict
        for eachprop in proplist:
            onepropmap[eachprop]=idhash
            manypropmap[eachprop]=idhash_many
            oneproprule[eachprop] = oneruledict
            manyproprule[eachprop] = manyruledict
    filedict = {"one_map.txt":onemapdict,"many_map.txt":manymapdict,"one_rule_map.txt":onerulemap,
                "many_rule_map.txt":manyrulemap,"one_prop_map.txt":onepropmap,"many_prop_map.txt":manypropmap,
                "one_prop_rule.txt":oneproprule,"many_prop_rule.txt":manyproprule}
    for eachkey in list(filedict.keys()):
        with open(os.path.join(datapath,eachkey),'w+') as outfile:
            outfile.write(json.dumps(filedict[eachkey]))
    

In [88]:
datapath = os.path.join('results','mappings')
generate_prop_rule_maps(datapath,expected_dict,prop_freq)

### Convert Bioschemas validation rules to be DDE-validator default-compatible

Format can be seen here https://github.com/biothings/discovery-app/blob/vue3-app/nuxt-app/store/modules/editor_options/validation_options.js

  {
    _id: "01233bio",
    title: "text",
    color: "#097969",
    validation: {
      type: "string",
    },
    belongs_to: "bioschemas",
  },

In [4]:
import random
datapath = os.path.join('results','mappings')
one_prop = os.path.join(datapath,'one_prop_rule.txt')
many_prop = os.path.join(datapath,'many_prop_rule.txt')

one_prop_json = json.load(open(one_prop,'rb'))
many_prop_json = json.load(open(many_prop,'rb'))

prop_keys = list(one_prop_json.keys())

dde_list = []

idlist = set()

for eachkey in prop_keys:
    tmpidlist = []
    while len(tmpidlist)<3:
        tmpid = eachkey[0:2]+random.choice('abcdefgh')+random.choice('pqrstuvwx')+str(random.randint(100,999))+'bio'
        if tmpid not in idlist:
            tmpidlist.append(tmpid)
            idlist.add(tmpid)
    dde_one = {}
    dde_one['_id'] = tmpidlist[0]
    dde_one['title'] = eachkey
    dde_one['color'] = '#097969'
    dde_one['validation'] = one_prop_json[eachkey]
    dde_one['belongs_to'] = 'bioschemas'
    dde_list.append(dde_one)
    dde_many = dde_one.copy()
    dde_many['_id'] = tmpidlist[1]
    dde_many['validation'] = many_prop_json[eachkey]
    dde_list.append(dde_many)

print(dde_list)

[{'_id': 'ideu553bio', 'title': 'identifier', 'color': '#097969', 'validation': {'oneOf': [{'@type': 'PropertyValue', 'type': 'object', 'properties': {}, 'required': []}, {'type': 'string'}, {'type': 'string', 'format': 'uri'}]}, 'belongs_to': 'bioschemas'}, {'_id': 'idfr404bio', 'title': 'identifier', 'color': '#097969', 'validation': {'anyOf': [{'@type': 'PropertyValue', 'type': 'object', 'properties': {}, 'required': []}, {'type': 'array', 'items': {'@type': 'PropertyValue', 'type': 'object', 'properties': {}, 'required': []}}, {'type': 'string'}, {'type': 'array', 'items': {'type': 'string'}}, {'type': 'string', 'format': 'uri'}, {'type': 'array', 'items': {'type': 'string', 'format': 'uri'}}]}, 'belongs_to': 'bioschemas'}, {'_id': 'hadq186bio', 'title': 'hasRepresentation', 'color': '#097969', 'validation': {'@type': 'PropertyValue orText orURL', 'type': 'object', 'properties': {}, 'required': []}, 'belongs_to': 'bioschemas'}, {'_id': 'haaq642bio', 'title': 'hasRepresentation', 'c

#### generate property list based on expected type

In [5]:
exp_type_file = os.path.join('results','frequency_tables','base_types.json')
expected_type = json.load(open(exp_type_file,'rb'))
exprop_names = list(expected_type.keys())

## Clean up bad entries
for eachprop in exprop_names:
    if ' ' in eachprop:
        exprop_names.remove(eachprop)
        
fixprops = {'text':'Text','url':'URL','number':'Number','integer':'Integer',
            'boolean':'Boolean','date':'Date','datetime':'DateTime'}
exp_type_list = []

for eachkey in exprop_names:
    tmpdict = expected_type[eachkey]
    exp_one = {}
    tmpidlist=[]
    while len(tmpidlist)<3:
        tmpid = 'exty'+random.choice('vwxyzabcdef')+random.choice('ijklmnopq')+str(random.randint(100,999))+'bio'
        if tmpid not in idlist:
            idlist.add(tmpid)
            tmpidlist.append(tmpid)
    exp_one['_id'] = tmpidlist[0]
    if '@type' in tmpdict.keys():
        exp_one['title'] = tmpdict['@type']
    elif eachkey in fixprops.keys():
        exp_one['title'] = fixprops[eachkey]
    else:
        exp_one['title'] = eachkey
    exp_one['color'] = '#097969'
    exp_one['validation'] = expected_type[eachkey]
    exp_one['belongs_to'] = 'bioschemas'
    exp_type_list.append(exp_one)
    exp_many = exp_one.copy()
    exp_many['_id'] = tmpidlist[1]
    exp_many['validation']  = {"oneOf": [
          tmpdict,
          {
            "type": "array",
            "items": tmpdict
          }
        ]}
    exp_type_list.append(exp_many)

In [6]:
print(exp_type_list)

[{'_id': 'extyaq553bio', 'title': 'Text', 'color': '#097969', 'validation': {'type': 'string'}, 'belongs_to': 'bioschemas'}, {'_id': 'extybq929bio', 'title': 'Text', 'color': '#097969', 'validation': {'oneOf': [{'type': 'string'}, {'type': 'array', 'items': {'type': 'string'}}]}, 'belongs_to': 'bioschemas'}, {'_id': 'extyxl513bio', 'title': 'URL', 'color': '#097969', 'validation': {'type': 'string', 'format': 'uri'}, 'belongs_to': 'bioschemas'}, {'_id': 'extydi371bio', 'title': 'URL', 'color': '#097969', 'validation': {'oneOf': [{'type': 'string', 'format': 'uri'}, {'type': 'array', 'items': {'type': 'string', 'format': 'uri'}}]}, 'belongs_to': 'bioschemas'}, {'_id': 'extyyo198bio', 'title': 'PropertyValue', 'color': '#097969', 'validation': {'@type': 'PropertyValue', 'type': 'object', 'properties': {}, 'required': []}, 'belongs_to': 'bioschemas'}, {'_id': 'extyxj593bio', 'title': 'PropertyValue', 'color': '#097969', 'validation': {'oneOf': [{'@type': 'PropertyValue', 'type': 'object',

#### export the resulting files

In [7]:
export_path = os.path.join(datapath,'for_dde_defaults')
with open(os.path.join(export_path,'by_property.txt'),'w') as f:
    f.write(json.dumps(dde_list, indent=2))

with open(os.path.join(export_path,'by_expectedType.txt'),'w') as f2:
    f2.write(json.dumps(exp_type_list, indent=2))

### Check schema compatibility

In [2]:
from biothings_schema import Schema

script_path = ''
url = "https://raw.githubusercontent.com/NIAID-Data-Ecosystem/nde-schemas/main/nde-mini-combined.jsonld"
#bioschemasfile = os.path.join(script_path,'bioschemas.json')
#with open(bioschemasfile,'r') as infile:
#    url = json.load(infile)

sc = Schema(url, base_schema=["schema.org"])
sc.validation

SchemaValidationError: Class "https://discovery.biothings.io/view/nde/Dataset" has no path to the root "schema:Thing" class