## Data Validity check

This notebook gets sample json records for metadata validity-checking as a schema.org-compliant JSON-LD object

1. Pull the records by source and ensure decent coverage (by using _exists_ filter)
2. Validate the json file against json-ld dump of schema.org using the jsonschema library
  * Note, the library is very loose with the schema.org jsonld file, so everything validates fine. In comparison, the same files will give errors when checked against https://validator.schema.org/
  * there does not appear to be an API for https://validator.schema.org/, so we can't automate that way
  * There does not appear to be a decent path forward for validating the JSON-ld exports from the NDE against Schema.org JSON-LD in an automated fashion
  * Instead, pull json files programmatically/systematically and manual check against: https://validator.schema.org/ 
    * Pull sample json files
    * Validate manually
    * Copy/paste table for downstream processing
3. organize the validation errors into a table

In [1]:
import os
import json
import requests
import pandas as pd
import jsonschema

In [8]:
script_path = os.getcwd()
sourcelistfile = os.path.join(script_path,'data','sourcelist.txt')
sample_path = os.path.join(script_path,'data','sample_data')
sourcelist = []
with open(sourcelistfile,'r') as srcfile:
    for line in srcfile:
        sourcelist.append(line.strip())

print(sourcelist)

['Zenodo', 'AccessClinicalData@NIAID', 'NCBI+SRA', 'ClinEpiDB', 'ImmPort', 'VEuPathDB', 'LINCS', 'Data+Discovery+Engine', 'Dryad+Digital+Repository', 'Vivli', 'Harvard+Dataverse', 'HuBMAP', 'NCBI+GEO', 'Omics+Discovery+Index+(OmicsDI)', 'Mendeley', 'MicrobiomeDB', 'NICHD+DASH', 'Qiita', 'ReframeDB', 'VDJServer', 'MassIVE', 'MalariaGEN', 'Human+Cell+Atlas', 'Figshare', 'biotools']


In [17]:
r = requests.get('https://schema.org/version/latest/schemaorg-current-https.jsonld')
latest_schema = json.loads(r.text)
print(latest_schema.keys())

dict_keys(['@context', '@graph'])


In [5]:
r = requests.get("https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:Zenodo")
#r = requests.get("https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:Zenodo+AND+_exists_:infectiousAgent.name")
print(r.status_code)
rjson = json.loads(r.text)
test = rjson['hits'][0]
print(test)

200


IndexError: list index out of range

In [21]:
print(jsonschema.validate(latest_schema,test))

None


In [6]:
features = ['infectiousAgent.name','species.name','topicCategory.name','measurementTechnique.name',
            'citation.pmid','funding.funder.name','citedBy','isBasedOn','isBasisFor','isPartOf',
            'hasPart','isRelatedTo','spatialCoverage','temporalCoverage']


In [11]:
%%time
## Pull records from each source and save the JSON-LD files
notfound = []
found_ids = []
redundant = []
for eachsource in sourcelist:
    baseurl = f'https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:{eachsource}'    
    for eachfeature in features:
        queryurl = f'{baseurl}+AND+_exists_:{eachfeature}'
        filename = f'{eachsource}_{eachfeature}_example.json'
        r = requests.get(queryurl)
        rjson = json.loads(r.text)
        if len(rjson['hits']) > 0:
            temp = rjson['hits'][0]
            if temp['_id'] in found_ids:
                redundant.append({"source":eachsource,"feature":eachfeature,"_id":temp['_id'],"dumped":"no"})
            else:
                found_ids.append(temp['_id'])
                redundant.append({"source":eachsource,"feature":eachfeature,"_id":temp['_id'],"dumped":"yes"})
                with open(os.path.join(sample_path,filename),'w') as outwrite:
                    outwrite.write(json.dumps(temp, indent=4))
        else:
            notfound.append({"source":eachsource,"feature":eachfeature,"query_url":queryurl})
faildf = pd.DataFrame(notfound)
redundantdf = pd.DataFrame(redundant)
print(faildf.head(n=2))
print(redundantdf.head(n=2))

   source               feature  \
0  Zenodo  infectiousAgent.name   
1  Zenodo          species.name   

                                           query_url  
0  https://api-staging.data.niaid.nih.gov/v1/quer...  
1  https://api-staging.data.niaid.nih.gov/v1/quer...  
   source              feature            _id dumped
0  Zenodo   topicCategory.name  zenodo_163968    yes
1  Zenodo  funding.funder.name   zenodo_12374    yes
CPU times: total: 19.9 s
Wall time: 1min 56s


In [17]:
faildf.to_csv(os.path.join(script_path,'data','sampling_results','no_examples.tsv'),sep='\t',header=True)
redundantdf.to_csv(os.path.join(script_path,'data','sampling_results','sample_info.tsv'),sep='\t',header=True)