# Converting Scraped Data into the IDP Knowledge Graph

__Authors:__  
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

András Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

Imran Asif ([ORCID:0000-0002-1144-6265](https://orcid.org/0000-0002-1144-6265)), _Heriot-Watt University, Edinburgh, UK_

__License:__ Apache 2.0

__Acknowledgements:__ This notebook builds upon the work conducted during the Virtual BioHackathon-Europe 2020 reported in [BioHackrXiv](https://biohackrxiv.org/v3jct/).

## Introduction

IDPCentral is the idea of having a central registry of proteins that are known to be disordered.

We aim to populate the content of the registry with Bioschemas markup that has been scraped using the BMUSE tool.

This notebook goes through the steps of converting the scraped content into a Knowledge Graph of IDP data that can be used as a source of data for the IDPCentral registry.

### Input

Scraped data from the following data sources:
- [DisProt](https://www.disprot.org/)
- [MobiDb](https://mobidb.bio.unipd.it/)
- [Protein Ensemble Database](https://proteinensemble.org/) (PED)

The data files can be found in this [GitHub directory](https://github.com/elixir-europe/BioHackathon-projects-2020/tree/master/projects/24/IDPCentral/scraped-data). The code in this notebook uses a relative link to retrieve the data files, i.e. `../scraped-data`.

### Output

The generated knowledge graph is written to file in the same directory as this notebook:
- `IDPKG.jsonld`: JSON-LD serialisation of all named graphs
- `IDPKG.nq`: N-QUADS serialisation of all named graphs

## Code

The following code converts the data scraped using BMUSE into the desired knowledge graph using the RDFlib python library and its abilities to process SPARQL queries over its internal data model.

The generated knowledge graph uses named graphs to track the provenance of where individual statements.

### Library Imports

In [1]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpETL.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [2]:
# Imports from RDFlib
from rdflib import ConjunctiveGraph, Dataset, Graph, RDF, URIRef

In [3]:
# Import template library for templating queries
from string import Template

In [4]:
# Import iPython widgets for user interactions
import ipywidgets as widgets
from IPython.display import display
from IPython.display import clear_output
from IPython.core.display import HTML

In [5]:
# Import Operation System details
import os

In [6]:
# Import functions for interacting with file directory
from glob import glob

### Template Queries

#### Provenance Query

Extracts metadata about where the data has originated and when it was scraped.

In [7]:
# Query to extract the graph and its metadata
provenanceQuery = """
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

CONSTRUCT {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
    ?g pav:createdWith ?createdWith .
    ?g void:inDataset ?dataset .
}
WHERE {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
    OPTIONAL {?g pav:createdWith ?createdWith}
    GRAPH ?g {
        OPTIONAL {?s schema:includedInDataset ?dataset}
    }
}
"""

#### Dataset Query
The following query extracts the Dataset data.

In [8]:
datasetQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:Dataset ;
        schema:description ?description ;
        schema:identifier ?identifier ;
        schema:keywords ?keywords ;
        schema:license ?license ;
        schema:name ?name ;
        schema:url ?url ;
        schema:alternateName ?altName ;
        schema:citation ?citation ;
        schema:creator ?creator ;
        schema:distribution ?distribution ;
        schema:includedInDatacatalog ?dataCatalog ;
        schema:isBasedOn ?study ;
        schema:measurementTechnique ?measurement ;
        schema:variableMeasured ?variable ;
        schema:version ?version ;
        schema:datecreated ?dateCreated ;
        schema:dateModified ?dateModified ;
        schema:datePublished ?datePublished ;
        schema:hasPart ?part ;
        schema:isAccessibleForFree ?free ;
        schema:isPartOf ?parent ;
        schema:maintainer ?maintainer ;
        schema:publisher ?publisher ;
        schema:sameAs ?sameAs .
}
WHERE {
  graph ?g {
    ?s a schema:Dataset .
# Minimum
    OPTIONAL {?s schema:description ?description}
    OPTIONAL {?s schema:identifier ?identifier}
    OPTIONAL {?s schema:keywords ?keywords}
    OPTIONAL {?s schema:license ?license}
    OPTIONAL {?s schema:name ?name}
    OPTIONAL {?s schema:url ?url}
# Recommended
    OPTIONAL {?s schema:alternateName ?altName}
    OPTIONAL {?s schema:citation ?citation}
    OPTIONAL {?s schema:creator ?creator}
    OPTIONAL {?s schema:distribution ?distribution}
    OPTIONAL {?s schema:includedInDatacatalog ?dataCatalog}
    OPTIONAL {?s schema:isBasedOn ?study}
    OPTIONAL {?s schema:measurementTechnique ?measurement}
    OPTIONAL {?s schema:variableMeasured ?variable}
    OPTIONAL {?s schema:version ?version}
# Optional
    OPTIONAL {?s schema:datecreated ?dateCreated}
    OPTIONAL {?s schema:dateModified ?dateModified}
    OPTIONAL {?s schema:datePublished ?datePublished}
    OPTIONAL {?s schema:hasPart ?part}
    OPTIONAL {?s schema:isAccessibleForFree ?free}
    OPTIONAL {?s schema:isPartOf ?parent}
    OPTIONAL {?s schema:maintainer ?maintainer}
    OPTIONAL {?s schema:publisher ?publisher}
    OPTIONAL {?s schema:sameAs ?sameAs}
  }
}
"""

#### DataCatalog Query
The following query extracts the DataCatalog data.

In [9]:
dataCatalogQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:DataCatalog ;
        schema:description ?description ;
        schema:keywords ?keywords ;
        schema:name ?name ;
        schema:provider ?provider ;
        schema:url ?url ;
        schema:about ?about ;
        schema:alternateName ?altName ;
        schema:citation ?citation ;
        schema:dataset ?dataset ;
        schema:datecreated ?dateCreated ;        
        schema:identifier ?identifier ;
        schema:license ?license ;
        schema:sourceOrganization ?srcOrg ;
        schema:dateModified ?dateModified ;
        schema:encodingFormat ?format ;
        schema:datePublished ?datePublished ;
        schema:sameAs ?sameAs .
}
WHERE {
  graph ?g {
    ?s a schema:DataCatalog .
# Minimum
    OPTIONAL {?s schema:description ?description}
    OPTIONAL {?s schema:keywords ?keywords}
    OPTIONAL {?s schema:name ?name}
    OPTIONAL {?s schema:provider ?provider}
    OPTIONAL {?s schema:url ?url}
# Recommended
    OPTIONAL {?s schema:about ?about}
    OPTIONAL {?s schema:alternateName ?altName}
    OPTIONAL {?s schema:citation ?citation}
    OPTIONAL {?s schema:dataset ?dataset}
    OPTIONAL {?s schema:datecreated ?dateCreated}
    OPTIONAL {?s schema:identifier ?identifier}
    OPTIONAL {?s schema:license ?license}
    OPTIONAL {?s schema:sourceOrganization ?srcOrg}
# Optional
    OPTIONAL {?s schema:dateModified ?dateModified}    
    OPTIONAL {?s schema:encodingFormat ?format}
# Extras
    OPTIONAL {?s schema:datePublished ?datePublished}
    OPTIONAL {?s schema:sameAs ?sameAs}
  }
}
"""

#### Protein Query

The following query extracts data from the named graph scraped model, unifies the identifier to the UniProt accession number in the Bioschemas namespace.

Query uses `OPTIONAL` clauses throughout since it was found that not all sources comply with the minimal properties of the Bioschemas Protein Profile.

In [10]:
# Templated query for creating the direct properties for a protein entity
proteinQuery = Template("""
# Query to convert Protein scraped data to a merged named graph
# Defensive query: assumes that data does not conform to Protein profile

PREFIX idpc: <https://idpcentral.org/id/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX pav: <http://purl.org/pav/> 
PREFIX schema: <https://schema.org/> 

CONSTRUCT {
    idpc:${bsAccession} a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:associatedDisease ?associatedDisease ;
        schema:description ?description ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:isEncodedByBioChemEntity ?encodedBy ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:url ?url ;
        schema:alternateName ?alternateName ;
        schema:bioChemInteraction ?bioChemInteraction ;
        schema:bioChemSimilarity ?bioChemSimilarity ;
        schema:hasBioChemEntityPart ?bioChemEntity ;
        schema:hasBioPolymerSequence ?sequence ;
        schema:hasMolecularFunction ?molFunction ;
        schema:hasRepresentation ?representation ;
        schema:image ?image ;
        schema:isInvolvedInBiologicalProcess ?process ;
        schema:isLocatedInSubcellularLocation ?cellularLocation ;
        schema:isPartOfBioChemEntity ?parentEntity ;
        schema:sameAs ?sameAs , ?s ;
        owl:sameAs ?sameAs .
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPolymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
}
""")

#### SequenceAnnotation Query

Query to extract sequence annotations.

In [11]:
sequenceAnnotationsQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
  ?s a schema:SequenceAnnotation ;
        schema:additionalProperty ?addProp ;
        schema:citation ?citation ;
        schema:creationMethod ?method ;
        schema:description ?description ;
        schema:editor ?editor ;
        schema:isPartOfBioChemEntity ?bioChemEntity ;
        schema:sequenceLocation ?seqLoc .#;
  ?s schema:subjectOf ?pubmedID .
  ?pubmedID a ?pubMedType .
}
WHERE {
  graph ?g {
    ?s a schema:SequenceAnnotation .
    OPTIONAL {?s schema:additionalProperty ?addProp }
    OPTIONAL {?s schema:citation ?citation }
    OPTIONAL {?s schema:creationMethod ?method }
    OPTIONAL {?s schema:description ?description }
    OPTIONAL {?s schema:editor ?editor }
    OPTIONAL {?s schema:isPartOfBioChemEntity ?bioChemEntity }
    OPTIONAL {?s schema:sequenceLocation ?seqLoc }
    OPTIONAL {?s schema:subjectOf ?pubmedID .
                ?pubmedID a ?pubMedType }
  }
}
"""

#### PropertyValue Query

Query to extract PropertyValue data.

In [12]:
propertyValueQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:PropertyValue ;
        schema:name ?name ;
        schema:value ?value .
    ?value a schema:DefinedTerm ;
        schema:inDefinedTermSet ?termSet ;
        schema:name ?termName ;
        schema:termCode ?termCode .
}
where {
    graph ?g {
        ?s a schema:PropertyValue .
        OPTIONAL {?s schema:name ?name }
        OPTIONAL {?s schema:value ?value }
        OPTIONAL {?value a schema:DefinedTerm }
        OPTIONAL {?value schema:inDefinedTermSet ?termSet }
        OPTIONAL {?value schema:name ?termName }
        OPTIONAL {?value schema:termCode ?termCode }
    }
}
"""

#### SequenceRange Query

Query to extract SequenceRange data.

In [13]:
sequenceRangeQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:SequenceRange ;
        schema:rangeStart ?start ;
        schema:rangeEnd ?end .
}
where {
    graph ?g {
        ?s a schema:SequenceRange .
        OPTIONAL {?s schema:rangeStart ?start }
        OPTIONAL {?s schema:rangeEnd ?end}
    }
}
"""

#### UniProt ID Query

The following query extracts the UniProt IRI, declared using a `schema:sameAs` declaration in the source data. Note that different sources use alternative patterns for the UniProt IRI; the `FILTER` clause matches the following patterns
- `https://www.uniprot.org/uniprot/`
- `http://purl.uniprot.org/uniprot/`

In [14]:
# Query to extract UniProt IRI
idQuery = """
PREFIX schema: <https://schema.org/>
SELECT ?proteinIRI ?uniprot
WHERE {
    GRAPH ?g {
        ?proteinIRI a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^(https://www|http://purl).uniprot.org/uniprot/")
    }
}
"""

### Methods

#### Create Knowledge Graph Protein Entity 

The `createKGEntity` method uses the queries above to extract the content from the scraped data and then transform it into data using the common identifier scheme.

In [15]:
# Inputs:
# - g: scraped data
# - ds_g: ContextGraph in the KG where the Protein is to be inserted
# - protein: IRI of the protein conepts
# - uniprot: Equivalent UniProt IRI for the protein
# - accession: The UniProt accession number
def createProteinEntity(g, ds_g, protein, uniprot, accession):
    # Parameterise the protein query with the proteinIRI and accession
    query = proteinQuery.substitute(proteinIRI=protein,bsAccession=accession)
    logging.debug('Query: %s' % query)
    # Retrieve protein entity
    result = g.query(query)
    logging.debug("\tconvert query has %s statements." % len(result))
    # Add protein entity to Dataset
    ds_g += result
    # Process SequenceAnnotations their property values and ranges
    logging.debug('SequenceAnnotation Query: %s' % sequenceAnnotationsQuery)
    ds_g += g.query(sequenceAnnotationsQuery)
    logging.debug('PropertyValue Query: %s' % propertyValueQuery)
    ds_g += g.query(propertyValueQuery)
    logging.debug('SequenceRange Query: %s' % sequenceRangeQuery)
    ds_g += g.query(sequenceRangeQuery)

#### Process Source Data Files

The following method processes the files in the given directory and calls the methods to do the data extraction.

In [16]:
def processDataFiles(idpKG, directoryLocation):
    processed = 0
    for file in glob(directoryLocation + "*.nq"):
        logging.info("\tProcessing file: %s" % file)
        g = ConjunctiveGraph()
        g.parse(file, format="nquads")
        logging.info("\tSource has %s statements." % len(g))
        # Ignore files with 4 or fewer statements; these correspond to empty scrapes
        if len(g) <= 4: 
            logging.info("\tSkip processing of file: %s – it contains no data." % file)
            continue

        # Retrieve provenance of crawl and add to default graph
        result = g.query(provenanceQuery)
        # Insert provenance into default context
        for s, p, o in result:
            idpKG.add((s, p, o))
            # Store context of crawl
            context = (s)
        logging.info('Context %s' % (context))
        # Create context in Dataset for the crawled entity
        ds_g = idpKG.graph(URIRef(context))

        # Add Dataset information to the generated KG
        logging.debug('Dataset Query: %s' % datasetQuery)
        ds_g += g.query(datasetQuery)
        logging.debug('Dataset query added %d statements' % len(ds_g))

        # Add Dataset information to the generated KG
        logging.debug('DataCatalog Query: %s' % dataCatalogQuery)
        ds_g += g.query(dataCatalogQuery)
        logging.debug('Number of statements after DataCatalog query %d' % len(ds_g))
        logging.info("\tIDPKG has %s statements." % len(idpKG))

        # Extract data source and UniProt IRIs
        results = g.query(idQuery)
        logging.info("\tID query result has %s statements." % len(results))
        # Convert to IDP KG model
        for result in results:
            proteinIRI = result['proteinIRI']
            uniprotIRI = result['uniprot']
            logging.debug("\tProtein: %s\n\tUniProt: %s" % (proteinIRI, uniprotIRI))
            
            # Extract UniProt accession to use as an identifier in the Bioschemas namespace
            uniprotAccession = uniprotIRI[uniprotIRI.rindex('/')+1:]
            logging.info('Accession: %s' % uniprotAccession)
            
            # Create entity for named graph KG approach
            createProteinEntity(g, ds_g, proteinIRI, uniprotIRI, uniprotAccession)
            logging.info("\tIDPKG has %s statements." % len(idpKG))
        processed += 1
    return processed

#### Process Sample Data
The following method processes the sample data files provided in the GitHub repository.

In [17]:
def processSampleData(idpKG):
    totalProcessed = 0

    # Process DisProt files
    print("Processing DisProt...", end='')
    logging.info("Processing DisProt...")
    numberOfFiles = processDataFiles(idpKG, "../scraped-data/sample/disprot/")
    print("%d files processed" % numberOfFiles)
    logging.info("%d files processed" % numberOfFiles)
    totalProcessed += numberOfFiles

    # Process MobiDB files
    print("Processing MobiDB...", end='')
    logging.info("Processing MobiDB...")
    numberOfFiles = processDataFiles(idpKG, "../scraped-data/sample/mobidb/")
    print("%d files processed" % numberOfFiles)
    logging.info("%d files processed" % numberOfFiles)
    totalProcessed += numberOfFiles

    # Process PED files
    print("Processing PED...", end='')
    logging.info("Processing PED...")
    numberOfFiles = processDataFiles(idpKG, "../scraped-data/sample/ped/")
    print("%d files processed" % numberOfFiles)
    logging.info("%d files processed" % numberOfFiles)
    totalProcessed += numberOfFiles
    
    return totalProcessed

#### Control Processing

This method controls the processing based on the scraped dataset chosen from the dropdown.

In [18]:
def controlProcessing(selected, output):
    validate=False
    print('Processing %r' % selected)
    logging.info('Processing %r' % selected)
    if selected=='../scraped-data/sample/':
        # Process the sample data
        totalProcessed = processSampleData(idpKG)
        validate=True
    else:
        # Process Dump
        totalProcessed = processDataFiles(idpKG, selected)

    # Output IDP KG
    idpKG.serialize('IDPKG.nq', format='nquads')
    idpKG.serialize('IDPKG.jsonld', format='json-ld')

    print('Processed %d files' % totalProcessed)
    logging.info('Processed %d files' % totalProcessed)

    # Validation only for the sample dataset
    if validate:
        assert (totalProcessed == 9), "Expected 9 data files but processed %r" % totalProcessed

    numberOfContexts = sum(1 for _ in idpKG.contexts())
    print('IDP KG has %d statements.' % len(idpKG))
    logging.info('IDP KG has %d statements.' % len(idpKG))
    print('IDP KG has %d contexts.' % numberOfContexts)
    logging.info('IDP KG has %d contexts.' % numberOfContexts)
    if validate:
        # Context created for each file, plus the default context
        assert (numberOfContexts == totalProcessed + 1), \
            "Expect the number of contexts (%d) to be one more than the number of files containing data (%d)" % \
            (numberOfContexts, totalProcessed)
    logging.debug('IDP contexts:')
    for c in idpKG.contexts():
        logging.debug('\t%s' % c)
        logging.debug('\tNumber of statements %d' % len(c))
    print('\nIDP ETL process finished successfully!')
    logging.info('IDP ETL process finished successfully!')

### Main Method

Processes the n-quad data files and converts them into the knowledge graph.

In [19]:
# Main control flow of the program

# Instantiate Knowledge Graphs
idpKG = Dataset()

directories = glob("../scraped-data/*" + os.path.sep)
selected = widgets.Dropdown(
    options=directories,
    value=None, #'../scraped-data/sample/',
    description='Datasets:',
    disabled=False,
)

output = widgets.Output()

#On change event
def on_change(change):
    with output:
        clear_output(True)
        controlProcessing(change.new, output)    
selected.observe(on_change, names='value')

print("Please select a scraped dataset to process. We suggest starting with the '../scraped-data/sample/' dataset.")
display(selected, output)

Please select a scraped dataset to process. We suggest starting with the '../scraped-data/sample/' dataset.


Dropdown(description='Datasets:', options=('../scraped-data/2021-08-13_25perSite/', '../scraped-data/2021-07-1…

Output()