# Converting Scraped Data into the IDP Knowledge Graph

__Authors:__
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

András Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

__License:__ Apache 2.0

__Acknowledgements:__ This notebook builds upon the work conducted during the Virtual BioHackathon-Europe 2020 reported in [BioHackrXiv](https://biohackrxiv.org/v3jct/).

## Introduction

IDPCentral is the idea of having a central registry of proteins that are known to be disordered.

We aim to populate the content of the registry with Bioschemas markup that has been scraped using the BMUSE tool.

This notebook goes through the steps of converting the scraped content into a Knowledge Graph of IDP data that can be used as a source of data for the IDPCentral registry.

### Input

Scraped data from the following data sources:
- [DisProt](https://www.disprot.org/)
- [MobiDb](https://mobidb.bio.unipd.it/)
- [Protein Ensemble Database](https://proteinensemble.org/) (PED)

The data files can be found in this [GitHub directory](https://github.com/elixir-europe/BioHackathon-projects-2020/tree/master/projects/24/IDPCentral/scraped-data). The code in this notebook uses a relative link to retrieve the data files, i.e. `../scraped-data`.

### Output

The generated knowledge graph is written to file in the same directory as this notebook:
- `IDPKG.jsonld`: JSON-LD serialisation of all named graphs
- `IDPKG.nq`: N-QUADS serialisation of all named graphs

## Code

The following code converts the data scraped using BMUSE into the desired knowledge graphs using the RDFlib python library and its abilities to process SPARQL queries over its internal data model.

The generated knowledge graph uses named graphs to track the provenance of where individual statements.

### Library Imports

In [1]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpETL.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [2]:
# Imports from RDFlib
from rdflib import ConjunctiveGraph, Dataset, Graph, RDF, URIRef

In [3]:
# Import template library for templating queries
from string import Template

In [4]:
# Import functions for interacting with file directory
from glob import glob

### Template Queries

#### Provenance Query

Extracts metadata about where the data has originated and when it was scraped.

In [5]:
# Query to extract the graph and its metadata
provenanceQuery = """
PREFIX pav: <http://purl.org/pav/>
PREFIX prov: <http://www.w3.org/ns/prov#>
CONSTRUCT {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
}
WHERE {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
}
"""

#### Protein Information Query

The following query extracts data from the named graph scraped model, unifies the identifier to the UniProt accession number in the Bioschemas namespace.

Query uses `OPTIONAL` clauses throughout since it was found that not all sources comply with the minimal properties of the Bioschemas Protein Profile.

In [6]:
# Templated query for creating the direct properties for a protein entity
proteinQuery = Template("""
# Query to convert Protein scraped data to a merged named graph
# Defensive query: assumes that data does not conform to Protein profile

PREFIX bs: <https://bioschemas.org/entity/>
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>

CONSTRUCT {
    bs:${bsAccession} a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:associatedDisease ?associatedDisease ;
        schema:description ?description ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:isEncodedByBioChemEntity ?encodedBy ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:url ?url ;
        schema:alternateName ?alternateName ;
        schema:bioChemInteraction ?bioChemInteraction ;
        schema:bioChemSimilarity ?bioChemSimilarity ;
        schema:hasBioChemEntityPart ?bioChemEntity ;
        schema:hasBioPolymerSequence ?sequence ;
        schema:hasMolecularFunction ?molFunction ;
        schema:hasRepresentation ?representation ;
        schema:image ?image ;
        schema:isInvolvedInBiologicalProcess ?process ;
        schema:isLocatedInSubcellularLocation ?cellularLocation ;
        schema:isPartOfBioChemEntity ?parentEntity ;
        schema:sameAs ?sameAs , ?s .
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPolymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
}
""")

#### SequenceAnnotation Query

Query to extract sequence annotations.

In [7]:
sequenceAnnotationsQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
  ?s a schema:SequenceAnnotation ;
        schema:additionalProperty ?addProp ;
        schema:citation ?citation ;
        schema:creationMethod ?method ;
        schema:description ?description ;
        schema:editor ?editor ;
        schema:isPartOfBioChemEntity ?bioChemEntity ;
        schema:sequenceLocation ?seqLoc .
}
WHERE {
  graph ?g {
    ?s a schema:SequenceAnnotation .
    OPTIONAL {?s schema:additionalProperty ?addProp }
    OPTIONAL {?s schema:citation ?citation }
    OPTIONAL {?s schema:creationMethod ?method }
    OPTIONAL {?s schema:description ?description }
    OPTIONAL {?s schema:editor ?editor }
    OPTIONAL {?s schema:isPartOfBioChemEntity ?bioChemEntity }
    OPTIONAL {?s schema:sequenceLocation ?seqLoc }
  }
}
"""

#### PropertyValue Query

Query to extract PropertyValue data.

In [8]:
propertyValueQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:PropertyValue ;
        schema:name ?name ;
        schema:value ?value .
}
where {
    graph ?g {
        ?s a schema:PropertyValue .
        OPTIONAL {?s schema:name ?name }
        OPTIONAL {?s schema:value ?value }
    }
}
"""

#### SequenceRange Query

Query to extract SequenceRange data.

In [9]:
sequenceRangeQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:SequenceRange ;
        schema:rangeStart ?start ;
        schema:rangeEnd ?end .
}
where {
    graph ?g {
        ?s a schema:SequenceRange .
        OPTIONAL {?s schema:rangeStart ?start }
        OPTIONAL {?s schema:rangeEnd ?end}
    }
}
"""

#### UniProt ID Query

The following query extracts the UniProt IRI, declared using a `schema:sameAs` declaration in the source data. Note that different sources use alternative patterns for the UniProt IRI; the `FILTER` clause matches the following patterns
- `https://www.uniprot.org/uniprot/`
- `http://purl.uniprot.org/uniprot/`

In [10]:
# Query to extract UniProt IRI
idQuery = """
PREFIX schema: <https://schema.org/>
SELECT ?proteinIRI ?uniprot
WHERE {
    GRAPH ?g {
        ?proteinIRI a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^(https://www|http://purl).uniprot.org/uniprot/")
    }
}
"""

### Methods

#### Create Knowledge Graph Protein Entity 

The `createKGEntity` method uses the queries above to extract the content from the scraped data and then transform it into data using the common identifier scheme.

In [11]:
def createKGEntity(g, ds, protein, uniprot, accession):
    # Retrieve provenance of crawl and add to default graph
    result = g.query(provenanceQuery)
    # Insert provenance into default context
    for s, p, o in result:
        ds.add((s, p, o))
        # Store context of crawl
        context = (s)
    logging.debug('Context %s' % (context))
    # Parameterise the query with the proteinIRI and accession
    query = proteinQuery.substitute(proteinIRI=protein,bsAccession=accession)
    logging.debug('Query: %s' % query)
    # Create context in Dataset for the crawled entity
    ds_g = ds.graph(URIRef(context))
    # Retrieve crawled entity
    result = g.query(query)
    logging.debug("\tconvert query has %s statements." % len(result))
    # Add crawled entity to Dataset
    ds_g += result
    logging.debug('SequenceAnnotation Query: %s' % sequenceAnnotationsQuery)
    ds_g += g.query(sequenceAnnotationsQuery)
    logging.debug('PropertyValue Query: %s' % propertyValueQuery)
    ds_g += g.query(propertyValueQuery)
    logging.debug('SequenceRange Query: %s' % sequenceRangeQuery)
    ds_g += g.query(sequenceRangeQuery)

#### Process Source Data Files

The following method processes the files in the given directory and calls the methods to do the data extraction.

In [12]:
def processDataFiles(idpKG, directoryLocation):
    processed = 0
    for file in glob(directoryLocation + "*.nq"):
        logging.info("\tProcessing file: %s" % file)
        g = ConjunctiveGraph()
        g.parse(file, format="nquads")
        logging.info("\tSource has %s statements." % len(g))
        # Extract data source and UniProt IRIs
        results = g.query(idQuery)
        logging.info("\tID query result has %s statements." % len(results))
        # Convert to IDP KG model
        for result in results:
            proteinIRI = result['proteinIRI']
            uniprotIRI = result['uniprot']
            logging.debug("\tProtein: %s\n\tUniProt: %s" % (proteinIRI, uniprotIRI))
            
            # Extract UniProt accession to use as an identifier in the Bioschemas namespace
            uniprotAccession = uniprotIRI[uniprotIRI.rindex('/')+1:]
            logging.info('Accession: %s' % uniprotAccession)
            
            # Create entity for named graph KG approach
            createKGEntity(g, idpKG, proteinIRI, uniprotIRI, uniprotAccession)
            logging.info("\tIDPKG has %s statements." % len(idpKG))
        processed += 1
    return processed

### Main Method

Processes the n-quad data files and converts them into the knowledge graph.

In [13]:
# Main control flow of the program

# Instantiate Knowledge Graphs
idpKG = Dataset()
totalProcessed = 0

# Process DisProt files
print("Processing DisProt...", end='')
numberOfFiles = processDataFiles(idpKG, "../scraped-data/disprot/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

# Process MobiDB files
print("Processing MobiDB...", end='')
numberOfFiles = processDataFiles(idpKG, "../scraped-data/mobidb/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

# Process PED files
print("Processing PED...", end='')
numberOfFiles = processDataFiles(idpKG, "../scraped-data/ped/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

# Output IDP KG
idpKG.serialize('IDPKG.nq', format='nquads')
idpKG.serialize('IDPKG.jsonld', format='json-ld')

logging.info('Processed %d files' % totalProcessed)

assert (totalProcessed == 8), "Expected 8 data files but processed %r" % totalProcessed

numberOfContexts = sum(1 for _ in idpKG.contexts())
print('IDP KG has %d statements.' % len(idpKG))
print('IDP KG has %d contexts.' % numberOfContexts)
# Context created for each file, plus the default context
assert (numberOfContexts == totalProcessed + 1), \
    "Expect the number of contexts (%d) to be one more than the number of files containing data (%d)" % \
    (numberOfContexts, totalProcessed)
print('IDP contexts:', '')
for c in idpKG.contexts():
    print('\t%s' % c)
    print('\tNumber of statements %d' % len(c))
print('\nIDP ETL process finished successfully!')

Processing DisProt...3 files processed
Processing MobiDB...2 files processed
Processing PED...3 files processed
IDP KG has 479 statements.
IDP KG has 9 contexts.
IDP contexts: 
	<https://bioschemas.org/crawl/v1/disprot/DP00005/20210722/51> a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory'].
	Number of statements 167
	<https://bioschemas.org/crawl/v1/proteinensemble/PED00001/20210722/54> a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory'].
	Number of statements 60
	<https://bioschemas.org/crawl/v1/proteinensemble/PED00148/20210722/55> a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory'].
	Number of statements 36
	<urn:x-rdflib:default> a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory'].
	Number of statements 16
	<https://bioschemas.org/crawl/v1/mobidb/Q12959/20210722/53> a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory'].
	Number of statements 18
	<https://bioschemas.org/crawl/v1/proteinensemble/PED00174/20210722/56> a r