# Analysis of the IDP Knowledge Graph

__Authors:__  
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Andras Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

Imran Asif ([ORCID:0000-0002-1144-6265](https://orcid.org/0000-0002-1144-6265)), _Heriot-Watt University, Edinburgh, UK_


__License:__ Apache 2.0

__Acknowledgements:__ This notebook was created during the Virtual BioHackathon-Europe 2020.

## Introduction

This notebook contains SPARQL queries to perform a data analysis of the Intrinsically Disordered Protein (IDP) Knowledge Graph. The IDP knowledge graph was constructed from Bioschemas markup embedded in DisProt, MobiDb, and Protein Ensemble Database (PED) that was harvested using the Bioschemas Markup Scraper and Extractor and converted into a knowledge graph using the process in this [notebook](https://github.com/elixir-europe/BioHackathon-projects-2020/blob/master/projects/24/IDPCentral/notebooks/ETLProcess.ipynb). 

### Library Imports

In [1]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpQuery.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [2]:
# Imports from RDFlib
from rdflib import ConjunctiveGraph

### Result Display Function

The following function takes the results of a `SPARQL SELECT` query and displays them using a HTML table for human viewing.

In [3]:
def displayResults(queryResult):
    from IPython.core.display import display, HTML
    HTMLResult = '<p>Number of results: ' + str(len(queryResult)) + '</p>'
    HTMLResult = HTMLResult + '<table><tr style="color:white;background-color:#43BFC7;font-weight:bold">'
    # print variable names and build header:
    for varName in queryResult.vars:
        HTMLResult = HTMLResult + '<td>' + varName + '</td>'
    HTMLResult = HTMLResult + '</tr>'
    
    # print values from each row and build table of results
    for row in queryResult:
        HTMLResult = HTMLResult + '<tr>'   
        for column in row:
            #print("COLUMN:", column)
            if column != "":
                HTMLResult = HTMLResult + '<td>' +  str(column) + '</td>'
            else:
                HTMLResult = HTMLResult + '<td>' + "N/A"+ '</td>'
        HTMLResult = HTMLResult + '</tr>'
    HTMLResult = HTMLResult + '</table>'
    display(HTML(HTMLResult))

## Loading IDP-KG

The data is read in from an N-QUADS file (`IDPKG.nq`). The data is expected to be in multiple named graphs, based on where the data was extracted from, with provenance data in the default graph.

In [4]:
idpKG = ConjunctiveGraph()
idpKG.parse("IDPKG.nq", format="nquads")
logging.info("\tIDP-KG has %s statements." % len(idpKG))

## Knowledge Graph Statistics

This section reports various statistics about the IDP-KG. The choice of statistics was inspired by the [HCLS Dataset Description Community Profile](https://www.w3.org/TR/hcls-dataset/#s6_6).

### Number of Triples

In [5]:
displayResults(idpKG.query("""
SELECT (COUNT(*) AS ?triples) 
WHERE {
    GRAPH ?g {
        ?s ?p ?o 
    }
}
"""))

0
triples
650


### Number of Typed Entities

Note that we use the `DISTINCT` keyword in the query since the same entity can appear in multiple named graphs.

In [6]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?s) AS ?entities) 
WHERE { 
    GRAPH ?g { 
        ?s a [] 
    }
}
"""))

0
entities
147


### Number of Unique Subjects

In [7]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?s) AS ?subjects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o
    }
}
"""))

0
subjects
155


### Number of Unique Properties

In [8]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?p) AS ?properties) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o 
    }
}
"""))

0
properties
20


### Number of Unique Objects

In [9]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o
    }
    FILTER(!isLiteral(?o))
}
"""))

0
objects
174


### Number of Unique Classes

In [10]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?classes) 
WHERE { 
    GRAPH ?g { 
        ?s a ?o 
    }
}
"""))

0
classes
5


### Number of Unique Literals

In [11]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o 
    }
    FILTER(isLiteral(?o))
}
"""))

0
objects
115


### Number of Graphs

In [12]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?g) AS ?graphs) 
WHERE { 
  GRAPH ?g 
    { ?s ?p ?o }
}
"""))

0
graphs
9


### Instances per Class

In [13]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?Class (COUNT(DISTINCT ?s) AS ?distinctInstances) 
WHERE {
    GRAPH ?g {
        ?s a ?Class
    }
} 
GROUP BY ?Class
ORDER BY ?Class
"""))

0,1
Class,distinctInstances
https://schema.org/DefinedTerm,27
https://schema.org/PropertyValue,54
https://schema.org/Protein,8
https://schema.org/SequenceAnnotation,29
https://schema.org/SequenceRange,29


### Properties and their Occurence

In [14]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) 
WHERE {
    GRAPH ?g {
        ?s ?p ?o
    }
} 
GROUP BY ?p
ORDER BY ?p
"""))

0,1
p,triples
http://purl.org/pav/createdWith,8
http://purl.org/pav/retrievedFrom,8
http://purl.org/pav/retrievedOn,8
http://rdfs.org/ns/void#inDataset,8
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,156
http://www.w3.org/2002/07/owl#sameAs,11
https://schema.org/additionalProperty,54
https://schema.org/description,3
https://schema.org/hasBioPolymerSequence,13


### Property, number of unique typed subjects, and triples

In [15]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p (COUNT(?p) AS ?triples) 
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?s a ?stype 
    }
} 
GROUP BY ?p ?stype
ORDER BY ?stype ?p
"""))

0,1,2,3
scount,stype,p,triples
27,https://schema.org/DefinedTerm,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,35
27,https://schema.org/DefinedTerm,https://schema.org/inDefinedTermSet,35
27,https://schema.org/DefinedTerm,https://schema.org/name,35
27,https://schema.org/DefinedTerm,https://schema.org/termCode,35
54,https://schema.org/PropertyValue,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,54
54,https://schema.org/PropertyValue,https://schema.org/name,54
54,https://schema.org/PropertyValue,https://schema.org/value,54
8,https://schema.org/Protein,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,9
8,https://schema.org/Protein,http://www.w3.org/2002/07/owl#sameAs,11


### Number of Unique Typed Objects Linked to a Property

In [16]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) ?otype (COUNT(DISTINCT ?o) AS ?count)
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?o a ?otype
    }
} 
GROUP BY ?p ?otype
ORDER BY ?p
"""))

0,1,2,3
p,triples,otype,count
https://schema.org/additionalProperty,54,https://schema.org/PropertyValue,54
https://schema.org/hasSequenceAnnotation,32,https://schema.org/SequenceAnnotation,29
https://schema.org/sequenceLocation,29,https://schema.org/SequenceRange,29
https://schema.org/value,54,https://schema.org/DefinedTerm,27


### Triples and Number of Unique Literals Related to a Property

In [17]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) (COUNT(DISTINCT ?o) AS ?literals)
WHERE {
    GRAPH ?g {
        ?s ?p ?o
    }
    FILTER (isLiteral(?o))
} 
GROUP BY ?p
ORDER BY ?p
"""))

0,1,2
p,triples,literals
http://purl.org/pav/retrievedOn,8,8
http://rdfs.org/ns/void#inDataset,6,2
https://schema.org/description,3,1
https://schema.org/hasBioPolymerSequence,13,9
https://schema.org/identifier,11,9
https://schema.org/name,98,34
https://schema.org/rangeEnd,29,15
https://schema.org/rangeStart,29,11
https://schema.org/termCode,35,27


### Number of Unique Subject Types Linked to Unique Object Types

In [18]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p ?otype (COUNT(DISTINCT ?o) AS ?ocount)
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?s a ?stype .
        ?o a ?otype .
    }
} 
GROUP BY ?p ?stype ?otype
ORDER BY ?p
"""))

0,1,2,3,4
scount,stype,p,otype,ocount
29,https://schema.org/SequenceAnnotation,https://schema.org/additionalProperty,https://schema.org/PropertyValue,54
8,https://schema.org/Protein,https://schema.org/hasSequenceAnnotation,https://schema.org/SequenceAnnotation,29
29,https://schema.org/SequenceAnnotation,https://schema.org/sequenceLocation,https://schema.org/SequenceRange,29
54,https://schema.org/PropertyValue,https://schema.org/value,https://schema.org/DefinedTerm,27


## Data Content Statistics

The previous section gave generic dataset statistics. We will now focus on information about the data content that is of interest to the IDP community.

### Number of Distinct Proteins
Retrieve the number of distinct proteins in the IDP-KG.

_Note that a protein can be present in multiple datasets._

In [19]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?Proteins) 
WHERE {
    GRAPH ?g {
        ?s a schema:Protein
    }
} 
"""))

0
Proteins
8


## Analysis of Proteins

The queries in this section focus on the proteins contained in the Knowledge Graph.

### Proteins per Dataset

Display the number of proteins per dataset

In [20]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?dataset (COUNT(DISTINCT ?s) AS ?Proteins) 
WHERE {
    GRAPH ?g {
        ?s a schema:Protein
    }
    ?g void:inDataset ?dataset
} 
GROUP BY ?dataset
"""))

0,1
dataset,Proteins
https://disprot.org/#2020-12,3
https://mobidb.org/#2020-09,2
https://proteinensemble.org/#2021-02-12,4


### Proteins from Multiple Datasets

A protein comes from multiple sources if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

In [21]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?protein (COUNT(?g) as ?numDatasets) (GROUP_CONCAT(?dataset;SEPARATOR=", ") AS ?datasets)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g void:inDataset ?dataset .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
ORDER BY ?numDatasets
"""))

0,1,2
protein,numDatasets,datasets
https://bioschemas.org/entity/P03265,2,"https://disprot.org/#2020-12, https://mobidb.org/#2020-09"


### Proteins from Multiple Pages

A protein comes from multiple pages (sources) if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

_Note that a protein can come from multiple pages within the same dataset._

In [22]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?protein (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=", ") AS ?sources)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
ORDER BY ?numSources
"""))

0,1,2
protein,numSources,sources
https://bioschemas.org/entity/P03265,2,"https://disprot.org/DP00003, https://mobidb.org/P03265"


### Minimal Protein Information

Retreive a minimal amount of information about the proteins.

In [23]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT  ?s ?name ?description
    (GROUP_CONCAT(DISTINCT ?identifier;SEPARATOR=',<br/>') AS ?identifiers)
    ?associatedDisease
    ?encodedBy
    ?taxonomicRange
    (GROUP_CONCAT(DISTINCT ?sameAs;SEPARATOR=',<br/>') AS ?sameAs)
    (GROUP_CONCAT(DISTINCT ?source;SEPARATOR=',<br/>') AS ?sources)
    (GROUP_CONCAT(DISTINCT ?dataset;SEPARATOR=',<br/>') AS ?datasets)
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
    ?g pav:retrievedFrom ?source
    OPTIONAL {?g void:inDataset ?dataset}
}
GROUP BY ?s
"""))

0,1,2,3,4,5,6,7,8,9
s,name,description,identifiers,associatedDisease,encodedBy,taxonomicRange,sameAs,sources,datasets
https://bioschemas.org/entity/P49913,Cathelicidin antimicrobial peptide,,https://identifiers.org/disprot:DP00004,,,https://bioschemas.org/crawl/v1/disprot/DP00004/20210813/8/disprot.org/DP00004/1514688226,"https://disprot.org/DP00004, http://purl.uniprot.org/uniprot/P49913",https://disprot.org/DP00004,https://disprot.org/#2020-12
https://bioschemas.org/entity/P03265,DNA-binding protein,,"https://identifiers.org/disprot:DP00003, https://identifiers.org/mobidb:P03265",,,https://bioschemas.org/crawl/v1/disprot/DP00003/20210813/7/disprot.org/DP00003/47171403,"http://purl.uniprot.org/uniprot/P03265, https://disprot.org/DP00003, https://mobidb.org/P03265","https://disprot.org/DP00003, https://mobidb.org/P03265","https://disprot.org/#2020-12, https://mobidb.org/#2020-09"
https://bioschemas.org/entity/P38634,Protein SIC1,,https://identifiers.org/uniprot:P38634,,,,"http://purl.uniprot.org/uniprot/P38634, https://proteinensemble.org/PED00001#P38634_A_1",https://proteinensemble.org/PED00001,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/Q12959,Disks large homolog 1,,https://identifiers.org/mobidb:Q12959,,,https://identifiers.org/taxonomy:9606,"http://purl.uniprot.org/uniprot/Q12959, https://mobidb.org/Q12959",https://mobidb.org/Q12959,https://mobidb.org/#2020-09
https://bioschemas.org/entity/P06400,Retinoblastoma-associated protein,,"https://identifiers.org/uniprot:P06400, https://identifiers.org/uniprot:P03255",,,,"https://proteinensemble.org/PED00174#P06400_A_1, https://proteinensemble.org/PED00174#P03255_B_1, http://purl.uniprot.org/uniprot/P03255, http://purl.uniprot.org/uniprot/P06400, https://proteinensemble.org/PED00174#P06400_A_0",https://proteinensemble.org/PED00174,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P03255,Retinoblastoma-associated protein,,"https://identifiers.org/uniprot:P06400, https://identifiers.org/uniprot:P03255",,,,"http://purl.uniprot.org/uniprot/P03255, https://proteinensemble.org/PED00174#P06400_A_0, https://proteinensemble.org/PED00174#P06400_A_1, https://proteinensemble.org/PED00174#P03255_B_1, http://purl.uniprot.org/uniprot/P06400",https://proteinensemble.org/PED00174,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P52292,Importin subunit alpha-1,,https://identifiers.org/uniprot:P52292,,,,"https://proteinensemble.org/PED00148#P52292_A_0, http://purl.uniprot.org/uniprot/P52292",https://proteinensemble.org/PED00148,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P03045,Antitermination protein N,,https://identifiers.org/disprot:DP00005,,,https://bioschemas.org/crawl/v1/disprot/DP00005/20210813/9/disprot.org/DP00005/2033189699,"https://disprot.org/DP00005, http://purl.uniprot.org/uniprot/P03045",https://disprot.org/DP00005,https://disprot.org/#2020-12


### Full Protein Information

Retrieve basic information about the proteins in the knowledge graph.

In [24]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT  ?s ?name ?description
    (GROUP_CONCAT(DISTINCT ?identifier;SEPARATOR=',<br/>') AS ?identifiers)
    ?associatedDisease
    (GROUP_CONCAT(DISTINCT ?annotation;SEPARATOR=',<br/>') AS ?annotations)
    ?encodedBy
    ?taxonomicRange
    ?url
    ?alternateName
    ?bioChemInteraction
    ?bioChemSimilarity
    ?bioChemEntity
    (GROUP_CONCAT(DISTINCT ?sequence;SEPARATOR=',<br/>') AS ?sequences)
    ?molFunction
    ?representation
    ?image
    ?process
    ?cellularLocation
    ?parentEntity
    (GROUP_CONCAT(DISTINCT ?sameAs;SEPARATOR=',<br/>') AS ?sameAs)
    (GROUP_CONCAT(DISTINCT ?source;SEPARATOR=',<br/>') AS ?sources)
    (GROUP_CONCAT(DISTINCT ?dataset;SEPARATOR=',<br/>') AS ?datasets)
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        #OPTIONAL 
        {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPolymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
    ?g pav:retrievedFrom ?source ;
    OPTIONAL {?g void:inDataset ?dataset}
}
GROUP BY ?s
"""))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
s,name,description,identifiers,associatedDisease,annotations,encodedBy,taxonomicRange,url,alternateName,bioChemInteraction,bioChemSimilarity,bioChemEntity,sequences,molFunction,representation,image,process,cellularLocation,parentEntity,sameAs,sources,datasets
https://bioschemas.org/entity/P49913,Cathelicidin antimicrobial peptide,,https://identifiers.org/disprot:DP00004,,"https://disprot.org/DP00004r001, https://disprot.org/DP00004r002, https://disprot.org/DP00004r004",,https://bioschemas.org/crawl/v1/disprot/DP00004/20210813/8/disprot.org/DP00004/1514688226,,,,,,MKTQRDGHSLGRWSLVLLLLGLVMPLAIIAQVLSYKEAVLRAIDGINQRSSDANLYRLLDLDPRPTMDGDPDTPKPVSFTVKETVCPRTTQQSPEDCDFKKDGLVKRCMGTVTLNQARGSFDISCDKDNKRFALLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,,,,,,,"https://disprot.org/DP00004, http://purl.uniprot.org/uniprot/P49913",https://disprot.org/DP00004,https://disprot.org/#2020-12
https://bioschemas.org/entity/P03265,DNA-binding protein,,"https://identifiers.org/disprot:DP00003, https://identifiers.org/mobidb:P03265",,"https://disprot.org/DP00003r002, https://disprot.org/DP00003r004, https://mobidb.org/P03265#prediction-disorder-mobidb_lite.1_108, https://mobidb.org/P03265#prediction-disorder-mobidb_lite.125_166",,https://bioschemas.org/crawl/v1/disprot/DP00003/20210813/7/disprot.org/DP00003/47171403,,,,,,MASREEEQRETTPERGRGAARRPPTMEDVSSPSPSPPPPRAPPKKRMRRRIESEDEEDSSQDALVPRTPSPRPSTSAADLAIAPKKKKKRPSPKPERPPSPEVIVDSEEEREDVALQMVGFSNPPVLIKHGKGGKRTVRRLNEDDPVARGMRTQEEEEEPSEAESEITVMNPLSVPIVSAWEKGMEAARALMDKYHVDNDLKANFKLLPDQVEALAAVCKTWLNEEHRGLQLTFTSKKTFVTMMGRFLQAYLQSFAEVTYKHHEPTGCALWLHRCAEIEGELKCLHGSIMINKEHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNTDARCCVHDAACPANQFSGKSCGMFFSEGAKAQVAFKQIKAFMQALYPNAQTGHGHLLMPLRCECNSKPGHAPFLGRQLPKLTPFALSNAEDLDADLISDKSVLASVHHPALIVFQCCNPVYRNSRAQGGGPNCDFKISAPDLLNALVMVRSLWSENFTELPRMVVPEFKWSTKHQYRNVSLPVAHSDARQNPFDF,,,,,,,"http://purl.uniprot.org/uniprot/P03265, https://disprot.org/DP00003, https://mobidb.org/P03265","https://disprot.org/DP00003, https://mobidb.org/P03265","https://disprot.org/#2020-12, https://mobidb.org/#2020-09"
https://bioschemas.org/entity/P38634,Protein SIC1,,https://identifiers.org/uniprot:P38634,,https://proteinensemble.org/PED00001#P38634_A_1_1_90,,,,,,,,MTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT,,,,,,,"http://purl.uniprot.org/uniprot/P38634, https://proteinensemble.org/PED00001#P38634_A_1",https://proteinensemble.org/PED00001,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/Q12959,Disks large homolog 1,,https://identifiers.org/mobidb:Q12959,,https://mobidb.org/Q12959#prediction-disorder-mobidb_lite.662_696,,https://identifiers.org/taxonomy:9606,,,,,,MPVRKQDTQRALHLLEEYRSKLSQTEDRQLRSSIERVINIFQSNLFQALIDIQEFYEVTLLDNPKCIDRSKPSEPIQPVNTWEISSLPSSTVTSETLPSSLSPSVEKYRYQDEDTPPQEHISPQITNEVIGPELVHVSEKNLSEIENVHGFVSHSHISPIKPTEAVLPSPPTVPVIPVLPVPAENTVILPTIPQANPPPVLVNTDSLETPTYVNGTDADYEYEEITLERGNSGLGFSIAGGTDNPHIGDDSSIFITKIITGGAAAQDGRLRVNDCILRVNEVDVRDVTHSKAVEALKEAGSIVRLYVKRRKPVSEKIMEIKLIKGPKGLGFSIAGGVGNQHIPGDNSIYVTKIIEGGAAHKDGKLQIGDKLLAVNNVCLEEVTHEEAVTALKNTSDFVYLKVAKPTSMYMNDGYAPPDITNSSSQPVDNHVSPSSFLGQTPASPARYSPVSKAVLGDDEITREPRKVVLHRGSTGLGFNIVGGEDGEGIFISFILAGGPADLSGELRKGDRIISVNSVDLRAASHEQAAAALKNAGQAVTIVAQYRPEEYSRFEAKIHDLREQMMNSSISSGSGSLRTSQKRSLYVRALFDYDKTKDSGLPSQGLNFKFGDILHVINASDDEWWQARQVTPDGESDEVGVIPSKRRVEKKERARLKTVKFNSKTRDKGEIPDDMGSKGLKHVTSNASDSESSYRGQEEYVLSYEPVNQQEVNYTRPVIILGPMKDRINDDLISEFPDKFGSCVPHTTRPKRDYEVDGRDYHFVTSREQMEKDIQEHKFIEAGQYNNHLYGTSVQSVREVAEKGKHCILDVSGNAIKRLQIAQLYPISIFIKPKSMENIMEMNKRLTEEQARKTFERAMKLEQEFTEHFTAIVQGDTLEDIYNQVKQIIEEQSGSYIWVPAKEKL,,,,,,,"http://purl.uniprot.org/uniprot/Q12959, https://mobidb.org/Q12959",https://mobidb.org/Q12959,https://mobidb.org/#2020-09
https://bioschemas.org/entity/P06400,Retinoblastoma-associated protein,,"https://identifiers.org/uniprot:P06400, https://identifiers.org/uniprot:P03255",,"https://proteinensemble.org/PED00174#P06400_A_0_372_581, https://proteinensemble.org/PED00174#P06400_A_1_643_787, https://proteinensemble.org/PED00174#P03255_B_1_36_146",,,,,,,,"SHFEPPTLHELYDLDVTAPEDPNEEAVSQIFPDSVMLAVQEGIDLLTFPPAPGSPEPPHLSRQPEQPEQRALGPVSMPNLVPEVIDLTCHEAGFPPSDDEDEEGEEFVLDY, KSTSLSLFYKKVYRLAYLRLNTLCERLLSEHPELEHIIWTLFQHTLQNEYELMRDRHLDQIMMCSMYGICKVKNIDLKFKIIVTAYKDLPHAVQETFKRVLIKEEEYDSIIVFYNSVFMQRLKTNILQYASTRPPTLSPIPHIPR, HTPVRTVMNTIQQLMMILNSASDQPSENLISYFNNCTVNPKESILKRVKDIGYIFKEKFAKAVGQGCVEIGSQRYKLGVRLYYRVMESMLKSEEERLSIQNFSKLLNDNIFHMSLLACALEVVMATYSRSTSQNLDSGTDLSFPWILNVLNLKAFDFYKVIESFIKAEGNLTREMIKHLERCEHRIMESLAWLSDSPLFDLIKQSKDREG",,,,,,,"https://proteinensemble.org/PED00174#P06400_A_1, https://proteinensemble.org/PED00174#P03255_B_1, http://purl.uniprot.org/uniprot/P03255, http://purl.uniprot.org/uniprot/P06400, https://proteinensemble.org/PED00174#P06400_A_0",https://proteinensemble.org/PED00174,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P03255,Retinoblastoma-associated protein,,"https://identifiers.org/uniprot:P06400, https://identifiers.org/uniprot:P03255",,"https://proteinensemble.org/PED00174#P03255_B_1_36_146, https://proteinensemble.org/PED00174#P06400_A_0_372_581, https://proteinensemble.org/PED00174#P06400_A_1_643_787",,,,,,,,"SHFEPPTLHELYDLDVTAPEDPNEEAVSQIFPDSVMLAVQEGIDLLTFPPAPGSPEPPHLSRQPEQPEQRALGPVSMPNLVPEVIDLTCHEAGFPPSDDEDEEGEEFVLDY, KSTSLSLFYKKVYRLAYLRLNTLCERLLSEHPELEHIIWTLFQHTLQNEYELMRDRHLDQIMMCSMYGICKVKNIDLKFKIIVTAYKDLPHAVQETFKRVLIKEEEYDSIIVFYNSVFMQRLKTNILQYASTRPPTLSPIPHIPR, HTPVRTVMNTIQQLMMILNSASDQPSENLISYFNNCTVNPKESILKRVKDIGYIFKEKFAKAVGQGCVEIGSQRYKLGVRLYYRVMESMLKSEEERLSIQNFSKLLNDNIFHMSLLACALEVVMATYSRSTSQNLDSGTDLSFPWILNVLNLKAFDFYKVIESFIKAEGNLTREMIKHLERCEHRIMESLAWLSDSPLFDLIKQSKDREG",,,,,,,"http://purl.uniprot.org/uniprot/P03255, https://proteinensemble.org/PED00174#P06400_A_0, https://proteinensemble.org/PED00174#P06400_A_1, https://proteinensemble.org/PED00174#P03255_B_1, http://purl.uniprot.org/uniprot/P06400",https://proteinensemble.org/PED00174,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P52292,Importin subunit alpha-1,,https://identifiers.org/uniprot:P52292,,https://proteinensemble.org/PED00148#P52292_A_0_3_97,,,,,,,,TNENANTPAARLHRFKNKGKDSTEMRRRRIEVNVELRKAKKDDQMLKRRNVSSFPDDATSPLQENRNNQGTVNWSVDDIVKGINSSNVENQLQAT,,,,,,,"https://proteinensemble.org/PED00148#P52292_A_0, http://purl.uniprot.org/uniprot/P52292",https://proteinensemble.org/PED00148,https://proteinensemble.org/#2021-02-12
https://bioschemas.org/entity/P03045,Antitermination protein N,,https://identifiers.org/disprot:DP00005,,"https://disprot.org/DP00005r006, https://disprot.org/DP00005r005, https://disprot.org/DP00005r013, https://disprot.org/DP00005r001, https://disprot.org/DP00005r012, https://disprot.org/DP00005r016, https://disprot.org/DP00005r008, https://disprot.org/DP00005r015, https://disprot.org/DP00005r010, https://disprot.org/DP00005r018, https://disprot.org/DP00005r007, https://disprot.org/DP00005r014, https://disprot.org/DP00005r011, https://disprot.org/DP00005r009, https://disprot.org/DP00005r004, https://disprot.org/DP00005r017",,https://bioschemas.org/crawl/v1/disprot/DP00005/20210813/9/disprot.org/DP00005/2033189699,,,,,,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKSRVESALNPIDLTVLAEYHKQIESNLQRIERKNQRTWYSKPGERGITCSGRQKIKGKSIPLI,,,,,,,"https://disprot.org/DP00005, http://purl.uniprot.org/uniprot/P03045",https://disprot.org/DP00005,https://disprot.org/#2020-12


## Analysis of Sequence Annotations

### Sequence Annotations per Dataset

Display the number of sequence annotations per dataset.

In [25]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?dataset (COUNT(DISTINCT ?s) AS ?annotations) 
WHERE {
    GRAPH ?g {
        ?s a schema:SequenceAnnotation
    }
    ?g void:inDataset ?dataset
} 
GROUP BY ?dataset
"""))

0,1
dataset,annotations
https://disprot.org/#2020-12,21
https://mobidb.org/#2020-09,3
https://proteinensemble.org/#2021-02-12,5


### Sequence Annotations from Multiple Datasets

Display the number of sequence annotations that come from multiple datasets.

_Note that sequence annotations are not merged based on any feature so we would not expect any sequence annotations to match the criteria in this query._

In [26]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?annotation (COUNT(?g) as ?numDatasets) (GROUP_CONCAT(?dataset;SEPARATOR=", ") AS ?datasets)
WHERE {
    GRAPH ?g {
        ?annotation a schema:SequenceAnnotation .
    }
    ?g void:inDataset ?dataset .
}
GROUP BY ?annotation
HAVING (COUNT(*) > 1)
ORDER BY ?numDatasets
"""))

0,1,2
annotation,numDatasets,datasets


### Sequence Annotations from Multiple Pages

Dislay the number of sequence annotations that come from multiple pages. It is conceivable that the same annotation comes from different pages in the same source, e.g. PED. However, as annotations are not combined, we would not expect any answers to the following query.

In [27]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?annotation (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=", ") AS ?sources)
WHERE {
    GRAPH ?g {
        ?annotation a schema:SequenceAnnotation .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?annotation
HAVING (COUNT(*) > 1)
ORDER BY ?numSources
"""))

0,1,2
annotation,numSources,sources


### Sequence Annotation Information

Return information known about each sequence annotation.

In [28]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>

SELECT ?s ?annotation ?start ?end ?termCode ?termName
WHERE {
    graph ?g {
        ?s a schema:Protein;
           schema:hasSequenceAnnotation ?annotation .
        ?annotation schema:additionalProperty/schema:value ?term;
            schema:sequenceLocation ?range .
        ?range schema:rangeStart ?start ;
               schema:rangeEnd ?end .
        ?term schema:termCode ?termCode ;
            schema:name ?termName .
    }
}    
ORDER BY ?s ?start ?end

"""))

0,1,2,3,4,5
s,annotation,start,end,termCode,termName
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r006,1,107,IDPO:00066,RNA binding
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r005,1,107,IDPO:00076,Disorder
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r001,1,107,IDPO:00076,Disorder
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r008,1,107,IDPO:00021,Activator
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r010,1,107,IDPO:00017,Molecular recognition effector
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r007,1,107,IDPO:00076,Disorder
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r011,1,107,IDPO:00008,Molecular recognition assembler
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r009,1,107,IDPO:00021,Activator
https://bioschemas.org/entity/P03045,https://disprot.org/DP00005r004,1,107,IDPO:00076,Disorder


## Find proteins with annotations in multiple datasets

We are looking for annotations where the protein is common but the annotation is different across the datasets.

### Proteins with Annotations in Multiple Datasets

In [29]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?protein (SAMPLE(?proteinName) AS ?name) (COUNT(distinct ?annotation) AS ?annotationCount) (COUNT(distinct ?dataset) AS ?datasets)
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
	    SELECT ?annotation ?dataset ?protein
    	WHERE {
        	GRAPH ?g {
            	?protein schema:hasSequenceAnnotation ?annotation
	        }
    	    ?g void:inDataset ?dataset .
	    }
    }
} 
GROUP BY ?protein
HAVING (COUNT(distinct ?dataset) > 1)
ORDER BY DESC(?annotationCount)
"""))

0,1,2,3
protein,name,annotationCount,datasets
https://bioschemas.org/entity/P03265,DNA-binding protein,4,2


### Proteins with Annotations in Multiple Pages

As sources such as PED can have the same protein detailed on multiple pages, it is also interesting to look at this at the page level.

The following query finds for each protein, its name (if known), a count of the number of sequence annotations, and a count of the number of sources from which the data has been extracted. Results are only returned if there are annotations from more than one source.

In [30]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein (SAMPLE(?proteinName) AS ?name) (COUNT(distinct ?annotation) AS ?annotationCount) (COUNT(distinct ?source) AS ?sourceCount)
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
	    SELECT ?annotation ?source ?protein
    	WHERE {
        	GRAPH ?g {
            	?protein schema:hasSequenceAnnotation ?annotation
	        }
    	    ?g pav:retrievedFrom ?source .
	    }
    }
} 
GROUP BY ?protein
HAVING (COUNT(distinct ?source) > 1)
ORDER BY DESC(?annotationCount)
"""))

0,1,2,3
protein,name,annotationCount,sourceCount
https://bioschemas.org/entity/P03265,DNA-binding protein,4,2


The following varient of the query will list the annotations and the source from which the annotation has come.

In [31]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein ?proteinName ?annotation ?source
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
        SELECT ?annotation ?source ?protein
        WHERE {
            GRAPH ?g {
                ?protein schema:hasSequenceAnnotation ?annotation
            }
            ?g pav:retrievedFrom ?source .
        }
    }
} 
ORDER BY ?protein ?annotation
"""))

0,1,2,3
protein,proteinName,annotation,source
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r001,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r004,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r005,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r006,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r007,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r008,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r009,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r010,https://disprot.org/DP00005
https://bioschemas.org/entity/P03045,Antitermination protein N,https://disprot.org/DP00005r011,https://disprot.org/DP00005
