# Analysis of the IDP Knowledge Graph

__Authors:__  
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Andras Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

Imran Asif ([ORCID:0000-0002-1144-6265](https://orcid.org/0000-0002-1144-6265)), _Heriot-Watt University, Edinburgh, UK_


__License:__ Apache 2.0

__Acknowledgements:__ This notebook was created during the Virtual BioHackathon-Europe 2020.

## Introduction

This notebook contains SPARQL queries to perform a data analysis of the Intrinsically Disordered Protein (IDP) Knowledge Graph. The IDP knowledge graph was constructed from Bioschemas markup embedded in DisProt, MobiDb, and Protein Ensemble Database (PED) that was harvested using the Bioschemas Markup Scraper and Extractor and converted into a knowledge graph using the process in this [notebook](https://github.com/elixir-europe/BioHackathon-projects-2020/blob/master/projects/24/IDPCentral/notebooks/ETLProcess.ipynb). 

### Library Imports

In [390]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpQuery.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

import ipywidgets as widgets
from ipywidgets import Layout
from IPython.core.display import display, HTML
from IPython.display import clear_output
from SPARQLWrapper import SPARQLWrapper, JSON
import json
import glob
import html

In [391]:
# Imports from RDFlib
import rdflib
from rdflib import ConjunctiveGraph, plugin
from rdflib.serializer import Serializer

### Result Display Function

The following function takes the results of a `SPARQL SELECT` query and displays them using a HTML table for human viewing.

In [392]:
def displayResults(queryResult):
    HTMLResult = '<div style="width:90% !important;overflow-x:auto;"><p>Number of results: ' + str(len(queryResult['results']['bindings'])) + '</p>'
    HTMLResult = HTMLResult + '<table><tr style="color:white;background-color:#43BFC7;font-weight:bold;word-wrap: break-word;">'
    # print variable names and build header:
    for varName in queryResult['head']['vars']:
        HTMLResult = HTMLResult + '<td>' + varName + '</td>'
    HTMLResult = HTMLResult + '</tr>'
    
    # print values from each row and build table of results
    for row in queryResult['results']['bindings']:
        HTMLResult = HTMLResult + '<tr>' 
        for column in queryResult['head']['vars']:
            #print("COLUMN:", column)
            if column != "":
                HTMLResult = HTMLResult + '<td>' +  str(row[column]['value']) + '</td>'
            else:
                HTMLResult = HTMLResult + '<td>' + "N/A"+ '</td>'
        HTMLResult = HTMLResult + '</tr>'
    HTMLResult = HTMLResult + '</table></div>'
    display(HTML(HTMLResult))

## Loading IDP-KG

The data is read in from an N-QUADS file (`IDPKG.nq`). The data is expected to be in multiple named graphs, based on where the data was extracted from, with provenance data in the default graph.

In [401]:
idpKG = None
opt = ''  #selection option
query_options = [] # use in dropdown list

queryOrder = { #All queries must be entered here.
			  1: 'hcls-stats/number-triples.rq', 
     		  2: 'hcls-stats/typed-entities.rq', 
			  3: 'hcls-stats/number-subjects.rq', 
			  4: 'hcls-stats/number-properties.rq', 
			  5: 'hcls-stats/number-objects.rq',
			  6: 'hcls-stats/number-classes.rq',
			  7: 'hcls-stats/number-literals.rq',
			  8: 'hcls-stats/number-graphs.rq',
			  9: 'hcls-stats/class-count.rq',
			  10: 'hcls-stats/properties-ccurence.rq',
			  11: 'hcls-stats/property-subjects-triples.rq',
			  12: 'hcls-stats/number-typed-objects-linked-property.rq',
			  13: 'hcls-stats/triples-literals-related-property.rq',
			  14: 'hcls-stats/number-subject-types-object-types.rq',
			  15: 'proteins/protein-count.rq',
			  16: 'proteins/protein-per-dataset.rq',
			  17: 'proteins/protein-multi-datasets.rq',
			  18: 'proteins/proteins-by-dataset-groupings.rq',
			  19: 'proteins/protein-multi-pages.rq',
			  20: 'proteins/protein-information-minimal.rq',
		  	  21: 'proteins/protein-information.rq',
			  22: 'annotations/annotation-per-dataset.rq',
			  23: 'annotations/annotations-multi-datasets.rq',
			  24: 'annotations/annotations-multi-pages.rq',
			  25: 'annotations/annotation-details.rq',
			  26: 'annotations/annotation-scholarly-articles.rq',
			  27: 'annotations/annotations-per-article.rq',
			  30: 'annotations/annotations-by-term-code.rq',
			  31: 'annotations/protein-annotations-multi-datasets.rq',
		      32: 'annotations/protein-annotation-count.rq',
		      33: 'annotations/list-annotations.rq'
			 }

# Change the order in dict key, the following line sort it and execute query
queryOrder = {k: queryOrder[k] for k in sorted(queryOrder)}

query_options.append(('All Queries', 'all'))
for key in queryOrder:
    text = queryOrder[key].split("/")[1]
    query_options.append((text, queryOrder[key]))
    
# Following method set the graph variable to sparql endpoint or local in-memory    
def set_variable(loadingOpt, endpoint):
    global idpKG
    if loadingOpt == 'sparql':
        idpKG = SPARQLWrapper(endpoint)
        idpKG.setReturnFormat(JSON)
        logging.info("SPARQL Endpoint: %s" % endpoint)
    else:
        idpKG = ConjunctiveGraph()
        idpKG.parse(endpoint, format="nquads")
        #idpKG.serialize(format="json-ld") 
        logging.info("\tIDP-KG has %s statements." % len(idpKG))

def query_idpkg(query, loadingOpt):
    if loadingOpt == 'sparql':
        idpKG.setQuery(query)
        results = idpKG.queryAndConvert()
        #ToDo: add log message here giving number of results
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results
    else:
        results = idpKG.query(query)
        results = json.loads(results.serialize(format="json"))
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results
    
def runQuery(queryFile):
    with open('../queries/'+queryFile) as f:
        query = f.read()
        first_line = query.partition('\n')[0]
        
        if '#' in first_line:
            query = query.split("\n",1)[1]
        else:
            first_line = ''
            
        display(HTML('<hr />'))
        print('File: /queries/' + queryFile)
        print('Query:')
        display(HTML('<div style="width:90%;overflow-x:auto;"><b><u>'+first_line.replace('#','')+'</u></b><br /><br /><pre>'+html.escape(query)+'</pre></div>'))
        logging.debug('File: /queries/' + queryFile)
        logging.debug('Query:\n' + query)
        try:
            displayResults(query_idpkg(query, opt))
        except Exception as e:
            print(str(e))
            
#############################################
#Create Selection GUI
rdo1 = widgets.RadioButtons(
    options=['SPARQL Endpoint:', 'Test-8', 'Sample-25', 'IDPKG-Full.nq'],
    #     value='pineapple',
    #description='Pizza topping:',
    name = 'select',
    disabled=False,
    layout=Layout(width='20%')
)
    
txt = widgets.Text(
    value='http://137.195.27.15:7200/repositories/IDPKG-Full',
    placeholder='Enter endpoint',
    disabled=False,
    layout=Layout(width='80%', height='5px')
)

dropdown = widgets.Dropdown(
            options=query_options,
            value='all'
        )
    
btn = widgets.Button(
    description='Execute',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Execute',
    icon='check'
)
output = widgets.Output()

def createSelectionGUI():
    btn.on_click(on_button_clicked)
    display(widgets.HBox([rdo1, txt]), dropdown, widgets.VBox([btn]), output)

def on_button_clicked(e):
    with output:
        global opt
        if 'sparql' in rdo1.value.lower():
            if txt.value == '':
                display(HTML('<span style="color:red">Please enter SPARQL endpoint.</span>'))
            else:
                set_variable('sparql', txt.value)
                opt = 'sparql'
        else:
            nqFile = ''
            if 'test-8' in rdo1.value.lower():
                nqFile = 'IDPKG-Sample8.nq'
            elif 'sample-25' in rdo1.value.lower():
                nqFile = 'IDPKG-Sample25.nq'
            elif 'idpkg-full' in rdo1.value.lower():
                nqFile = 'IDPKG-Full.nq'
                
            set_variable('local', nqFile)
            opt = 'local'
            
        clear_output(True)
        
        #Execute query
        if dropdown.value == 'all':
            for key in queryOrder:
                runQuery(queryOrder[key])
        else:
            runQuery(dropdown.value)

In [402]:
createSelectionGUI()

HBox(children=(RadioButtons(layout=Layout(width='20%'), options=('SPARQL Endpoint:', 'Test-8', 'Sample-25', 'I…

Dropdown(options=(('All Queries', 'all'), ('number-triples.rq', 'hcls-stats/number-triples.rq'), ('typed-entit…

VBox(children=(Button(description='Execute', icon='check', style=ButtonStyle(), tooltip='Execute'),))

Output()

## Knowledge Graph Statistics

This section reports various statistics about the IDP-KG. The choice of statistics was inspired by the [HCLS Dataset Description Community Profile](https://www.w3.org/TR/hcls-dataset/#s6_6).

### Number of Triples

In [7]:
logging.info(' Number of Triples - Query Started.')
runQuery('hcls-stats/number-triples.rq')
logging.info('Query Completed.')

File: /queries/hcls-stats/number-triples.rq
Query:
## Number of Triples

SELECT (COUNT(*) AS ?triples)
WHERE {
    GRAPH ?g {
        ?s ?p ?o
    }
}



0
triples
7709


### Number of Typed Entities

Note that we use the `DISTINCT` keyword in the query since the same entity can appear in multiple named graphs.

In [None]:
logging.info(' Number of Typed Entities - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?s) AS ?entities) 
WHERE { 
    GRAPH ?g { 
        ?s a [] 
    }
}
""", opt))
logging.info('Query Completed.')

### Number of Unique Subjects

In [None]:
logging.info(' Number of Unique Subjects - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?s) AS ?subjects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o
    }
}
""", opt))
logging.info('Query Completed.')

### Number of Unique Properties

In [None]:
logging.info(' Number of Unique Properties - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?p) AS ?properties) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o 
    }
}
""", opt))
logging.info('Query Completed.')

### Number of Unique Objects

In [None]:
logging.info(' Number of Unique Objects - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o
    }
    FILTER(!isLiteral(?o))
}
""", opt))
logging.info('Query Completed.')

### Number of Unique Classes

In [None]:
logging.info(' Number of Unique Object Classes - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?o) AS ?classes) 
WHERE { 
    GRAPH ?g { 
        ?s a ?o 
    }
}
""", opt))
logging.info('Query Completed.')

### Number of Unique Literals

In [None]:
logging.info(' Number of Unique Literals - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
    GRAPH ?g { 
        ?s ?p ?o 
    }
    FILTER(isLiteral(?o))
}
""", opt))
logging.info('Query Completed.')

### Number of Graphs

In [None]:
logging.info(' Number of Unique Graphs - Query Started.')
displayResults(query_idpkg("""
SELECT (COUNT(DISTINCT ?g) AS ?graphs) 
WHERE { 
  GRAPH ?g 
    { ?s ?p ?o }
}
""", opt))
logging.info('Query Completed.')

### Instances per Class

In [None]:
logging.info(' Classes & Distinct Instances - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?Class (COUNT(DISTINCT ?s) AS ?distinctInstances) 
WHERE {
    GRAPH ?g {
        ?s a ?Class
    }
} 
GROUP BY ?Class
ORDER BY ?Class
""", opt))
logging.info('Query Completed.')

### Properties and their Occurence

In [None]:
logging.info(' Number of Unique Predicates - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) 
WHERE {
    GRAPH ?g {
        ?s ?p ?o
    }
} 
GROUP BY ?p
ORDER BY ?p
""", opt))
logging.info('Query Completed.')

### Property, number of unique typed subjects, and triples

In [None]:
logging.info(' scount	stype	p	triples - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p (COUNT(?p) AS ?triples) 
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?s a ?stype 
    }
} 
GROUP BY ?p ?stype
ORDER BY ?stype ?p
""", opt))
logging.info('Query Completed.')

### Number of Unique Typed Objects Linked to a Property

In [None]:
logging.info(' Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) ?otype (COUNT(DISTINCT ?o) AS ?count)
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?o a ?otype
    }
} 
GROUP BY ?p ?otype
ORDER BY ?p
""", opt))
logging.info('Query Completed.')

### Triples and Number of Unique Literals Related to a Property

In [None]:
logging.info(' Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) (COUNT(DISTINCT ?o) AS ?literals)
WHERE {
    GRAPH ?g {
        ?s ?p ?o
    }
    FILTER (isLiteral(?o))
} 
GROUP BY ?p
ORDER BY ?p
""", opt))
logging.info('Query Completed.')

### Number of Unique Subject Types Linked to Unique Object Types

In [None]:
logging.info(' Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p ?otype (COUNT(DISTINCT ?o) AS ?ocount)
WHERE {
    GRAPH ?g {
        ?s ?p ?o .
        ?s a ?stype .
        ?o a ?otype .
    }
} 
GROUP BY ?p ?stype ?otype
ORDER BY ?p
""", opt))
logging.info('Query Completed.')

## Data Content Statistics

The previous section gave generic dataset statistics. We will now focus on information about the data content that is of interest to the IDP community.

### Number of Distinct Proteins
Retrieve the number of distinct proteins in the IDP-KG.

_Note that a protein can be present in multiple datasets._

In [None]:
logging.info(' Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?Proteins) 
WHERE {
    GRAPH ?g {
        ?s a schema:Protein
    }
} 
""", opt))
logging.info('Query Completed.')

## Analysis of Proteins

The queries in this section focus on the proteins contained in the Knowledge Graph.

### Proteins per Dataset

Display the number of proteins per dataset

In [8]:
logging.info(' Proteins per Dataset - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?dataset (COUNT(DISTINCT ?s) AS ?Proteins) 
WHERE {
    GRAPH ?g {
        ?s a schema:Protein
    }
    ?g void:inDataset ?dataset
} 
GROUP BY ?dataset
""", opt))
logging.info('Query Completed.')

0,1
dataset,Proteins
https://mobidb.org/#2020-09,28
https://proteinensemble.org/#2021-02-12,20
https://disprot.org/#2020-12,26


### Proteins from Multiple Datasets

A protein comes from multiple sources if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

In [9]:
logging.info(' Proteins Dataset & Number of Datasets - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?protein (COUNT(?g) as ?numDatasets) (GROUP_CONCAT(?dataset;SEPARATOR=", ") AS ?datasets)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g void:inDataset ?dataset .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
ORDER BY ?numDatasets
""", opt))
logging.info('Query Completed.')

0,1,2
protein,numDatasets,datasets
https://idpcentral.org/id/P42212,2,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/P09525-1,2,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/P03265,2,"https://mobidb.org/#2020-09, https://disprot.org/#2020-12"
https://idpcentral.org/id/Q5L4K5,2,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/P37840,2,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/Q16143,2,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/P12296,4,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/P38634,4,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"
https://idpcentral.org/id/O14558,5,"https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12, https://proteinensemble.org/#2021-02-12"


### Venn Analysis of Proteins by Dataset

In [10]:
logging.info(' Venn Analysis of Proteins by Dataset - Query Started.')
runQuery('proteins/proteins-by-dataset-groupings.rq')
logging.info('Query Completed.')

File: /queries/proteins/proteins-by-dataset-groupings.rq
Query:
# Query to analyse the number of proteins by dataset groups

PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?description ?count
WHERE {
    {
        {
            SELECT (COUNT(DISTINCT ?s) AS ?c)
            WHERE {
                GRAPH ?g1 {
                    ?s a schema:Protein
                }
            }
        }
        BIND("Distinct Proteins (Union)" AS ?d)
    }
    UNION
    {
        {
            SELECT (COUNT(DISTINCT ?s) AS ?c)
            WHERE {
                GRAPH ?g1 {
                    ?s a schema:Protein
                }
                ?g1 void:inDataset "https://disprot.org/#2020-12"
            }
        }
        BIND("DisProt Proteins" AS ?d)
    }
    UNION
    {
        {
            SELECT (COUNT(DISTINCT ?s) AS ?c)
            WHERE {
                GRAPH ?g1 {
                    ?s a schema:Protein
        

0,1
description,count
73,Distinct Proteins (Union)
26,DisProt Proteins
28,MobiDB Proteins
20,PED Proteins
25,DisProt \ (MobiDB U PED)
27,MobiDB \ (DisProt U PED)
20,PED \ (DisProt U MobiDB)
53,(DisProt U MobiDB)
46,(DisProt U PED)


### Proteins from Multiple Pages

A protein comes from multiple pages (sources) if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

_Note that a protein can come from multiple pages within the same dataset._

In [None]:
logging.info(' Proteins Sources & Number of sources - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?protein (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=", ") AS ?sources)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
ORDER BY ?numSources
""", opt))
logging.info('Query Completed.')

### Minimal Protein Information

Retreive a minimal amount of information about the proteins.

In [None]:
logging.info(' Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT  ?s ?name ?description
    (GROUP_CONCAT(DISTINCT ?identifier;SEPARATOR=',<br/>') AS ?identifiers)
    ?associatedDisease
    ?encodedBy
    ?taxonomicRange
    (GROUP_CONCAT(DISTINCT ?sameAs;SEPARATOR=',<br/>') AS ?sameAs)
    (GROUP_CONCAT(DISTINCT ?source;SEPARATOR=',<br/>') AS ?sources)
    (GROUP_CONCAT(DISTINCT ?dataset;SEPARATOR=',<br/>') AS ?datasets)
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
    ?g pav:retrievedFrom ?source
    OPTIONAL {?g void:inDataset ?dataset}
}
GROUP BY ?s
""", opt))
logging.info(' Query Completed.')

### Full Protein Information

Retrieve basic information about the proteins in the knowledge graph.

In [None]:
logging.info(' Full Protein Information - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT  ?s ?name ?description
    (GROUP_CONCAT(DISTINCT ?identifier;SEPARATOR=',<br/>') AS ?identifiers)
    ?associatedDisease
    (GROUP_CONCAT(DISTINCT ?annotation;SEPARATOR=',<br/>') AS ?annotations)
    ?encodedBy
    ?taxonomicRange
    ?url
    ?alternateName
    ?bioChemInteraction
    ?bioChemSimilarity
    ?bioChemEntity
    (GROUP_CONCAT(DISTINCT ?sequence;SEPARATOR=',<br/>') AS ?sequences)
    ?molFunction
    ?representation
    ?image
    ?process
    ?cellularLocation
    ?parentEntity
    (GROUP_CONCAT(DISTINCT ?sameAs;SEPARATOR=',<br/>') AS ?sameAs)
    (GROUP_CONCAT(DISTINCT ?source;SEPARATOR=',<br/>') AS ?sources)
    (GROUP_CONCAT(DISTINCT ?dataset;SEPARATOR=',<br/>') AS ?datasets)
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        #OPTIONAL 
        {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPolymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
    ?g pav:retrievedFrom ?source ;
    OPTIONAL {?g void:inDataset ?dataset}
}
GROUP BY ?s
""", opt))
logging.info('Query Completed.')

## Analysis of Sequence Annotations

### Sequence Annotations per Dataset

Display the number of sequence annotations per dataset.

In [None]:
logging.info('Sequence Annotations per Dataset - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?dataset (COUNT(DISTINCT ?s) AS ?annotations) 
WHERE {
    GRAPH ?g {
        ?s a schema:SequenceAnnotation
    }
    ?g void:inDataset ?dataset
} 
GROUP BY ?dataset
""", opt))
logging.info('Query Completed.')

### Sequence Annotations from Multiple Datasets

Display the number of sequence annotations that come from multiple datasets.

_Note that sequence annotations are not merged based on any feature so we would not expect any sequence annotations to match the criteria in this query._

In [None]:
logging.info(' Sequence Annotations from Multiple Datasets - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?annotation (COUNT(?g) as ?numDatasets) (GROUP_CONCAT(?dataset;SEPARATOR=", ") AS ?datasets)
WHERE {
    GRAPH ?g {
        ?annotation a schema:SequenceAnnotation .
    }
    ?g void:inDataset ?dataset .
}
GROUP BY ?annotation
HAVING (COUNT(*) > 1)
ORDER BY ?numDatasets
""", opt))
logging.info('Query Completed.')

### Sequence Annotations from Multiple Pages

Dislay the number of sequence annotations that come from multiple pages. It is conceivable that the same annotation comes from different pages in the same source, e.g. PED. However, as annotations are not combined, we would not expect any answers to the following query.

In [None]:
logging.info(' Sequence Annotations from Multiple Pages - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?annotation (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=", ") AS ?sources)
WHERE {
    GRAPH ?g {
        ?annotation a schema:SequenceAnnotation .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?annotation
HAVING (COUNT(*) > 1)
ORDER BY ?numSources
""", opt))
logging.info('Query Completed.')

### Sequence Annotation Information

Return information known about each sequence annotation.

In [None]:
logging.info(' Sequence Annotation Information - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>

SELECT ?s ?annotation ?start ?end ?termCode ?termName ?pubmedID
WHERE {
    graph ?g {
        ?s a schema:Protein;
           schema:hasSequenceAnnotation ?annotation .
        ?annotation schema:additionalProperty/schema:value ?term;
            schema:sequenceLocation ?range .
        ?range schema:rangeStart ?start ;
               schema:rangeEnd ?end .
        ?term schema:termCode ?termCode ;
            schema:name ?termName .
        OPTIONAL { ?annotation schema:subjectOf ?pubmedID }
    }
}    
ORDER BY ?s ?start ?end

""", opt))
logging.info('Query Completed.')

### Details of Scholarly Articles with respect to Annotations

Number of articles per annotation.

In [None]:
logging.info(' Details of Scholarly Articles with respect to Annotations - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?annotation (COUNT (?pubmedID) AS ?numArticles)
WHERE {
    graph ?g {
        ?annotation a schema:SequenceAnnotation;
            schema:subjectOf ?pubmedID
    }
}    
GROUP BY ?annotation
ORDER BY DESC(?numArticles)
""", opt))
logging.info('Query Completed.')

Number of annotations per article.

In [None]:
logging.info(' Number of annotations per article - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?pubmedID (COUNT (?pubmedID) AS ?numAnnotations)
WHERE {
    graph ?g {
        ?annotation a schema:SequenceAnnotation;
            schema:subjectOf ?pubmedID
    }
}    
GROUP BY ?pubmedID
ORDER BY DESC(?numAnnotations)
""", opt))
logging.info('Query Completed.')

### Number of annotations by term code

For each term code, return the number of annotations using that code.

In [None]:
logging.info(' Number of annotations by term code - Query Started.')
displayResults(query_idpkg("""
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?termCode ?termName (COUNT (?annotation) AS ?numAnnotations)
WHERE {
    graph ?g {
        ?annotation schema:additionalProperty/schema:value ?term .
        ?term schema:termCode ?termCode ;
            schema:name ?termName .
    }
}    
GROUP BY ?termCode ?termName
ORDER BY DESC(?numAnnotations)
""", opt))
logging.info('Query Completed.')

## Find proteins with annotations in multiple datasets

We are looking for annotations where the protein is common but the annotation is different across the datasets.

### Proteins with Annotations in Multiple Datasets

In [None]:
logging.info(' Proteins with Annotations in Multiple Datasets - Query Started.')
displayResults(query_idpkg("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>

SELECT ?protein (SAMPLE(?proteinName) AS ?name) (COUNT(distinct ?annotation) AS ?annotationCount) (COUNT(distinct ?dataset) AS ?datasets)
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
	    SELECT ?annotation ?dataset ?protein
    	WHERE {
        	GRAPH ?g {
            	?protein schema:hasSequenceAnnotation ?annotation
	        }
    	    ?g void:inDataset ?dataset .
	    }
    }
} 
GROUP BY ?protein
HAVING (COUNT(distinct ?dataset) > 1)
ORDER BY DESC(?annotationCount)
""", opt))
logging.info('Query Completed.')

### Proteins with Annotations in Multiple Pages

As sources such as PED can have the same protein detailed on multiple pages, it is also interesting to look at this at the page level.

The following query finds for each protein, its name (if known), a count of the number of sequence annotations, and a count of the number of sources from which the data has been extracted. Results are only returned if there are annotations from more than one source.

In [None]:
logging.info(' Proteins with Annotations in Multiple Pages - Query Started.')
displayResults(query_idpkg("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein (SAMPLE(?proteinName) AS ?name) (COUNT(distinct ?annotation) AS ?annotationCount) (COUNT(distinct ?source) AS ?sourceCount)
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
	    SELECT ?annotation ?source ?protein
    	WHERE {
        	GRAPH ?g {
            	?protein schema:hasSequenceAnnotation ?annotation
	        }
    	    ?g pav:retrievedFrom ?source .
	    }
    }
} 
GROUP BY ?protein
HAVING (COUNT(distinct ?source) > 1)
ORDER BY DESC(?annotationCount)
""", opt))
logging.info('Query Completed.')

The following varient of the query will list the annotations and the source from which the annotation has come.

In [358]:
logging.info(' list the annotations - Query Started.')
displayResults(query_idpkg("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein ?proteinName ?annotation ?source
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
        SELECT ?annotation ?source ?protein
        WHERE {
            GRAPH ?g {
                ?protein schema:hasSequenceAnnotation ?annotation
            }
            ?g pav:retrievedFrom ?source .
        }
    }
} 
ORDER BY ?protein ?annotation
""", opt))
logging.info('Query Completed.')

0,1,2,3
protein,proteinName,annotation,source
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r001,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r004,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r005,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r006,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r007,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r008,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r009,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r010,https://disprot.org/DP00005
https://idpcentral.org/id/P03045,Antitermination protein N,https://disprot.org/DP00005r011,https://disprot.org/DP00005
