# Assignment 5 answers:

## Question 1: How many protein records are in UniProt? 

Source: https://www.uniprot.org/core/#http://purl.uniprot.org/core/

In [None]:
https://sparql.uniprot.org/sparql/

In [2]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(?protein) as ?numberProteins)

WHERE {  
   ?protein a up:Protein .
}

numberProteins
360157660


## Question 2: How many Arabidopsis thaliana protein records are in UniProt? 

In [3]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(?protein) as ?numberProteins)

WHERE {
    ?protein a up:Protein ;
               up:organism ?taxon .
    ?taxon up:scientificName "Arabidopsis thaliana"
}

numberProteins
136782


## Question 3: retrieve pictures of Arabidopsis thaliana from UniProt?

FOAF for images: http://xmlns.com/foaf/spec/

In [4]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT distinct ?image

WHERE {
    ?arabidopsisThaliana a up:Taxon ;
                           up:scientificName "Arabidopsis thaliana" ;
                           foaf:depiction ?image
}

image
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


## Question 4: What is the description of the enzyme activity of UniProt Protein Q9SZZ8

In [5]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT distinct ?description

WHERE {
    uniprotkb:Q9SZZ8 a up:Protein ;
                       up:enzyme ?enzyme .
    ?enzyme up:activity ?activity .
    ?activity rdfs:label ?description
}

description
Beta-carotene + 4 reduced ferredoxin [iron-sulfur] cluster + 2 H(+) + 2 O(2) = zeaxanthin + 4 oxidized ferredoxin [iron-sulfur] cluster + 2 H(2)O.


## Question 5: Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [1]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT distinct ?id ?date

WHERE {
    ?protein a up:Protein ;
               up:mnemonic ?id ;
               up:created ?date .
    FILTER (?date > "2021-01-01"^^xsd:date)
    
} limit 21 # 21 proteins submitted in 2021 since trying to get all proteins takes a huge amount of time.

id,date
A0A1H7ADE3_PAEPO,2021-06-02
A0A1V1AIL4_ACIBA,2021-06-02
A0A2Z0L603_ACIBA,2021-06-02
A0A4J5GG53_STREE,2021-04-07
A0A6G8SU52_AERHY,2021-02-10
A0A6G8SU69_AERHY,2021-02-10
A0A7C9JLR7_9BACT,2021-02-10
A0A7C9JMZ7_9BACT,2021-02-10
A0A7C9KUQ4_9RHIZ,2021-02-10
A0A7D4HP61_NEIMU,2021-02-10


## Question 6: How many species are in the UniProt taxonomy?

In [2]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(distinct ?species) as ?numberSpecies)

WHERE {
    ?species a up:Taxon ;
               up:rank up:Species .
}

numberSpecies
2029846


## Question 7: How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [2]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>

SELECT (count(distinct ?species) as ?numberSpecies)

WHERE {
    ?protein a up:Protein ;
               up:organism ?species .
    ?species a up:Taxon ;
               up:rank up:Species .
}

numberSpecies
1057158


## Question 8: find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

Regex in sparql: https://www.w3.org/TR/rdf-sparql-query/#funcex-regex

In [3]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT distinct ?AGI ?geneName

WHERE {
    ?protein a up:Protein ;
               up:annotation ?annotation ;
               up:encodedBy ?gene ;
               up:organism ?taxon .
    ?taxon up:scientificName "Arabidopsis thaliana" .
    ?annotation a up:Function_Annotation ;
                  rdfs:comment ?functionAnnotation .
    FILTER regex(?functionAnnotation, "pattern formation", "i")
    ?gene up:locusName ?AGI ;
          skos:prefLabel ?geneName # from uniprot sparql tutorial query nº8
} 

AGI,geneName
At3g54220,SCR
At4g21750,ATML1
At1g13980,GN
At5g40260,SWEET8
At1g69670,CUL3B
At1g63700,YDA
At2g46710,ROPGAP3
At1g26830,CUL3A
At3g09090,DEX1
At4g37650,SHR


## Question 9: what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

Source: https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf


In [12]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT distinct ?ReactionIdentifiers

where {
    ?reac a mnx:REAC ;
            rdfs:label ?ReactionIdentifiers .
    ?gpr a mnx:GPR ;
           mnx:cata ?cata ;
           mnx:reac ?reac .
    ?cata a mnx:CATA ;
            mnx:pept ?Protein .
    ?Protein a mnx:PEPT ;
               mnx:peptXref uniprotkb:Q18A79 .
} 


ReactionIdentifiers
mnxr165934
mnxr145046c3


## Question 10: What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).

In order to do a federated query, we need to use SERVICE.


In [1]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
SELECT distinct ?geneId ?ReactionIdentifiers
WHERE
{
    ?Protein a up:Protein ;
      up:organism taxon:272563 ;
      up:enzyme ?enzyme ;
      up:mnemonic ?geneId .
  ?enzyme up:activity ?activity ;
          rdfs:comment ?functionAnnotation .
  FILTER regex(?functionAnnotation, "Starch synthase", "i")
  
  SERVICE <https://rdf.metanetx.org/sparql> {
    ?reac a mnx:REAC ;
            rdfs:label ?ReactionIdentifiers .
    ?gpr a mnx:GPR ;
           mnx:cata ?cata ;
           mnx:reac ?reac .
    ?cata a mnx:CATA ;
            mnx:pept ?pept .
    ?pept a mnx:PEPT ;
               mnx:peptXref ?Protein.
  }
}

geneId,ReactionIdentifiers
GLGA_CLOD6,mnxr165934
GLGA_CLOD6,mnxr145046c3
