## Assignment 5 SPARQL queries

Create the SPARQL query that will answer each of these questions.

In [5]:
%endpoint http://sparql.uniprot.org/sparql
%format JSON

### Q1: 1 POINT  How many protein records are in UniProt? 

In [3]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (?protein) AS ?protcount)
#SELECT (COUNT (DISTINCT ?protein) AS ?protcount) ## TAKES TOO MUCH TIME BUT WOULD BE MORE CORRECT

WHERE
{
    ?protein a up:Protein .
}

protcount
360157660


### Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt? 

In [4]:
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 

SELECT (COUNT(DISTINCT ?protein) AS ?proteincount)
WHERE 
{
    ?protein a up:Protein .
    ?protein up:organism taxon:3702 .
}

proteincount
136782


### Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt? 

In [5]:
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?image
WHERE
{
    taxon:3702  foaf:depiction  ?image .
}

image
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


### Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8 

In [6]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>  
PREFIX up: <http://purl.uniprot.org/core/> 

SELECT DISTINCT ?description
WHERE
{
  uniprotkb:Q9SZZ8 a up:Protein ;
                   up:annotation ?annotation .
  
  ?annotation a up:Function_Annotation ;
               rdfs:comment ?description .
}

description
Nonheme diiron monooxygenase involved in the biosynthesis of xanthophylls. Specific for beta-ring hydroxylations of beta-carotene. Has also a low activity toward the beta- and epsilon-rings of alpha-carotene. No activity with acyclic carotenoids such as lycopene and neurosporene. Uses ferredoxin as an electron donor.


### Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

In [None]:
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?id ?date
WHERE
{
  ?protein a up:Protein ;
           up:mnemonic ?id ;
           up:created ?date .
  FILTER(?date > "2021-01-01"^^xsd:date) .
}

### Q6: 1 POINT How  many species are in the UniProt taxonomy?

In [6]:
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT (DISTINCT ?species) AS ?specount)
WHERE
{
    ?species a up:Taxon ;
             up:rank up:Species .
}

specount
2029846


### Q7: 2 POINT  How many species have at least one protein record? (this might take a long time to execute, so do this one last!)

In [6]:
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?species) AS ?speciesnum)
WHERE 
{
    ?protein a up:Protein .
    ?protein up:organism ?species .
    ?species a up:Taxon .
    ?species up:rank up:Species .
}

speciesnum
1057158


### Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [5]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene_name ?agi_code
WHERE
{
    ?protein a up:Protein ;
             up:organism taxon:3702 ;
             up:recommendedName ?n ;
             up:encodedBy ?gene ;
             up:annotation ?annotation .
    ?n up:shortName ?gene_name . 
    ?gene up:locusName ?agi_code .
    ?annotation a up:Function_Annotation ;
                rdfs:comment ?comment .
    FILTER regex( ?comment, "pattern formation","i")
}

gene_name,agi_code
AtSCR,At3g54220
AtCUL3b,At1g69670
AtSWEET8,At5g40260
AtCUL3a,At1g26830
AtSHR,At4g37650
AtRopGEF7,At5g02010


### Q9: 4 POINTS:  what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79

In [1]:
%endpoint https://rdf.metanetx.org/sparql

In [3]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?reac_identifier
WHERE{
    ?pept mnx:peptXref uniprotkb:Q18A79 .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reac_identifier .
}

reac_identifier
mnxr165934
mnxr145046c3


### Q10: 5 POINTS:  What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “Starch synthase” catalytic activity in Clostridium difficile (taxon 272563).

In [3]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT DISTINCT ?mnemonic ?reac_label
WHERE
{
  service <http://sparql.uniprot.org/sparql> {
    ?protein a up:Protein ;
             up:organism taxon:272563 ;
             up:mnemonic ?mnemonic ;
             up:classifiedWith ?goTerm .
    ?goTerm rdfs:label ?activity .
    filter contains(?activity, "starch synthase")
    bind (substr(str(?protein),33) as ?ac)
    bind (IRI(CONCAT("http://purl.uniprot.org/uniprot/",?ac)) as ?proteinRef)
  }
  service <https://rdf.metanetx.org/sparql> {
    ?pept mnx:peptXref ?proteinRef .
    ?cata mnx:pept ?pept .
    ?gpr mnx:cata ?cata ;
         mnx:reac ?reac .
    ?reac rdfs:label ?reac_label .
  }
}

mnemonic,reac_label
GLGA_CLOD6,mnxr165934
GLGA_CLOD6,mnxr145046c3
