## UniProt SPARQL Endpoint

Here we stabish the endpoint for the following questions that is going to be UniProt, also the output format that will be JSON. 

In [15]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON  

**1. How many protein records are in UniProt?**

For this we are going to use COUNT to obtain the number of protein records. 

In [18]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT (COUNT (?protein) AS ?many_prots) 
WHERE 
{
    ?protein rdf:type up:Protein . 
}

many_prots
378979161


**2. How many Arabidopsis thaliana protein records are in UniProt?**

Using part of the last Query we also use taxon:3702 to select those protein records that corresponds with Arabidopsis thaliana (using taxon:3702, because it is the id for this species). 

In [3]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT (COUNT (DISTINCT ?protein) AS ?how_many_prots)
WHERE 
{
    ?protein rdf:type up:Protein ; 
             up:organism taxon:3702 .                                 
}

how_many_prots
136447


**3. Retrieve pictures of Arabidopsis thaliana from UniProt?** 

Using taxon id for A. thaliana again and foaf:depiction let us to retrive pictures. 

In [4]:
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?picture 
WHERE {
  taxon:3702 foaf:depiction ?picture . 
}

picture
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


**4. What is the description of the enzyme activity of UniProt Protein Q9SZZ8 ?** 

Here ?just_enzyme retrieves the enzyme record that has ?activity_des as description. For this we use uniprotkb to filter records to be jus for the desired protein (Q9SZZ8). 

In [5]:
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?just_enzyme ?activity_des 
WHERE 
{
    uniprotkb:Q9SZZ8 rdf:type up:Protein ; 
                     up:enzyme ?just_enzyme . 
    ?just_enzyme up:activity ?enzyme_activity . 
    ?enzyme_activity rdfs:label ?activity_des . 
}

just_enzyme,activity_des
http://purl.uniprot.org/enzyme/1.14.15.24,all-trans-beta-carotene + 4 H(+) + 2 O2 + 4 reduced [2Fe-2S]-[ferredoxin] = all-trans-zeaxanthin + 2 H2O + 4 oxidized [2Fe-2S]-[ferredoxin].


**5. Retrieve the proteins ids, and date of submission, for 5 proteins that have been added to UniProt this year.**

Here we extract 2 things, the id and the date. We extract the ?protein_id in mneumoniac format, and also it is ?adition_date. The ?adition_date retrieved then it is used for filtering and keeping only with the 2022 records. Finally we filter to obtain just 5 records (since all the results are ordered by creation date the limit 5 gives us the first records of 2022. 





In [6]:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?protein_id ?adition_date
WHERE
{
  ?protein rdf:type up:Protein ;
           up:mnemonic ?protein_id ; 
           up:created ?adition_date . 
  FILTER( ?adition_date > xsd:date("2022-01-01") && ?adition_date < xsd:date("2023-01-01")) .
}
LIMIT 5


protein_id,adition_date
A0A8E0N8L5_ECOLX,2022-01-19
A0A8F9CQZ7_ECOLX,2022-01-19
A0A8F9ICG9_ECOLX,2022-01-19
A0A8F8WH98_PSEAI,2022-01-19
A0A8F9NZK3_PSEAI,2022-01-19


**6. How  many species are in the UniProt taxonomy?**

We use COUNT again to obatin the number of species(?num_species), first delimiting ?species to catalogue them as Taxon (with rdf:type), then we specify that being Taxon it has to corresponds with species.  

In [7]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT (COUNT(DISTINCT ?species) AS ?num_species)
WHERE {
 ?species rdf:type up:Taxon ; 
          up:rank up:Species . 
}

num_species
1995728


**7. How many species have at least one protein record?**

We reuse the count species query that we have used in Q6 and we add first the query that filters by rdf:type protein and by the organism, that has to be species. 





In [8]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT (COUNT(DISTINCT ?species) AS ?num_species)
WHERE {
 ?protein rdf:type up:Protein ; 
          up:organism ?species . 
  ?species rdf:type up:Taxon ; 
          up:rank up:Species . 
}

num_species
1078469


**8. Find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”**

Fist of all we filter by taxon id for A. Thaliana as we previously did. Also we use up:annotation for figuring out the function, after this we use filter in order to keep just with those entries for gene_id that contains "pattern formation".  

In [9]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

SELECT ?gene_name ?AGI_code 
WHERE {
  ?protein rdf:type up:Protein ; 
           up:organism taxon:3702 ; 
           up:annotation ?annotate ; 
           up:encodedBy ?gene . 
  ?annotate rdfs:comment ?function . 
  ?gene up:locusName ?AGI_code ; 
        skos:prefLabel ?gene_name . 
  FILTER (CONTAINS (?function, "pattern formation")) 
}

gene_name,AGI_code
GN,At1g13980
RPK2,At3g02130
RPK1,At1g69270
SEC23A,At4g01810
RSL1,At5g37800
CUL3A,At1g26830
MED13,At1g55325
RHD6,At1g66470
DEX1,At3g09090
DEX1,At3g09090


## MetaNetX SPARQL Endpoint

Here we stabish the endpoint for the following questions that is going to be MetaNetX, also the output format that will be JSON. 


In [10]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON  

**9. What is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb:Q18A79?**

For this first we filtered by protein to keep eith the entries related to the protein that is asked in the question. Also we have to find the catalyst related with that protein just for finding the reaction that will led us to the reaction idetifier in MetaNext. 

In [11]:
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?reaction_id 
WHERE{
    ?protein rdf:type mnx:PEPT ; 
             mnx:peptXref uniprotkb:Q18A79 .
    ?catalyst rdf:type mnx:CATA ; 
              mnx:pept ?protein .
    ?gen_prot_react mnx:cata ?catalyst ;
                    mnx:reac ?reaction .
    ?reaction rdfs:label ?reaction_id .
}



reaction_id
mnxr165934
mnxr145046c3


## FEDERATED QUERY - UniProt and MetaNetX

**10. What is the official locus name, and the MetaNetX Reaction identifier (mnxr…..) for the protein that has “glycine reductase” catalytic activity in Clostridium difficile (taxon 272563).   (this must be executed on the https://rdf.metanetx.org/sparql   endpoint)**

In order to complete this task we are going to use federated querys, thing that allows us to do SPARQL Query's for different endpoints to use remote SPARQL endpoints: in this case we will both of the previous endpoints used (UniProt and MetaNetX). 

First we do querys in Uniprot service. Here we filter by taxon (thing that we have previously made), and find out the gene that encodes the protein for obtaining the locus name. Also we look after the GO Term (with classifiedWith) for accesing to the label which contains the label that we are going to use for filtering and keeping only withe the ones labeled with "glycine reductase". 
With the MetaNext service we reuse the Query made in the previous question and we use the proteins filtered in the first moment (with the Uniprot service). 

In [12]:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>

SELECT ?locus_name ?reaction_id
WHERE {
    SERVICE <http://sparql.uniprot.org/sparql> {
        select distinct ?locus_name ?protein
                           WHERE {
                               ?protein rdf:type up:Protein ; 
                                        up:organism taxon:272563 ;
                                        up:encodedBy ?gene ; 
                                        up:classifiedWith ?g_term .
                               ?gene up:locusName ?locus_name . 
                               ?g_term rdfs:label ?info . 
                               FILTER (CONTAINS (?info, "glycine reductase")) 
                           }
    }
    SERVICE <https://rdf.metanetx.org/sparql> {
        select distinct ?protein ?reaction_id
                                 WHERE {
                                     #?prot rdf:type mnx:PEPT ; 
                                     ?prot mnx:peptXref ?protein .
                                     ?catalyst rdf:type mnx:CATA ;
                                               mnx:pept ?prot .
                                     ?gen_prot_react mnx:cata ?catalyst ;
                                                     mnx:reac ?reaction .
                                     ?reaction rdfs:label ?reaction_id . 
                                 } 
    }
}

locus_name,reaction_id
CD630_23490,mnxr157884c3
CD630_23490,mnxr162774c3
CD630_23520,mnxr157884c3
CD630_23520,mnxr162774c3
CD630_23510,mnxr157884c3
CD630_23510,mnxr162774c3
CD630_23540,mnxr157884c3
CD630_23480,mnxr157884c3
CD630_23480,mnxr162774c3


**Bibliography**
1. https://sparql.uniprot.org/.well-known/sparql-examples/
2. https://sparql.uniprot.org/uniprot
3. https://www.uniprot.org/help/entry_name
4. https://docs.openlinksw.com/virtuoso/virtuosotipsandtricksmanagedaterangequery/
5. https://docs.cambridgesemantics.com/anzograph/v2.2/userdoc/date-functions.htm 
6. https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf
7. https://www.w3.org/TR/sparql11-federated-query/
