# Bioinformatics Programming Challenges
# Exercise 5
### Nicole Frontero
### Due: January 18, 2022

Here we are using **SPARQL Kernel for Jupyter Notebooks**.  The documentation for the "magic instructions" can be found [here](https://github.com/paulovn/sparql-kernel).

In [6]:
# Set the endpoint and format
%endpoint https://sparql.uniprot.org/sparql
%format JSON

### Q1: 1 POINT How many protein records are in UniProt?

In [5]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT (COUNT (?protein) AS ?num_protein_records) 
WHERE {
    ?protein a up:Protein .
}

num_protein_records
378979161


### Q2: 1 POINT How many Arabidopsis thaliana protein records are in UniProt?

In [4]:
PREFIX up: <http://purl.uniprot.org/core/>

SELECT COUNT(?protein) AS ?Arabidopsis_thaliana_records

WHERE {
    ?protein a up:Protein .
    ?protein up:organism ?species .
    ?species up:scientificName "Arabidopsis thaliana" .
    
}

Arabidopsis_thaliana_records
136447


### Q3: 1 POINT retrieve pictures of Arabidopsis thaliana from UniProt

In [27]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?image

WHERE
{
  ?taxon a up:Taxon .
  ?taxon up:scientificName "Arabidopsis thaliana" . 
  ?taxon foaf:depiction ?image .
  ?image a foaf:Image .
}

image
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


### Q4: 1 POINT:  What is the description of the enzyme activity of UniProt Protein Q9SZZ8

In [5]:
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?enzyme_activity

WHERE { 
    uniprotkb:Q9SZZ8 up:enzyme ?protein .
    ?protein up:activity ?activity .
    ?activity rdfs:label ?enzyme_activity .
}

enzyme_activity
all-trans-beta-carotene + 4 H(+) + 2 O2 + 4 reduced [2Fe-2S]-[ferredoxin] = all-trans-zeaxanthin + 2 H2O + 4 oxidized [2Fe-2S]-[ferredoxin].


This result aligns with what we see on the internet at this link https://www.uniprot.org/uniprotkb/Q9SZZ8/entry

### Q5: 1 POINT:  Retrieve the proteins ids, and date of submission, for 5 proteins that have been added to UniProt this year   (HINT Google for “SPARQL FILTER by date”)

When I tried to run this code with 2023, I got no results.  So, I stipulated the start date as 2022-01-01.

In [39]:
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?id ?date 
WHERE
{
    ?protein a up:Protein . 
    ?protein up:created ?date .
    # From first day of 2022 through first day of 2023
    FILTER(?date >= xsd:date("2022-01-01") && ?date < xsd:date("2023-01-01")) . 
    BIND (REPLACE(STR(?protein), "http://purl.uniprot.org/uniprot/", "") AS ?id) .
} 
ORDER BY ?date 
LIMIT 5

id,date
A0A8D9QAN0,2022-01-19
A0A8D9QMM1,2022-01-19
A0A8E2P2A0,2022-01-19
A0A8F8DND0,2022-01-19
A0A8F9HBZ8,2022-01-19


### Q6: 1 POINT How  many species are in the UniProt taxonomy?

I'm not sure if we are being asked for distinct species or not, but I figure running it with distinct makes more sense, so I am going to do that. 

In [15]:
PREFIX up:<http://purl.uniprot.org/core/> 
SELECT (COUNT (DISTINCT ?species) AS ?total)
WHERE {
    ?species a up:Taxon .
    ?species up:rank up:Species .
}

total
1995728


### Q7: 2 POINT How many species have at least one protein record? 
(this might take a long time to execute, so do this one last!)

In [9]:
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT (COUNT(DISTINCT ?taxon) AS ?atleast_one_protein_record)
WHERE 
{
    ?protein a up:Protein ;
    up:organism ?taxon .
    ?taxon a up:Taxon ;
    up:rank up:Species .
}

atleast_one_protein_record
1078469


### Q8: 3 points:  find the AGI codes and gene names for all Arabidopsis thaliana  proteins that have a protein function annotation description that mentions “pattern formation”

In [3]:
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?agi ?gene_name

WHERE
{
    ?protein a up:Protein .
    # Protein variable involved with its taxon (Arabidopsis thaliana)
    ?protein up:organism taxon:3702 .
    # Protein variable involved with each gene
    ?protein up:encodedBy ?gene . 
    # Protein variable involved with each annotation or function
    ?protein up:annotation ?annotation. 
    # AGI locus code of each gene
    ?gene up:locusName ?agi . 
    # Label selected (gene name) of each gene
    ?gene skos:prefLabel ?gene_name . 
    ?annotation a up:Function_Annotation . 
    ?annotation rdfs:comment ?text .
    # Apply the filter established above
    FILTER CONTAINS(?text,"pattern formation") . 
}

agi,gene_name
At1g13980,GN
At3g02130,RPK2
At1g69270,RPK1
At5g37800,RSL1
At1g26830,CUL3A
At1g66470,RHD6
At3g09090,DEX1
At5g55250,IAMT1
At1g63700,YDA
At4g21750,ATML1


### Q9: 4 POINTS: what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt Protein uniprotkb: Q18A79

From the MetaNetX metabolic networks for metagenomics database
SPARQL Endpoint: https://rdf.metanetx.org/sparql
(this slide deck will make it much easier for you!
https://www.metanetx.org/cgi-bin/mnxget/mnxref/MetaNetX_RDF_schema.pdf)

In [22]:
# Set new endpoint
%endpoint https://rdf.metanetx.org/sparql

In [25]:
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>

SELECT DISTINCT ?mnxrlabel

WHERE {
    # Getting the mnxpept of the UniProtKB protein Q18A79
    ?mnxpept mnx:peptXref uniprotkb:Q18A79 . 
    # Getting the mnxcata from the previous mnxpept
    ?mnxcata mnx:pept ?mnxpept . 
    # Getting mnxgpr using mnxcata
    ?mnxgpr mnx:cata ?mnxcata . 
    # Getting mnxreac
    ?mnxgpr mnx:reac ?mnxreac . 
    # Getting the label of that mnxreac (mnxrlabel)
    ?mnxreac rdfs:label ?mnxrlabel . 
}

mnxrlabel
mnxr165934
mnxr145046c3


## FEDERATED QUERY - UniProt and MetaNetX 
### Q10: 5 POINTS: What is the official locus name, and the MetaNetX Reaction identifier (mnxr.....) for the protein that has “glycine reductase” catalytic activity in Clostridium difficile (taxon 272563). (this must be executed on the https://rdf.metanetx.org/sparql endpoint)

In [12]:
%endpoint https://rdf.metanetx.org/sparql