# Bioinformatics programming challenges. Pablo Catarecha. Assignment 5.

Question 1.- How many protein records are in UniProt?

In [1]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select (count (?s) as ?no_of_proteins) where {
    ?s a up:Protein.
}


no_of_proteins
378979161


Question 2.- How many Arabidopsis thaliana protein records are in UniProt?

In [2]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select (count (?s) as ?no_of_Arathprots) where {
    ?s a up:Protein;
        up:organism ?taxon.
    ?taxon up:scientificName 'Arabidopsis thaliana'.
}

no_of_Arathprots
136447


Question 3.- Retrieve pictures of Arabidopsis thaliana from UniProt.

In [3]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

select ?Arath_images where {
    ?taxon foaf:depiction ?Arath_images;
        up:scientificName 'Arabidopsis thaliana'.
}

Arath_images
https://upload.wikimedia.org/wikipedia/commons/3/39/Arabidopsis.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Arabidopsis_thaliana_inflorescencias.jpg/800px-Arabidopsis_thaliana_inflorescencias.jpg


Question 4.- What is the description of the enzyme activity of UniProt Protein Q9SZZ8?

In [4]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

select ?enzyme_activity ?enzyme_act_description where {
    uniprotkb:Q9SZZ8 up:enzyme ?enzyme_activity.
    ?enzyme_activity skos:prefLabel ?enzyme_act_description.
}

enzyme_activity,enzyme_act_description
http://purl.uniprot.org/enzyme/1.14.15.24,beta-carotene 3-hydroxylase


Question 5.- Retrieve the proteins ids, and date of submission, for proteins that have been added to UniProt this year.

In [None]:
# This one takes surprisingly long to run in Jupyter notebook.
# I have checked it on the SPARQL endpoint and it takes seconds.
# I cannot find a reason why.
# The code is ok.

%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select ?id ?date  where {
    ?id a up:Protein;
        up:created ?date.
    filter (?date > "2021-12-30"^^xsd:dateTime)
}

Question 6.- How many species are in the UniProt taxonomy?

In [5]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select (count (?taxon) as ?species) where {
    ?taxon a up:Taxon;
        up:rank up:Species.
}

species
1995728


Question 7.- How many species have at least one protein record? (this might take a long time
to execute, so do this one last!).

In [10]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>

select (count(?taxon)as ?species) where{
    select ?taxon (count (distinct(?p))as ?prot) where{
        ?p a up:Protein;
            up:organism ?taxon.
        ?taxon a up:Taxon;
            up:rank up:Species.
    }
    group by(?taxon)
    having(count(?p)>0)
}

species
1078469


Question 8.- Find the AGI codes and gene names for all Arabidopsis thaliana proteins that
have a protein function annotation description that mentions “pattern formation”.

In [7]:
%endpoint https://sparql.uniprot.org/sparql
%format JSON

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up: <http://purl.uniprot.org/core/>

select ?agi ?name ?comment where {
    ?s a up:Protein;
        up:organism ?taxon;
        up:annotation ?a;
        up:encodedBy ?g.
    ?a a up:Function_Annotation;
        rdfs:comment ?comment.
    ?g a up:Gene;
        skos:prefLabel ?name;
        up:locusName ?agi.
    ?taxon up:scientificName 'Arabidopsis thaliana'.
    filter(contains(?comment,'pattern formation'))
}

agi,name,comment
At1g13980,GN,Activates the ARF proteins by exchanging bound GDP for free GTP. Plays a role in vesicular protein sorting. Acts as the major regulator of endosomal vesicle trafficking but is also involved in the endocytosis process. Could function redundantly with GNL1 in the retrograde Golgi to endoplasmic reticulum trafficking. Regulates vesicle trafficking required for the coordinated polar localization of auxin efflux carriers which in turn determines the direction of auxin flow. Mediates the sorting of PIN1 from endosomal compartments to the basal plasma membrane and the polarization of PIN3 to the bottom side of hypocotyl endodermal cells. Involved in the specification of apical-basal pattern formation in the early embryo and during root formation. Required for correct cell wall organization leading to normal cell adhesion during seedling development. Also plays an essential role in hydrotropism of seedling roots.
At3g02130,RPK2,"Key regulator of anther development (e.g. lignification pattern), including tapetum degradation during pollen maturation (e.g. germination capacity). Together with RPK1, required for pattern formation along the radial axis (e.g. the apical embryonic domain cell types that generate cotyledon primordia), and the apical-basal axis (e.g. differentiation of the basal pole during early embryogenesis)."
At1g69270,RPK1,"Involved in the main abscisic acid-mediated (ABA) signaling pathway and in early ABA perception. Together with RPK2, required for pattern formation along the radial axis (e.g. the apical embryonic domain cell types that generate cotyledon primordia), and the apical-basal axis (e.g. differentiation of the basal pole during early embryogenesis)."
At5g37800,RSL1,"Transcription factor that is specifically required for the development of root hairs (PubMed:17556585). Acts with RHD6 to positively regulate root hair development (PubMed:17556585). Acts downstream of genes that regulate epidermal pattern formation, such as GL2 (PubMed:17556585). Acts with RHD6 as transcription factor that integrates a jasmonate (JA) signaling pathway that stimulates root hair growth (PubMed:31988260)."
At1g26830,CUL3A,"Component of the cullin-RING ubiquitin ligases (CRL), or CUL3-RBX1-BTB protein E3 ligase complexes which mediate the ubiquitination and subsequent proteasomal degradation of target proteins. The functional specificity of the CRL complex depends on the BTB domain-containing protein as the susbstrate recognition component. Involved in embryo pattern formation and endosperm development. Required for the normal division and organization of the root stem cells and columella root cap cells. Regulates primary root growth by an unknown pathway, but in an ethylene-dependent manner. Functions in distal root patterning, by an ethylene-independent mechanism. Functionally redundant with CUL3B."
At1g66470,RHD6,"Transcription factor that is specifically required for the development of root hairs (PubMed:17556585). Acts with RSL1 to positively regulate root hair development (PubMed:17556585). Acts downstream of genes that regulate epidermal pattern formation, such as GL2 (PubMed:17556585). Targets directly RSL4, another transcription factor involved in the regulation of root hair elongation (PubMed:20139979). Acts with RSL1 as transcription factor that integrates a jasmonate (JA) signaling pathway that stimulates root hair growth (PubMed:31988260)."
At3g09090,DEX1,"Required for exine pattern formation during pollen development, especially for primexine deposition."
At5g55250,IAMT1,Catalyzes the methylation of the free carboxyl end of the plant hormone indole-3-acetic acid (IAA). Converts IAA to IAA methyl ester (MeIAA). Regulates IAA activities by IAA methylation. Methylation of IAA plays an important role in regulating plant development and auxin homeostasis. Required for correct leaf pattern formation. MeIAA seems to be an inactive form of IAA.
At1g63700,YDA,"Functions in a MAP kinase cascade that acts as a molecular switch to regulate the first cell fate decisions in the zygote and the early embryo. Promotes elongation of the zygote and development of its basal daughter cell into the extra-embryonic suspensor. In stomatal development, acts downstream of the LRR receptor TMM, but upstream of the MKK4/MKK5-MPK3/MPK6 module to regulate stomatal cell fate before the guard mother cell (GMC) is specified. Plays a central role in both guard cell identity and pattern formation. This MAPK cascade also functions downstream of the ER receptor in regulating coordinated local cell proliferation, which shapes the morphology of plant organs. Upon brassinosteroid signaling, is inhibited by phosphorylation of its auto-inhibitory N-terminal domain by the GSK3-like kinase ASK7."
At4g21750,ATML1,"Probable transcription factor involved in cell specification and pattern formation during embryogenesis. Binds to the L1 box DNA sequence 5'-TAAATG[CT]A-3'. Plays a role in maintaining the identity of L1 cells, possibly by interacting with their L1 box or other target-gene promoters; binds to the LIP1 gene promoter and stimulates its expression upon imbibition (PubMed:24989044). Acts as a positive regulator of gibberellins (GAs)-regulated epidermal gene expression (e.g. LIP1, LIP2, LTP1, FDH and PDF1) (PubMed:24989044). Functionally redundant to PDF2 (PubMed:24989044). Seems to promote cell differentiation (PubMed:25564655)."


Question 9.- what is the MetaNetX Reaction identifier (starts with “mnxr”) for the UniProt
Protein uniprotkb:Q18A79?

In [8]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON

PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/uniprot/>

select distinct ?UniProt_pept ?pept_comment ?reaction_identifier where{
    ?UniProt_pept mnx:peptXref up:Q18A79;
        rdfs:comment ?pept_comment.
    ?cata mnx:pept ?UniProt_pept.
    ?gpr mnx:cata ?cata;
        mnx:reac ?reac.
    ?reac mnx:mnxr ?rxn.
    ?rxn mnx:reacRefer ?reaction_identifier.
}

UniProt_pept,pept_comment,reaction_identifier
https://rdf.metanetx.org/pept/GLGA_CLOD6,Glycogen synthase (EC 2.4.1.21) (Starch [bacterial glycogen] synthase),http://bigg.ucsd.edu/universal/reactions/GLCS1
https://rdf.metanetx.org/pept/GLGA_CLOD6,Glycogen synthase (EC 2.4.1.21) (Starch [bacterial glycogen] synthase),http://rdf.rhea-db.org/18189


Question 10.- What is the official Gene ID (UniProt calls this a “mnemonic”) and the MetaNetX
Reaction identifier (mnxr.....) for the protein that has “Starch synthase” catalytic activity in Clostridium
difficile (taxon 272563)?

In [9]:
%endpoint https://rdf.metanetx.org/sparql
%format JSON

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>

select distinct ?id ?reaction_identifier where{
    service <https://sparql.uniprot.org/sparql>{
        select ?UPquery ?id  where {
            ?UPquery a up:Protein;
            up:organism taxon:272563;
            up:mnemonic ?id;
            up:enzyme ?act.
            ?act skos:prefLabel 'starch synthase'.
        }
    }
    ?pept mnx:peptXref ?UPquery. # This ended up being the same protein as above.
    ?cata mnx:pept ?pept.
    ?gpr mnx:cata ?cata;
        mnx:reac ?reac.
    ?reac mnx:mnxr ?rxn.
    ?rxn mnx:reacRefer ?reaction_identifier.
}

id,reaction_identifier
GLGA_CLOD6,http://bigg.ucsd.edu/universal/reactions/GLCS1
GLGA_CLOD6,http://rdf.rhea-db.org/18189
