# 01. Uniprot API

The [UniProt](https://www.uniprot.org/) knowledgebase is a large resource of protein sequences and associated detailed annotation.
The moment this tutorial was written it contained close to 200 million sequences,
of which more than half a million were curated by experts that critically review experimental and predicted data for each protein. [1]

## How to search this database

Uniprot provides a text search in which you describe the kind of data you are looking form in the form of queries. An image of how this search bar looks is given below.

![](img/uniprot-search-bar.png)

If this search bar is entered empty, uniprot will give back a list of all available sequences in the database. With advanced dropdown menu, it is possible to select specific fields.

### Example 1.1: Search on website for all human proteins

Use the advanced dropdown menu and select the field "Organism" put in the value human and use the autocompletion to get what is shown below.

![](img/uniprot-advanced-search.png)

This gives the following value in the search bar,

![](img/uniprot-search-human.png)

and the following results.

![](img/uniprot-search-human-results.png)

We can use extra fields to further refine this search. 
For example, lets say we are only interested in those proteins that have a 3D structure available and are longer then 1000 amino acids.

![](img/uniprot-search-human-big-structure.png)

### Exercise 1.1.a: Search on website for all E. coli (strain K12) proteins with a signal peptide

(Click dots for solution)

```
annotation:(type:signal) AND organism:"Escherichia coli (strain K12) [83333]"
```

### Exercise 1.1.b: Search on website for the protein with id P0AFL3

(Click dots for solution)

```
id:P0AFL3
```

## Use Uniprot API to download files

When you perform a query on the Uniprot website,
you can download the results in different formats from the web page with the following button.

![](img/uniprot-download.png)

Simple right? Why would we need to automate this simple task with python.
The thing is that if you want to download many different files, 
the task of filling in the query on the website and clicking the download button gets very repetitious.
Lets say you want to download a list of protein identifiers for every protein that contains a signal peptide,
and you want to that for 250 different organisms.
Can you imagine yourself refilling the text search 250 times, 
pushing the download button 250 times,
selecting the list format 250 times,
choosing the destination on your computer 250 times ...
You get the idea.
It even gets worse, 
if after one week, you realize that having a signal peptide was not enough the research you are doing and the proteins also needs to have a length of at least 200 amino acids,
you will have to redo all those steps again for 250 times.
A python script could solve this problem in less then 10 lines of code.
Additionally, you have your data collection method written down, 
which you could pass to other researchers if they want to recreate your dataset.

So how does it work?
Simple, uniprot requires a specific format of URL to know which data you want and then you can download this data.
More information about the ins and outs can be found on this [link](https://www.uniprot.org/help/api%5Fqueries). 
Below I have written some functions that will generate a URL based on given parameters and download the requested file in the current working directory.
If you are interested you can a look at them how they work, but this is not necessary.
You can also just run the cell and skip towards the examples.

In [2]:
import os
import requests

def downloadFile(url,fileName):
    """
    Downloads a file from the internet with a given url.
    The function delete any existing files with the given filename.
    It will then download and name the new file.
    The function is designed to also work with very big files.
    
    Parameters
    ----------
    url : str
        url that is needed used to download a file.
    fileName : str
        Name of the new file
    
    Returns
    -------
    fileName : str
        returns the name of the new file
    """
    # Delete existing files with filename
    try:
        os.remove(fileName) 
    except:
        pass
    
    """ Use requests to download file. 
    Works with streams to be able large files without having the need of a 
    large memory.
    """
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(fileName, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                if chunk:
                    f.write(chunk)
    return fileName

def uniprotDownload(fileName, query="",format="list",columns="",include="no",compress="no",limit=0,offset=0):
    """Downloads file from uniprot for given parameters
    
    If no parameters are given the function will download a list of all the 
    proteins ID's. More information about how the URL should be constructed can
    be found on: 
    https://www.uniprot.org/help/api%5Fqueries
    
    Parameters
    ----------
    fileName : str
        name for the downloaded file
    query : str (Default='')
        query that would be searched if as you used the webinterface on 
        https://www.uniprot.org/. If no query is provided, all protein entries
        are selected. 
    format : str (Default='list')
        File format you want to retrieve from uniprot. Available format are:
        html | tab | xls | fasta | gff | txt | xml | rdf | list | rss
    columns : str (Default='')
        Column information you want to know for each entry in the query 
        when format tab or xls is selected.
    include : str (Default='no')
        Include isoform sequences when the format parameter is set to fasta.
        Include description of referenced data when the format parameter is set to rdf.
        This parameter is ignored for all other values of the format parameter.
    compress : str (Default='no')
        download file in gzipped compression format.
    limit : int (Default=0)
        Limit the amount of results that is given. 0 means you download all.
    offset : int (Default=0)
        When you limit the amount of results, offset determines where to start.
        
    Returns
    -------
    fileName : str
        Name of the downloaeded file.
    """
    def generateURL(baseURL, query="",format="list",columns="",include="no",compress="no",limit="0",offset="0"):
        """Generate URL with given parameters"""
        def glueParameters(**kwargs):
            gluedParameters = ""
            for parameter, value in kwargs.items():
                gluedParameters+=parameter + "=" + str(value) + "&"
            return gluedParameters.replace(" ","+")[:-1] #Last "&" is removed, spacec replaced by "+"
        return baseURL + glueParameters(query=query,
                                        format=format,
                                        columns=columns,
                                        include=include,
                                        compress=compress,
                                        limit=limit,
                                        offset=offset)
    URL = generateURL("https://www.uniprot.org/uniprot/?",
               query=query,
               format=format,
               columns=columns,
               include=include,
               compress=compress,
               limit=limit,
               offset=offset)
    return downloadFile(URL, fileName)

### example 1.2 download in list format

In this example we will download a list file for all human proteins with protein length of at least 4000 amino acids.
The list format is just a plain text file of all the protein identifiers that agree with the search query.
Each protein identifier is unique, 
thus they can always be mapped back to the database.

In [4]:
# Query as you give it in the textsearch
QUERY='length:[4000 TO *] AND organism:"Homo sapiens (Human) [9606]"' 
FORMAT = 'list'                               
filename = 'humanProteins.list'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'humanProteins.list'

(In the left panel, you can click on the file to check out the content.)

### Exercise 1.2.a: 

Download a list of all mouse proteins that are annotated to have a disulfide bond.
(hint: use the text search on the web to find out how the query should look like).

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

(click dots for solution)

In [5]:
QUERY='annotation:(type:disulfid) AND organism:"Mus musculus (Mouse) [10090]"' 
FORMAT = 'list'                               
filename = 'mouseProteinsDisulfideBond.list'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'mouseProteinsDisulfideBond.list'

### Exercise 1.2.b: 

Download a list of all E. coli (K12 strain) proteins that are annotated to be DNA binding.

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

(click dots for solution)

In [6]:
QUERY='annotation:(type:dna_bind) AND organism:"Escherichia coli (strain K12) [83333]"' 
FORMAT = 'list'                               
filename = 'EcoliDnaBinding.list'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'EcoliDnaBinding.list'

### Example 1.3 download in fasta format

The list format is very useful if you want to keep a list of all proteins that a agree with specific query.
However, you may want to know more about those proteins you have identified, like sequence information.
Retrieving the primary sequence is a good starting point for further analysis.
For example, it can be used to do:

* a multiple sequence alignment ([Clustal Omega](02.clustalOmega-API.ipynb))
* find homologues in a database ([BLAST+](09.BLAST-API.ipynb))
* predict biophysical features
    - [DynaMine](06.dynamine-API.ipynb)
    - [EFoldMine](07.EFoldMine-API.ipynb)
    - [DisoMine](08.DisoMine-API.ipynb)

Below is an example of how we can use the API download all the sequences of E. coli (K12) that have been reviewed (manually curated).

In [3]:
QUERY='reviewed:yes AND organism:"Escherichia coli (strain K12) [83333]"' 
FORMAT = 'fasta'                               
filename = 'EcoliReviewed.fasta'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'EcoliReviewed.fasta'

(By clicking on the file in the left panel, you can have a look at it)

### Exercise 1.3.a

Download fasta file proteins with that contain "ppiA" in the gene name, but limit it to the taxon of the gammaproteobacteria.

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

Click dots for solution

In [4]:
QUERY='gene:ppia taxonomy:"Gammaproteobacteria [1236]"' 
FORMAT = 'fasta'                               
filename = 'GammaProteoBacteria_ppiA.fasta'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'GammaProteoBacteria_ppiA.fasta'

### Exercise 1.3.b

The organism of the Corona virus that is causing the 2020 pandemic is called **sars-cov**.
Retrieve a fasta file of all the known proteins of the corona virus.

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

Click dots for solution

In [5]:
QUERY='organism:sars-cov' 
FORMAT = 'fasta'                               
filename = 'coronaVirus.fasta'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'coronaVirus.fasta'

### Example 1.4 download in XML format

Until now, we have only downloaded protein identifiers and sequential information.
However, if you look at protein page on Uniprot (e.g. [P59632](https://www.uniprot.org/uniprot/P59632), protein 3a human SARS coronavirus),
you see there is much more information available.
To access this information computationally, we will download it xml format.
More information about the structure of xml files and how to access them with python can be found in [this tutorial](01.a.XML.ipynb).

Below, the code is shown to download information in xml format about protein **P59632**.

In [6]:
QUERY='id:P59632' 
FORMAT = 'xml'                               
filename = 'coronaVirusP59632.xml'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'coronaVirusP59632.xml'

Click in the panel on the left on the file to take a look at it.
Try to map the information on the webpage to the corresponding information in the xml file.

### Exercise 1.4.a 

Download a xml file that contains information about all the corona virus proteins that have been reviewed.

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

Click dots for solution

In [7]:
QUERY='organism:sars-cov AND reviewed:yes' 
FORMAT = 'xml'                               
filename = 'coronaVirusReviewed.xml'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'coronaVirusReviewed.xml'

### Exercise 1.4.b

Other variants of the corona virus come from the **Coronaviridae** family.
Lets say we want to compare **protein 3a** from the different organisms within this family with each other.
Download a xml file that contains all the proteins with the name **"protein 3a"** within the **CoronaViridae** family.

In [None]:
QUERY='' 
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY)

Click dots for solution

In [8]:
QUERY='taxonomy:"Coronaviridae [11118]" name:"protein 3a"' 
FORMAT = 'xml'                               
filename = 'coronaFamilyProtein3a.xml'

uniprotDownload(filename,format=FORMAT, query=QUERY)

'coronaFamilyProtein3a.xml'

### Example 1.5 Download in tab format

XML files are useful because they contain a lot of information.
However sometimes, we are only interested in a couple of features for a big list of proteins.
Even though you can find these features in the xml file,
you will first have to download a very big file out of which you will only use a limited number of features.
Therefore, Uniprot provided the **tab** format.
This a plain text file where every column depict one feature and each row is an entry.
You can easily parse them with libraries like [pandas](01.b.pandas.ipynb), 
or even open them with excel.

Lets say for we want a list of all human proteins that have been reviewed,
but for each entry, we only want to know the 

* id 
* protein length 
* protein name 
* gene name 
* protein localization.

As in previous examples we will have to provide a search query,
but in addition you also have to provide a list columns you are interested in.
To know, what columns you can choose from, you can use this [link](https://www.uniprot.org/help/uniprotkb_column_names)

In [10]:
QUERY='reviewed:yes AND organism:"Homo sapiens (Human) [9606]"'
COLUMNS='id,length,entry name,genes,comment(SUBCELLULAR LOCATION)'
FORMAT = 'tab'                               
filename = 'humanProteins.tab'

uniprotDownload(filename,format=FORMAT, query=QUERY, columns=COLUMNS)

'humanProteins.tab'

### Exercise 1.5.a

Download all proteins of the organism **sars-cov** in **tab** format.
Display the the **id, protein name, gene name,** and **Mapped PubMed ID**.

In [None]:
QUERY=''
COLUMNS=''
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY, columns=COLUMNS)

click dots for solution

In [11]:
QUERY='organism:"Human SARS coronavirus (SARS-CoV) (Severe acute respiratory syndrome coronavirus) [694009]"'
COLUMNS='id,entry name,genes,citationmapping'
FORMAT = 'tab'                               
filename = 'coronaVirusCitations.tab'

uniprotDownload(filename,format=FORMAT, query=QUERY, columns=COLUMNS)

'coronaVirusCitations.tab'

### Exercise 1.5.b

Download all **E.coli (strain K12)** proteins and display **id, gene name,** and whether or not there is a **signal peptide**.

In [None]:
QUERY=''
COLUMNS=''
FORMAT = ''                               
filename = ''

uniprotDownload(filename,format=FORMAT, query=QUERY, columns=COLUMNS)

click dots for solution

In [12]:
QUERY='organism:"Escherichia coli (strain K12) [83333]"'
COLUMNS='id,length,entry name,genes,feature(SIGNAL)'
FORMAT = 'tab'                               
filename = 'EcoliSignalPeptide.tab'

uniprotDownload(filename,format=FORMAT, query=QUERY, columns=COLUMNS)

'EcoliSignalPeptide.tab'

# References

1. [UniProt: the universal protein knowledgebase. Nucleic acids research, 2017, 45.D1: D158-D169.](https://academic.oup.com/nar/article/45/D1/D158/2605721)
2. [Uniprot website](https://www.uniprot.org/)
3. [Uniprot help page on API](https://www.uniprot.org/help/api%5Fqueries)
4. [column names](https://www.uniprot.org/help/uniprotkb_column_names)