### Demo notebook: uniprot to gene name mapping
The code below is a simple example of using SHEPHARD to build an interactive Proteome object which can be used to map uniprot IDs to gene names.

In [1]:
from shephard.apis import uniprot
from shephard import interfaces

In [2]:
# name of a FASTA file from uniprot. The example here uses the cleaned human proteome
# - i.e., the human proteome with proteins that lack non-standard amino acids, but this
# could be any FASTA file generated from UniProt (e.g. mouse proteome etc)
filename = '../shprd_data/human_proteome_validated.fasta'

In [3]:
# read in FASTA file from uniprot
human_proteome = uniprot.uniprot_fasta_to_proteome(filename)

### extract gene names from uniprot headers
The code below takes advantage of the fact that by default in a [proteome](https://www.uniprot.org/proteomes/UP000005640) downloaded by uniprot the standard gene name is included in the FASTA header record using the `GN=` delimiter. Because when using the `shephard.apis.uniprot` api the full fasta header is available under the `protein.name` variable, we can parse out the gene names from here and assign them as protein attributes.

In [4]:
# excise and assign gene names as attributes
for protein in human_proteome:
    name_string = protein.name
    
    # first try and get gene name from the GN= entry in the FASTA header
    try:
        gene_name = name_string.split('GN=')[1].split()[0]
    except IndexError:
        # if this fails get the UID-assigned identifier which is 
        # typicall gene-name (or close to) followed by species identifier
        # - e.g. P53_HUMAN in instead of TP53. We keep the _HUMAN so records
        # parsed in this way can be easily identified (although) the _HUMAN
        # could be excised 
        gene_name = name_string.split()[0].split('|')[2]
        
    protein.add_attribute('gene_name', gene_name)
    

#### Using the annotated human proteome object
Having run the code above, you've generated an annotated human proteome object which can be used to lookup uniprot ID to gen the gene name as follows

In [5]:
# lets check out p53...
uid_of_interest = 'P04637'

print(human_proteome.protein(uid_of_interest).attribute('gene_name'))

TP53


#### Saving this annotation
To be fancy, you could save this annotation and then you only need to re-load it rather than reparse


In [6]:
# the line here writes the gene name annotations out to a SHEPHARD protein attributes file
interfaces.si_protein_attributes.write_protein_attributes(human_proteome,'shprd_prot_atts_gene_names.tsv')

In [7]:
# then the two lines below load in a new (unannotated) proteome and then annotate using those attributes we wrote
# - this is obviously just a use case example but illustrates how you only need to do the parsing once, strictly
# speaking...
new_proteome = uniprot.uniprot_fasta_to_proteome(filename)
interfaces.si_protein_attributes.add_protein_attributes_from_file(new_proteome,'shprd_prot_atts_gene_names.tsv')