# Example: parsing FASTA records into custom unique IDs
*Last updated 2022-06-07*

This short notebook walks through how one could parse a FASTA file where you want to extract unique IDs from the FASTA header.

In [11]:
from shephard.apis import fasta

### Define custom function for parsing FASTA header
The custom FASTA header function must take in the FASTA header as a string and return a new string which will be used as the unique ID. Note that if this leads to non-unique unique IDs the function will raise a `ProteomeException`

In [12]:
def parse_header(header_string):
    """
    This custom header parser returns the 
    uniprot ID followed by the organism taxon ID, with
    an underscore connecting the two.
    
    Parameters
    -------------
    header_string : str
        Will be the input string for any given FAST record
        
    Returns
    --------------
    str
        Returns a record in the format <uniprotID>_<taxonID>
    
    """
    
    # gets uniprot ID e.g. from ">sp|P12532|KCRU_HUMAN..." extracts P12532
    uid = header_string.split('|')[1]
    
    # gets organism taxonID e.g. from .... "OS=Homo sapiens OX=9606 GN=CKMT1A" extracts 9606
    taxon_ID = header_string.split('OX=')[1].split()[0]
    
    return f"{uid}_{taxon_ID}"
    

    

In [13]:
def bad_parse_header(header_string):
    """
    Example of a parser that will fail because the
    returned string will not be unique!
    """
    
    return 'non_unique_uid'

    

In [15]:
# this should run without issue...
test_proteome = fasta.fasta_to_proteome('seqs.fasta', build_unique_ID=parse_header)

In [16]:
# and we can verify it worked by printing the unique IDs for our proteins
for protein in test_proteome:
    print(protein.unique_ID)

P41956_6239
P12532_9606
P25809_10116
Q924S5_10116
O54937_10116
Q9P0J1_9606
O88483_10116
Q12511_559292
P25646_559292
Q9VL76_7227


In [19]:
# this should raise an exception, which in the code-block below is caught in the try/except loop
# and the error message will be printed 

try:
    test_proteome = fasta.fasta_to_proteome('seqs.fasta', build_unique_ID=bad_parse_header)
except Exception as e:
    print('This failed with the following exception:\n')
    print(e)


This failed with the following exception:

Non-unique unique_ID passed [non_unique_uid]
