# Creating a Custom FASTA File

The following code takes in a specified FASTA database (i.e. a Uniprot download or custom search space), iterates through each protein, truncates the protein sequence and inserts a custom peptide at the C-terminus. 

This is a 'brute force' method of creating all possible protein sequences that may exist when the location of the inserted peptide sequence is not specific. As such, this produces a large FASTA. Each new protein sequence within the output file is given a unique identifier that allows for discrimination of unique sequences during database search.

For example: 
>"sp|ProteinAccession_151|Protein_Name_151"       indicates the linker sequnce has been inserted after amino acid 151 in the original protein sequence 

>"sp|ProteinAccession_150|Protein_Name_150"       is the same protein as shown above but has the linker sequence inserted after amino acid 150

To see an example of this in action, running cells 1-4 will produce an example fasta database with the unique linker specified below. To see the fasta used within our manuscript, run cells...

In [1]:
import re
import os
import ntpath
from pyteomics import fasta

In [2]:
# below are the custom functions used to produce the custom fasta file 

def customize(s: str, new_text: str, start_pos: int, stop_pos=None):
    '''
    This function takes in a full protein sequence string,
    truncates the protein at the 'start_position' and inserts
    the 'new_text' at the end.

    :param s (str): full protien sequence or header name from fasta
    :param new_text (str): linker sequence or new text to be inserted; 
                           specified by user
    :param start_pos (int): amino acid position or string index that specifies
                            where the sequence will be terminated

    returns: truncated protein sequence with an added linker sequnce
    '''
    pref = s[:start_pos]
    if stop_pos:
        return pref + new_text + s[stop_pos:]
    else:
        suf = s[:start_pos]
        return suf + new_text

def grab_header_info(s: str):
    '''
    Function used to grab the protein accession and name from the 
    fasta header.

    :param s (str): string containing a full fasta header

    returns: the protein name/accession, the starting string index,
             the ending string index
    '''
    pat = re.compile(r'\|\w+\|\w+\s')
    match = re.search(pat, s)
    if not match:
        raise Exception(f'No regex match found in {s}')
    start, end = match.span()[0]+1, match.span()[1]-1
    res = s[start:end]
    return res, start, end

def make_new_header(s: str, i: int):
    '''
    Function that creates a new header that denotes the amino acid position
    where the custom linker seqence has been inserted.

    :param s (str): the header information extracted from func(grab_header_info)
    :param i (int): the amino acid position after which with linker has been inserted

    returns: a new, useable header 
    '''
    l = s.split('|')
    l = [s+'_'+str(i) for s in l]
    return '|'.join(l)

In [3]:
# file = r"C:\Users\graha\Downloads\E.coli_proteome.fasta"
INPUT_FASTA = '.\RNaseB.fasta'
LINKER = "ANDHHHHHHD"

# read in the original fasta file and instantiate list for new data
f = fasta.read(INPUT_FASTA)
new_data = []

# iterate through each protein and add original sequence to new data
for i, (header, seq) in enumerate(f):
    new_data.append((header, seq))

    # pull the header information so we can write unique protein names in the output fast
    name, start, stop = grab_header_info(header)

    # go through each protein, sequentially remove the terminal amino acid, add the linker
    for j in range(len(seq), 0, -1):
        new_name = make_new_header(name, j+1)
        new_name = customize(header, new_name, start, stop)
        new_seq = customize(seq, LINKER, j)
        new_data.append((new_name, new_seq))

# write the data to a new fasta file with a familiar filename
fasta.write(new_data, output='custom_'+ntpath.basename(INPUT_FASTA))

# write the data as a txt file, just in case
with open('custom_'+ntpath.basename(INPUT_FASTA)+'.txt', 'a') as f:
    for (header, seq) in new_data:
        new_string = '\n'.join([header, seq])
        f.write(new_string+'\n')

## Run the real data here

In [4]:

INPUT_FASTA = '.\E.coli_proteome.fasta'
LINKER = "ANDHHHHHHD"

# read in the original fasta file and instantiate list for new data
f = fasta.read(INPUT_FASTA)
new_data = []

# iterate through each protein and add original sequence to new data
for i, (header, seq) in enumerate(f):
    new_data.append((header, seq))

    # pull the header information so we can write unique protein names in the output fast
    name, start, stop = grab_header_info(header)

    # go through each protein, sequentially remove the terminal amino acid, add the linker
    for j in range(len(seq), 0, -1):
        new_name = make_new_header(name, j+1)
        new_name = customize(header, new_name, start, stop)
        new_seq = customize(seq, LINKER, j)
        new_data.append((new_name, new_seq))

# write the data to a new fasta file with a familiar filename
fasta.write(new_data, output='custom_'+ntpath.basename(INPUT_FASTA))

# write the data as a txt file, just in case
with open('custom_'+ntpath.basename(INPUT_FASTA)+'.txt', 'a') as f:
    for (header, seq) in new_data:
        new_string = '\n'.join([header, seq])
        f.write(new_string+'\n')