## Working with multiple sequences in `localcider`
###### Last updated 2022-09-18

By default, localCIDER was not built as a FASTA parser. However, [protfasta](https://protfasta.readthedocs.io/en/latest/installation.html) was, and can be used alongside localCIDER to work with multiple sequences.

### Installation
Both localCIDER and protfasta can be installed using pip into your conda environment; e.g.

    pip install localcider protfasta
    
Should just work out the box. 

Once installed, the code below shows an example of calculating the FCR for every sequence in a FASTA file

### NB: upgrade required

If you previously had protfasta installed you may need to upgrade to version 0.1.10 (released Sept. 2022)

    pip install --upgrade protfasta
    
This provides support for the `convert-remove` keyword described below.    

## Scenario 1: Standard situation
The code below walks through how things will normally work if your FASTA file has only standard amino acids in the underlying proteins sequences

In [1]:
import protfasta
from localcider.sequenceParameters import SequenceParameters

In [2]:
seqs = protfasta.read_fasta('demo.fasta')

all_FCRs = []
for idx in seqs:
    s = seqs[idx]
    local_FCR = SequenceParameters(s).get_FCR()
    
    all_FCRs.append(local_FCR)

print('This list now has the FCR for each sequence in the FASTA file:\n')    
print(all_FCRs)    

This list now has the FCR for each sequence in the FASTA file:

[0.32678132678132676, 0.4004282655246253, 0.29310344827586204, 0.297029702970297, 0.2599118942731278, 0.2396694214876033, 0.14437367303609341, 0.18316831683168316, 0.16101694915254236, 0.07207207207207207, 0.05426356589147287, 0.05063291139240506, 0.3815028901734104, 0.3978494623655914, 0.43333333333333335]


### Scenario 1: If non-standard amino acids are found in the sequence
Sometimes your FASTA file may have one or more non-standard amino acids that will cause localCIDER to crash. protfasta enables you to handle these types of residues in a few ways using the `invalid_sequence` argument.

Options available are:

* `ignore` - invalid sequences are completely ignored (this may allow sequences that will cause localCIDER to throw and Exception)
* `fail` - invalid sequence cause parsing to fail and throw an exception (this is the default)
* `remove` - invalid sequences are removed
* `convert` - invalid sequences are converted using the standard non-standard to standard amino acid convention:
    * B -> N
    * U -> C
    * X -> G
    * Z -> Q
    * " " -> empty string 
    * \* -> empty string
    * \- -> empty string
    
    Note that if non-standard amino acids are still found even after converting, this throws an exception.
    
* `convert-ignore` - invalid sequences are converted to valid sequences and any remaining invalid residues are ignored
* `convert-remove` - invalid sequences are converted to valid sequences where possible, and any remaining sequences with invalid residues are removed

Note that only `remove` and `convert-remove` guarentee that you'll only have valid amino acids in the sequences, but may of course leads to sequences being removed from the input dataset.

In [3]:
seqs = protfasta.read_fasta('demo.fasta', invalid_sequence_action='convert-remove')

all_FCRs = []
for idx in seqs:
    s = seqs[idx]
    local_FCR = SequenceParameters(s).get_FCR()
    
    all_FCRs.append(local_FCR)

print('This list now has the FCR for each sequence in the FASTA file:\n')    
print(all_FCRs)    

This list now has the FCR for each sequence in the FASTA file:

[0.32678132678132676, 0.4004282655246253, 0.29310344827586204, 0.297029702970297, 0.2599118942731278, 0.2396694214876033, 0.14437367303609341, 0.18316831683168316, 0.16101694915254236, 0.07207207207207207, 0.05426356589147287, 0.05063291139240506, 0.3815028901734104, 0.3978494623655914, 0.43333333333333335]


### Scenario 2: Non-unique FASTA headers
In general FASTA headers within a file are unique, although this is not actually a requirement of the FASTA specification. This means you COULD have a FASTA file with mulitiple sequences with the same header. If this is the case, you need to use the `expect_unique_header=False` and  `return_list=True` settings which means the output from `read_fasta()` is now a list, not a dictionary, where the list elements are the FASTA header and the sequence.

In [8]:
seqs = protfasta.read_fasta('demo.fasta', expect_unique_header=False, return_list=True)

all_FCRs = []

#  note now we're iterating through a LIST not a dictionary 
for entry in seqs:
    
    # each entry has 
    # entry[0] = FASTA header
    # entry[1] = sequence
    s = entry[1]
    local_FCR = SequenceParameters(s).get_FCR()
    
    all_FCRs.append(local_FCR)

print('This list now has the FCR for each sequence in the FASTA file:\n')    
print(all_FCRs)    

This list now has the FCR for each sequence in the FASTA file:

[0.32678132678132676, 0.4004282655246253, 0.29310344827586204, 0.297029702970297, 0.2599118942731278, 0.2396694214876033, 0.14437367303609341, 0.18316831683168316, 0.16101694915254236, 0.07207207207207207, 0.05426356589147287, 0.05063291139240506, 0.3815028901734104, 0.3978494623655914, 0.43333333333333335]
