# **Multiple Sequence Alignment**

This notebook covers work done on MSA and any strategies used to improve on the alignments
## **Goals**
1. Develop a Multiple Sequence Alignment pipeline
2. Make decision on which tools to use for MSA and visualization
3. Conduct Multiple Sequence Alignment on the generated FASTA format sequences
4. Visualize the results using an assortment of tools
5. Trim and develop a clean, suitable data set for phylogenetic analysis:  
5.1. Exclude unaligned sequences (non-homologous)  
5.2. Exclude short sequences


## **Tasks**
1. [Installing the bioinformatics tools for MSA](./03.01.MSA_tools_installation.ipynb)
2. Setting up the tools: writting the scripts, setting parameters and testing using samples from test data
3. Evaluating the test data results from various tools to enable decision making on which tools to use for what purpose
4. Conduct MSA on the subsets of data using the best selected tools
5. Visuallize the aligned sequences and trim the sequences to collumns within the 658 5' region of the COI-5P barcode
6. Finally develop a suitable data set for phylogenetic analysis

### **Multiple Sequence Alignment tools.**
1. [MUSCLE.](http://www.drive5.com/muscle/)
It is problematic to align large number of sequences using global alignment algorithims used by muscle as explained in [Very large alignments are usually a bad idea](http://www.drive5.com/muscle/manual/bigalignments.html). Clustering highly identical (95% or 90% identity) help reduce the the sequences and challanges faced.
2. [T-Coffee(Tree based Consistency Objective Function For AlignmEnt Evaluation)](https://github.com/cbcrg/tcoffee). [The regressive mode of T-Coffee](https://github.com/cbcrg/tcoffee/blob/master/docs/tcoffee_quickstart_regressive.rst) is [described as most suitable for large datasets](https://www.biorxiv.org/content/10.1101/490235v1.full) by E. G. Nogales et. al (2018).
3. [MAFFT Version 7](https://mafft.cbrc.jp/alignment/software/). For large datasets: [Tips for handling a large dataset](https://mafft.cbrc.jp/alignment/software/tips.html). More published by [T. Nakamura et. al (2018)](https://academic.oup.com/bioinformatics/article/34/14/2490/4916099)
4. [SATé(Simultaneous Alignment and Tree Estimation)](http://sysbio.oxfordjournals.org/content/61/1/90.abstract?sid=58895a54-2686-4b58-a676-3cc4d73a3b76): From GitHub [source code](https://github.com/sate-dev/sate-core) using [sate-tools-linux](https://github.com/sate-dev/sate-tools-linux) tools
5. [PASTA(Practical Alignment using Sate and TrAnsitivity)](https://www.liebertpub.com/doi/full/10.1089/cmb.2014.0156): From GitHub [Source code](https://github.com/smirarab/pasta) [(Tutorial)](https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md)
6. Other tools; [SEPP](https://github.com/smirarab/sepp), [UPP](https://github.com/smirarab/sepp/blob/master/README.UPP.md) and [HMMER](http://hmmer.org/)
7. Visualized the multiple sequence alignments using [jalview](http://www.jalview.org/download) for datasets upto a few thousands and [Seaview](http://doua.prabi.fr/software/seaview) by [Gouy et al.](https://academic.oup.com/mbe/article/27/2/221/970247) for bigger data sets. Seaview uses the FLTK project (installed separately) for its user interface.

#### MSA evaluation methods used in T_Coffee:
We used the following [sequence based methods](https://tcoffee.readthedocs.io/en/latest/tcoffee_main_documentation.html#sequence-based-methods) to evaluate our MSAs:
1. [Computing the CORE index of any alignment
](https://tcoffee.readthedocs.io/en/latest/tcoffee_main_documentation.html#computing-the-local-core-index).
2. Evaluating the [Transitive Consistency Score (TCS)](https://tcoffee.readthedocs.io/en/latest/tcoffee_main_documentation.html#transitive-consistency-score-tcs) of an MSA. The scores generated here are usefull in filtering our sequences and in phylogenetic inference based on herogenous site evolutionary rates.


### **MSA Tools Installation**
**MUSCLE, T-Coffee and MAFFT** are all available on conda environment bioconda channel and are easily installed to our analysis anaconda3 environment (coi_env):

In [None]:
%%bash
conda env list
#conda install -n coi_env -c bioconda t-coffee muscle mafft

**SATé** and **PASTA** up-to-date source codes are available on GitHub and are installed as described below.  
SATé *(~on its own~)** was not compatible with Python3 and some of it's dependencies were no longer available for Python2. I tried to upgrade it to accommodate Python3 but gave up after a few days.  
>(~on its own~)*: SATé is a crucial part of PASTA and UPP/SEPP by extention and won't operate without some inbuilt SATé modules

In [None]:
%%bash
# PASTA
#cd ~/bioinformatics/github/co1_metaanalysis/code/tools/pasta_code/
#git clone https://github.com/smirarab/pasta.git
#git clone https://github.com/sate-dev/sate-tools-linux.git
cd pasta/
sudo python3 setup.py develop
chmod +x run_pasta.py
chmod +x run_pasta_gui.py
chmod +x run_seqtools.py

# SATé
#cd ~/bioinformatics/github/co1_metaanalysis/code/tools/sate/
#git clone https://github.com/sate-dev/sate-core.git
#git clone https://github.com/sate-dev/sate-tools-linux.git
sudo python3 setup.py develop #
chmod +x run_sate.py
chmod +x run_sate_gui.py

For the **other tools**:
1. SEPP (SATe-enabled Phylogenetic Placement): phylogenetic placement of short reads into reference alignments and trees.
2. UPP (Ultra-large alignments using Phylogeny-aware Profiles): alignment of very large datasets, potentially containing fragmentary data.
4. HMMER (I have **NOT** used so far for alignment on it's own): Uses probabilistic models called profile hidden Markov models (profile HMMs) for sequence alignment among other functions

SEPP and UPP source codes are available on GitHub as a single package and were installed as follows:

In [None]:
%%bash
# SEPP
#cd ~/bioinformatics/github/co1_metaanalysis/code/tools/sepp
#git clone https://github.com/smirarab/sepp.git
python3 setup.py config -c 
python3 setup.py install

### **Visualization Tools**
1. **Seaview**: The most suitable
2. **Jalview**: Good for fewer sequences
3. **SuiteMSA**

All are java based programs and were installed using their source codes