# Part 7  
---

## Multiple sequence aligners + Phylogenetics

For more information about the theory behind of multiple sequence alignment, please see these resources:

- https://academic.oup.com/bib/article/17/6/1009/2606431
- https://www.hindawi.com/journals/isrn/2013/615630/

Note: There is a user-friend tool called MEGA that is likely sufficient for analyzing closely related species (within a phylum). However, it lacks some of the more sophisticated phylogenetic methods https://www.megasoftware.net/

### Aligners

Here is a list of recommended aligners (non-exhaustive). Unfortunately, most do not have Windows packages, however some of them do have webservers.  
- [MAFFT](https://mafft.cbrc.jp/alignment/software/) 
- [T-COFFEE](http://www.tcoffee.org/Projects/tcoffee/workshops/tcoffeetutorials/installation.html)
- [MUSCLE](https://www.drive5.com/muscle/) -- Has a windows distribution!
- [HMMER](http://hmmer.org/) *great for aligning distantly related proteins*
  - Profile alignment, guides the alignment of your sequences using a well curated template file. 

### Alignment viewers

The most popular viewer on the market is [Jalview](https://www.jalview.org/), but there are older alternatives such as [SeaView](http://doua.prabi.fr/software/seaview).

### Masking alignments

If you want to proceed to phylogenetic analysis you can treat your alignment with a 'masking' method. This is a way to 'mask' or 'hide' regions of the alignment that might be misaligned (and therefore non-homologous). Historically, this was done by eye (yep..) but now we have tools to help us out! My favourites are: 

- [BMGE](https://bmcecolevol.biomedcentral.com/articles/10.1186/1471-2148-10-210)
- [trimal](http://trimal.cgenomics.org/) -- has a windows distribution
- [divvier](https://github.com/simonwhelan/Divvier) # advanced user

### Phylogenetic Analyses

Finally, for phylogenetic analyses, I recommend Maximum-likelihood or Bayesian methods:

- [IQTREE](http://www.iqtree.org/)
- [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/)
- [PhyloBayes](http://www.atgc-montpellier.fr/phylobayes/)

### TreeViewers

- [FIGTREE](http://tree.bio.ed.ac.uk/software/figtree/)
- [iTOL](https://itol.embl.de/)



Since muscle has a windows distribution let's try and get that working.  Mac/unix/colab users head to the next section

### Windows

Window's folks head to this link and download the executable (tar.gz again): 

https://www.drive5.com/muscle/downloads.htm

http://trimal.cgenomics.org/downloads

http://www.iqtree.org/#download


Note the path where you have downlodaded these including the \bin\


In [None]:
# Set the path variable for each package
!set PATH=%PATH%;C:\your\path\here\
!muscle

In [None]:
### MAC/UNIX/COLLAB

!conda install --yes -c bioconda muscle
!conda install -c bioconda --yes trimal
!conda install --yes -c bioconda iqtree


In [None]:
!muscle 

In [None]:
## Run the aligner using mafft
## It will save to the file SDHA_ncbi.mus.fasta 

!muscle -quiet -in SDHA_ncbi.fasta -out SDHA_ncbi.mus.fasta 

## Run the 'masking' or 'trimming' tool

!trimal -gappyout -in SDHA_ncbi.mus.fasta  -out SDHA_ncbi.mus.go.fasta 


In [None]:
## Run the tree making tool 

!iqtree -s SDHA_ncbi.mus.go.fasta --mset LG,WAG --mrate G,I -bb 1000 -pre SDHA_ncbi.mus.go.mfp -quiet


## Making a standalone python script

Save the following in a file called myScript.py

In [None]:
#!/path/to/python

"""
Description:
 multiline description, all this text will be displayed as it is
 so please don't mess this up.
 You can also add as much section as needed

Dependencies:
  module1
  module2
  module3

Examples:
 (1) python script_name -i input_file.ext
       The script will automatically create a file based on input file
 (2) python script_name -i input_file.ext -o output_file

Author:
 YOU

"""
################## IMPORTS ################################

import argparse





################ DEFINITIONS ################################

################ ARGUMENTS ##################################

parser = argparse.ArgumentParser(description='descriptttttt')
parser.add_argument("-f", "--fasta", required=True, help="fasta")
parser.add_argument("-o", "--output",  required=True, help="output")
args = parser.parse_args()

### Main ### 

infile = args.fasta
outfile = args.output
print("The fasta file is %s" % (infile))
print("The output file is %s" %(args.output))


## Project


Project - from sequence to annotation. Combine what you have learned. Write a SCRIPT to help you annotate an unknown protein locally - that is, do not simply BLAST your sequence using BLAST online. Your workflow should include at least 3 of the following outputs.


- The protein sequence in a fasta file: "myProtein.fasta"
- A small FASTA file of the most related sequences from the swissprot database (try -evalue 0.01)
   - hint, you can use the IDs to search against NCBI with the e-utilities!
- A table describing each of the top 30-50 related sequences. For e.g.,
   - Description of the sequence (annotation)
   - Length in amino acids
   - Organism, other taxonomy?
   - Other fun facts? GRavy? 
- A filtered table with ONLY eukaryotes. 
   - hint: df[df['Domain'] == "Eukaryota"]
- Multiple sequence alignment of your mystery sequence and the top 30-50 protein
    - Challenge: before making the alignment, think about what you would like in your header names. Could taxonomy be helpful? Beware, tools like to split the headers on white space so your sequence IDs will likely go from >myACCESSION myDescription...  to >myACCESSION. 
- Phylogenetic tree of these proteins (keep the model simple, see example)

Practice building the workflow in jupyter notebook, but then convert it to a stand-alone script using arg.parse so you can run it like this: 

python myScript.py --input (accesionnumber) --output_taxonomy XXX -output_myProtein --output_table ...

too many outputs? try: 
python myScript.py --input (accesionnumber) --output_prefix XXX 
then, in your script, you can make output files using the prefix 'XXX' and followed by the suffix. We'll show some examples in class.  

Your nucleotide sequence is below. 

