# **Multiple Sequence Alignment Piepline**
This notebook documents the configuration of select MSA tools to be used in our work:
1. MUSCLE
2. MAFFT v7
3. T_COFFEE
4. PASTA
5. SATé

Thorough literature reviews were conducted and the following functionalities of the above tools were settled on and tested:
### **1. MUSCLE:**
1. **Large data alignment:** Not the best output: see bash function below

In [None]:
%%bash
muscle_large() { #muscle aligment of large datasets, long execution times is an issue
        #-maxiters 2: Iterations beyond 2 attempt refinement, which often results in a small improvement, at most
        usage $@
        echo "muscle starting alinment..."

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa) ) ]]
                then
                        rename
                        echo -e "\nproceeding with file `basename $i`..."
                        muscle -in $i -fastaout ${muscle_dest}./aligned/${output_filename}.afa -clwout ${muscle_dest}./aligned/${output_filename}.aln -maxiters 2
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done
}

Test run:

In [1]:
%%bash
pwd

/home/kibet/bioinformatics/github/co1_metaanalysis/code


In [None]:
%%bash
source ./aling.sh
muscle_large ../data/input/test_data/COI_testa00_data.fasta

2. **MSA alignment refinement:** Proved useful in refining the a large alignment, especially from PASTA

In [None]:
%%bash
muscle_refine() {
        #attempt to improve an existing alignment. Input is an existing MSA
        usage $@
        echo "starting refinment of existing MSAs..."

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa|aln) ) ]]
                then
                        rename
                        echo -e "\nproceeding with file `basename $i`..."
                        muscle -in $i -fastaout ${muscle_dest}./refined/\r${output_filename}.afa -clwout ${muscle_dest}./refined/\r${output_filename}.aln -refine
                else
                        echo "input file error in `basename $i`: input file should be a .afa file format"
                        continue
                fi
        done
}

Test run:(using a PASTA alignment output)

In [None]:
%%bash
source ./aling.sh
muscle_refine ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln

3. **MSA profile to profile alignment:** Useful in merging two alignment files of homologous sequences

In [None]:
%%bash
muscle_p2p() {
        #takes two existing MSAs ("profiles") and aligns them to each other, keeping the columns in each MSA intact.The final alignment is made by inserting columns of gaps into the MSAs as needed. The alignments of sequences in each input MSAs are thus fully preserved in the output alignment.
        usage $@
        unset $in1_outname && echo "good: output filename var in1_outname is empty"
        unset $in2_outname && echo "good: output filename var in2_outname is empty"


        for i in $@
        do
                if [ $# -ne 2 ]
                then
                        echo "input file error: only two input files are allowed!"
                        break
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa) ) ]]
                then
                        if [ -z $in1_outname ]
                        then
                                rename $i
                                in1_outname=$output_filename
                        else
                                rename $i
                                in2_outname=$output_filename
                        fi
                        continue
                elif [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                        break
                else
                        echo "input file error in `basename $i`: input file should be a .afa file format"
                        break
                fi
        done

        if [[ ( -n $in1_outname ) && ( -n $in2_outname  )  ]]
        then
                echo -e "\nproceeding with file `basename $1` and `basename $2`..."
                muscle -profile -in1 $1 -in2 $2 -fastaout ${muscle_dest}./merged/${in1_outname}_${in2_outname}.afa -clwout ${muscle_dest}./merged/${in1_outname}_${in2_outname}.aln
        else
                echo " A error with output_filenames: in1_outname and in2_outname "
        fi
}

Test run: Using two PASTA alignment output files

In [None]:
%%bash
source ./aling.sh
muscle_p2p ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/output/alignment/pasta_output/aligned/COI_testa01_data.aln

### **2. MAFFT:**
MAFFT version 7 was used and the following functionalities tested:
1. **G-INS-1 option for large data alignment**: Just as accurate as MAFFT global alignment option G-INS-1

In [None]:
%%bash
###mafft :highly similar ∼50,000 – ∼100,000 sequences × ∼5,000 sites incl. gaps (2016/Jul)

##G-INS-1 option is applicable to large data, when huge RAM and a large number of CPU cores are available (at most 26.0 GB)
#By a new flag, --large, the G-INS-1 option has become applicable to large data without using huge RAM.This option uses files, instead of RAM, to store temporary data. The default location of temporary files is $HOME/maffttmp/ (linux, mac and cygwin) or %TMP% (windows) and can be changed by setting the MAFFT_TMPDIR environmental variable.

#syntax:        mafft --large --globalpair --thread n in > out

mafft_GlINS1() {
        usage $@

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa) ) ]]
                then
                        rename
                        echo -e "\nmafft G-large-INS-1 MSA: proceeding with file `basename $i`..."
                        printf "Choose from the following output formats: \n"
                        select output_formats in fasta_output_format clustal_output_format none_exit
                        do
                                case $output_formats in
                                        fasta_output_format)
                                                echo -e "\nGenerating .fasta output\n"
                                                mafft --large --globalpair --thread -1 --reorder $i > ${mafft_dest}aligned/${output_filename}.fasta
                                                break
                                                ;;
                                        clustal_output_format)
                                                echo -e "\nGenerating a clustal format output\n"
                                                mafft --large --globalpair --thread -1 --reorder --clustalout $1 > ${mafft_dest}aligned/${output_filename}.aln
                                                break
                                                ;;
                                        none_exit)
                                                break
                                                ;;
                                        *) echo "error: Invalid selection!"
                                esac
                        done
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done
}

Test run:

In [None]:
%%bash 
source ./aling.sh
mafft_GlINS1 ../data/input/test_data/COI_testa00_data.fasta << EOF
1
EOF

2. **L-INS-I option for large data aligment:** More accurate than G-INS-I option in some cases but less suitable for large data; in this case we use the MPI(Massage Parsing Interface) version

In [None]:
%%bash
#MPI version of high-accuracy progressive options, [GLE]-large-INS-1; Two environmental variables, MAFFT_N_THREADS_PER_PROCESS and MAFFT_MPIRUN, have to be set:
#The number of threads to run in a process: Set "1" unless using a MPI/Pthreads hybrid mode.
#       export MAFFT_N_THREADS_PER_PROCESS="1"
#Location of mpirun/mpiexec and options: mpirun or mpiexec must be from the same library as mpicc that was used in compiling
#       export MAFFT_MPIRUN="/somewhere/bin/mpirun -n 160 -npernode 16 -bind-to none ..." (for OpenMPI)
#OR     export MAFFT_MPIRUN="/somewhere/bin/mpirun -n 160 -perhost  16 -binding none ..." (for MPICH)

#mpi command: Add "--mpi --large" to the normal command of G-INS-1, L-INS-1 or E-INS-1
#       mafft --mpi --large --localpair --thread 16 input

#mafft L-INS-I command:
#mafft --localpair --maxiterate 1000 input_file > output_file

mafft_local() {
        usage $@

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa) ) ]]
                then
                        rename
                        echo -e "\nmafft G-large-INS-1 MSA: proceeding with file `basename $i`..."
                        printf "Choose from the following output formats: \n"
                        select output_formats in fasta_output_format clustal_output_format none_exit
                        do
                                case $output_formats in
                                        fasta_output_format)
                                                echo -e "\nGenerating .fasta output\n"
                                                bash $mpionly
                                                mafft --mpi --large --globalpair --thread -1 --reorder $i > ${mafft_dest}aligned/${output_filename}_l.fasta
                                                break
                                                ;;
                                        clustal_output_format)
                                                echo -e "\nGenerating a clustal format output\n"
                                                bash $mpionly
                                                mafft --mpi --large --globalpair --thread -1 --reorder --clustalout $1 > ${mafft_dest}aligned/${output_filename}_l.aln
                                                break
                                                ;;
                                        none_exit)
                                                break
                                                ;;
                                        *) echo "error: Invalid selection!"
                                esac
                        done
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done
}

Test run:

In [None]:
%%bash
source ./aling.sh
mafft_local ../data/input/test_data/COI_testb01_data.aln << EOF
1
EOF

3. **MSA merging option**: Just like MUSCLE merging option, it merges two MSA profiles, but with a different strategy.

In [None]:
%%bash
##Merge multiple sub-MSAs into a single MSA
#Each sub-MSA is forecd to form a monophyletic cluster in version 7.043 and higher (2013/May/26).

#syntax:        cat subMSA1 subMSA2 > input
#               ruby makemergetable.rb subMSA1 subMSA2 > subMSAtable
#               mafft --merge subMSAtable input > output

mafft_merge() {
        echo -e "\nwarning: Each sub-MSA is forced to form a monophyletic cluster in version 7.043 and higher (2013/May/26)."
        printf "Enter [Yes] to continue or [No] to exit: "
        read choice
        case $choice in
                [yY][eE][sS] | [yY] )
                        usage $@ #testing the arguments
                        RUBY_EXEC=$( which ruby )

                        inputfiletest $@ #assesing the validity of the input files

                        outputfilename $@ # generating output file name

                        cat $@ > ${mafft_dest}merged/input.fasta
                        ${RUBY_EXEC} ${makemergetable} $@ > ${mafft_dest}merged/subMSAtable
                        mafft --merge ${mafft_dest}merged/subMSAtable ${mafft_dest}merged/input.fasta > ${mafft_dest}merged/${outname}.fasta
                        ;;
                [nN][oO] | [nN] )
                        echo "exiting the operation"
                        ;;
                *)
                        echo "Invalid input: please enter [Yes] or [No]"
                        ;;
        esac

}

Test run: uses two PASTA alignment output files

In [None]:
%%bash
source ./aling.sh
mafft_merge ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/output/alignment/pasta_output/aligned/COI_testa01_data.aln

4. **MAFFT -add option:** For adding unaligned full-length sequence(s) into an existing alignment

In [None]:
%%bash
#new_sequences and existing_alignment files are single multi-FASTA format file. Gaps in existing_alignment are preserved, but the alignment length may be changed in the default setting. If the --keeplength option is given, then the alignment length is unchanged.  Insertions at the new sequences are deleted. --reorder to rearrange sequence order.

#syntax:        % mafft --add new_sequences --reorder existing_alignment > output

mafft_add() {
        echo -e "\n \$1 should be the new_sequences: unaligned full-length sequence(s) to be added into the existing alignment (\$2) "
        printf "\nEnter [Yes] to continue or [No] to exit: "
        read choice
        case $choice in
                [yY][eE][sS] | [yY] )
                        usage $@

                        inputfiletest $@

                        outputfilename $@

                        mafft --add $1 --reorder $2 > ${mafft_dest}addseq/${outname}.fasta
                        ;;
                [nN][oO] | [nN] )
                        echo "exiting the --add sequences operation"
                        ;;
                *)
                        echo "Invalid input: please enter [Yes] or [No]"
                        ;;
        esac
}

Test run: Uses one PASTA alignment output and an unaligned sample sequence file

In [None]:
%%bash
source ./aling.sh
mafft_add ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/input/test_data/COI_testb01_data.aln

5. **MAFFT --addfragments option:** For  adding unaligned fragmentary sequence(s) into an existing alignment

In [None]:
%%bash
## --addfragments: Adding unaligned fragmentary sequence(s) into an existing alignment

#syntax:        Accurate option
#syntax:        % mafft --addfragments fragments --reorder --thread -1 existing_alignment > output
#               Fast option (accurate enough for highly similar sequences):
#               % mafft --addfragments fragments --reorder --6merpair --thread -1 existing_alignment > output

mafft_addfragmets() {
        echo -e "\n \$1 is fragments is a single multi-FASTA format file and \$2 existing_alignment is a single multi-FASTA format file "
        printf "\nEnter [Yes] to continue or [No] to exit:  "
        read choice
        case $choice in
                [yY][eE][sS] | [yY] )
                        usage $@

                        inputfiletest $@

                        outputfilename $@

                        mafft --addfragments $1 --reorder --thread -1 $2 > ${mafft_dest}addfragments/${outname}.fasta
                        ;;
                [nN][oO] | [nN] )
                        echo "exiting the --addfragments operation"
                        ;;
                *)
                        echo "Invalid input: please enter [Yes] or [No]"
                        ;;
        esac
}

Test run: using sapmles from those with less than 500 nucleotides to those with between 650 to 660 nucleotides

In [None]:
%%bash
source ./aling.sh
mafft_addfragments ../data/input/test_data/COI_testd06_data.fasta ../data/output/alignment/pasta_output/aligned/COI_testc04_data.aln

### **3. T_Coffee:**
1. **T_Coffee regressive mode for very large sequence alignment:** Uses a regressive algorithm opposed to the common progressive algorithm.

In [None]:
%%bash
# t_coffee: the regressive mode of T-Coffee is meant to align very large datasets with a high accuracy.
#It starts by aligning the most distantly related sequences first.
#It uses this guide tree to extract the N most diverse sequences. 
#In this first intermediate MSA, each sequence is either a leaf or the representative of a subtree.
#The algorithm is re-aplied recursively onto every representative sequence until all sequences have been incorporated in an internediate MSA of max size N.
#The final MSA is then obtained by merging all the intermediate MSAs into the final MSA.

#Fast and accurate: the regressive alignment is used to align the sequences in FASTA format. The tree is estimated using the mbed method of Clustal Omega (-reg_tree=mbed), the size of the groups is 100 (-reg_nseq=100) and the method used to align the groups is Clustal Omega:

#syntax:        $ t_coffee -reg -seq proteases_large.fasta -reg_nseq 100 -reg_tree mbed -reg_method clustalo_msa -outfile proteases_large.aln -outtree proteases_large.mbed

#-seq           :provide sequences. must be in FASTA format
#-reg_tree      :defines the method to be used to estimste the tree
#-outtree       :defines the name of newly computed out tree. mbed method of Clustal Omega is used.
#-outfile**     :defines the name of output file of the MSA
#-reg_nseq      :sets the max size of the subsequence alignments; the groups is 100
#-reg_thread    :sets max threads to be used
#-reg_method**  :defines the method to be used to estimate MSA: Clustal Omega
#-multi_core    :Specifies that T-Coffee should be multithreaded or not; by default all relevant steps are parallelized; DEFAULT: templates_jobs_relax_msa_evaluate OR templates_jobs_relax_msa_evaluate (when flag set)
#-n_core        :Number of cores to be used by machine [default=0 => all those defined in the environement]

tcoffee_large() {
        usage $@
        echo "t-coffee starting alinment..."

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa) ) ]]
                then
                        rename
                        echo -e "\nproceeding with file `basename $i`..."
                        t_coffee -reg -multi_core -n_core=32 -seq $i -reg_nseq 100 -reg_tree mbed -reg_method clustalo_msa -outfile ${tcoffee_dest}aligned/${output_filename}.fasta -newtree ${tcoffee_dest}trees/${output_filename}.mbed
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue

                fi
        done

}

Test Run:

In [None]:
%%bash
source ./align.sh
tcoffee_large ../data/input/test_data/COI_testa00_data.fasta

2. **T_Coffee MSA evaluation option CORE index**:  
Useful in scoring the various results from different alignment algorithms and choosing the most accurate/suitable algorithm

In [None]:
%%bash
##Evaluating Your Alignment: Most of T_coffee evalution methods are designed for protein sequences (notably structure based methods), however, T-Coffee via sequence-based-methods (TCS and CORE index) offers some possibilities to evaluate also DNA alignments

#The CORE index is the basis of T-Coffee is an estimation of the consistency between your alignment and the computed library( by default a list of pairs of residues that align in possible global and 10 best local pairwise alignments). The higher the consistency, the better the alignment.
#Computing the CORE index of any alignment: To evaluate any existing alignment with the CORE index, provide that alignment with the -infile flag and specify that you want to evaluate it

#syntax:        $ t_coffee -infile=proteases_small_g10.aln -output=html -score

COREindex() { #Evaluating an existing alignment with the CORE index
        usage $@
        echo "t_coffee starting MSA alignment evaluation using CORE index... "

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa|aln) ) ]]
                then
                        rename
                        outfile_dest
                        echo -e "\nproceeding with `basename $i` alignment file evaluatio..."
                        t_coffee -infile=$i -multi_core -n_core=32 -output=html -score -outfile ${output_dest}scores/coreindex/${output_filename}.html
                else
                        echo "input file error in `basename $i`: input file should be a *.aln file format"
                        continue
                fi
        done
}

**Test run**

In [None]:
%%bash
source ./align.sh
COREindex ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/output/alignment/pasta_output/aligned/COI_testa01_data.aln

2. **T_Coffee MSA evaluation Transitive Consistency Score (TCS) option**:   
Like CORE index above it is used in scoring the different alignments and choosing the most accurate/suitable algorithm. It's output can also be used in phylogenetic inference to give different weights to the alignment collumns

In [None]:
%%bash
# However, to evaluate an alignment, the use of Transitive Consistency Score (TCS) procedure is recommended. TCS is an alignment evaluation score that makes it possible to identify the most correct positions in an MSA. It has been shown that these positions are the most likely to be structuraly correct and also the most informative when estimating phylogenetic trees.
#Evaluating an existing MSA with Transitive Consistency Score (TCS): most informative when used to identify low-scoring portions within an MSA. *.score_ascii file displays the score of the MSA, the sequences and the residues. *.score_html file displays a colored version score of the MSA, the sequences and the residues

#syntax:        $ t_coffee -infile sample_seq1.aln -evaluate -output=score_ascii,aln,score_html

TCSeval() { #Evaluating an existing alignment with the TCS
        usage $@
        echo "t_coffee starting MSA alignment evaluation using TCS... "

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa|aln) ) ]]
                then
                        rename
                        outfile_dest
                        if [ -z "$output_dest" ]
                        then
                                echo -e "\noutput destination folder not set: input can only be sourced from: \n$muscle_dest; \n$mafft_dest; \n$tcoffee_dest; \n$pasta_dest; and \n$sate_dest"
                                continue
                        else
                                echo -e "\nproceeding with `basename $i` alignment file evaluatio..."
                                select TCS_library_estimation_format in proba_pair fast_mafft_kalign_muscle_combo none_exit
                                do
                                        case $TCS_library_estimation_format in
                                                proba_pair)
                                                        echo -e "\nTCS evaluation using default aligner proba_pair"
                                                        t_coffee -multi_core -n_core=32 -infile $i -evaluate -method proba_pair -output=score_ascii,html -outfile ${output_dest}scores/tcs/${output_filename}_score
                                                        break
                                                        ;;
                                                fast_mafft_kalign_muscle_combo)
                                                        echo -e "\nTCS evaluation using a series of fast multiple aligners; mafft_msa,kalign_msa,muscle_msa. \nThis option is not accurate and can not be relied on in filtering sequences"
                                                        t_coffee -multi_core -n_core=32 -infile $i -evaluate -method mafft_msa,kalign_msa,muscle_msa -output=score_ascii,html -outfile ${output_dest}scores/tcs/${output_filename}_fastscore
                                                        break
                                                        ;;
                                                none_exit)
                                                        break
                                                        ;;
                                                *)
                                                        echo "error: Invalid selection!"
                                        esac
                                done
                        fi
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done
}

**Test run**

In [None]:
%%bash
source ./align.sh
TCSeval ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/output/alignment/pasta_output/aligned/COI_testa01_data.aln

Other T_Coffee functionalities to look at

In [None]:
%%bash
#=====================================================================================
#Filtering unreliable MSA positions: columns
#TCS allows you to filter out from your alignment regions that appears unreliable according to the consistency score; the filtering can be made at the residue level or the column level:
#        t_coffee -infile sample_seq1.aln -evaluate -output tcs_residue_filter3,tcs_column_filter3,tcs_residue_lower4

#sample_seq1.tcs_residue_filter3 :All residues with a TCS score lower than 3 are filtered out
#sample_seq1.tcs_column_filter3 :All columns with a TCS score lower than 3 are filtered out
#sample_seq1.tcs_residue_lower4 :All residues with a TCS score lower than 3 are in lower case

#t_coffee -infile sample_seq1.aln -evaluate -output tcs_residue_filter1,tcs_column_filter1

#=====================================================================================

##Estimating the diversity in your alignment:

# The "-other_pg" flag: call a collection of tools that perform other operations: reformatting, evaluating results, comparing methods. After the flag -other_pg, the common T-Coffee flags are not recognized. "-seq_reformat" flag: calls one of several tools to reformat/trim/clean/select your input data but also your output results, from a very powerful reformatting utility named seq_reformat

# "-output" option of "seq_reformat", will output all the pairwise identities, as well as the average level of identity between each sequence and the others:
#               "-output sim_idscore" realign your sequences pairwise so it can accept unaligned or aligned sequences alike. "-output sim" computes the identity using the sequences as they are in your input file so it is only suited for MSAs

#Syntax:        $ t_coffee -other_pg seq_reformat -in sample_seq1.aln -output sim

#               "-in" and "in2" flags: define the input file(s)
#               "-output" flag: defines output format*


### **4. PASTA:**
This was the most important and was used a lot in the actual analysis. It si a progressive algorithm that subsets the data set into a maximum of 200 sequences per subset; aligns the subsets using one of many third party tools (MAFFT L-INS-i) and; merges the subsets using transivity using third party tools (OPAL).
1. **PASTA MSA alignment using MAFFT, OPAL and FastTree**:  
Estimates a start tree using HMM, subsets the data, align using Mafft L-INS-I, marge using OPAL and estimate a tree using FastTree, then use this tree in the next iteration to subset the data. set to 3 iterations(default).

In [None]:
%%bash
#Usage:         $run_pasta.py [options] <settings_file1> <settings_file2> ...
#syntax:        $run_pasta.py -i <input_fasta_file> -j <job_name> --temporaries <TEMP_DIR> -o <output_dir>

pasta_aln() { #MSA alignment using pasta
        usage $@
        echo "PASTA starting alinment..."

        PYTHON3_EXEC=$( which python3 )
        runpasta=${co1_path}code/tools/pasta_code/pasta/run_pasta.py

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa|aln) ) ]]
                then
                        echo -e "\tProceeding with `basename $i`" 
                        echo -e "\tPlease select the mafft alignment method;\n\tlocal[mafft_linsi] or global[mafft_ginsi]:"
                        select type_of_alignment in mafft_linsi mafft_ginsi mafft_linsi_with_starting_tree mafft_ginsi_with_starting_tree none_exit
                        do
                                case $type_of_alignment in
                                        mafft_linsi)
                                                rename
                                                echo -e "\nDoing local alignment of `basename $i`..."
                                                ${PYTHON3_EXEC} ${runpasta}  --num-cpus=32 --aligner=mafft -i $i -j ${output_filename} --temporaries=${pasta_dest}temporaries/ -o ${pasta_dest}\jobs/
                                                cp ${pasta_dest}\jobs/*.${output_filename}.aln ${pasta_dest}aligned/ && mv ${pasta_dest}aligned/{*.${output_filename}.aln,${output_filename}.aln}
                                                cp ${pasta_dest}\jobs/${output_filename}.tre ${pasta_dest}aligned/${output_filename}.tre
                                                break
                                                ;;
                                        mafft_ginsi)
                                                rename
                                                echo -e "\nDoing global alignment of `basename $i`..."
                                                ${PYTHON3_EXEC} ${runpasta}  --num-cpus=32 --aligner=ginsi -i $i -j ${output_filename} --temporaries=${pasta_dest}temporaries/ -o ${pasta_dest}\jobs/
                                                cp ${pasta_dest}\jobs/*.${output_filename}.aln ${pasta_dest}aligned/ && mv ${pasta_dest}aligned/{*.${output_filename}.aln,${output_filename}.aln}
                                                cp ${pasta_dest}\jobs/${output_filename}*.tre ${pasta_dest}aligned/${output_filename}.tre
                                                break
                                                ;;
                                        mafft_linsi_with_starting_tree)
                                                rename
                                                unset start_tree
                                                echo -e "\nDoing local alignment of `basename $i` using a starting tree..."
                                                until [[ ( -f "$start_tree" ) && ( `basename -- "$start_tree"` =~ .*\.(tre) ) ]]
                                                do
                                                        echo -e "\n\tFor the starting tree provide the full path to the file, the filename included."
                                                        read -p "Please enter the file to be used as the starting tree: " start_tree
                                                done
                                                ${PYTHON3_EXEC} ${runpasta}  --num-cpus=32 --aligner=mafft -i $i -t $start_tree -j ${output_filename} --temporaries=${pasta_dest}temporaries/ -o ${pasta_dest}\jobs/
                                                cp ${pasta_dest}\jobs/*.${output_filename}.aln ${pasta_dest}aligned/ && mv ${pasta_dest}aligned/{*.${output_filename}.aln,${output_filename}.aln}
                                                cp ${pasta_dest}\jobs/${output_filename}.tre ${pasta_dest}aligned/${output_filename}.tre
                                                break
                                                ;;
                                        mafft_ginsi_with_starting_tree)
                                                rename
                                                unset start_tree
                                                echo -e "\nDoing global alignment of `basename $i` using a starting tree..."
                                                until [[ ( -f "$start_tree" ) && ( `basename -- "$start_tree"` =~ .*\.(tre) ) ]]
                                                do
                                                        echo -e "\n\tFor the starting tree provide the full path to the file, the filename included."
                                                        read -p "Please enter the file to be used as the starting tree: " start_tree
                                                done
                                                ${PYTHON3_EXEC} ${runpasta}  --num-cpus=32 --aligner=ginsi -i $i -j ${output_filename} --temporaries=${pasta_dest}temporaries/ -o ${pasta_dest}\jobs/
                                                cp ${pasta_dest}\jobs/*.${output_filename}.aln ${pasta_dest}aligned/ && mv ${pasta_dest}aligned/{*.${output_filename}.aln,${output_filename}.aln}
                                                cp ${pasta_dest}\jobs/${output_filename}*.tre ${pasta_dest}aligned/${output_filename}.tre
                                                break
                                                ;;
                                        none_exit)
                                                break
                                                ;;
                                        *)
                                                echo "error: Invalid selection!"
                                esac
                        done
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done
}

**Test run**

In [None]:
%%bash
source ./align.sh
pasta_aln ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln ../data/output/alignment/pasta_output/aligned/COI_testb01_data.fasta ../data/output/alignment/pasta_output/aligned/COI_testc04_data.fasta ../data/output/alignment/pasta_output/aligned/COI_teste07_data.fasta << EOF
1
2
1
1
EOF

### **5. UPP:Ultra-large alignments using Phylogeny-aware Profiles**  
"addresses the problem of alignment of very large datasets, potentially containing fragmentary data. UPP can align datasets with up to 1,000,000 sequences"  
Dependent on Python, PASTA and SEPP(SATe-enabled Phylogenetic Placement)

In [None]:
%%bash
upp_align() { #UPP stands for Ultra-large alignments using Phylogeny-aware Profiles. A modification of SEPP, SATé-Enabled Phylogenetic Placement, for performing alignments of ultra-large and fragmentary datasets.
        #Usage: $ python <bin>/run_upp.py -s <unaligned_sequences>
        #To run UPP with a pre-computed backbone alignment and tree, run
        #       $ python <bin>/run_upp.py -s input.fas -a <alignment_file> -t <tree_file>
        #To run the parallelized version of UPP, run
        #       $ python <bin>/run_upp.py -s input.fas -x <cpus>
        usage $@
        echo "UPP starting alinment..."

        PYTHON3_EXEC=$( which python3 )
        run_upp=${co1_path}code/tools/sepp/run_upp.py

        for i in $@
        do
                if [ ! -f $i ]
                then
                        echo "input error: file $i is non-existent!"
                elif [[ ( -f $i ) && ( `basename $i` =~ .*\.(afa|fasta|fa|aln) ) ]]
                then
                        echo -e "\tProceeding with `basename $i`"
                        echo -e "\tPlease select the type of alignment method;\n\tUsing unaligned sequences only[using_sequences_only] or using a backbone[using_precomputed_backbone]:"
                        select type_of_alignment in using_sequences_only using_precomputed_backbone none_exit
                        do
                                case $type_of_alignment in
                                        using_sequences_only)
                                                rename
                                                echo -e "\nDoing Multiple Sequence Alignment of `basename $i` based on the fragmentary sequences alone"
                                                ${PYTHON3_EXEC} ${run_upp} -s $i -o ${output_filename} --tempdir ${pasta_dest}temporaries/sepp/ -d ${pasta_dest}jobs_upp/ #-x 32
                                                cp ${pasta_dest}\jobs/*.${output_filename}_alignment.fasta ${pasta_dest}aligned/
                                                break
                                                ;;
                                        using_precomputed_backbone)
                                                rename
                                                unset backbone
                                                unset start_tree
                                                echo -e "\nDoing Multiple Sequence Alignment of `basename $i` using a backbone alignment and a starting tree..."

                                                until [[ ( -f "$start_tree" ) && ( `basename -- "$start_tree"` =~ .*\.(tre) ) ]]
                                                do
                                                        echo -e "\n\tFor the starting tree provide the full path to the file, the filename included."
                                                        read -p "Please enter the file to be used as the starting tree: " start_tree
                                                done

                                                until [[ ( -f "$backbone" ) && ( `basename -- "$backbone"` =~ .*\.(aln|fasta|fa|afa) ) ]]
                                                do
                                                        echo -e "\n\tFor the backbone alignment provide the full path to the file, the filename included."
                                                        read -p "Please enter the file to be used as the backbone alignment: " backbone
                                                done

                                                ${PYTHON3_EXEC} ${run_upp} -s $i -a ${backbone} -t ${start_tree} -o ${output_filename} --tempdir ${pasta_dest}temporaries/sepp/ -d ${pasta_dest}jobs_upp/ #-x 32
                                                cp ${pasta_dest}\jobs/*.${output_filename}_alignment.fasta ${pasta_dest}aligned/
                                                break
                                                ;;
                                        none_exit)
                                                break
                                                ;;
                                        *)
                                                echo "error: Invalid selection!"
                                esac
                        done
                else
                        echo "input file error in `basename $i`: input file should be a .fasta file format"
                        continue
                fi
        done

}

**Test run:**

In [None]:
%%bash
source ./align.sh
upp_align ../data/output/alignment/pasta_output/aligned/COI_testa00_data.aln