# **Multiple Sequence Alignment Workflow**
In this notebook the actual multiple sequence alignment (MSA) analysis work flow is explained.
## **Goals**
1. To conduct a MSA on the combined African Insecta data sets listed below
>1. **enafroCOI_Under500_data.fasta: 6,715 sequences**  
>2. **enafroCOI_Over700_data.fasta: 1,607 sequences**  
>3. **enafroCOI_650to660_data.fasta: 99,698 sequences**  
>4. **enafroCOI_500to700_data-650to660.fasta: 85,157 sequences**  
2. For the above listed data sets; conduct a rigorous subsetting and alignment until a good quality sequence alignment is generated.  
3. Extract only sequences from these alignments that fit the right loci and length of the 658 5' cytochrome c oxidase subunit 1 gene; must extend to both ends of the gene, with or without gaps within the sequnce; Only a maximum of ten gaps, '-', and ten unidentified nucleotides, 'N', are allowed in the terminals, 3' or 5', of the sequnce.
4. Conduct the final alignment of combined cleaned up data set from all the four sets of data above in one set.

## **Tasks**
1. Perform a MSA on eafroCOI_650to660_data.fasta (*24,475 sequences*), East African insect sequences, and vizualise in SeaView to assess the quality of alignment. This alignment will be used to define the 3' and 5' ends of the COI-5P sequences. It is on these basis that trimming will be done and comparison to other data sets will be done.  
Refine the alignment if necessesary or use fewer sequences but but many enough to be accurately representative of all possible COI-5P lengths.
2. Conduct MSA on enafroCOI_Under500_data.fasta and refine when needed to acquire a good quality alignment. Compare the alignment to the reference eafroCOI_650to660_data.fasta alignment and trim at the determined 3' and 5' positions. Take the output and delete sequences that have in excess of ten end gaps "-" in the aligned nucleotide blocks/columns, excluding largely gappy columns from the counting of the ten gaps. save the output.
3. carry out task 2 above on enafroCOI_Over700_data.fasta
4. carry out task 2 on enafroCOI_650to660_data.fasta
5. Then finally on enafroCOI_500to700_data-650to660.fasta: There are a lot of sequences and possibly a lot of impurities in this set. Will require a lot of subsetting and iterations.
6. Concatate all the outputs from 2, 3, 4, and 5; then align them again and refune the alignment further.

### **1. MSA alignment on eafroCOI_650to660_data.fasta (24,475 sequences)**

### **2. MSA alignment on enafroCOI_Under500_data.fasta (6,715 sequences)**

### **3. MSA alignment on enafroCOI_Over700_data.fasta (1,607 sequences)**

### **4. MSA alignment on enafroCOI_650to660_data.fasta (99,698 sequences)**

### **5. MSA alignment on enafroCOI_500to700_data-650to660.fasta (85,157 sequences)**

### **6. Concatate all the outputs from 2, 3, 4, and 5;** then align them again and refune the alignment further.

#### **6.1.**

#### **6._**

In [2]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
wc -l enafroCOI_all_clean*

   203551 enafroCOI_all_clean5.aln
   196916 enafroCOI_all_clean5_sN10-eN10.aln
        2 enafroCOI_all_clean5.tre
   203551 enafroCOI_all_clean.aln
   203551 enafroCOI_all_clean.fasta
   198217 enafroCOI_all_clean_sN10-eN10.aln
    46706 enafroCOI_all_clean_sN10-eN10_eafro.aln
    17384 enafroCOI_all_clean_sN10-eN10_eafro_genera.aln
    29322 enafroCOI_all_clean_sN10-eN10_eafro_generaNA.aln
   151511 enafroCOI_all_clean_sN10-eN10_eafroNA.aln
    68401 enafroCOI_all_clean_sN10-eN10_genera.aln
   129815 enafroCOI_all_clean_sN10-eN10_generaNA.aln
        6 enafroCOI_all_clean_sN10-eN10_undesired.fasta
        2 enafroCOI_all_clean.tre
  1448935 total


In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
source ../../../../../code/process_all_input_files.sh
delete_shortseqs_N enafroCOI_all_clean.aln << EOF
10
10
EOF

Visualise the output file "enafroCOI_all_clean_sN10-eN10.aln" in Seaview and remove gap sites only, possibly created by the removal of some sequences in the step above. Save the output file.

In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
seaview enafroCOI_all_clean_sN10-eN10.aln

Delete Outgroups (Crustacea, Arachnida and Chilopoda), they were initially included in the alignment of enafroCOI_all_clean_sN10-eN10.aln.  
See the code below:

In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
source ../../../../../code/process_all_input_files.sh
delete_unwanted enafroCOI_all_clean_sN10-eN10.aln << EOF
1
Crustacea
1
Arachnida
1
Chilopoda
2
EOF
#Renaming
mv enafroCOI_all_clean_sN10-eN10_generaNA_undesired.fasta outgroups.aln

Then sort the "enafroCOI_all_clean_sN10-eN10.aln" into two files:  
**1. enafroCOI_all_clean_sN10-eN10_genera.aln** - contains African COI sequences with genus names  
**2. enafroCOI_all_clean_sN10-eN10_generaNA.aln** - Contains African COI sequences without genus names  

In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
source ../../../../../code/process_all_input_files.sh
cp enafroCOI_all_clean_sN10-eN10.aln enafroCOI_all_clean_sN10-eN10.aln2
delete_unwanted enafroCOI_all_clean_sN10-eN10.aln2 << EOF
1
gs-NA
2
EOF
#Renaming
mv enafroCOI_all_clean_sN10-eN10.aln2 enafroCOI_all_clean_sN10-eN10_genera.aln
mv enafroCOI_all_clean_sN10-eN10_undesired.aln enafroCOI_all_clean_sN10-eN10_generaNA.aln

Then sort the data into two files:  
**1. enafroCOI_all_clean_sN10-eN10_eafro.aln** - Contains East African sequences; Kenya, Uganda, Tanzania, Rwanda, Burundi, South Sudan and Ethiopia  
**2. enafroCOI_all_clean_sN10-eN10_eafroNA.aln** - Contains non East African Data  


In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
source ../../../../../code/process_all_input_files.sh
cp enafroCOI_all_clean_sN10-eN10.aln enafroCOI_all_clean_sN10-eN10.aln2
delete_unwanted enafroCOI_all_clean_sN10-eN10.aln2 << EOF
1
Kenya
1
Tanzania
1
Uganda
1
Rwanda
1
Burundi
1
South_Sudan
1
Ethiopia
2
EOF
#Renaming
mv enafroCOI_all_clean_sN10-eN10_undesired.fasta enafroCOI_all_clean_sN10-eN10_eafro.aln
mv enafroCOI_all_clean_sN10-eN10.aln2 enafroCOI_all_clean_sN10-eN10_eafroNA.aln

Then sort enafroCOI_all_clean_sN10-eN10_eafro.aln into two files:  
**1. enafroCOI_all_clean_sN10-eN10_eafro_genera.aln** - Contains East African COI records with genus names  
**2. enafroCOI_all_clean_sN10-eN10_eafro_generaNA.aln** - Contains East African COI records without genus names  
Those without genus names do not have the species names either, howerver those with species names may lack the genus names

In [None]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
cp enafroCOI_all_clean_sN10-eN10_eafro.aln enafroCOI_all_clean_sN10-eN10_eafro.aln2
delete_unwanted enafroCOI_all_clean_sN10-eN10_eafro.aln2 << EOF
1
gs-NA
2
EOF
#Renaming
mv enafroCOI_all_clean_sN10-eN10_eafro.aln2 enafroCOI_all_clean_sN10-eN10_eafro_genera.aln
mv enafroCOI_all_clean_sN10-eN10_eafro_undesired.fasta enafroCOI_all_clean_sN10-eN10_eafro_generaNA.aln


In [51]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
ls enafroCOI_all_clean*sN10-eN10*.aln
#cat $(ls enafroCOI_all_clean*sN10-eN10*.aln2) | head -5

enafroCOI_all_clean5_sN10-eN10.aln
enafroCOI_all_clean_sN10-eN10.aln
enafroCOI_all_clean_sN10-eN10_eafro.aln
enafroCOI_all_clean_sN10-eN10_eafro_genera.aln
enafroCOI_all_clean_sN10-eN10_eafro_generaNA.aln
enafroCOI_all_clean_sN10-eN10_eafroNA.aln
enafroCOI_all_clean_sN10-eN10_genera.aln
enafroCOI_all_clean_sN10-eN10_generaNA.aln


#### **Replacing Illigal characters.**
Replacing **Illegal characters in taxon-names are: tabulators, carriage returns, spaces, ":", ",", ")", "(", ";", "]", "\[", "'"** that affect the interpretation in RAxML  
This has to be done, otherwise RAxML will throw up an error and exit

In [69]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
for i in $(ls enafroCOI_all_clean*sN10-eN10*.aln); do vim $i -n << EOF
:%s/\[//g
:%s/\]//g
:%s/ /_/g
:%s/://g
:%s/;//g
:%s/,/__/g
:%s/(/__/g
:%s/)//g
:%s/'//g
:wq
EOF
echo -e "`basename -- ${i}` completed"
done

<roCOI_all_clean5_sN10-eN10.aln" 196916L, 93274261C[1;1H>PYRG096-09|Lepidoptera|gs-NA|sp-NA|subsp-NA|country-Kenya|exactsite-NA|lat_3.3||[2;1Hlon_39.983|elev-20|l-658
---AACATTATA---TTTTAT-TTTT--GGAA-TTTGA-AGAGGA-----AT-----A-----GTAG-GA---AC------[4;1H-----TTCATTAAGATTACTGATTCGTG--CTGAAT-----TA--GG-T--A---ACCC---TGG------ATCTTTAA--[5;1H--T------T------GGA---GA--TGATCAAATTTATAATA-CAATCGTAACAGCACATGCA-TTTATTATAA--TTTT[6;1HTTTTCATAGTAATACCTATTATAATTGGAGGATTTGGTAATTGATTAGTACCTTTAATATTAGGAGCTCCTGATATAGCAA[7;1HTTCCCCCGAATAAACAATATAAGATTTTGATTATT-ACCC-CCATCATTAACCTTA-CTAATTTCTAGAAGAATCGTTGAA[8;1HAAATGGAGCTGGTACTGGATGAACAGTTTA-TCC-CCCACTCTCATCCAATATTGCACATGGAG-GAAGATCTGTAGAC--[9;1HTTAGCTATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACTATTATT-AACAA[10;1HTACGAA---TC---AAT-----------G---GATTATCATTTGATCAAA-TACCTTTATTTGTATGAGCTGTTGGAATTT[11;1HACAGCTTTATTATTACTTCTTTCTTTACC-TGTATTAGCTGG-AGCTATTACTATACTTTTAACTG-ATCGAAACTTAAAA[12;1HT-AC-ATCT-TTTTTCGACC-CAGCA--GGA----GGGGGAGAT-



Updating the remote server (hpc01.icipe.org) ready for execution

In [66]:
%%bash
cd ../data/output/alignment/pasta_output/aligned/
scp ./enafroCOI_all_clean*  gilbert@hpc01.icipe.org:/home/gilbert/bioinformatics/github/co1_metaanalysis/data/output/alignment/pasta_output/aligned/ << EOF
<password>
EOF

ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
gilbert@hpc01.icipe.org: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
lost connection


CalledProcessError: Command 'b'cd ../data/output/alignment/pasta_output/aligned/\nscp ./enafroCOI_all_clean*  gilbert@hpc01.icipe.org:/home/gilbert/bioinformatics/github/co1_metaanalysis/data/output/alignment/pasta_output/aligned/ << EOF\n<password>\nEOF\n'' returned non-zero exit status 1.