![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

![filo_virion](https://user-images.githubusercontent.com/22747792/73687685-7111bc00-467f-11ea-906e-e16132529840.png)

# Python for Genomics 
## Section 7: Multiple Sequence Alignments

Multiple sequence alignments are a common component of genomic workflows that align three ++ homologous sequences.

This allows us to view the conservation and variations between sequences/species/organisms etc.

In such cases we can use the results to help us with comparative structure and function analyses using DNA or protein sequences.

### Such analyses are the starting point for:
* tracking mutations (or variations) for evolutionary analysis, e.g. where has the virus been based an its relation to other viruses
* designing primers for conserved regions that we want to study
* observe conserved regions that may indicate important functional properties of the resulting protein, e.g. residues that facilitate cell entry for infection --> which conserved sites can help with developing proper thearupuetics
* etc.

---

### MSA's are run on external programs, so we have a couple options. We can:

1. run the MSA on a web server/locally installed app and import the sequence alignment files for analysis in our notebooks; or 
2. use BioPython to run the applications here. 

Personally, even though I will run this through the command line, I like using notebooks because it keeps a record of our workflow without lots of extra effort. 

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


👉🏼 Remember it's good to ask ourselves *why* we are performing the alignment; this will give us a reason to use either a protein alignment or dna alignment in our workflow. eg, are you studying structural changes? (both dna and protein alignments would be useful here), are you needing primers (only dna alignment), etc.

---


We're going to use Clustal Omega to perform a global MSA; because these are all genes from the same organism during the about the same time period, it is more likely that they are going to be very similiar:

https://pypi.org/project/clustalo/

Remember, you can always ask for `-h` (help)

[or visit the FAQ to get answers on inputs and outputs etc.](https://www.ebi.ac.uk/seqdb/confluence/display/THD/Help+-+Clustal+Omega+FAQ#Help-ClustalOmegaFAQ-WhatoutputsdoesClustalOmegaproduce?)

In [12]:
! clustalo -h

Clustal Omega - 1.2.4 (AndreaGiacomo)

If you like Clustal-Omega please cite:
 Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG.
 Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
 Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75. PMID: 21988835.
If you don't like Clustal-Omega, please let us know why (and cite us anyway).

Check http://www.clustal.org for more information and updates.

Usage: clustalo [-hv] [-i {<file>,-}] [--hmm-in=<file>]... [--hmm-batch=<file>] [--dealign] [--profile1=<file>] [--profile2=<file>] [--is-profile] [-t {Protein, RNA, DNA}] [--infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]}] [--distmat-in=<file>] [--distmat-out=<file>] [--guidetree-in=<file>] [--guidetree-out=<file>] [--pileup] [--full] [--full-iter] [--cluster-size=<n>] [--clustering-out=<file>] [--trans=<n>] [--posterior-out=<file>] [--use-kimura] 

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Aligning the early outbreak GP sequences

We will align the 249 sequences that we extracted the GP genes from using that SeqFeatures script.

Remember that was from an entire BioProject: PRJNA257197

I have copied that file (PRJNA2571967_GPgenes.fasta) and it sits in /data/early_outbreak_data/

The set, however, contains sequences that have ambiguous bases in them. For some purpose it's ok, but let's just throw those sequences out so we get a clean alignment using only canonical bases (A, T, C, G)

In [13]:
from Bio import SeqIO

early_outbreak = SeqIO.parse('data/early_outbreak_data/PRJNA257197_GPgenes.fasta', 'fasta')

no_N = []

for fasta in early_outbreak:
    num_of_Ns = fasta.seq.count('N')
    if num_of_Ns == 0:
        no_N.append(fasta)
        

SeqIO.write(no_N, 'data/early_outbreak_data/noN_PRJNA257197_GPgenes.fasta', 'fasta')

205

In [15]:
! clustalo -i data/early_outbreak_data/noN_PRJNA257197_GPgenes.fasta -o data/early_outbreak_data/output/early_outbreak_GPgenes.clustal --outfmt=clu --verbose --seqtype=DNA --resno

Using 2 threads
Read 205 sequences (type: DNA) from data/early_outbreak_data/noN_PRJNA257197_GPgenes.fasta
Using 58 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 205 sequences)
Calculating pairwise ktuple-distances...
Ktuple-distance calculation progress done. CPU time: 53.27u 0.02s 00:00:53.29 Elapsed: 00:00:54
mBed created 2 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)
Distance calculation within sub-clusters done. CPU time: 81.34u 0.04s 00:01:21.38 Elapsed: 00:01:23
Guide-tree computation (mBed) done.
Progressive alignment progress done. CPU time: 117.02u 0.54s 00:01:57.56 Elapsed: 00:02:00
Alignment written to data/early_outbreak_data/output/early_outbreak_GPgenes.clustal


In [16]:
! head -260 data/early_outbreak_data/output/early_outbreak_GPgenes.clustal

CLUSTAL O(1.2.4) multiple sequence alignment


KR105345.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105328.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105323.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105294.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105282.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105266.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105263.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105349.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105346.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105344.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105343.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KR105342.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTT

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## And last, we'll align the 174 sequences from late outbreak, and save the output.

Now remember, when we used the excel file to download the records from NCBI, they're full genbank records. 

We just want to align the GP sequences, so I'm going to run that script again to extract the GP genes (features) and save it to a new fasta file that we can run with clustalo.

In [17]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO


late_outbreak_records = SeqIO.parse('../06 Retreiving Sequences from NCBI/data/gen_div_ebola.gb', 'genbank')
late_outbreak_GPlist = []

for record in late_outbreak_records:
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['GP']:
            GP_gene_seq = feature.extract(record.seq)
            new_record = SeqRecord(GP_gene_seq,
                                  id=record.id, name=record.name,
                                  description=record.description)
            late_outbreak_GPlist.append(new_record)
            
len(late_outbreak_GPlist)

174

In addition, we're going to remove all the sequences that contain N's just like we did above:

In [18]:
from Bio import SeqIO

no_N = []

for fasta in late_outbreak_GPlist:
    num_of_Ns = fasta.seq.count('N')
    if num_of_Ns == 0:
        no_N.append(fasta)
        

SeqIO.write(no_N, 'data/late_outbreak_data/noN_late_outbreak_GPgenes.fasta', 'fasta')

174

In [20]:
! clustalo -i data/late_outbreak_data/noN_late_outbreak_GPgenes.fasta -o data/late_outbreak_data/output/late_outbreak_GPgenes.clustal --outfmt=clu --verbose --seqtype=DNA --resno

Using 2 threads
Read 174 sequences (type: DNA) from data/late_outbreak_data/noN_late_outbreak_GPgenes.fasta
Using 55 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 174 sequences)
Calculating pairwise ktuple-distances...
Ktuple-distance calculation progress done. CPU time: 42.13u 0.01s 00:00:42.14 Elapsed: 00:00:43
mBed created 3 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)
Distance calculation within sub-clusters done. CPU time: 33.23u 0.01s 00:00:33.23 Elapsed: 00:00:34
Guide-tree computation (mBed) done.
Progressive alignment progress done. CPU time: 99.77u 12.57s 00:01:52.34 Elapsed: 00:01:56
Alignment written to data/late_outbreak_data/output/late_outbreak_GPgenes.clustal


In [21]:
! head -180 data/late_outbreak_data/output/late_outbreak_GPgenes.clustal

CLUSTAL O(1.2.4) multiple sequence alignment


KP759636.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759640.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759651.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759668.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759628.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759630.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759631.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759718.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759734.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759639.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759641.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
KP759642.1      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTT