# Microbial Genomics: Lab 3
## Topic: Local and global sequence alignment
#### Tools Used: Biopython, BLAST, ClustalW, MUSCLE

## Part A: Lab exercises

### Exercise 1: Dot Plots
Dot Plots are a simple graphical approach for the visual comparison of two sequences (see Maizel and Lenk 1981 and references). They involve placing one sequence along a vertical axis of a 2D grid, a second sequence on the horizontal axis, and looking to see where the two sequences match. 

More complex dot plots use additional parameters, including 'sliding windows' composed of multiple characters, and a threshold vaue (i.e. match stringency) for two windows to be considered as a match.

**[This link](https://bioboot.shinyapps.io/dotplot/), created by Dr. Barry Grant at UCSD, gives a great visual demonstration of how dot plots work. Open it up, play around with the sliding bars, and answer the questions below as comments:**
1. Why does the DNA sequence have more dots than the protein sequence plot?
2. What does a 'Match Stringency' larger than 'Window Size' yield and why?
3. What would off-diagonal runs of dots represent?
4. What are the major weaknesses of this approach?

In [1]:
# Exercise 1

### Exercise 2: BLAST 
BLAST is a tool that performs local alignments between two sequences of interest, or a sequence and a database of sequences that contain potential matches. BLAST can be performed on nucleotide or protein sequences, but not between the two. 

Below are two alkane monooxygenase sequences; we are interested in comparing their similarity:
>WP_007626901.1 alkane 1-monooxygenase [Dietzia cinnamea]
MSSTEYIRPTDGADEHQAPHAHHDHHGHDHHGHDHADVEPYAWTDAKRYLWLLGVIPAMGLFLSMPFVAGFNALGWEIPATIAWFLLPFLVYVA
IPLGDLAIGADGENPPDEVMDKLEADPFYRWCTYLYIPFQYASLIAACYLWTADDLSWLGYDGGLGVAASIGVAWTVAITGGIGINTAHELGHK
IAGSEKWLSKVALATTGYGHFFIEHNRGHHARVATPEDPASSRFGESFWAFLPRSVVGSLRSAWSLESERLGRLGKSPWTLRNDNLNAWLMTVV
LFGALIAIFGWEVAPWLIVQAIFGFSLLEVVNYLEHYGLLRQKTSAGRYQRCRPEHSWNSDHLVTNIFLYHLQRHSDHHANPMRRYQMLRSFEQ
APQLPSGYATMMVVAYIPPLWRKVMDKRVLDHYDGDITRANIQPSKREKILARYGAGSTAVAERIIADTDIAADQTSPTGEYVCPNCGNHYSEA
AGLPREGFPPGTPWSAIPDSWQCSDCGVRDKVDFLPVK

>WP_106297665.1 alkane 1-monooxygenase [Knoellia remsis]
MTANAGTDTGANATVPQGSTQQWTDKKRYLWLIGLVVPSLAFLGIGMYELTGWKVWFWLGPIVVLGIVPAIDLVAGLDRSNPPDDVIEALEKDR
YYRWITYAYLPIQYAGFVGAMWIIGTDAISGLTVLDKVGLAVSIGCIGGIGINTAHELGHKREANERWLSKIALAQSFYGHFYIEHNRGHHVRV
ATPEDPASSRVGENFYQFWPRTVWGSLKSAWGLEARRIARRKQHPFRLSNDVLNAWLMSAVLWGALLLWLGWGILPYLVIQAVVGFTLLEVVNY
MEHYGMLRQRVAYGEKSRYERVDPSHSWNSNNIATNVLLYHLQRHSDHHANPTRRYQTLRDFEESPVLPTGYAGMIVLALVPFVWRRVMDPRVL
RHFDGDLSRANLSPRKRERLLAQYPPPVRSLVGAGPGEGGYAGAPTVEEILAARCPGCGYTYDVVAGEEREGFAAGTAWSQIPDDWCCPDCGVR
EKVDFVAVDPQVA

Navigate to online BLASTP [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&PAGE_TYPE=BlastSearch&BLAST_SPEC=blast2seq). Copy and paste the first sequence into the Query Sequence box, and the second sequence into the Subject Sequence box. Click BLAST, and answer the questions below using comments:
1. Do these two proteins perform the same function? Why or why not? 
2. How else do you interpret the similarity of the two sequences based on the alignment metrics? 
3. Examine the dot plot - what do the observed gaps mean?
4. Click on Edit Search at the top, and scroll down to expand the additional Algorithm Parameters by clicking the +. Change at least two of these parameters, and re-run the BLAST. Describe how the changes impacted the results, and why.

In [2]:
# Exercise 2

### Exercise 3: Biopython Alignments
Biopython has tools to align sequences on the command line. This allows us to run alignments without relying on NCBI servers (which can be slow) and in a more programmatic way.
* NCBIWWW is the primary command line tool to run blast searches through Biopython over the web
* The first argument is the blast program to use for the search, as a lower case string. The options and descriptions of the programs are available [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi). Currently qblast only works with blastn, blastp, blastx, tblast and tblastx
* The second argument specifies the databases to search against. The options for this are available in the document HowTo_BLAST.pdf, included in the Labs folder
* The third argument is a string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number

There are a number of very good resources on the web that describe everything from the undelying BLAST algorithm, to example pipelines and workflows. Two of the most important ones are below:
* The NCBI BLAST command line arguments and values can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/). This contains all the possible arguments to BLAST, as well as useful references for output formats, etc.
* Biopython also has a BLAST command-line reference, found [here](https://biopython.org/docs/1.75/api/Bio.Blast.NCBIWWW.html). Note that the Biopython BLAST is not always exactly the same as Web UI or non-python BLAST; when troubleshooting, make sure you're using the correct reference.

Below, we'll walk through some different examples of running BLAST through Biopython.

In [3]:
# Import relevant BLAST modules
from Bio import SearchIO
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from io import StringIO

In [None]:
## Running BLAST via Biopython: remote blastn
nucl_fasta = SeqIO.read("lab3/katG_EC.fasta", format="fasta")
result_handle_nuc = NCBIWWW.qblast("blastn", "nt", nucl_fasta.seq, hitlist_size=30)
blast_file = "lab3/remote_blast_nuc.xml"
with open(blast_file, "w") as out_handle:
    out_handle.write(result_handle_nuc.read())

In [None]:
## Running BLAST via Biopython: remote blastp (to run nucleotide blast, use blastn and 'nt')
prot_fasta = SeqIO.read("lab3/alkB.fasta", format="fasta")
result_handle_prot = NCBIWWW.qblast("blastp", "nr", prot_fasta.seq, hitlist_size=30)
blast_file = "lab3/remote_blast_prot.xml"
with open(blast_file, "w") as out_handle:
    out_handle.write(result_handle_prot.read())

In [None]:
## Running BLAST via Biopython: local blastn
from Bio.Blast.Applications import NcbiblastnCommandline
blast_local = NcbiblastnCommandline(query="katG_EC.fasta", subject="katG_SE.fasta", 
                                      evalue=0.001, outfmt=1, out="local_blast_output.aln")
stdout, stderr = blast_local()

In [None]:
# # Read in blast file and parse the records
blast_file = "lab3/local_blast_output.xml"
with open(blast_file,'r') as f:
    blast_records = list(NCBIXML.parse(f))[0]
    
    # collect list of accession ids to download
#     accession_list = []
#     # determine which accession id meets criteria, and store in list
#     for alignment in blast_records.alignments:
#         pct_id =  alignment.hsps[0].identities/alignment.hsps[0].align_length*100
#         e_val = alignment.hsps[0].expect
#         if e_val < 1e-3 and pct_id > 40:
#             accession_list.append(alignment.accession)


**Use the examples above (and any other reference/documentation you'd like) to perform the exercises below:
1. Use the BLAST web interface to search katG_EC.fasta against the nucleotide (nt) database, and do the same with command-line BLAST (the cell above does this for you). Once both finish, open up the first xml file (remote_blast_nuc.xml) and compare the .xml file with the online results. Is there any information missing? Write your answer as a comment below.
2. Try re-running the BLAST command from Biopython, but change the output format to try to get it as close as possible to the online results. Use the references listed above to explore different output formats. Which one do you think is most informative?
3. Investigate the structure of the blast_records variable that was created in the above cell. We want to store only the accession IDs for sequences that met our desired BLAST criteria: e-value < 0.003, percent identify > 40. Use a for loop to parse the xml output and store accession IDs for sequences that meet these thresholds. Note that for every BLAST output, we can caluclate e-value and % identity using the following:
`pct_id =  alignment.hsps[0].identities/alignment.hsps[0].align_length*100`
and
`e_val = alignment.hsps[0].expect`

In [None]:
# Exercise 3

### Exercise 4: ClustalW 

ClustalW (pronounced clustal omega) is a popular command line tool for pair-wise or multiple sequence alignments. This can be accessed via Biopthon's Bio.Align.Applications module. Below, we'll align a multi-fasta file using ClustalW and Biopython.

In [None]:
# Import clustalW
from Bio.Align.Applications import ClustalwCommandline
from Bio import AlignIO
# help(ClustalwCommandline)

# build the clustalW command
cline = ClustalwCommandline("clustalw2", infile="lab3/fused-rds_subset.fasta")

# run the command
stdout, stderr = cline()

# read our alignment back in and view it
align = AlignIO.read("lab3/fused-rds_subset.aln", "clustal")
print(align)

Note that there are a number of different aligners; each has strengths and weaknesses. Later in the course, you will have more autonomy in choosing which tool to use; make sure you have an understanding of which tool to use when! This takes some research, some knowledge about the outcome you're trying to achieve, and some experience in trying different tools out. To read about one study comparing different aligners, see [this article](https://www.frontiersin.org/articles/10.3389/fpls.2021.657240/full).

## Part B: Homework questions

#### Question 1: As discussed in lecture, the Needleman-Wunsh algorithm is a means of calculating an alignment between two sequences. 
Use this algorithm to calculate the optimal alignment for the sequences `MEANLY` and `PLEASANTLY`, using the [BLOSUM62 matrix](https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt) to calculate match and mistmatch costs between pairs of letters, and a gap penalty of -10. Draw (or type) the distance and back-tracking matrices and include them in your answer. Make sure to include the final alignment in your submission.

In [None]:
# Question 1

#### Question 2: Repeat Question 1, but use local alignment. Compare the two results- how are they the same or different?

In [None]:
# Question 2

#### Question 3: Use [this online tool](https://bioinfo.lifl.fr/yass/yass.php) to create a dot-plot of the sequences contained in `katG_EC.fasta` and `katG_SE.fasta`. Upload the sequences and click "select" on each, and then click "Run YASS" to use the default parameters. Use the result to answer the following questions below:
1. How closely related are these two sequences? Do you see any indels?
2. What do the small off-diagonal segments mean?
3. What can you conclude about the relationship of these sequences based on the color of the lines?
4. Re-run the analysis but this time, use "+1, -5" as the scoring matrix. What does the resulting plot look like? Why? Feel free to change some other parameters and comment on what happens.

In [None]:
# Question 3

#### Question 4: Often, we have a single sequence of interest that comes from an unknown species within a subset of possible candidates. In these cases, we may want to use BLAST to align the query sequence across several genomes, extract the best hits from each, and perform some analysis to decide which result is best suited to downstream analyses. Using this general workflow, do the following:
1. Use Nucleotide BLAST to align `katG_EC.fasta` against the genomes contained in `small_genome_db.fasta`
2. Loop through the top hit of each alignment and extract the corresponsing sequence
3. Combine all hit sequences into a single multi-fasta
4. Use MUSCLE (see Biopython Cookbook for examples) to align the resulting multi-fasta
5. Visualize the alignment using Jalview and answer the following questions:
    * What can you tell about the different genomes that we used BLAST against?
    * Are there any genomes you would initially rule out based on the alignment results?
    * How much do the different genomes effect our MUSCLE alignment?

In [None]:
# Question 4