# Microbial Genomics: Lab 2
## Topic: Working with databases and biopython
#### Tools used: Biopython

## Part A: Lab Exercises

### Exercise 1: Navigating NCBI on the Web

NCBI, or the National Center for Biotechnology Information, is part of the NIH. It provides tools, databases, and analysis services to researchers across the world, and has made much of modern Bioinformatics possible. To start this lab, we're going to perform a common workflow of searching for a known gene and retrieving some information about that gene.

Start by navigating to the Genbank database on NCBI: https://www.ncbi.nlm.nih.gov/genbank/. Perform the following steps:
1. Change the database dropdown menu to 'Gene' 
2. Type acrA in the top search bar -> enter
2. Click on Salmonella enterica on the right
3. Click on the gene name acrA

Answer the following questions in the markdown cell below, based on the information on this and linked pages (use the cell below to type your answers):
1. What does *acrA* do?
2. Scroll to NCBI RefSeq, click on FASTA, and then click on Genbank. What is the difference between this page and the previous one?
3. How many nucleotides does this *acrA* have?
4. Click around to see if you can get to the protein sequence. How many amino acids does it have?
5. Click on the protein ID if you haven’t already and then look at the FASTA sequence. How does it differ from the nucleotide sequence?
6. What database are we in?
7. Go back to the genbank *acrA* page and find the reference sequence accession ID from where the gene came from. This should start with NC. What’s the difference?
8. Describe at least two other accession IDs you see, and what they link to

[Your response goes here]

### Exercise 2: Navigating NCBI on the command line
Biopython is a Python suite of packages that allows for easy navigation of NCBI and other common public databases. It also includes tools to process common formats and file types that we encounter with genomic sequencing data. 
* To work with Biopython tools, it must first be imported. The module is called Bio when importing: `import Bio`
* The entire Bio library is large; to only import specific modules, the syntax is `from Bio import X`
* The [Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) is indispensible when working with any of these tools. Bookmark it and use it often!
* Another incredibly useful resource is the [Biopython API documentation](https://biopython.org/docs/1.75/api/index.html). This has very granular detail on exactly what makes up each and every module and object within Biopython.

In this part of the lab, we will step through common Biopython commands and features. In the take home lab report, you will use Biopython to analyze your own DNA sequence.

In [None]:
# First, let's import and check the help() function for a few pieces of Biopython
# The seqIO module from Biopython provides functions to easily prase nucleotide or amino acid sequence strings
from Bio import SeqIO
# The help function provides useful documentation on how to use SeqIO
##help(SeqIO)
# One of the most useful tools within SeqIO is parse
help(SeqIO.parse)

Now, lets try to read, write and parse some common file formats.

The file `acrA.fasta` contains 3 acrA nucleotide sequences in the commonly used multi-fasta file format. We can use `seqIO.parse()` to separate the records and evaluate them individually.

In [None]:
# Count the number of records in acrA.fasta
filename = "lab2/acrA.fasta"
count = 0
for record in SeqIO.parse(filename,"fasta"):
    count = count + 1
print("There are " + str(count) + " records in file " + filename)

In [None]:
# Store the species name for each record (2 examples)
species = []
for record in SeqIO.parse(filename,"fasta"):
    species.append(record.description.split(' ',1)[1])
print(species)

species = []
for record in SeqIO.parse(filename,"fasta"):
    species.append(' '.join(record.description.split(' ')[1:3]))
print(species)

In [None]:
# Examining the sequences
for record in SeqIO.parse(filename, "fasta"):
    start_seq = record.seq[:10] # first 10 letters
    end_seq = record.seq[-10:] # last 10 letters
    print(record.id + " " + start_seq + "..." + end_seq)

Some types of bioinformatic data includes what is called a feature (or multiple features). These often contain the information of most interest, such as gene function, location, homologies, etc. Lets look at one such example.

In [None]:
# Unlike fasta files, most of of the times genbank files only have single
# records. So we can use the SeqIO.read function
from Bio import SeqIO
record = SeqIO.read("lab2/acrA.gb","genbank")
# print(record.id)
# print(record.seq)
print(record.features)

We can also use Biopython to download sequences from NCBI, which is extremely useful, especially when we have lists of hundreds or thousands of sequences to fetch.

The `efetch` module is what you use when you want to retrieve a full record from Entrez. Below we only go through one example of database and file type; the rest can be found as described on the main EFetch Help page:
* https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch 
* https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

In [None]:
# Download the sequence with EU490707 accession ID from the nucleotide database in a genbank format
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "pac@columbia.edu"  # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gbwithparts", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(record.id)

In [None]:
# Write downloaded genbank information into a file
output_filename = "lab2/EU490707.gb"
SeqIO.write(record, output_filename, "genbank")

Now it's your turn! **Using the analysis from above, and the biopython documentation, do the following in the cell below:**
1. Print the first feature in `acrA.gb`
2. Determine and print the organism that the first feature of `acrA.gb` came from
3. Determine and print the length of the each sequence of *acrA* from the `acrA.fasta` file
4. Write a piece of code to calculate and print the total length of all sequences in the file

In [None]:
# Exercise 2

## Part B: Homework

### Question 1
Download and store gb files for each genome in the `accession_IDs` list defined below. Make sure to use your own email address!

In [None]:
# Question 1
accession_IDs = ["NZ_CP018816.1","NZ_CP020061.1","NZ_CP020071.1"] 

### Question 2 
Read each gb file from Q1 and search for *acrA* gene; concatenate all *acrA* genes into a multi-fasta file using any method; and save the file

In [None]:
# Question 2

### Question 3
Read in the fasta file and find the genome with the shortest *acrA* sequence. If more than one has the same length, choose any. Print the accession ID of this genome


In [None]:
# Question 3

### Question 4:
Choose any genbank file that you downloaded above, and convert it into a fasta file

In [None]:
# Question 4

### Question 5: 
In the markdown cell below, answer the following:
1. Describe what a genome accession ID and a gene accession ID is, and how they differ
2. What type of data does the Bioproject database on NCBI contain?
3. Choose one of the klebsiella genbank files from above, and open it on your computer. What information is not contained in this file that could be found in a different database? 
4. Given a nucleotide FASTA file, is there enough information to generate the corresponding amino acid FASTA file? How about in the reverse direction (i.e. protein to nucleotide)?
5. What is the difference between fasta and fastq file formats?

[Your response goes here]