#  Project 1:  Genbank Annotation Files

##### Due Friday Apr 15 

If two or more people are working on this project enter each person's name here:
* Name:
* Name:
* Name:

For each section of the project enter your answers in the appropriate notebook cells.  Save the notebook and submit the completed assignment through Canvas.

### Preparation 

The goal for this project is to explore data available from the National Center for Biotechnogy Information (NCBI).  You will download the complete genome sequence for a species of bacteria and use Python to search for data in the genome.

You will need the [Biopython](http://biopython.org) library.  It's easy to install with `pip`:
```
pip install biopython
```
(Mac/Linux users: if you're not using Anaconda type `pip3` to install for `python3`)

You will also need a shell command that downloads a file. In this document I use `wget` but you can use `curl` or any other utility you have installed on your system.

### Warmup 

Before you start this project take some time to familiarize yourself with the NCBI web site.  It's a huge site, but it's worth learning where to find documentation and "how-tos" in addition to the main data sources.

The home page is [http://www.ncbi.nlm.nih.gov](http://www.ncbi.nlm.nih.gov).

To see what sorts of full-genome data are available click Genome from the list of "popular resources" on the right side of the page, then click Microbes and explore a little bit.  The terminology will be overwhelming to non-experts but you should get a sense of how the site is organized.

**Note:** &nbsp; Later in the term, if your group needs to download data from NCBI, you'll want to learn about "eutils", the web service API that allows you to access data programatically.  Here's a link to a "how to" guide that gives a quick overview:  [How to: Download a large, custom set of records from NCBI](https://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/)

### Part 1: &nbsp; Data 

For this project you'll need the complete genome sequence for _Escherichia coli_, a bacterium that lives in our lower intestines.  The most common strains play an essential role in digestion, but _E. coli_ periodically shows up in the news since other strains are associated with colitis, Crohn's Disease, and other intestinal disorders.

To learn about _E. coli_ select the **Genome** database from the pull-down menu at the top of the NCBI home page.  Enter this text in the search box and click Search:
```
Escherichia coli[ORGN] 
```

To find the complete genome sequence for the K-12 strain, select the **Nucleotide** database from the pull-down menu and add the strain ID to the search box, so it now reads
```
Escherichia coli[ORGN] K-12[STRN] 
```

There will be a lot of search results (this is a very widely studied organism).  Look for the Genbank identifier ("GI number") of one of the strains and copy it.

To fetch the genome file the simplest method is to create a URL and use it with `wget` to download the file.  If you want, you can use the Entrez library in Biopython to construct the URL and download the data; see the optional project at the end of this notebook.

Use the GI number to create a URL that looks like this:
```
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gbwithparts&retmode=text&id=NNN
```
where NNN is replaced by your GI number.

Now use the URL with `wget` or a similar command to fetch the genome file and save it as a file named `ecoli.gbk` on your system.  Here's the shell command I typed (with NNN instead of my GI number):
```
$ wget 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gbwithparts&retmode=text&id=NNN' -O ecoli.wg.gbk
```

Edit the markdown cell below to answer the following questions about your genome file:
* what GI number did you use to download the file?
* how big is your file (in GB)?
* how many lines are in the file? 
* what is the first line in the file? 
* how many lines have the string `"tRNA"`? 
* how many of these lines are the start of "tRNA features"? 

* GI #: 985000614
* file size: 10.79 MB

In [3]:
! ls -lh ecoli.gbk

-rw-r--r--  1 hrnmy  staff    11M Apr 11 14:22 ecoli.gbk


In [4]:
! wc -l ecoli.gbk

  170278 ecoli.gbk


In [5]:
! head -n 1 ecoli.gbk

LOCUS       CP014225             4659625 bp    DNA     circular BCT 02-FEB-2016


In [6]:
! grep -c 'tRNA' ecoli.gbk

431


In [7]:
! grep -c 'tRNA features' ecoli.gbk

0


### Extra Credit

Use biopython entrez.esearch to find id's for e.coli genome then use efetch to download a genome to .gbk file.

In [16]:
from Bio import Entrez

In [41]:
Entrez.email = 'hduvvuri@uoregon.edu'
handle = Entrez.esearch(db = 'nucleotide', term = 'escherichia coli[ORGN] k-12[STRN]')
record = Entrez.read(handle)

In [42]:
record['IdList']

['1015632154', '1000950200', '985533865', '985000614', '983515101', '939732440', '939731527', '937526251', '937521852', '937517453', '958167895', '958167893', '958167892', '958167890', '958167888', '958167886', '3434983', '938151182', '938151181', '938149557']

In [45]:
ecoli_handle = Entrez.efetch(db = 'nucleotide', id = '985533865', rettype = 'gb')
ecoli_file = open('ecoli_2.gbk', 'w')
ecoli_file.write(ecoli_handle.read())
ecoli_handle.close()
ecoli_file.close()

### Part 2: &nbsp; Exploring the Genome 

Below are a series of questions about the _E. coli_ genome that can be answered using Biopython.  Load the Genbank file with a call to `SeqIO.read`.  You can then use methods of the resulting `SeqRecord` object to answer the questions.

There are three notebook cells for each question: (1) a markdown cell with the question, (2) an empty markdown cell for you to explain how you answered the question, and (3) an empty code cell where you should enter the Python expression you used to answer the question.

Each time you work on this project make sure you execute this code cell to import the Biopython modules you need (feel free to add more imports to this cell if you want):

In [2]:
from Bio.Seq import Seq
from Bio import SeqIO

In [3]:
ecoli = SeqIO.read('ecoli.gbk', 'genbank')

##### Sequence Name

What is the name of the sequence?

using SeqIO name function to find name of genome

In [8]:
ecoli.name

'CP014225'

##### Sequence Size

What is the total size in base pairs (bp) of the chromosome?

using len() function to find the number of basepairs in ecoli

In [9]:
len(ecoli)

4659625

##### Feature List

Use the features function in SeqIO and use python's len function to find size

In [10]:
len(ecoli.features)

9243

##### Pseudogenes

How many of the features are pseudogenes?  Hint: find features that have a `pseudo` qualifier.

list comprehension to find where all qualifiers key = pseudo, then len() finds how number of those genes

In [7]:
len([i for i in ecoli.features if 'pseudo' in i.qualifiers])

164

##### Feature Types

Create a dictionary that contains each type of feature found in this genome and the number of times that feature occurs.  Your dictionary should look something like this:
```
{'CDS': 4492,
 'gene': 4619,
 ...
}
```

Loop over ecoli features to see if features.type is in the dictionary, if not, create new key for that feature, but if it is, add +1 to the count.

In [12]:
features_dict = {}

for i in ecoli.features:
    if i.type not in features_dict.keys():
        features_dict[i.type] = 1
    else:
        features_dict[i.type] += 1
    
    
features_dict

{'CDS': 4489,
 'gene': 4617,
 'misc_binding': 6,
 'ncRNA': 18,
 'rRNA': 22,
 'repeat_region': 2,
 'source': 1,
 'tRNA': 87,
 'tmRNA': 1}

### Part 3: &nbsp; tRNA Sequence File 

Write a Python function called `extract_trnas` that prints a FASTA file containing all the tRNA sequences in a genome.  The function should take two arguments: the name of the input file (which you can assume is a `.gbk` file) and the name of the output file.

The defline for each output sequence should have the name of the genome followed by the name of the tRNA.  To make the tRNA name use the "product" attribute of the tRNA feature, which is a string of the form "tRNA-XXX", where XXX is the 3-letter amino acid transported by this tRNA.  Because there are multiple tRNAs for each amino acid you'll need to attach a number to each ID, starting with 1.  For example, the first tRNA for Arg will be called "tRNA-Arg-1", the second will be "tRNA-Arg-2", and so on.

Here is the first part of the output from my program (with the genome name replaced with XXX):
```
>XXX tRNA-Arg-1
GCGCCCTTAGCTCAGTTGGATAGAGCAACGACCTTCTAAGTCGTGGGCCGCAGGTTCGAATCCTGCAGGGCGCGCCA
>XXX tRNA-Thr-1
GCTCAAGTAGTTAAAAATGCATTAACATCGCATTCGTAATGCGAAGGTCGTAGGTTCGACTCCTATTATCGGCACCA
...
```

Use the following markdown cell to describe your function and explain how you tested it.

extract_trnas takes the input file to give a SeqRecord object. open() creates a new file and a for loop loops over the features in the genbank file to only write features.type that are tRNAs to the fasta file. The header and sequence is written to the file for each tRNA feature, with the sequence found by indexing the full genome sequence using the tRNA location found in the feature. The index is found by appending each tRNA name to a seen list and using count() to find the count of the tRNAs seen.

Tested using print statements in the loop for each statement written to the file and by writing out a file for the ecoli genome tRNAs

Edit the body of the function in this code cell so it contains your solution to this problem:

In [15]:
def extract_trnas(input, output):
    
    genome = SeqIO.read(input, 'genbank')
    seen = []
    
    with open(output, 'w+') as file_out:
        for i in genome.features:
            if i.type == 'tRNA':
                seq = genome.seq[i.location.nofuzzy_start:i.location.nofuzzy_end+1]
                seen.append(i.qualifiers['product'][0])
                index = seen.count(i.qualifiers['product'][0])
                
                file_out.write('>{} {}-{} \n'.format(genome.name, i.qualifiers['product'][0], index))
                file_out.write('{} \n'.format(str(seq)))
            

In [48]:
extract_trnas('ecoli.gbk', 'trnas_2.fasta')