![filo_virion](https://user-images.githubusercontent.com/22747792/73687685-7111bc00-467f-11ea-906e-e16132529840.png)

# Python for Genomics 
## Section 9: Final Project Part 1 - building genomic features summary calculator

Welcome to your final project! I'm so glad you made it 👊🏼 🎉

Now that you've learned some python tools for working with genetic data, let's build something together.

We'll be pulling in all the knowledge that we've learned so far, so feel free to look at past notebooks for ideas.

### Here is the brief:
Let's say we had to build a diagnostic test that detects the DNA (Ebola is actually RNA, but for testing, we can speak in terms of DNA since that is what is represented in genbank and the sequences.) We need the most conserved gene. What is the most conserved gene from all the past ebola outbreaks? Where should we focus our studies?

The first thing we'll do is build a small script that gives us an overview of the Ebola virus's genome. 

### Build a Viral Genome Statistics Summary Calculator. 

#### The input for this calculator will be a genbank file containing a single record. 
We will we building this small calculator using the reference genome for ebola: KM034562. 

#### The output will be a text summary containing:


    1. how many genes in the genome;
    2. total length of the genome;
    and for every gene provide/calculate:
    3. gene name;
    4. its length;
    5. its GC content;
    6. no. of A's;
    7. T's;
    8. C's;
    9. G's;
    10. N's;

#### The best way to go about this is to split up your tasks into chunks. 

I see two main chunks: the first one, which give an overview of the entire genome. That can be accomplished by referencing attributes in the SeqRecord object.

The second chunk is based on SeqFeatures, so we'll have to get into each feature and perform some specific tasks there.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

Alright, let's do our imports up here. I'll fill them in for us so go ahead and just run the cell.

In [6]:
from Bio import Seq, SeqIO, SeqUtils
from Bio.SeqRecord import SeqRecord

Now for the data. I'm going to show you how I get the data for this, but, you can just use what I put in the data folder. We're going to be building our calculator using a reference sequence for Ebola first. The ref seq is accession KM034562.

We can retrieve it like this:

In [1]:
from Bio import Entrez

Entrez.email = 'your email here'

new_file_handle = open('data/KM034562.gb', 'w')

efetch_handle = Entrez.efetch(db='nucleotide', id='KM034562', rettype='gb', retmode='text',)
downloaded_record_handle = efetch_handle.read()

new_file_handle.write(downloaded_record_handle)
efetch_handle.close()

Now we have our data saved in our folder. Time to design our calculator.

The first order of business is to find all the genes, so let's do that first. 

So first, read in your file, then loop through all the features objects contained within the SeqRecord's feature attribute list. Then just print all the feature that you find that are of the type "gene".

In [25]:
# your code goes here...

### Now we are going to tidy this information up into a summary for action item 1.
Fix the output of your little script so that it prints a statement like this:

---
GENOMIC SUMMARY for:
<br>
(insert genome description here)
<br>
(insert genome accession here)


The total length of the genome is (insert number here) bases.
<br>
There are (insert number here) genes.
<br>
The gene names are: 
<br>
(insert list of gene names here)
---
You can go ahead and just provide the list of gene names as a list, it's not pretty but it will work for our little calculator.

In [None]:
# your code goes here...

### OK! We have the initial part of our summary, now let's start to gather the statistics for every gene. 

Remember we want these bits of info for every gene:
* gene name;
* its length;
* its GC content;
* no. of A's;
* T's;
* C's;
* G's;
* N's;

So the first order of business is to get inside each feature and run some calculations.

In [47]:
# your code goes here...

### Ok, now that we have the two components, let's just put them all together inside one function that takes one argument - the genbank record. That way if we ever have to use it, we can come back to this notebook and just change around the input file.

You can present the data however you feel is nice and tidy up the code if you like. 

In [4]:
# Your code here ...

from Bio import SeqIO
from Bio.SeqUtils import GC

def genome_stat_calc(genbank_record):
    pass



Now there are so many ways to accomplish this, and definitely tighten up the code to make it more concise.
That only works if you will be able to understand it later though. I tend to err on the side of being more verbose, or explanatory, so that when I come back to code in a few months I knew what the heck I was doing. 

But isn't that cool? We just built a custom calculator using bipython. 

### Great job!! 🥳