# Using Python to analyze genome data

We will be making heavy use of the BioPython module. The BioPython module is a set of bioinformatics tools that attempt to make it as easy as possible to use Python for bioinformatics. The Bio module will enable use to collect sequences from online servers, parse them, align them, and analyze them. 

## What is a genome and why analyze it?
DNA is the chemical compound that contains the instructions needed to develop and direct the activites of organisms. Each DNA strand is made of chemical units called nucleotides that pair together to form a genetic alphabet. The combination of the A's, C's, T's, and G's which make up this alphabet determines the meaning of the information encoded by the strand of DNA. A genome is the complete set of all the DNA strands. Scientists are primarily concerned with those regions of the DNA which code for proteins. Proteins make up body structures like organs and tissue, as well as control chemical reactions and carry signals between cells. When the DNA of a cell becomes damaged or mutated, a defective protein may be produced. These defective proteins can disrupt normal processes in the body and can lead to disease. By studying the genetic code we can study the mutations that give rise to abnormal proteins. 

Genomics is not limited to the study of disease. DNA can be used to draw comparisons between individuals or entire species. The fields of comparitive and evolutionary genomics often use DNA or protein sequences to draw evolutionary relationships between organisms of different taxa in order to study biological similarites and differences between organisms.

## How do we study genomes?
Genomes are first sequenced to determine the exact order of their base pairs. The sequences are then assembled like a jigsaw puzzle in order to determine their linear order. Once these sequences are assembled we can start collecting information about particular genes. Often, the goal is to gather many homologous sequences (sequences that are shared between individuals - derived from a common ancestor) and compare them by looking for substitutions, deletions, or additional base pairs. These differences between sequences can reveal variations in functionality of proteins or ellucidate evolutionary relationships between taxa. The necessary steps for these types of analyses are usually as follows: 

1. Determine a sequence of interest (this could be a specific gene responsible for some type of disease of illness)
2. find homologous (similar/related) sequences by perfoming a BLAST search of a database containing sequence data and download them as FASTA files
3. align these sequences (called a multiple sequence alignment or MSA)
4. create a phylogenetic tree from the MSA
5. make estimates and measurements based off of this tree
6. other interesting things...


##########################################################################################################################################################################

## Step 1: Search and download sequences from NCBI database

### import Entrez from the BioPython module and urllib
This will let us automatically search NCBI databases and download results into FASTA files. 

In [1]:
from Bio import Entrez
from urllib.error import HTTPError

### Set up our search query

The most important step in the analysis is determining what gene or protein we want to study. For the sake of this project let's look for a gene everyone knows, BRCA1. The BRCA1 gene encodes a nuclear phosphoprotein that plays a role in maintaining genomic stability and acts as a tumor surpressor. Mutations in this gene account for approximately 40% of inherited breast cancers and more than 80% of inherited breast and ovarian cancers [ReSeq May 2009].

#### Set search terms
We can use the .esearch method to specify our search query of the Entrez gene database and then download all of the sequences we find into .fasta files for later analysis.

In [2]:
search_term = 'BRCA1'
Entrez.email = 'tuc12093@temple.edu'
search_handle = Entrez.esearch(db="nucleotide",
                               term=search_term,
                               usehistory="y",
                               idtype="acc")
search_results = Entrez.read(search_handle)
search_handle.close()

acc_list = search_results["IdList"]
count = 10

webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]

batch_size = 3
outfile = search_term + ".fasta"
out_handle = open(outfile, "w")
for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Going to download record {} to {}".format(start+1, end))
    attempt = 0
    while attempt < 3:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db="nucleotide",
                                         rettype="fasta", retmode="text",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key,
                                         idtype="acc")
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server {}".format(err))
                print("Attempt {} of 3".format(attempt))
                time.sleep(15)
            else:
                raise
                
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
    
out_handle.close()

Going to download record 1 to 3
Going to download record 4 to 6
Going to download record 7 to 9
Going to download record 10 to 10


### Perform multiple sequence alignment (MSA) on the fasta files
Now that we have our sequences downloaded we can use the biopython module to perform a multiple sequence alignment. But first, what is a MSA?

#### MSA
MSAs are sequence alignments of three or more sequences or nucleotides or amino acids. In most cases, the sequences used as inputs to the MSA are assumed to be related (homologous). From the resultant MSA we can infer homology and perform phylogenetic analysis. In essence we are taking all of the input sequences and finding out which parts of them are all the same and which individual parts of a particular sequence might differ from the rest of the group.

ex. using amino acids
seq 1 = M Q P I L L P   ; aligned ==> M Q P I L L P
seq 2 = M L R L P       ; aligned ==> M L R - L - P
seq 3 = M P V I L K P   ; aligned ==> M P V I L K P

MSA programs perform these alignments and then derive scores for each based on the relatedness of the pairing. The MSA with the best score is then used as the output MSA. Algorithms and scoring matrices for MSAs can get complicated so we'l just ignore the details for now. Just know that we need to align the raw sequences for further analysis and that programs exist to do just that.

#### using MUSCLE for multiple sequence alignment
MUSCLE is just one of many programs that can perform MSA. It exists as its own executable. Therefore, it is typically downloaded and then run from the command line like so...
`muscle -in raw_fasta_files.fa -out aligned_fasta_files.afa`

But we can use Python to call muscle and perform the alignment within our script

In [6]:
from Bio.Align.Applications import MuscleCommandline


command = MuscleCommandline(input='BRCA1.fasta', out="aligned.afa")


muscle -in BRCA1.fasta -out aligned.afa
