# Big Data for Biologists: Decoding Genomic Function- Class 3

## How can we predict the protein product of a gene? More specifically, how can we predict the amino acid sequence of a protein made from an mRNA sequence transcribed from a gene?

## What are genomic coordinates and how are they used to designate the position of a gene in the human genome? 

##  Learning Objectives
***Students should be able to***
 <ol>
 <li><a href=#Aminoacidsequence>Use the genetic code to determine the amino acid sequence of the protein product for an mRNA protein coding sequence </a></li>
 <li><a href=#PythonDictionary>Make a Python dictionary for the genetic code (also called a look up table)</a> </li>
 <li><a href=#DefineFunction>Define and call a function in a Python script </a></li>
 <li><a href=#PredictProteinSequence>Predict a protein sequence from a processed mRNA sequence using a Python dictionary</a></li>
 <li><a href=#SaveFunction>Save functions to a .py file so they can be used in other programs </a></li> 
  <li><a href=#ReferenceGenome>Explain what a reference genome is </a></li>
 <li><a href=#GenomicCoordinates>Explain how genomic coordinates are used to designate the position of a gene or feature in the human reference genome </a></li>
 <li><a href=#BEDformat>Use genomic coordinates to make a file in BED format </a></li>
 <li><a href=#makeFASTAfromBED>Use the genomic analysis package BEDtools to obtain a protein coding sequence from the human reference genome</a></li>



## How can we use the genetic code to determine the amino acid sequence of the protein product made from an mRNA sequence? <a name='Aminoacidsequence' />

In the last class, we wrote code to transcribe DNA to pre-mRNA and concluded by finding the start and stop codons in an mRNA sequence. Today we are going to look at the next step in gene expression, the translation of an mRNA sequence into protein. 

<img src="../Images/3-Translation.png" style="width: 70%; height: 75%" align="center"/>


Looking at a little more detail at the mRNA sequence, the region that gets translated into protein is called the **coding sequence**. The coding sequence is flanked by the 5' untranslated region at one end and the 3' untranslated region at the other. 

<img src="../Images/3-CDS.png" style="width: 80%; height: 90%" align="center"/>


As a reminder,during translation, every three base pairs in an mRNA sequence past the start codon codes for one amino acid. These three base pair sequences are called codons. 

<img src="../Images/3-Translation Codons.png" style="width: 80%; height: 90%" align="center"/>

The start codon, as we saw previously is ATG which codes for the amino acid Methionine. Below is the **genetic code** for the rest of the amino acids as well as the three stop codons. 

<img src="../Images/3-Genetic Code.png" style="width: 60%; height: 70%" align="center"/>

In the final step of our last class we found three possible combinations of start and stop codons: 

* start codon: 60 stop codon: 390 orf length: 330
* start codon: 72 stop codon: 390 orf length: 318
* start codon: 442 stop codon: 448 orf length: 6

The actual combination of start and stop codons is often the combination that results in the longest sequence, but the true start and stop codons need to be validated experimentally. 

In this case, we can find the actual start codon for human insulin that has been experimentally validated in the NCBI database [here](https://www.ncbi.nlm.nih.gov/nuccore/NM_000207.2) is at position 60. 

Here is the mRNA sequence for the first twelve residues from codon 60 of the insulin sequence from the previous class: 

AUGGCCCUGUGG 

As an exercise, write out the amino acid sequence corresponding to the mRNA sequence. 

The amino acid sequence should be: 


We are now going to learn additional Python tools to help us create the code to write out the amino acid sequence of a protein product that is made from an mRNA sequence.

## Making a python dictionary for the genetic code<a name='PythonDictionary' />

The python code we will be writing today has some similarities to scripts that we looked at in the last class. 

Last time, when we wrote out the complementary DNA sequence we made four substitutions and when we wrote out the mRNA sequence we made one substitution. 

To write out the amino acid sequence that will be produced from a mRNA sequence, we will need to make sixty four substitutions! 

To simplify the code, we will use a Python tool known as dictionaries or look-up tables. 

Python dictionaries let you define a number of substitutions in one line rather than as a series of lines. 

There are a few different ways to define a dictionary in Python. If you are interested, this  [link](https://docs.python.org/2/library/stdtypes.html#typesmapping) gives the complete syntax options. 

Before we define the dictionary for protein translation, we are going to practice writing a dictionary that we could use to write out a complementary DNA sequence.

We'll use the same code as last time but we will use a python dictionary to do the substitutions instead of using if statements. 

The syntax for creating the dictionary is: 

DNAdict={'A':'T','T':'A','G':'C','C':'G'}

DNAdict is the name of the dictionary.

The letters before the colon are known as 'keys'and the letters after the colon are known as values. 

The syntax for defining a dictionary is as follows: 

 a={key1:value1,key2:value2,key3:value3}
 
To index or refer to a dictionary value use the following syntax: 

 a[key1]=value1

 a[key2]=value2

 a[key3]=value3
 
Given the information above:

What is DNAdict[A] =  ?


In [None]:
#Write out the complementary sequence for a DNA sequence using a look up table
FASTAgenesequence=open('../class_01_Gene_Sequences/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','') #this line removes the \n values from the genesequence

#This line defines the subsitutions that will be made when the dictionary is called 
DNAdict={'A':'T','T':'A','G':'C','C':'G'}

complementarysequence='' #this defines the variable 'complementarysequence'
for i in genesequence:
    #this line adds the dictionary element for the base pair in position i in genesequence to complementary sequence. 
    complementarysequence=complementarysequence+str(DNAdict[i])

#if we wanted to reverse-complement, we would execute the line below to reverse the sequence. 
print (complementarysequence[::-1])

Using the space below, start writing a python dictionary for the genetic code starting with the four entries for the upper left hand corner of the table in the <a href=#GeneticCode>figure above </a>. 

In [None]:
#Start writing a Python Dictionary for the first four entries in the upper left hand corner of the Genetic Code 
###BEGIN SOLUTION
###END SOLUTION

We will be using a complete dictionary for the genetic code below when we write out the protein sequence for an mRNA sequence, but first we will be covering one other tool for writing more complex scripts, creating functions. 

## Defining and calling functions in Python scripts <a name='DefineFunction' />

As you start to write more complex python scripts, a very helpful tool for any set of commands that accomplish a particular task and are used repeatedly either within a script or between scripts is to define a **function** for the series of commands. 

Once the function is defined the series of commands can be run with one line of code rather than multiple lines. 

An example of a task that we could put into a function is reading in a nucleotide sequence from a FASTA sequence. 

We wrote the code to read in the nucleotides from a FASTA sequence in the first class. In this class we are going to define it as a function. 

Functions are defined using the **def** command followed by the name of the function and then any necessary inputs separated by commas. 

In the example below the following code: 

> <font color=green>def</font><font color=blue> read_nt_from_fastasequence(FASTA_sequence)</font>:
 
defines a function called "read_nt_from_FASTAsequence". 

And the code: 

> sequence=read_nt_from_fastasequence('../class_1/data/Human-Insulin NM_000207.2.txt')

calls the function "read_nt_from_FASTAsequence". 

Within the function, the input sequence '../class_1/data/Human-Insulin NM_000207.2.txt' is assigned the name 'FASTAsequence'.  

At the end, the "return" command defines the output of the function that is saved for the rest of the program. 

Note, the variable names definied within the function are not  

In [None]:
#read in a nucleotide (DNA or RNA) sequence from a FASTA sequence

def read_nt_from_fastasequence(FASTAsequence):
    FASTAsequence=open(FASTAsequence,'r')
    nt_sequence=(FASTAsequence.readlines()[1:])
    nt_sequence=''.join(nt_sequence)
    nt_sequence=nt_sequence.replace('\n','')
    return(nt_sequence)

insulin_DNA_sequence=read_nt_from_fastasequence('../class_01_Gene_Sequences/data/Human-Insulin NM_000207.2.txt')
print(insulin_DNA_sequence)

An example of another task that we could put into a function is changing a DNA sequence to an RNA sequence by substituting uracils (Us) for thymines (Ts). 

We wrote the code to substitute uracils for thymines in the last class. Now we can define that as a function.

In [None]:
#write an RNA sequence from a DNA sequence

def write_RNA_from_DNA(DNAsequence):
    RNAsequence=DNAsequence.replace('T','U')
    return(RNAsequence)

#call the function to write an RNA sequence from a DNA sequence and print the sequence

###BEGIN SOLUTION
###END SOLUTION

Lets look carefully now at the variables in the function that we wrote above. In the example above we defined the variable name RNAsequence within the function and we used the variable insulin_RNAsequence when we called the function. Try running the code in the box below.      

In [None]:
print(RNAsequence)

Why does the print(RNAsequence) command lead to an error message?

RNAsequence is what is known as a **local variable**. It is defined only within the write_RNA_from_DNA function. Once outside of the function the variable name is no longer stored. 

In contrast, insulin_DNA_sequence and insulin_RNA_sequence are examples of **global variables**. They are defined in the main body of the code and are stored in the notebook until they are deleted or changed. 

In [None]:
print(insulin_RNA_sequence)

## Predict a protein sequence from a processed RNA sequence using a Python dictionary<a name='PredictProteinSequence' />

We are now ready to write a third function to predict a protein sequence from a processed mRNA sequence, using the genetic code dictionary. 

Note if we were starting from scratch, we would first need to call the functions defined above. 

The function has two main sections. 
1. Define a complete python dictionary for the three letter genetic code
2. Iterate over the bases in the RNAsequence and convert each codon to the corresponding amino acid defined in the dictionary. 

Complete the code below to write out the protein sequence for an mRNA coding sequence. 
 

In [None]:
#Write out the protein amino acid sequence from an mRNA sequence

def write_protein_from_RNA(RNAsequence):
                           
#defines the python dictionary for the three letter genetic code 
    geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

#defines the string variable proteinseq
    proteinseq=''

#range command (start,stop(not included),step)

    for i in range(1,len(RNAsequence),3): 

Is the start to the code above correct, why or why not?

The code above writes out the protein sequence from the entire RNA sequence but does not take into account where the open reading frame is. 

The actual start site, based on what is reported in the NCBI database, is basepair 60 (one-based numbering). 

The end of the last codon (including the Stop codon) is 392 (also 1-based numbering).   

How could we change the code above to write out only the protein sequence only from the start to the stop codon?  


In [None]:
#Write out the protein amino acid sequence from an mRNA sequence

def write_protein_from_RNA(RNAsequence):
                           
#defines the python dictionary for the three letter genetic code 
    geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

###BEGINSOLUTION fill in the brackets to write out the correct protein sequence
#translates the RNAsequence into protein 

#defines the string variable proteinseq
    proteinseq=''

#range command (start,stop(not included),step)

    for i in range(0,len(RNAsequence),3): 
 
        proteinseq=proteinseq+str(geneticcode3let[RNAsequence[]])

    return (proteinseq)

#call the write_protein_from RNA sequence
insulin_protein_sequence=write_protein_from_RNA(insulin_RNA_sequence[])

#print the protein sequence
print(insulin_protein_sequence)
###ENDSOLUTION

In the code above we have started a script to write out the same protein sequence using one letter amino acid symbols. Add in the missing lines to write out the protein sequence using the one letter amino acid symbols. 

This time change the numbers to leave off the stop codon. 

In [None]:
#Write out the protein 1-letter amino acid from an mRNA sequence

def write_protein_1_letter_aa_from_RNA(RNAsequence):

#defines the python dictionary for the one letter genetic code 
    geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

###BEGINSOLUTION edit the code to write out the 1 letter amino acid sequence
#defines the string variable proteinseq
    proteinseq=''

#range command (start,stop(not included),step)

    for i in range(0,len(RNAsequence),3):  
        proteinseq=proteinseq+str(geneticcode3let[RNAsequence[i:i+3]])
    return (proteinseq)
###END SOLUTION


###BEGINSOLUTION
#call the write_protein_1_letter_aa_from RNA sequence and print the 1-letter amino acid sequence
insulin_1letter_protein_sequence=write_protein_from_RNA(insulin_RNA_sequence[59:389])
print(insulin_1letter_protein_sequence)
###END SOLUTION 

## Save functions to a .py file that can be imported into other programs<a name='SaveFunction' />

Once you have written a function or set of functions, it is helpful to be able to save the funcion(s) in a format that you can use it in other scripts. 

In Python, a file with a set of functions is called a **module**. Module files are saved with the extension .py and they can be called from other Python scripts using the import command. 
 
In the course of this class, you will learn about the vast resources of .py files that are publicly available and that you can use to view or analyze sequences or data without having to write algorithms from scratch. 

Here, you are going to make your own module. 

We will write two functions: write_RNA_from_DNA and write_protein_from_mRNA and then save them to a .py file called central_dogma_helpers.py.  

The first line defines the name of the .py file and provides the instructions to write the contents of the box to a file. We are going to store the file in a directory called helpers that we will use for saving modules throughout the class. 

We already copied the first function, write_mRNA_from_DNA, for you into the box. 

Fill in the code for the second function, write_protein_from_mRNA. You will need the code that you wrote above as well as your knowledge about how to format a function. 

In [None]:
%%writefile ../helpers/central_dogma_helpers.py

#read in a nucleotide (DNA or RNA) sequence from a FASTA sequence

def read_nt_from_fastasequence(FASTAsequence):
    FASTAsequence=open(FASTAsequence,'r')
    nt_sequence=(FASTAsequence.readlines()[1:])
    nt_sequence=''.join(nt_sequence)
    nt_sequence=nt_sequence.replace('\n','')
    return(nt_sequence)

#write an RNA sequence from a DNA sequence
def write_RNA_from_DNA(DNAsequence):
    RNAsequence=DNAsequence.replace('T','U')
    return(RNAsequence)

#Write out the protein 1-letter amino acid from an mRNA sequence
def write_protein_1_letter_aa_from_RNA(RNAsequence):

#defines the python dictionary for the one letter genetic code 
    geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

#defines the string variable proteinseq
    proteinseq=''

#range command (start,stop(not included),step)

    for i in range(0,len(RNAsequence),3): 
        proteinseq=proteinseq+str(geneticcode1let[RNAsequence[i:i+3]])
    return (proteinseq)

You should now be able to see a ../helpers/central_dogma_helpers.py file. Take a look at the file and make sure you see what you would expect. 

In [None]:
ls ../helpers/

## What is a reference genome? <a name='ReferenceGenome' />
 
Now that you have learned some tools for how to work with single DNA, RNA and protein sequences, we are going to start to learn about genomics data. 

We will be working with data from the Human Genome Project as well as larger scale sequencing projects such as the 1000 Genomes Project.  

The Human Genome Project produced what is called a human **reference genome**, a publicly available, mostly complete sequence of the human genome that the scientific community agreed to use as a basis for comparison for new sequencing information. 

The sequence is "mostly complete" because some regions are difficult to sequence. The human genome still has some gaps.  

The initial reference genome was made from the sequences of a small number of individuals. Now, with more individuals having been sequenced, the reference genome captures more, but still not all of human genetic diversity. 

Researchers are still updating the reference genome. It is maintained by a consortium. You can find out more details [here](https://www.ncbi.nlm.nih.gov/grc/human).  

As an introduction to working with data from the Human Genome, we are going to look at how to find the sequence for human insulin that we have been looking at in the human reference genome.


## What are genomic coordinates and how are they used to designate the position of a gene in the human genome?<a name='GenomicCoordinates' />

The position of genes in the human reference genome are specified by their **genomic coordinates**. 

The genomic coordinates for a gene include the start position, stop position as well as the chromosome number. 

Whenever you are using genomic coordinates its important to also keep track of the version of the reference genome because these numbers change as the reference genome sequence is updated. 

Human Genomes typically have 23 pairs of chromosomes. Reference genomes are typically haploid, meaning that they   have sequencing information for 23 chromosomes, but not for separate pairs. 

<img src="../Images/3-Human Chromosomes.jpg" style="width: 40%; height: 50%" align="center"/>


## Use genomic coordinates to make a file in BED format<a name='BEDformat' />


The format for specifying genomic coordinates depends on the particular programs that you plan to use. There are several formats including BED, VCF, GFF/GTF. 

For this class, we will primarily be using a suite of programs called BED tools, so we will be using the BED format for genomic coordinates. 

BED files are text files that contain genomic coordinate information, and are typically given the .bed extension. The format of a BED file is specified [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). 

Only the first three columns are mandatory for BED files, but we've listed the first six columns here. They contain the following information:

Columns in a BED file:

- Column 1: **chromosome** (this is designated as a number for chromosomes 1 to 22 and chrX or chrY for the sex chromosomes)  
<br>
- Column 2: **start** position (the beginning of the first base is indicated by the start position 0; the beginning of the 5th base is indicated by the start position "4")  
<br>
- Column 3: **end** position (the end of the first base is indicated by the end position "1"; the end of the 5th base is indicated by the end position "5")  
<br>
- Column 4: **name** defines the BED feature eg. Exon number  
<br>
- Column 5: **score** We'll hear more about this later. eg. quality score of sequencing data  
<br>
- Column 6: **strand** which can be either '+' or '-'

People are often confused by the fact that the same "base" is referred to by a different number depending on whether you are referring to the start or the end. A simple way to understand this is to realize that the positions are not referring to the numbering of the bases themselves, but to the boundary between bases, as illustrated in the figure below:

<img src="./array_slice_indexing.png">

As a reminder from our previous class, this convention is also consistent with how slicing in python works, as illustrated below:

In [None]:
dna_string = "ACCTG"
print(dna_string[0:4])
print(dna_string[1:5])
print(dna_string[0:5])

We can use Python to make a file in BED format. 

For this example, we are going to make a .bed file that allows us to print the coding sequence for human insulin from a file containing the whole human genome in fasta format! 

Originally, the coding sequence would be determined experimentally, but we can obtain it now from genomics resources such as this [link](https://www.ncbi.nlm.nih.gov/variation/view/) from NCBI.

The coding sequence boundaries for the human insulin gene in the version of the human genome that we will be using in 1-based numbering (hg19 or GRCh37) are: 

* 2,181,082 to 2,181,227
* 2,182,015 to 2,182,201

And insulin is on the (-) or reverse strand. 

In [None]:
#Note that the exon locations have been adjusted to the zero-based numbering system described above
#\t is a tab character, and \n is a newline character
file1 = open('human_insulin_cds_boundaries.bed', 'w') #defines a file with the name "human_insulin_cds.bed"

#Coordinates are for GRCh37 reference genome 
file1.write("chr11\t2181081\t2181227\tcds\t1\t-\n") 
file1.write("chr11\t2182014\t2182201\tcds\t1\t-\n")

file1.close()

Let's view the lines in the created file:

In [None]:
file1 = open('human_insulin_cds_boundaries.bed', 'r')
file1_contents=file1.read()
print(file1_contents)

## Use the genomic analysis package BEDtools to obtain a protein coding sequence from the human reference genome<a name='makeFASTAfromBED' />


We are now going to use the BED tools box to obtain the sequence of the exon regions for the human insulin gene from the human reference genome. 

We can use the output of the BED tool box to make the mRNA sequence by pasting the exon sequences together using Python. 

Bedtools provides the getFastaFromBed command to extract the FASTA sequence from a specific set of chromosome coordinates.

The FASTA sequences must contain chromosome information in the headers.  

For details on the syntax of the command see ["getFastaFromBed"](http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)

The syntax for the full command is:
!bedtools getfasta 

A shortcut is: 
!fastaFromBed

The command requires an input FASTA file, a BED file containing your regions of interest, and an output FASTA file name. The reference file in our case is the h19.fa containing all DNA bases in the hg19 version of the human genome. The file can be downloaded from the [UCSC Genome Browser](https://genome.ucsc.edu/). You can access this file here:

In [None]:
!ls /data 

In [None]:
!cat  /data/hg19.genome.fa | head -n250

In [None]:
## first, the default behavior: 
!bedtools getfasta -fi /data/hg19.genome.fa -bed human_insulin_cds_boundaries.bed -fo human_insulin_cds.fa.out  
#examine the output
!cat human_insulin_cds.fa.out

We can use the startswith command in Python to take out the lines starting with '>'. 

Optional and if we have time: Try combining what you have learned to write a script to use the output above to write a single FASTA sequence.  

In [None]:
!cat human_insulin_cds.fa.out

In [None]:
#opens the FASTA sequences output from the bedtools command

file1 = open('human_insulin_cds.fa.out','r')
file1_contents=file1.readlines()
print(file1_contents[1])

###BEGINSOLUTION
###END SOLUTION

If you compare the sequence you made above to the mRNA sequence that we got from NCBI in the last class they should be the same!  