# Big Data for Biologists: Decoding Genomic Function- Class 1

## What is a gene and how can we read a DNA sequence into Python?

##  Learning Objectives
 ***Students should be able to***
 <ol>
 <li><a href=#dna>Describe the structure of DNA</a></li>
 <li><a href=#whatisagene>Explain what a gene is</a></li>
 <li><a href=#commandline>Identify the command line in a Jupyter notebook and enter commands</a></li>
 <li><a href=#workingdirectory>Find and set a working directory </a></li>
 <li><a href=#path>Determine the absolute and relative address (path) of a directory or file</a></li>
 <li><a href=#genesequence>Download a gene sequence from a genome database</a></li>
 <li><a href=#sequencelength>Read a gene sequence using the Python programming language</a></li>
 <li><a href=#sequencelength>Determine the length of the gene sequence using Python</a></li>
 </ol>
 
**Note: For additional background on DNA and the Central Dogma see [Khan Academy video on DNA](https://www.khanacademy.org/science/biology/classical-genetics/molecular-basis-of-genetics-tutorial/v/dna-deoxyribonucleic-acid) or [Khan Academy video on Central Dogma](https://www.khanacademy.org/test-prep/mcat/biomolecules/amino-acids-and-proteins1/v/central-dogma-of-molecular-biology-2). 
Slides for lecture 1 of the course:

In [1]:
#DAY 1 SLIDES POSTED SEPARATELY
from IPython.display import IFrame

IFrame("https://docs.google.com/presentation/d/e/2PACX-1vSewhZ-q8odgDLaYDglBTyHpbUDU_82bHEEEUbeiortznYFiKyFrjmIPl8CEuPgn1dqUpmjTfjK7qz_/embed?start=false&loop=false",height=749,width=960)

## What is DNA? <a name='dna' />

If you open up a recent science news website there's a good chance that you'll find some kind of article talking about the lastest study having to do with genes. 
 
But what is a gene? 

To define the term gene, we first need to define Deoxyribonucleic acid (DNA). 

**DNA** is the molecule in cells that enables the transmission of genetic information from one generation to the next. 

Pioneering work by James Watson, Francis Crick and Rosalind Franklin in the 1950s revealed the structure of DNA and laid the foundations for elucidating the mechanism by which DNA can be reliably replicated when one cell divides to become two. 

DNA is made up of units (or monomers) called nucleotides. Each nucleotide consists of a sugar, a phophosphate group and one of four bases: 

* **Adenine (A)**   
* **Cytosine (C)** 
* **Guanine (G)** 
* **Thymine (T)**   


The figure below shows the structure of a nucleotide (top) and how the base pairs come together (bottom). The base pairs are held together by a type of bond called hydrogen bonds.  

* **Adenine (A) pairs with Thymine (T)**
* **Guanine (G) pairs with Cytosine (C)**

<br>

<img src="../Images/1-DNAMonomer.png" style="width: 65%; height: 20%" align="center"/>


<br>
<br>
<br>

Inside a cell, the DNA nucleotides link together in two strands and form a double helix (see figures below). 

Note that the end of a DNA strand with the phosphate group is called the 5' end and the end with the OH group is called the 3' end. We'll talk more about the directionality of the DNA strands in later classes. 

<br>

<img src="../Images/1-DNABuildingBlocks.png" style="width: 85%; height: 20%" align="center"/>


<img src="../Images/1-DNAHelix.png" style="width: 75%; height: 20%" align="center"/>


In [2]:
#Press SHIFT+ENTER to view a cool video about the structure of DNA!
from IPython.display import HTML

HTML('<iframe width="560" height="315" \
       src="https://www.youtube.com/embed/o_-6JXLYS-k" frameborder="0" allowfullscreen></iframe>')




## **What is a gene**? <a name='whatisagene' />

**Genes** are segments of DNA that code for RNA and proteins, which are critical molecules for carrying out cellular function. 

RNA that is transcribed from DNA but does not get translated into protein is referred to as **non-coding RNA**.  

There are approximately 21,000 protein-coding genes in the human genome and an estimated 15,000-21,000 non-protein coding genes ([Willyard, Nature 558, 354-355 (2018)](https://www.nature.com/articles/d41586-018-05462-w)).

Unraveling the function of non-coding DNA and RNA, is an active area of research. You will learn a lot more about non-coding DNA later in this class. 


<img src="../Images/1-GeneExpression.png" style="width: 80%; height: 80%" align="center">


## **Using the command line** <a name='commandline' />

Being able to work with DNA sequences in a computer program is a core skill that you will learn in this class.  

Before we can start using a computer to work with DNA and gene sequences, we first need to get started with a programming language. For this course we will primarily use **Python**, which is one of the programming languages commonly used by biologists. 

We will also teach you **Unix** commands which can be very helpful for navigating to directories and folders. 

Python is what is known as a "scripting language" which means that its a type of programming language that is ready to use after it is installed and you can start entering commands, or you can write a set of commands, in what is known as **"a script"**. 

A lot of the principles we are introducing will also help you if you need to use other types of programs that are commonly used by biologists and in biomedical research such as **R or MATLAB**. 

Unlike a windows or graphical user interface (GUI) environment like you may be used to, Python and other scripting languages have what is known as "command line".  

**"Command line"** is a location where you can enter code to give a computer program an instruction. 

When you are first starting, it is often helpful to use the command line, but often as you become more comfortable using programs you can assemble your code into scripts, like we mentioned above, that will initiate a series of commands all at once without the user 

In programming classes one of the first commands that you often learn is to ask the computer to write out the phrase "Hello World". For example: 


In [1]:
print ('Hello World')

Hello World


Since this is a biology course we are going to have the computer print out the four types of DNA bases. 

Try running the command in the next box by clicking on the box and clicking enter while holding down the shift key. 


In [2]:
print(DNA makes RNA makes Protein)

SyntaxError: invalid syntax (<ipython-input-2-28b11cef8a3d>, line 1)

Why didn't that work? 

One of the challenges that beginners at coding quickly learn is that computers are very literal. Details such as quotation marks, spaces, capital letters versus upper case letters all matter. 

The first lesson from this exercise is to pay attention to quotes, spaces and caps when you are coding. 

Fortunately, many coding interfaces help give you feedback (or error messages) to help you debug your code. 

In the box below, try editing the line from above to fix the errors. 


In [3]:
###BEGIN SOLUTION
###END SOLUTION

DNA makes RNA makes Protein


## How do I find my working directory? <a name='workingdirectory' />

Scripts or computer programs can get very complex, but at its most basic, a computer program reads an input, does something with the input and then writes an output.

<img src="../Images/1-Working Directories.png" style="width: 55%; height: 30%" align="center"/>

The question in red in the figure above repesents a key part of writing a functioning program. 

How you organize your input and output files is also essential for building well-organized functioning projects. We'll have some more tips on that later.  

You can think of your computer storage as a hierarchy with many branches, somewhat like this: 

<img src="../Images/FileHierarchy2.jpg" style="width: 50%; height: 33%" align="center"/>

When you start Python the directory where you start the program becomes, by default, what is known as your working directory. 

The **working directory** is where a computer program will look for inputs and write outputs unless you instruct it to do so otherwise. 

It is often helpful, or necessary, to identify the working directory. In Python, you can find the working directory that you are using with the series of commands below. 

In the commands, import os sets up a way to use functions that may be dependent on your operating system (ie. are you using a Mac OSX, a PC or some other environment). 

The getcwd command is "get current working directory". 

In [4]:
#import os allows the Jupyter notebook to interact with the operating system.

import os

#gets the current working directory
os.getcwd()

'/home/jovyan/humbio51/class_01_Gene_Sequences'

Note that the concept of a working directory is not unique to Python. In Unix you can get your working directory with the command pwd. To run Unix commands in a Python notebook you need to start the line with an !. 

In [5]:
!pwd

/home/jovyan/humbio51/class_01_Gene_Sequences


## How can I set my working directory using an absolute or relative path?<a name='path' />

The output of the os.getcwd() command gives what is known as the **absolute path** of the current working directory, or the exact location of the current working directory on the computer.  

When you are telling a program where to find inputs or to write outputs it is essential to be able to tell the computer where they are. You can do this by telling the computer either the absolute path or relative path. 

The **relative path** indicates where a file or directory is with respect to the working directory. 

A helpful trick to define the path to go up one directory is to use the ".." command. 

For example, if you want to change your working directory to:

'/home/jovyan/humbio51.git/'

You could use the comand with the absolute path: 
os.chdir('/home/jovyan/humbio51.git/')

Or, you can run the command below.    

In [6]:
import os

#changes directory back to the parent directory
#two periods('..') stands for the parent directory

os.chdir('..') 

print(os.getcwd())

/home/jovyan/humbio51


What do you think the command would be to go up two directories? 

If you want to go back down to a subdirectory, you can just give the name of the directory. 

For example to go from: 

'/home/jovyan/humbio51.git'

Back to: 

'/home/jovyan/humbio51.git/class_01_Gene_Sequences'

In [7]:
import os

os.chdir('class_01_Gene_Sequences')
print(os.getcwd())


/home/jovyan/humbio51/class_01_Gene_Sequences


It also can be helpful to list the names of the files that are in working directory.

Try it below and see whats there. 

In [8]:
import os

#lists files in the current directory
#a single period (.)  stands for the current directory 

os.listdir('.')


['.DS_Store',
 'sequences',
 'Introduction_to_Jupyter_notebooks.ipynb',
 '1-Reading in a gene sequence.ipynb',
 'data',
 '_DS_Store',
 '.ipynb_checkpoints']

What command could you use if you want to change the directory to: 

/home/jovyan/humbio51/class_01_Gene_Sequences/data

Without looking below, try writing the code using both the absolute and relative paths in the boxes below. 

We've given you some guidance in comment lines which are denoted by a #. You'll see comment lines used a lot throughout the class. Adding comments is a helpful way to help you and others follow your code! 

In [9]:
#Write the commands to change the current working directory using relative paths
#Print the working directory so you can check your work. 
###BEGIN SOLUTION
###END SOLUTION

Now change your working directory using absolute paths back to /home/jovyan/humbio51/class_01_Gene_Sequences

In [10]:
#Write the commands to change the current working directory using absolute paths
#Print the working directory so you can check your work. 
###BEGIN SOLUTION
###END SOLUTION 

The basic navigation commands that we have introduced above also have similar commands in Unix. For example: 

* Change to a parent directory: cd .. 
* Change to sub-directory class_01_Gene_Seuences: cd class_01_Gene_Sequences  
* List files in directory: ls 

With these basic navigation commands we are now ready to apply what we've learned to an example from biology. 

For the rest of the class we are going to apply what we've covered to this point and show how to use it to read a gene sequence into a program and calculate the length of the sequence.

<img src="../Images/1-ReadWriteGeneSequence.png" style="width: 50%; height:70%" align="center"/>


## **How can I download a gene sequence from a genome database**?<a name='genesequence' />


The first step for our project today is finding a gene sequence. 

Many, many gene sequences have been collected in publicly available on-line databases from genome sequencing research projects or smaller scale research projects to determine the sequence of single genes or sets of genes.  

Three commonly used databases to obtain gene or genome sequence information are:
   
   [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene)  
   [Ensemble](http://uswest.ensembl.org/index.html)  
   [UCSC Genome Browser](https://genome.ucsc.edu/)  

We aren't going to go into detail now about the differences between these three sites, but an important point is that there are several large scale collaborative efforts that have created very organized sites to collect genome sequences from the research community. 

These sites include genome sequences not only from humans (or, more technically *Homo Sapiens*) ranging from bacteria like *E.Coli* to large organisms like elephants (*Loxodonta africana*). 

To see what a gene sequence looks like in a genome browser, visit the entry in the NCBI Gene database for [human insulin](https://www.ncbi.nlm.nih.gov/nuccore/NG_007114.1?from=4986&to=6416&report=fasta)

If you have extra time, look for any gene of interest to see what you find. 

Computer programs may need gene sequences in a particular format.  You will want to make sure that the format for your input file matches the format that you need for the program. 

Today we will use "FASTA" format. You can download sequences in FASTA format directly from the NCBI database.   

In **FASTA format** the first line of the file starts with a > followed by an identifier describing the nucleotide sequence. You can see an example of what the FASTA sequence for a gene in the NCBI database looks like below. 

<img src="../Images/1-Gene Sequence.png" style="width: 60%; height: 40%" align="center"/>

To get the gene of interest into a file, you can copy it into a text file using a textfile editor and save the file. As you become more advanced and more comfortable with programming, you can also write code to direcly instruct a computer program to access the web and "scrape" information from a website to load into a computer.

For this activity, we already did the following steps. 

1. Made a directory in the Class_1 working directory called data. 
2. Visited the entry in the NCBI Gene database for [human insulin](https://www.ncbi.nlm.nih.gov/nuccore/NG_007114.1?from=4986&to=6416&report=fasta) 
3. Opened up a text editor and pasted the sequence into a file. 
4. Saved the human insulin sequence into a file called Human-Insulin-NG_007114.1.txt in the data directory. 

For practice go ahead and make a new directory with a name of your choice in the box below. 

In [11]:
#make a subdirectory in the working directory 
###BEGIN SOLUTION
###END SOLUTION

FileExistsError: [Errno 17] File exists: 'sequences'

To see the list of directories and files in the folder you can use the os.listdir command

In [12]:
#lists the subdirectories in the working directory 

os.listdir()

['.DS_Store',
 'sequences',
 'Introduction_to_Jupyter_notebooks.ipynb',
 '1-Reading in a gene sequence.ipynb',
 'data',
 '_DS_Store',
 '.ipynb_checkpoints']

To see the list of directories and files in a subdirectory you can use the os.listdir command and specify the sub directory as in the example below.  

In [13]:
#lists the subdirectories in the working directory 

os.listdir('data')

['.DS_Store', 'Human-Insulin-NG_007114.1.txt', 'Human-Insulin NM_000207.2.txt']

## How can I read a gene sequence into Python and determine the length of the sequence?<a name='sequencelength' />

Now that we are set up with the data that we need as input, we are ready to start writing the program to read in the sequence into Python and calculate the length of the sequence. 

Think for a moment about how you might set this up and then we'll look at the example we've provided below. 

In [14]:
#Open a file and create a file object 
#The 'r' means that the file is readable.'w' would mean the file is writable.  
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')

#Read the sequence file contents and print them 
print(FASTAgenesequence.read())

>NG_007114.1:4986-6416 Homo sapiens insulin (INS), RefSeqGene on chromosome 11
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG
GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG
CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT
GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC
CCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAG
TTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTG
TTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAA
TGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCT
GGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGA
GGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGA

Give yourself a chance to think about the code above. 
What do you think the first line did? What do you think the second line did?

If you want to calculate the length of the sequence what do you think you will need to do next?

There is a command "len" in Python that will allow you to calculate the length of a variable that is made up of letters like the gene sequence that we read in. 

However, if you used the length command on the sequence variable as is, what would be one of the problems?


To calculate the number of base pairs in the file, we will need to take out the first line. 

You will see a lot of examples in this class when it may be helpful to look at only part of a file. 

Fortunately, Python (and other scripting languages) can help as you'll see in the next example. 

This time rather than read in the whole sequence at once using the read command we will use the readlines command which reads one line at a time. 

In [15]:
#Open a file and create a file object 
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')

#Read the lines of the sequence and trim the first line
#The numbering of lines or characters in Python starts with 0, so the fist line is line 0. 

genesequence=(FASTAgenesequence.readlines()[1:])

print(genesequence)

['AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT\n', 'GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG\n', 'TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC\n', 'CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG\n', 'GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG\n', 'CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT\n', 'GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC\n', 'CCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAG\n', 'TTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTG\n', 'TTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAA\n', 'TGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCT\n', 'GGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGA\n', 'GGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGAG\n', 'ATGGGGAAGA

What are the '\n' values that you see?

The \n that you see are linebreaks, remember, we used the readlines command.

The linebreaks will also get in the way of calcuating the length of the gene sequence because they themselves take up extra space. 

We can get rid of them, however, using the join and replace command. 

We first will join all the lines together using the join command. 

We then will use the replace command to substitute '\n' with nothing or ''. 

In [2]:
#Read in the sequence and trim the first line
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])

#joins the lines in genesequence into a single string
genesequence=''.join(genesequence)

#print(genesequence)
print(genesequence)

AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG
GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG
CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT
GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC
CCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAG
TTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTG
TTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAA
TGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCT
GGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGA
GGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGAG
ATGGGGAAGATGCTGGGGACAGGCCCTGGGGAGAAGTACTGGGATCACCTGTTCAGGCTCCCACTGTGAC
GCTGCC

In [17]:
#Read in the sequence and trim the first line
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])

#joins the lines in genesequence into a single string
genesequence=''.join(genesequence)

#removes the linebreaks
genesequence=genesequence.replace('\n','')

#calculates the length of the genesequence
print(len(genesequence))

1431


Congratulations! We've covered a lot today, but everything that we've covered you'll see for a lot more practice as the quarter continues. 

See you next time where we'll start looking at transcription, the next step in gene expression!  