# Notebook 5: Transcription factor binding sites in *Escherichia coli*

### by Justin B. Kinney

In this example, we will practice downloading a small biological dataset from the internet and analyzing it. 

Specifically, we will analyse the DNA sequences of CRP binding sites. CRP is a transcription factor (TF) of the bacterium *Escherichia coli*. CRP is one of the most pleiotropic *E. coli* transcriptio factor, with over 350 functional binding sites throughout the bacterium's 4.6 Mb genome.  

The most comprehensive databse of transcriptional regulation in *E. coli* database is RegulonDB, available at http://regulondb.ccg.unam.mx/. Let's go to this site to find out how we can access all CRP binding site...
    
Ok..., looks like the information we want can be found in the file http://regulondb.ccg.unam.mx/menu/download/datasets/files/BindingSiteSet.txt. First let's check to see how big the file is, to see if simply downloading it will work.

In [None]:
# Always put this first
%matplotlib inline
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Get information about remote file
import urllib
db_remote_file = "http://regulondb.ccg.unam.mx/menu/download/datasets/files/BindingSiteSet.txt"
d = urllib.urlopen(db_remote_file)
print d.info()

Ah. Looks like the file is both current (April 2016) and small (~ 0.7MB). Let's download it.

In [None]:
# Download remote file
db_local_file = "binding_site_db.txt"
urllib.urlretrieve(db_remote_file, db_local_file)

In [None]:
# Check that file downloaded
# WARNING: might not work on Windows machines
!ls -lah

Now that we have the file, let's open it to see what it looks like

In [None]:
# Open file with external program
# WARNING: might not work on Windows machines
!open binding_site_db.txt

Ah, looks like we want lines that have "CRP" in the second column. First we need to know what the delimeter is. Let's load all lines and look at the last one. 

In [None]:
# Open file and load all lines into list
f = open(db_local_file)
lines = f.readlines()
print len(lines)
lines[-1]

Ah, the file uses tabs. So the quickest thing that might work is to keep only lines that contain "CRP" with a tab character on either side. Let's try that

In [None]:
# Get lines for CRP
string_to_match = '\tCRP\t'
lines_we_want = [l for l in lines if string_to_match in l]
len(lines_we_want)

Now we have a list of lines. Here are the first and last lines.

In [None]:
# Check first line and the last line
print lines_we_want[1]
print lines_we_want[-1]

In [None]:
x = lines_we_want[-1]
x.split('\t')
starts = [int(x.split('\t')[3]) for x in lines_we_want]
starts

Looks like fields 3 and 4 contain the start and stop positions of sites. Get site coordinates

In [None]:
# Get start,stop coordinate pairs for all sites
starts = [int(x.split('\t')[3]) for x in lines_we_want]
stops = [int(x.split('\t')[4]) for x in lines_we_want]
coords = zip(starts,stops)
coords

In [None]:
# Oddly, there are some sites missing, and others might have variable length. Let's find these lengths
site_lengths = [c[1]-c[0]+1 for c in coords]
site_lengths

In [None]:
# Let's figure out how many sites there are of each length
length_dict = {}
for l in site_lengths:
    if l in length_dict.keys():
        length_dict[l] += 1
    else:
        length_dict[l] = 1
print length_dict

In [None]:
# Grab coordinates corresponding to sites of length 22
L = 22
good_coords = [c for c in coords if c[1]-c[0]+1==L]
print len(good_coords)

In [None]:
# Now we have to load the E. coli genome, so that we can extract sites
f = open("MG1655.fa")
genome_lines = f.readlines()
genome_lines[:10]

In [None]:
# Concatenate genome into one string
genome = ''.join([l.strip() for l in genome_lines[1:]])
print len(genome)
print genome[:100]

In [None]:
# Now grab all 22bp sites from the genome
sites = [genome[start-1:stop] for start,stop in good_coords]
print sites[:10]

Now let's count how many times each base occurs at each position in this list. After a little thought we conclude that we want to do something like this:

In [None]:
counts_matrix = np.zeros([L,4])
bases = 'ACGT'
for s in sites:
    for i in range(L):
        for b, base in enumerate(bases):
            counts_matrix[i,b] += (s[i] == base)
print counts_matrix

Looks reasonable, but it would be nice to have a graphical representation of this.

In [None]:
plt.imshow(counts_matrix)
plt.show()

Hard to see. Flip this thing on it's side and make it bigger. Also, which colors mean what?

In [None]:
plt.figure(figsize=[12,2])
plt.imshow(counts_matrix.T,interpolation='nearest')
plt.colorbar()
plt.show()

That's better, but it will still take some playing around with to make it pretty. It's all blurry, the y-axis labels are meaningless, etc.

Just have to play around a while until you get something that looks presentable. Here's what I came up with. 

In [None]:
# Compute occurence frequency of each base at each position, not the total counts
num_sites = len(sites)
freq_matrix = counts_matrix.T/num_sites

# Set plotting parameter
figure_size = [12,3]
label_size = 16
title_size = 24
colormap = plt.get_cmap('Greens')

# Specify figure of proper size
plt.figure(figsize=figure_size)

# Show matrix without any smoothing
plt.imshow(freq_matrix, interpolation='nearest', cmap=colormap)

# Put interpretable letters on y-axis
plt.yticks(range(4),['A','C','G','T'], fontsize=label_size)

# Label positions symmetically
positions = np.arange(L)-(L/2)+1
indices = np.arange(0,L,5)
plt.xticks(indices+.5, positions[indices], fontsize=label_size)

# Fix colorbar
plt.clim([0, 1])
cbar = plt.colorbar(ticks=np.linspace(0,1,5))
cbar.ax.tick_params(labelsize=label_size) 

# Draw a title
plt.title('CRP base frequency matrix', fontsize=title_size)

# Fix spacing in plot
plt.tight_layout()

# Save the figure
picture_file = 'crp_matrix.png'
plt.savefig(picture_file)

# Draw the plot
plt.show()



Let's check to make sure this actually worked.

In [None]:
# WARNING: Might not work in Windows
!open $picture_file

We should also save a text file that has the base counts at each position, so we can remake plots like this whenever we want

In [None]:
counts_matrix_file = 'crp_counts_matrix.txt'
np.savetxt(counts_matrix_file,counts_matrix)

In [None]:
# WARNING: Might not work in Windows
!cat $counts_matrix_file

This is hard for us humans to read. Let's change the text format so that we can eyeball this when we need to

In [None]:
np.savetxt(counts_matrix_file,counts_matrix, fmt='%d', delimiter='\t')

In [None]:
# WARNING: Might not work in Windows
!open $counts_matrix_file

In [None]:
# Clean up files
!rm binding_site_db.txt crp_matrix.png crp_counts_matrix.txt
!ls

We're done!