% Bein tutorial: ChipSeq on an influenza genome % Fred Ross % May 20, 2011
This is a toy problem created to illustrate writing a script for Bein for the May 25, 2011 SyBit tech day. The user begins with a FASTA file containing the genome of an influenza isolate, and a set of reads from a ChipSeq experiment on that virus. I chose influenza for this purpose since its genome consists of multiple, linear, single stranded RNA molecules, which simplifies the problem.
The user begins with the genome in FASTA format in influenza.fa and 40,000 reads1 from a simulated ChipSeq experiment in reads.txt, one per line. The goal of the exercise is to produce a list of binding positions on the genome.
The simulated ChipSeq experiment consists of crosslinking all proteins to the RNA of the virus, fragmenting the RNA, pulling down a particular protein species, then extracting the RNA fragments that were attached to those proteins. Only the 5' end of the fragments are sequenced, since the RNA is single stranded, so you should have reads on average half a mean fragment length 5'-ward of the protein's binding site.
The exercise goes through the following steps:
-
Align the reads to the genome with
align.py. The reads contain no errors or mismatches, soalign.pyonly finds perfect matches. It produces a CSV file with the fields chromosome and position. Position is measured in 0-based coordinates from the 5' of the chromosome. -
At each position in the genome, count the number of reads whose 5'-most base fall at that position. The script
pileup.pytakes takes the genome and the alignment fromalign.pyand produces this pileup as a CSV file with fields chromosome, position, and count. -
Threshold the pileup to find regions and calculate the most likely insertion site from those regions. The script
calculate_threshold.pytakes the pileup and calculates a threshold with a given probability of type I error by assuming that the background noise is Poisson distributed, and that sites with 1 or 2 reads are mostly background. Given that threshold,threshold_pileup.pytakes the pileup and produces a list of estimated binding sites in a CSV file with fields chromosome and position.
The threshold_pileup.py script does several things: it finds all regions above the given threshold, then eliminates those which are too short to plausibly be binding sites (the width of such regions is approximately known since the fragment lengths follow a geometric distribution). It takes the center of the region and adds half the mean fragment length to get an estimate for the binding site, which is what it reports.
The shell commands to do all this, from generating reads to finding binding sites, are
python generate_reads.py -n 40000 influenza.fa centers.txt > reads.txt
python align.py influenza.fa reads.txt > aligned.txt
python pileup.py influenza.fa aligned.txt > piledup.txt
python threshold_pileup.py `python calculate_threshold.py -a 0.01 piledup.txt` 100 piledup.txt
To do this in Bein, we would first bind the scripts we use.
@program
def align(genome, reads, output=None):
if output == None:
output = unique_filename_in()
return {'arguments': ['python', '../align.py',
'-o', output,
genome, reads],
'return_value': output}
@program
def pileup(genome, alignment, output=None):
if output == None:
output = unique_filename_in()
return {'arguments': ['python', '../pileup.py',
'-o', output, genome, alignment],
'return_value': output}
@program
def calculate_threshold(pileup, alpha=0.01):
def parse(program_result):
return int(''.join(program_result.stdout))
return {'arguments': ['python', '../calculate_threshold.py',
'-a', str(alpha), pileup],
'return_value': parse}
@program
def threshold(threshold, fragment_size, piledup, output=None):
if output == None:
output = unique_filename_in()
return {'arguments': ['python', '../threshold_pileup.py',
'-o', output,
str(threshold), str(fragment_size),
piledup],
'return_value': output}
Then we attach to the MiniLIMS and put our initial files in it. The commands besides the first need to only be run one, or you'll get errors about non-unique aliases.
M = MiniLIMS("toy_chipseq")
reads_id = M.import_file("reads.txt", description="ChipSeq reads")
genome_id = M.import_file("influenza.fa", description="Influenza genome (FASTA)")
M.add_alias(genome_id, "influenza genome")
The analysis is run in an execution with
with execution(M, description="Find binding sites") as ex:
reads = ex.use(reads_id)
genome = ex.path_to_file("influenza genome")
aligned = align(ex, genome, reads)
piledup = pileup(ex, genome, aligned)
t = calculate_threshold(ex, piledup, alpha=0.005)
regions = threshold(ex, t, 100, piledup)
ex.add(regions, description="CSV of binding sites")
Producing this script, and looking in the htminilims to see its results, should be the goal of the tutorial.
Footnotes
-
The reads were generated by the command
python generate_reads.py -o reads.txt -n 40000 influenza.fa centers.txt. ↩