GitHub - madhadron/bein-chipseq-tutorial: A toy example for teaching Bein

% Bein tutorial: ChipSeq on an influenza genome % Fred Ross % May 20, 2011

This is a toy problem created to illustrate writing a script for Bein for the May 25, 2011 SyBit tech day. The user begins with a FASTA file containing the genome of an influenza isolate, and a set of reads from a ChipSeq experiment on that virus. I chose influenza for this purpose since its genome consists of multiple, linear, single stranded RNA molecules, which simplifies the problem.

The user begins with the genome in FASTA format in influenza.fa and 40,000 reads¹ from a simulated ChipSeq experiment in reads.txt, one per line. The goal of the exercise is to produce a list of binding positions on the genome.

The simulated ChipSeq experiment consists of crosslinking all proteins to the RNA of the virus, fragmenting the RNA, pulling down a particular protein species, then extracting the RNA fragments that were attached to those proteins. Only the 5' end of the fragments are sequenced, since the RNA is single stranded, so you should have reads on average half a mean fragment length 5'-ward of the protein's binding site.

The exercise goes through the following steps:

Align the reads to the genome with align.py. The reads contain no errors or mismatches, so align.py only finds perfect matches. It produces a CSV file with the fields chromosome and position. Position is measured in 0-based coordinates from the 5' of the chromosome.
At each position in the genome, count the number of reads whose 5'-most base fall at that position. The script pileup.py takes takes the genome and the alignment from align.py and produces this pileup as a CSV file with fields chromosome, position, and count.
Threshold the pileup to find regions and calculate the most likely insertion site from those regions. The script calculate_threshold.py takes the pileup and calculates a threshold with a given probability of type I error by assuming that the background noise is Poisson distributed, and that sites with 1 or 2 reads are mostly background. Given that threshold, threshold_pileup.py takes the pileup and produces a list of estimated binding sites in a CSV file with fields chromosome and position.

The threshold_pileup.py script does several things: it finds all regions above the given threshold, then eliminates those which are too short to plausibly be binding sites (the width of such regions is approximately known since the fragment lengths follow a geometric distribution). It takes the center of the region and adds half the mean fragment length to get an estimate for the binding site, which is what it reports.

The shell commands to do all this, from generating reads to finding binding sites, are

python generate_reads.py -n 40000 influenza.fa centers.txt > reads.txt
python align.py influenza.fa reads.txt > aligned.txt
python pileup.py influenza.fa aligned.txt > piledup.txt
python threshold_pileup.py `python calculate_threshold.py -a 0.01 piledup.txt` 100 piledup.txt

To do this in Bein, we would first bind the scripts we use.

@program
def align(genome, reads, output=None):
    if output == None:
        output = unique_filename_in()
    return {'arguments': ['python', '../align.py',
                          '-o', output,
                          genome, reads],
            'return_value': output}

@program
def pileup(genome, alignment, output=None):
    if output == None:
        output = unique_filename_in()
    return {'arguments': ['python', '../pileup.py',
                          '-o', output, genome, alignment],
            'return_value': output}

@program
def calculate_threshold(pileup, alpha=0.01):
    def parse(program_result):
        return int(''.join(program_result.stdout))
    return {'arguments': ['python', '../calculate_threshold.py',
                          '-a', str(alpha), pileup],
            'return_value': parse}

@program
def threshold(threshold, fragment_size, piledup, output=None):
    if output == None:
        output = unique_filename_in()
    return {'arguments': ['python', '../threshold_pileup.py',
                          '-o', output,
                          str(threshold), str(fragment_size),
                          piledup],
            'return_value': output}

Then we attach to the MiniLIMS and put our initial files in it. The commands besides the first need to only be run one, or you'll get errors about non-unique aliases.

M = MiniLIMS("toy_chipseq")

reads_id = M.import_file("reads.txt", description="ChipSeq reads")
genome_id = M.import_file("influenza.fa", description="Influenza genome (FASTA)")
M.add_alias(genome_id, "influenza genome")

The analysis is run in an execution with

with execution(M, description="Find binding sites") as ex:
    reads = ex.use(reads_id)
    genome = ex.path_to_file("influenza genome")
    aligned = align(ex, genome, reads)
    piledup = pileup(ex, genome, aligned)
    t = calculate_threshold(ex, piledup, alpha=0.005)
    regions = threshold(ex, t, 100, piledup)
    ex.add(regions, description="CSV of binding sites")

Producing this script, and looking in the htminilims to see its results, should be the goal of the tutorial.

The reads were generated by the command python generate_reads.py -o reads.txt -n 40000 influenza.fa centers.txt. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.markdown		README.markdown
align.py		align.py
aligned.txt		aligned.txt
calculate_threshold.py		calculate_threshold.py
centers.txt		centers.txt
generate_reads.py		generate_reads.py
influenza.fa		influenza.fa
piledup.txt		piledup.txt
pileup.py		pileup.py
plan.org		plan.org
reads.txt		reads.txt
threshold_pileup.py		threshold_pileup.py

Folders and files

Latest commit

History

Repository files navigation

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages