New "Goldilocks" synthetic datasets for automated tests #100

hollybik · 2012-02-27T19:31:22Z

Datasets where taxa are not too closely related (E.coli / Shigella) or too far apart (no close relatives in database). These will replace the older, hard datasets we're currently using for testing.

To measure things like linkage accuracy, etc.

koadman · 2012-03-09T04:30:40Z

grinder is a nice tool for making synthetic datasets:
http://sourceforge.net/projects/biogrinder/files/biogrinder/Grinder-0.4.5/

It uses a fasta file of genomes as input and can sample reads under various sampling strategies. Parameters of relevance to us could be:
grinder -read_dist 105 -insert_dist 400 normal 50 -md poly4 3e-3 3.3e-8 -mr 95 5 -genome_file <your genome file> -total_reads XXXXX

hollybik · 2012-03-10T18:19:32Z

Working on the Eukaryotes now, and sitting down with Jenna on Wednesday to go through the Bacterial and Archaeal tree and pick out a range of non-nucleated genomes for compiling this dataset. After we compile a list of target taxa, will move forward with grinder to subsample genomes.

gjospin · 2012-04-05T22:09:11Z

The Kangaroo was born today and is knocking out genomes!!!!

I am working on a new branch called 040512_sims that will add a new PS mode called sim.
There are still a few things to tweak but the idea is there.
So far it picks the top X taxa with highest PD from our concatenated tree and X random taxa from that tree and generated Y Illumina paired ends reads in a PS_temp/Sims directory.

To do : Have a user specified output directory so multiple instances of simulations don't write on top of each other
To do : Generate other reads using more flexible parameter inputs
To do : Think of shorter file naming convention that are still easy to understand. Grinder adds its own junk at the end which can make the file names really long.

gjospin · 2012-04-10T19:07:50Z

Added a randomly generated abundance for all taxa selected.
Ready to merge to devel, waiting for appropriate time do the merging.

ghost assigned hollybik Mar 1, 2012

ghost assigned gjospin Apr 3, 2012

gjospin closed this as completed Apr 10, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New "Goldilocks" synthetic datasets for automated tests #100

New "Goldilocks" synthetic datasets for automated tests #100

hollybik commented Feb 27, 2012

koadman commented Mar 9, 2012

hollybik commented Mar 10, 2012

gjospin commented Apr 5, 2012

gjospin commented Apr 10, 2012

New "Goldilocks" synthetic datasets for automated tests #100

New "Goldilocks" synthetic datasets for automated tests #100

Comments

hollybik commented Feb 27, 2012

koadman commented Mar 9, 2012

hollybik commented Mar 10, 2012

gjospin commented Apr 5, 2012

gjospin commented Apr 10, 2012