Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "Goldilocks" synthetic datasets for automated tests #100

Closed
hollybik opened this issue Feb 27, 2012 · 4 comments
Closed

New "Goldilocks" synthetic datasets for automated tests #100

hollybik opened this issue Feb 27, 2012 · 4 comments
Assignees
Labels
Milestone

Comments

@hollybik
Copy link
Collaborator

Datasets where taxa are not too closely related (E.coli / Shigella) or too far apart (no close relatives in database). These will replace the older, hard datasets we're currently using for testing.

To measure things like linkage accuracy, etc.

@ghost ghost assigned hollybik Mar 1, 2012
@koadman
Copy link
Collaborator

koadman commented Mar 9, 2012

grinder is a nice tool for making synthetic datasets:
http://sourceforge.net/projects/biogrinder/files/biogrinder/Grinder-0.4.5/

It uses a fasta file of genomes as input and can sample reads under various sampling strategies. Parameters of relevance to us could be:
grinder -read_dist 105 -insert_dist 400 normal 50 -md poly4 3e-3 3.3e-8 -mr 95 5 -genome_file <your genome file> -total_reads XXXXX

@hollybik
Copy link
Collaborator Author

Working on the Eukaryotes now, and sitting down with Jenna on Wednesday to go through the Bacterial and Archaeal tree and pick out a range of non-nucleated genomes for compiling this dataset. After we compile a list of target taxa, will move forward with grinder to subsample genomes.

@ghost ghost assigned gjospin Apr 3, 2012
@gjospin
Copy link
Owner

gjospin commented Apr 5, 2012

The Kangaroo was born today and is knocking out genomes!!!!

I am working on a new branch called 040512_sims that will add a new PS mode called sim.
There are still a few things to tweak but the idea is there.
So far it picks the top X taxa with highest PD from our concatenated tree and X random taxa from that tree and generated Y Illumina paired ends reads in a PS_temp/Sims directory.

To do : Have a user specified output directory so multiple instances of simulations don't write on top of each other
To do : Generate other reads using more flexible parameter inputs
To do : Think of shorter file naming convention that are still easy to understand. Grinder adds its own junk at the end which can make the file names really long.

@gjospin
Copy link
Owner

gjospin commented Apr 10, 2012

Added a randomly generated abundance for all taxa selected.
Ready to merge to devel, waiting for appropriate time do the merging.

@gjospin gjospin closed this as completed Apr 10, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants