-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
availability of the artificial genomes dataset #3
Comments
Hi Chiara! Thanks for your interest in the tool and data :-) We didn't realize that the simulated genomes would be useful in themselves, but you're already the second person asking for them. Is there an aspect about the (different types of) simulations that you're particularly interested in, or do you mainly want the genomes with defined levels of contamination and "shorn" MAG-like contig size distribution? We (that is, @Askarbek-orakov) are working on cleaning that data to release it asap via https://grp-bork.embl-community.io/gunc/datasets . Likewise, we plan to release the Python code that was used to generate them, but that also still requires some work first (mundane stuff like removing hard-coded file paths etc). So watch this space for updates! Sebastian |
Hi, Let me know if you have any questions/issues! |
Thank you for making these data sets available! is there any accompanying metadata besides the fasta headers and folder names (e.g. manifests for genomes the contigs came from, or dominant taxonomy in each genome, etc.)? If not, can you provide a key for how to interpret the fasta headers to back-infer this information? |
Dear Taylor, When simulating chimeric genomes I encoded the information about the genome in its name. For example, the first file in The filename delimited by underscore contains this info for type3a genomes: So, contributing genome ids can be derived as described above and their taxonomy is in the attached table which is for proGenomes2.0 database. Currently, the proGenomes website provides a taxonomy table for v2.1 but that one misses some genomes that were used for simulations. And finally, contig headers contain: example Cheers, |
Hi!
Thank you for the very cool and useful tool!
I would like to know if it is possible to have access to the synthetic data you generated to benchmark GUNC on the different chimerism scenarios.
Thanks again!
Chiara
The text was updated successfully, but these errors were encountered: