availability of the artificial genomes dataset #3

ChiaraVanni · 2020-12-21T09:05:30Z

Hi!
Thank you for the very cool and useful tool!
I would like to know if it is possible to have access to the synthetic data you generated to benchmark GUNC on the different chimerism scenarios.

Thanks again!

Chiara

defleury · 2020-12-21T09:59:34Z

Hi Chiara!

Thanks for your interest in the tool and data :-)

We didn't realize that the simulated genomes would be useful in themselves, but you're already the second person asking for them. Is there an aspect about the (different types of) simulations that you're particularly interested in, or do you mainly want the genomes with defined levels of contamination and "shorn" MAG-like contig size distribution?

We (that is, @Askarbek-orakov) are working on cleaning that data to release it asap via https://grp-bork.embl-community.io/gunc/datasets . Likewise, we plan to release the Python code that was used to generate them, but that also still requires some work first (mundane stuff like removing hard-coded file paths etc).

So watch this space for updates!

Sebastian

fullama · 2021-02-03T13:14:18Z

Hi,
@Askarbek-orakov has put the data together and we have made it available on https://grp-bork.embl-community.io/gunc/datasets

Let me know if you have any questions/issues!

taylorreiter · 2022-06-23T17:29:26Z

Thank you for making these data sets available! is there any accompanying metadata besides the fasta headers and folder names (e.g. manifests for genomes the contigs came from, or dominant taxonomy in each genome, etc.)? If not, can you provide a key for how to interpret the fasta headers to back-infer this information?

Askarbek-orakov · 2022-06-24T10:00:52Z

Dear Taylor,

When simulating chimeric genomes I encoded the information about the genome in its name. For example, the first file in type3a.genomes.tar.gz looks like this type3a.genomes/10/class/type3a_class_0000_1388475.SAMN02325599_1.0_2893517_1121425.SAMN02745218_0.1_2881643_.fa

The filename delimited by underscore contains this info for type3a genomes:
type3a - simulation scenario
class - divergence level between chimera sources
0000 - id for genomes with the same simulation parameters
1388475.SAMN02325599 - acceptor genome id
1.0 - acceptor genome portion contribution
2893517 - acceptor genome size in bps
1121425.SAMN02745218 - donor genome id
0.1 - donor genome portion contribution
2881643 - donor genome size in bps
It varies slightly for each type so please let me know if you need more info on others.

So, contributing genome ids can be derived as described above and their taxonomy is in the attached table which is for proGenomes2.0 database. Currently, the proGenomes website provides a taxonomy table for v2.1 but that one misses some genomes that were used for simulations.

And finally, contig headers contain: example >1388475.SAMN02325599.KI969747_0-51549
1388475.SAMN02325599 - genome id
KI969747 - contig id
0-51549 - bp range of the original contig that ended up in the simulated chimeric genome.

Cheers,
Askarbek

proGenomes2.genome_taxonomy.csv

fullama assigned defleury Dec 21, 2020

fullama closed this as completed Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

availability of the artificial genomes dataset #3

availability of the artificial genomes dataset #3

ChiaraVanni commented Dec 21, 2020

defleury commented Dec 21, 2020

fullama commented Feb 3, 2021

taylorreiter commented Jun 23, 2022

Askarbek-orakov commented Jun 24, 2022

availability of the artificial genomes dataset #3

availability of the artificial genomes dataset #3

Comments

ChiaraVanni commented Dec 21, 2020

defleury commented Dec 21, 2020

fullama commented Feb 3, 2021

taylorreiter commented Jun 23, 2022

Askarbek-orakov commented Jun 24, 2022