Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

availability of the artificial genomes dataset #3

Closed
ChiaraVanni opened this issue Dec 21, 2020 · 4 comments
Closed

availability of the artificial genomes dataset #3

ChiaraVanni opened this issue Dec 21, 2020 · 4 comments
Assignees

Comments

@ChiaraVanni
Copy link

Hi!
Thank you for the very cool and useful tool!
I would like to know if it is possible to have access to the synthetic data you generated to benchmark GUNC on the different chimerism scenarios.

Thanks again!

Chiara

@defleury
Copy link

Hi Chiara!

Thanks for your interest in the tool and data :-)

We didn't realize that the simulated genomes would be useful in themselves, but you're already the second person asking for them. Is there an aspect about the (different types of) simulations that you're particularly interested in, or do you mainly want the genomes with defined levels of contamination and "shorn" MAG-like contig size distribution?

We (that is, @Askarbek-orakov) are working on cleaning that data to release it asap via https://grp-bork.embl-community.io/gunc/datasets . Likewise, we plan to release the Python code that was used to generate them, but that also still requires some work first (mundane stuff like removing hard-coded file paths etc).

So watch this space for updates!

Sebastian

@fullama
Copy link
Contributor

fullama commented Feb 3, 2021

Hi,
@Askarbek-orakov has put the data together and we have made it available on https://grp-bork.embl-community.io/gunc/datasets

Let me know if you have any questions/issues!

@fullama fullama closed this as completed Feb 3, 2021
@taylorreiter
Copy link

Thank you for making these data sets available! is there any accompanying metadata besides the fasta headers and folder names (e.g. manifests for genomes the contigs came from, or dominant taxonomy in each genome, etc.)? If not, can you provide a key for how to interpret the fasta headers to back-infer this information?

@Askarbek-orakov
Copy link

Dear Taylor,

When simulating chimeric genomes I encoded the information about the genome in its name. For example, the first file in type3a.genomes.tar.gz looks like this type3a.genomes/10/class/type3a_class_0000_1388475.SAMN02325599_1.0_2893517_1121425.SAMN02745218_0.1_2881643_.fa

The filename delimited by underscore contains this info for type3a genomes:
type3a - simulation scenario
class - divergence level between chimera sources
0000 - id for genomes with the same simulation parameters
1388475.SAMN02325599 - acceptor genome id
1.0 - acceptor genome portion contribution
2893517 - acceptor genome size in bps
1121425.SAMN02745218 - donor genome id
0.1 - donor genome portion contribution
2881643 - donor genome size in bps
It varies slightly for each type so please let me know if you need more info on others.

So, contributing genome ids can be derived as described above and their taxonomy is in the attached table which is for proGenomes2.0 database. Currently, the proGenomes website provides a taxonomy table for v2.1 but that one misses some genomes that were used for simulations.

And finally, contig headers contain: example >1388475.SAMN02325599.KI969747_0-51549
1388475.SAMN02325599 - genome id
KI969747 - contig id
0-51549 - bp range of the original contig that ended up in the simulated chimeric genome.

Cheers,
Askarbek

proGenomes2.genome_taxonomy.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants