-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Groot parser implementation #37
Comments
I figure because its dependent on what data you create the db with we can't parse too tightly for groot.
|
Maybe @will-rowe can be of assistance here? :) |
Heya. This looks like a great and much needed project! Groot hasn't received much love recently. What do you need? Sounds like @fmaguire is right though - as users can change the input DB, it is going to be hard to write a generic parser? Happy to make updates to groot if needed |
Hey @will-rowe! Thanks for joining the discussion! Could you please clarify what is in the groot's report? Maybe adding some headers to the tsv file would be a nice addition.. :P I've quite a bit of trouble mapping it to our AMR spec. (Warning: this is very much WIP!). Both |
Sorry @cimendes - dropped the ball here. The report is 4 column tsv where you have:
I thought this was in the docs but I can't find it - sorry! Will add it. The ARG name is just lifted from whatever input was used for indexing. So in your linked example, that is just the header from the CARD-3.0.4 multifasta. This does need improving and groot seems to still be going strong so I need to work on this. Open to suggestions on how though. One way to do it could be to have a flag provided to the report subcommand which you can use to sanitise the report based on a database (CARD/resfinder). So it could lookup the multifasta header against your AMR spec for CARD/resfinder. Is there a way to do this already? This also means that if a user didn't use CARD/resfinder, it would fall back to the old behaviour of just using the multifasta header. I'd update the report format to have consistent fields regardless though (possibly just duplicating |
With the following single-entry output this is currently what is being parsed:
The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}
This is the current output:
As you can see, the
gene_symbol
,gene_name
andreference_accession
fields are all storing the same information.I've a bit of trouble mapping the fields in
OqxA.3003470.EU370913.4407527-4408202.4553
to the spec. Thegene_name
technically this is not present in the report. Should we store the same value asgene_symbol
or keep is asNone
?For the
reference_accession
, shouldn't we keep just theEU370913
value? I'm unsure of what3003470
represents, as well as the4407527-4408202.4553
.Any input is welcomed!
The text was updated successfully, but these errors were encountered: