Groot parser implementation #37

cimendes · 2020-11-11T13:25:04Z

With the following single-entry output this is currently what is being parsed:

OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}

This is the current output:

    assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'OqxA'
    assert result.gene_name == 'OqxA.3003470.EU370913' 
    assert result.reference_database_id == 'argannot'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553' 
    assert result.analysis_software_name == 'groot'
    assert result.analysis_software_version == '0.0.1'
    assert result.reference_gene_length == 657
    assert result.coverage_depth == 266

As you can see, the gene_symbol, gene_name and reference_accession fields are all storing the same information.

I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553 to the spec. The gene_name technically this is not present in the report. Should we store the same value as gene_symbol or keep is as None?
For the reference_accession, shouldn't we keep just the EU370913 value? I'm unsure of what 3003470 represents, as well as the 4407527-4408202.4553.

Any input is welcomed!

The text was updated successfully, but these errors were encountered:

fmaguire · 2020-11-12T01:13:33Z

I figure because its dependent on what data you create the db with we can't parse too tightly for groot.

gene_name, gene_symbol, reference_accession are mandatory, but as we can't guarantee input formatting we should probably just sling the same thing into gene_symbol, gene_name and reference_accession

3003470 is the ARO accession and I think the other numbers are related to indexed locations in the variation graph and clusters.

cimendes · 2020-11-16T12:17:32Z

Maybe @will-rowe can be of assistance here? :)

will-rowe · 2020-11-18T09:11:23Z

Heya. This looks like a great and much needed project! Groot hasn't received much love recently. What do you need? Sounds like @fmaguire is right though - as users can change the input DB, it is going to be hard to write a generic parser? Happy to make updates to groot if needed

cimendes · 2020-11-20T10:53:32Z

Hey @will-rowe! Thanks for joining the discussion! Could you please clarify what is in the groot's report? Maybe adding some headers to the tsv file would be a nice addition.. :P I've quite a bit of trouble mapping it to our AMR spec. (Warning: this is very much WIP!). Both gene symbol and gene name are mandatory fields, and having duplicated information there feel a little bit to me like "cheating". :P I would like to avoid that if possible.

will-rowe · 2020-11-30T09:38:57Z

Sorry @cimendes - dropped the ball here.

The report is 4 column tsv where you have:

ARG name
mapped read count
ARG reference length
CIGAR to describe reference coverage

I thought this was in the docs but I can't find it - sorry! Will add it. The ARG name is just lifted from whatever input was used for indexing. So in your linked example, that is just the header from the CARD-3.0.4 multifasta.

This does need improving and groot seems to still be going strong so I need to work on this. Open to suggestions on how though. One way to do it could be to have a flag provided to the report subcommand which you can use to sanitise the report based on a database (CARD/resfinder). So it could lookup the multifasta header against your AMR spec for CARD/resfinder. Is there a way to do this already? This also means that if a user didn't use CARD/resfinder, it would fall back to the old behaviour of just using the multifasta header. I'd update the report format to have consistent fields regardless though (possibly just duplicating gene symbol and gene name if sanitisation wasn't possible/requested

cimendes mentioned this issue Nov 11, 2020

Add sanity testing for hAMRonization #40

Merged

fmaguire added the discussion label Mar 23, 2021

cimendes added the p3 label Feb 24, 2022

fmaguire closed this as completed Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groot parser implementation #37

Groot parser implementation #37

cimendes commented Nov 11, 2020

fmaguire commented Nov 12, 2020

cimendes commented Nov 16, 2020

will-rowe commented Nov 18, 2020

cimendes commented Nov 20, 2020

will-rowe commented Nov 30, 2020 •

edited

Loading

Groot parser implementation #37

Groot parser implementation #37

Comments

cimendes commented Nov 11, 2020

fmaguire commented Nov 12, 2020

cimendes commented Nov 16, 2020

will-rowe commented Nov 18, 2020

cimendes commented Nov 20, 2020

will-rowe commented Nov 30, 2020 • edited Loading

will-rowe commented Nov 30, 2020 •

edited

Loading