bug in get_proteins_by_id affecting pfam annotator #78

LuisFF · 2022-09-06T22:57:31Z

First of all thanks for developing DeepBGC and making it available to the community.

I came across a bug in HmmscanPfamRecordAnnotator when generating the proteins_by_id dictionary. The util function get_proteins_by_id is currently looping through all the potential protein ids of a feature (e.g. unique_protein_id, protein_id and locus_tag) and this can cause features with id based on protein_id qualifier to be overwritten by another feature that shares the same protein_id but it was deduplicated using the unique_protein_id. This is causing PFAM_domain features to be incorrectly placed in the genomic sequence because protein_id used in hmmscan output file will match a different feature and pick the incorrect feature location.

The text was updated successfully, but these errors were encountered:

prihoda · 2022-09-12T09:49:25Z

Hi @LuisFF, thanks for reporting this. I'll need to think for a bit since it's been a while since I wrote the code :)

I would be careful around changing the logic of get_proteins_by_id, it's used in multiple places and not sure if it's safe to be able to only access the feature using its first protein ID, as proposed in your PR. Currently, it allows mapping from any of the potential protein IDs.

If I understand it correctly, this could also be solved by removing the protein_id qualifier when entering the unique_protein_id, correct? Here:

deepbgc/deepbgc/util.py

Lines 503 to 505 in c836f08

    
           if protein_id != new_protein_id: 
        
               logging.warning('Setting new unique_protein_id %s for CDS %s', new_protein_id, protein_id) 
        
               feature.qualifiers['unique_protein_id'] = [new_protein_id]

LuisFF · 2022-09-12T12:27:03Z

Hi @prihoda, thanks for looking into this.

Yes, removing the protein_id qualifier from subsequent CDS features with that same qualifier value is another way of solving this issue. That avoids any side effects I may overlooked in #79.

prihoda · 2022-09-19T07:46:37Z

Great, would you be up to implementing that behavior in your PR? @LuisFF

LuisFF · 2022-10-04T12:19:17Z

Hi @prihoda , I pushed the suggested changes.

Please let me know if there's anything else needed.

LuisFF mentioned this issue Sep 6, 2022

fix get_proteins_by_id to avoid overwriting protein_id features #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in get_proteins_by_id affecting pfam annotator #78

bug in get_proteins_by_id affecting pfam annotator #78

LuisFF commented Sep 6, 2022

prihoda commented Sep 12, 2022

LuisFF commented Sep 12, 2022

prihoda commented Sep 19, 2022 •

edited

Loading

LuisFF commented Oct 4, 2022 •

edited

Loading

bug in get_proteins_by_id affecting pfam annotator #78

bug in get_proteins_by_id affecting pfam annotator #78

Comments

LuisFF commented Sep 6, 2022

prihoda commented Sep 12, 2022

LuisFF commented Sep 12, 2022

prihoda commented Sep 19, 2022 • edited Loading

LuisFF commented Oct 4, 2022 • edited Loading

prihoda commented Sep 19, 2022 •

edited

Loading

LuisFF commented Oct 4, 2022 •

edited

Loading