Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in get_proteins_by_id affecting pfam annotator #78

Open
LuisFF opened this issue Sep 6, 2022 · 4 comments
Open

bug in get_proteins_by_id affecting pfam annotator #78

LuisFF opened this issue Sep 6, 2022 · 4 comments

Comments

@LuisFF
Copy link

LuisFF commented Sep 6, 2022

First of all thanks for developing DeepBGC and making it available to the community.

I came across a bug in HmmscanPfamRecordAnnotator when generating the proteins_by_id dictionary. The util function get_proteins_by_id is currently looping through all the potential protein ids of a feature (e.g. unique_protein_id, protein_id and locus_tag) and this can cause features with id based on protein_id qualifier to be overwritten by another feature that shares the same protein_id but it was deduplicated using the unique_protein_id. This is causing PFAM_domain features to be incorrectly placed in the genomic sequence because protein_id used in hmmscan output file will match a different feature and pick the incorrect feature location.

@prihoda
Copy link
Collaborator

prihoda commented Sep 12, 2022

Hi @LuisFF, thanks for reporting this. I'll need to think for a bit since it's been a while since I wrote the code :)

I would be careful around changing the logic of get_proteins_by_id, it's used in multiple places and not sure if it's safe to be able to only access the feature using its first protein ID, as proposed in your PR. Currently, it allows mapping from any of the potential protein IDs.

If I understand it correctly, this could also be solved by removing the protein_id qualifier when entering the unique_protein_id, correct? Here:

deepbgc/deepbgc/util.py

Lines 503 to 505 in c836f08

if protein_id != new_protein_id:
logging.warning('Setting new unique_protein_id %s for CDS %s', new_protein_id, protein_id)
feature.qualifiers['unique_protein_id'] = [new_protein_id]

@LuisFF
Copy link
Author

LuisFF commented Sep 12, 2022

Hi @prihoda, thanks for looking into this.

Yes, removing the protein_id qualifier from subsequent CDS features with that same qualifier value is another way of solving this issue. That avoids any side effects I may overlooked in #79.

@prihoda
Copy link
Collaborator

prihoda commented Sep 19, 2022

Great, would you be up to implementing that behavior in your PR? @LuisFF

@LuisFF
Copy link
Author

LuisFF commented Oct 4, 2022

Hi @prihoda , I pushed the suggested changes.

Please let me know if there's anything else needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants