Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate output for Klebsiella pneumonia dataset #56

Closed
sumit-walia opened this issue Jun 5, 2023 · 6 comments · Fixed by #57
Closed

Inaccurate output for Klebsiella pneumonia dataset #56

sumit-walia opened this issue Jun 5, 2023 · 6 comments · Fixed by #57
Assignees
Labels
bug Something isn't working

Comments

@sumit-walia
Copy link

Hi All,

We attempted to construct a Pangraph using the Klebsiella pneumonia dataset. The raw sequences and Pangraph output JSON file are available here. However, we encountered an issue in the output file which is as follows:

In the klebs_100 PanGraph JSON file, we believe that the following five sequences are not represented correctly: NZ_CP013985.1 NZ_LR607362.1 NZ_CAKACX010000001.1 NZ_QIXX01000100.1 NZ_JARAMW010000001.1
We noticed that the unique string of eight nucleotides (TGCTTTTT or its reverse complement AAAAAGCA) is missing from these sequences, despite being present in the raw sequence. For example, in the sequence, 'NZ_CP013985.1', this string should be present in the block 'STETJDHNZS' (the 199th block on the path). This block is represented by the reverse strand. We manually reconstructed this block using the mutations in the PanGraph and found that the string of 8 nucleotides (TGCTTTTT) is missing from the reverse complement. We recommend that you do the same to verify. We believe the block is missing an 'insertion' at position 2775 for the sequence 'NZ_CP013985.1'. The nucleotides of the insertion should be AAAAAGCA.
We have used the following command to build the Pangraph: pangraph --circular klebs_100.fa

Thanks,
Sumit

@mmolari
Copy link
Collaborator

mmolari commented Jun 6, 2023

Dear Sumit,
thanks again for testing PanGraph and reporting issues, this is very precious for us. I looked into the .json file attached and indeed, it looks like the 8-nucleotide insertion present in some of the block occurrences is not in the output file. The rest of the blocks are accurately reconstructed.
I also checked the file and I think you did not use the --circular flag (the circular option is false for all the path in the json file). Could you confirm that the command you used to produce the graph is: pangraph build klebs_100.fa > klebs_100.json ?
I will try to debug this on my end. In the meantime could you try and see if this can be reproduced on your end with a smaller subset of the isolates? And if it is reproduced consistently in the same way? This would make the debugging easier... Could you also check which version of pangraph you are running? It should be sufficient to type pangraph version.
Thanks!
Marco

@mmolari mmolari self-assigned this Jun 6, 2023
@mmolari mmolari added the bug Something isn't working label Jun 6, 2023
@sumit-walia
Copy link
Author

sumit-walia commented Jun 6, 2023

Hi Marco,

Thank you for your response. I would like to correct and confirm that we have used the following command: pangraph build klebs_100.fa > klebs_100.json. We are running the latest version of Pangraph (v0.6.3).

We attempted to replicate comparable issues on a subset of this dataset but couldn't. This is the smallest dataset where we observed the inconsistency.

Thanks,
Sumit

@sumit-walia
Copy link
Author

sumit-walia commented Jun 9, 2023

Hi Marco,

We found a similar issue on the E.Coli dataset (files are added in the same google drive folder). This might help you to debug as the file sizes are smaller.

In the ecoli_50.json PanGraph output, the following five sequence sequences are not represented correctly: NZ_CP006027.1 NZ_CP007136.1 NZ_CP006262.1 NZ_CP051001.1 NZ_CP010371.1.
We investigated the issue and made the following observation for sequence NZ_CP006027.1: Since the sequence is circular, we found that the block MBEUQKOMBV is the block that wraps around the end and has a few nucleotides at the beginning of the sequence. Since the starting point of the next block is 14, we know that the last 13 nucleotides of block MBEUQKOMBV appear at the beginning of the sequence. We manually computed what the last 13 nucleotides of this sequence would be and found that they are 'TTAAAAAAAGAGT'. This does not match the actual sequence. We made one more observation that will help you debug the issue. The actual sequence has this string of nucleotides starting at the 45th position and if the sequence was rotated by an additional offset of 44, then a large number of nucleotides of the actual sequence would match the sequence represented by PanGraph. What we suspect the issue is (and are quite certain about) is that an insertion of 44 characters is missing from the sequence. If this insertion had taken place, 44 more characters from the end of the sequence would be rotated towards the beginning.

Thanks.

@sumit-walia sumit-walia reopened this Jun 9, 2023
@mmolari
Copy link
Collaborator

mmolari commented Jun 10, 2023

Dear Sumit,

thank you once more! Having a smaller dataset with the issue is really helpful.
A small status report: I am currently in the process of locating the source of the error in the code. I have had some difficulties due to the fact that the error is not reproduced exactly in the same way every time, because the current behavior of PanGraph is non-deterministic. This in turn is due to some technicalities with the way random numbers are generated by threads, and some non-deterministic ordering of items in a dictionary. As a first step we therefore decided to correct this, and make PanGraph deterministic and reproducible. This is currently done in this branch, and will help in locating the bug more precisely. I'll keep working on it and report back as soon as I have more progress.

In the meantime thank you once more for all of your careful testing! It's really precious for us!

Best,
Marco

@mmolari mmolari linked a pull request Jun 19, 2023 that will close this issue
@mmolari
Copy link
Collaborator

mmolari commented Jun 21, 2023

Hi Sumit,
just wanted to give you an update. The bug was caused by an edge-case in the merging of two alignments, that now should be handled correctly by the code. Branch fix/issue-56 contains the fix. I tested it on both of the datasets you provided and genomes are correctly reconstructed now. I'll perform some more checks before merging to master, but in the meantime if you need it or want to test it too you can pull the code from that branch.
Thanks again for all of your feedback!
Best,
Marco

@sumit-walia
Copy link
Author

Hi Marco,

Thank you for your prompt response. I used the code on the aforementioned branch. It does resolve the issue and generate the correct fasta for both the datasets.

Best,
Sumit

mmolari added a commit that referenced this issue Jun 27, 2023
- fixed issue #56 
- pangraph `build` command is now deterministic, a random seed can be set with the `-r` option.
- the `build` and `merge` commands now have a `-t` flag. When set sanity checks are performed on the graph.
- fasta input files are checked for duplicated record names, and white lines between records are tolerated
@mmolari mmolari mentioned this issue Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants