Inaccurate output for Klebsiella pneumonia dataset #56

sumit-walia · 2023-06-05T22:06:38Z

Hi All,

We attempted to construct a Pangraph using the Klebsiella pneumonia dataset. The raw sequences and Pangraph output JSON file are available here. However, we encountered an issue in the output file which is as follows:

In the klebs_100 PanGraph JSON file, we believe that the following five sequences are not represented correctly: NZ_CP013985.1 NZ_LR607362.1 NZ_CAKACX010000001.1 NZ_QIXX01000100.1 NZ_JARAMW010000001.1
We noticed that the unique string of eight nucleotides (TGCTTTTT or its reverse complement AAAAAGCA) is missing from these sequences, despite being present in the raw sequence. For example, in the sequence, 'NZ_CP013985.1', this string should be present in the block 'STETJDHNZS' (the 199th block on the path). This block is represented by the reverse strand. We manually reconstructed this block using the mutations in the PanGraph and found that the string of 8 nucleotides (TGCTTTTT) is missing from the reverse complement. We recommend that you do the same to verify. We believe the block is missing an 'insertion' at position 2775 for the sequence 'NZ_CP013985.1'. The nucleotides of the insertion should be AAAAAGCA.
We have used the following command to build the Pangraph: pangraph --circular klebs_100.fa

Thanks,
Sumit

The text was updated successfully, but these errors were encountered:

mmolari · 2023-06-06T14:39:24Z

Dear Sumit,
thanks again for testing PanGraph and reporting issues, this is very precious for us. I looked into the .json file attached and indeed, it looks like the 8-nucleotide insertion present in some of the block occurrences is not in the output file. The rest of the blocks are accurately reconstructed.
I also checked the file and I think you did not use the --circular flag (the circular option is false for all the path in the json file). Could you confirm that the command you used to produce the graph is: pangraph build klebs_100.fa > klebs_100.json ?
I will try to debug this on my end. In the meantime could you try and see if this can be reproduced on your end with a smaller subset of the isolates? And if it is reproduced consistently in the same way? This would make the debugging easier... Could you also check which version of pangraph you are running? It should be sufficient to type pangraph version.
Thanks!
Marco

sumit-walia · 2023-06-06T16:45:29Z

Hi Marco,

Thank you for your response. I would like to correct and confirm that we have used the following command: pangraph build klebs_100.fa > klebs_100.json. We are running the latest version of Pangraph (v0.6.3).

We attempted to replicate comparable issues on a subset of this dataset but couldn't. This is the smallest dataset where we observed the inconsistency.

Thanks,
Sumit

sumit-walia · 2023-06-09T22:14:49Z

Hi Marco,

We found a similar issue on the E.Coli dataset (files are added in the same google drive folder). This might help you to debug as the file sizes are smaller.

In the ecoli_50.json PanGraph output, the following five sequence sequences are not represented correctly: NZ_CP006027.1 NZ_CP007136.1 NZ_CP006262.1 NZ_CP051001.1 NZ_CP010371.1.
We investigated the issue and made the following observation for sequence NZ_CP006027.1: Since the sequence is circular, we found that the block MBEUQKOMBV is the block that wraps around the end and has a few nucleotides at the beginning of the sequence. Since the starting point of the next block is 14, we know that the last 13 nucleotides of block MBEUQKOMBV appear at the beginning of the sequence. We manually computed what the last 13 nucleotides of this sequence would be and found that they are 'TTAAAAAAAGAGT'. This does not match the actual sequence. We made one more observation that will help you debug the issue. The actual sequence has this string of nucleotides starting at the 45th position and if the sequence was rotated by an additional offset of 44, then a large number of nucleotides of the actual sequence would match the sequence represented by PanGraph. What we suspect the issue is (and are quite certain about) is that an insertion of 44 characters is missing from the sequence. If this insertion had taken place, 44 more characters from the end of the sequence would be rotated towards the beginning.

Thanks.

mmolari · 2023-06-10T15:23:32Z

Dear Sumit,

thank you once more! Having a smaller dataset with the issue is really helpful.
A small status report: I am currently in the process of locating the source of the error in the code. I have had some difficulties due to the fact that the error is not reproduced exactly in the same way every time, because the current behavior of PanGraph is non-deterministic. This in turn is due to some technicalities with the way random numbers are generated by threads, and some non-deterministic ordering of items in a dictionary. As a first step we therefore decided to correct this, and make PanGraph deterministic and reproducible. This is currently done in this branch, and will help in locating the bug more precisely. I'll keep working on it and report back as soon as I have more progress.

In the meantime thank you once more for all of your careful testing! It's really precious for us!

Best,
Marco

mmolari · 2023-06-21T13:29:26Z

Hi Sumit,
just wanted to give you an update. The bug was caused by an edge-case in the merging of two alignments, that now should be handled correctly by the code. Branch fix/issue-56 contains the fix. I tested it on both of the datasets you provided and genomes are correctly reconstructed now. I'll perform some more checks before merging to master, but in the meantime if you need it or want to test it too you can pull the code from that branch.
Thanks again for all of your feedback!
Best,
Marco

sumit-walia · 2023-06-26T18:00:40Z

Hi Marco,

Thank you for your prompt response. I used the code on the aforementioned branch. It does resolve the issue and generate the correct fasta for both the datasets.

Best,
Sumit

- fixed issue #56 - pangraph `build` command is now deterministic, a random seed can be set with the `-r` option. - the `build` and `merge` commands now have a `-t` flag. When set sanity checks are performed on the graph. - fasta input files are checked for duplicated record names, and white lines between records are tolerated

mmolari self-assigned this Jun 6, 2023

mmolari added the bug Something isn't working label Jun 6, 2023

sumit-walia closed this as completed Jun 9, 2023

sumit-walia reopened this Jun 9, 2023

mmolari linked a pull request Jun 19, 2023 that will close this issue

Fix/issue 56 #57

Merged

mmolari closed this as completed in #57 Jun 27, 2023

mmolari mentioned this issue Jun 27, 2023

Fix/issue 56 #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate output for Klebsiella pneumonia dataset #56

Inaccurate output for Klebsiella pneumonia dataset #56

sumit-walia commented Jun 5, 2023

mmolari commented Jun 6, 2023

sumit-walia commented Jun 6, 2023 •

edited

sumit-walia commented Jun 9, 2023 •

edited

mmolari commented Jun 10, 2023

mmolari commented Jun 21, 2023

sumit-walia commented Jun 26, 2023

Inaccurate output for Klebsiella pneumonia dataset #56

Inaccurate output for Klebsiella pneumonia dataset #56

Comments

sumit-walia commented Jun 5, 2023

mmolari commented Jun 6, 2023

sumit-walia commented Jun 6, 2023 • edited

sumit-walia commented Jun 9, 2023 • edited

mmolari commented Jun 10, 2023

mmolari commented Jun 21, 2023

sumit-walia commented Jun 26, 2023

sumit-walia commented Jun 6, 2023 •

edited

sumit-walia commented Jun 9, 2023 •

edited