New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inaccurate output for Klebsiella pneumonia dataset #56
Comments
Dear Sumit, |
Hi Marco, Thank you for your response. I would like to correct and confirm that we have used the following command: We attempted to replicate comparable issues on a subset of this dataset but couldn't. This is the smallest dataset where we observed the inconsistency. Thanks, |
Hi Marco, We found a similar issue on the E.Coli dataset (files are added in the same google drive folder). This might help you to debug as the file sizes are smaller. In the ecoli_50.json PanGraph output, the following five sequence sequences are not represented correctly: NZ_CP006027.1 NZ_CP007136.1 NZ_CP006262.1 NZ_CP051001.1 NZ_CP010371.1. Thanks. |
Dear Sumit, thank you once more! Having a smaller dataset with the issue is really helpful. In the meantime thank you once more for all of your careful testing! It's really precious for us! Best, |
Hi Sumit, |
Hi Marco, Thank you for your prompt response. I used the code on the aforementioned branch. It does resolve the issue and generate the correct fasta for both the datasets. Best, |
- fixed issue #56 - pangraph `build` command is now deterministic, a random seed can be set with the `-r` option. - the `build` and `merge` commands now have a `-t` flag. When set sanity checks are performed on the graph. - fasta input files are checked for duplicated record names, and white lines between records are tolerated
Hi All,
We attempted to construct a Pangraph using the Klebsiella pneumonia dataset. The raw sequences and Pangraph output JSON file are available here. However, we encountered an issue in the output file which is as follows:
In the klebs_100 PanGraph JSON file, we believe that the following five sequences are not represented correctly: NZ_CP013985.1 NZ_LR607362.1 NZ_CAKACX010000001.1 NZ_QIXX01000100.1 NZ_JARAMW010000001.1
We noticed that the unique string of eight nucleotides (TGCTTTTT or its reverse complement AAAAAGCA) is missing from these sequences, despite being present in the raw sequence. For example, in the sequence, 'NZ_CP013985.1', this string should be present in the block 'STETJDHNZS' (the 199th block on the path). This block is represented by the reverse strand. We manually reconstructed this block using the mutations in the PanGraph and found that the string of 8 nucleotides (TGCTTTTT) is missing from the reverse complement. We recommend that you do the same to verify. We believe the block is missing an 'insertion' at position 2775 for the sequence 'NZ_CP013985.1'. The nucleotides of the insertion should be AAAAAGCA.
We have used the following command to build the Pangraph:
pangraph --circular klebs_100.fa
Thanks,
Sumit
The text was updated successfully, but these errors were encountered: