Bugfix related to newscan.x and creation of SLP by oma219 · Pull Request #5 · maxrossi91/moni

oma219 · 2024-02-07T16:28:52Z

Hi Max,

There are two separate issues that occur when a reference text begins with a trigger string, where mod p == 0. Here is a brief description of the issues.

When using multiple threads, moni will use newscan.x to create the *.parse and *.dict file but newscan.x has a issue where it will not report the first trigger string as a phrase and this leads to pfp_thresholds creating an incorrect BWT.
- Solution/Workaround: Not use any multi-threading since newscanNT.x does not have that same issue. It appears like newscan.x might require some more testing to ensure that there are no differences with the single-threaded version.
When the first w-mer is a trigger string, downstream, compress_dictionary will write the compressed length of each phrase after stripping the trigger string but that will be 0. This causes an issue later on because when procdic combines the grammar of the parse and dictionary, it will try to create a rule for the first phrase but there is no text in the *.dicz to create a grammar for so it pushes all the rules off by one. Long story short, it creates an incorrect SLP so matching statistics are incorrect because the random-access to the text in the second pass it not correct.
- Solution/Workaround: Remove that first phrase from *.dicz and *.dicz.len and decrement all phrase ID in the *.parse file to ensure the *.dicz and *.parse file have the same number of phrases and they correspond to each other.

Here is an example text that causes this issue for testing:

>example_reference
AAATATATATAAATATATATATAAGTAAATATAAATATTATATAGATATATAAATATATAAATATATATATAATATATAA
TAAATATATATATAAATAAATATAAATATATATAAATATATATAAATAAATATATATAAATATATATAAATATATAAATA
TATATATATATATATATATAGTACTGCCTGCTGGAGAGTCCATTCTAGATAACCAACAAACCCTCAAACTTAGTTCACTA
TGTTTACTCCCCAAATCTGCTTCCCTTCTCATGTGATGCCAGAAAAGATATTAAACATTCATCCCCTTGCCAAGGCTGGG

…er to generate correct SLP downstream.

…d to unsigned chars

oma219 · 2024-02-21T14:35:30Z

Just to followup, I found a separate bug in the ms_rle_string.hpp, it has to do with comparing unsigned chars to signed chars. This particular bug would convert any unsigned chars > 127 to 1 when BWT is loaded from file into an r-index object. I suspect this issue has not been encountered much at all since most text files would not use an unsigned char > 127.

…itive

maxrossi91

Thanks for the fix

oma219 added 3 commits February 6, 2024 17:05

avoid using newscan.x until it can be updated

28cd093

add code to handle case where first w-mer is a trigger string, in ord…

8dbbb19

…er to generate correct SLP downstream.

fix bug in ms_rle_string constructor that errorneously compares signe…

dd1d256

…d to unsigned chars

fixed issue with previous commit, make sure array indexes are all pos…

8ebdb1f

…itive

maxrossi91 approved these changes Feb 25, 2024

View reviewed changes

maxrossi91 merged commit dee7c88 into maxrossi91:main Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix related to newscan.x and creation of SLP#5

Bugfix related to newscan.x and creation of SLP#5
maxrossi91 merged 4 commits intomaxrossi91:mainfrom
oma219:bugfix

oma219 commented Feb 7, 2024

Uh oh!

oma219 commented Feb 21, 2024

Uh oh!

maxrossi91 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oma219 commented Feb 7, 2024

Uh oh!

oma219 commented Feb 21, 2024

Uh oh!

maxrossi91 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants