Skip to content

Bugfix related to newscan.x and creation of SLP#5

Merged
maxrossi91 merged 4 commits intomaxrossi91:mainfrom
oma219:bugfix
Feb 25, 2024
Merged

Bugfix related to newscan.x and creation of SLP#5
maxrossi91 merged 4 commits intomaxrossi91:mainfrom
oma219:bugfix

Conversation

@oma219
Copy link
Copy Markdown
Contributor

@oma219 oma219 commented Feb 7, 2024

Hi Max,

There are two separate issues that occur when a reference text begins with a trigger string, where mod p == 0. Here is a brief description of the issues.

  1. When using multiple threads, moni will use newscan.x to create the *.parse and *.dict file but newscan.x has a issue where it will not report the first trigger string as a phrase and this leads to pfp_thresholds creating an incorrect BWT.

    • Solution/Workaround: Not use any multi-threading since newscanNT.x does not have that same issue. It appears like newscan.x might require some more testing to ensure that there are no differences with the single-threaded version.
  2. When the first w-mer is a trigger string, downstream, compress_dictionary will write the compressed length of each phrase after stripping the trigger string but that will be 0. This causes an issue later on because when procdic combines the grammar of the parse and dictionary, it will try to create a rule for the first phrase but there is no text in the *.dicz to create a grammar for so it pushes all the rules off by one. Long story short, it creates an incorrect SLP so matching statistics are incorrect because the random-access to the text in the second pass it not correct.

    • Solution/Workaround: Remove that first phrase from *.dicz and *.dicz.len and decrement all phrase ID in the *.parse file to ensure the *.dicz and *.parse file have the same number of phrases and they correspond to each other.

Here is an example text that causes this issue for testing:

>example_reference
AAATATATATAAATATATATATAAGTAAATATAAATATTATATAGATATATAAATATATAAATATATATATAATATATAA
TAAATATATATATAAATAAATATAAATATATATAAATATATATAAATAAATATATATAAATATATATAAATATATAAATA
TATATATATATATATATATAGTACTGCCTGCTGGAGAGTCCATTCTAGATAACCAACAAACCCTCAAACTTAGTTCACTA
TGTTTACTCCCCAAATCTGCTTCCCTTCTCATGTGATGCCAGAAAAGATATTAAACATTCATCCCCTTGCCAAGGCTGGG

@oma219
Copy link
Copy Markdown
Contributor Author

oma219 commented Feb 21, 2024

Just to followup, I found a separate bug in the ms_rle_string.hpp, it has to do with comparing unsigned chars to signed chars. This particular bug would convert any unsigned chars > 127 to 1 when BWT is loaded from file into an r-index object. I suspect this issue has not been encountered much at all since most text files would not use an unsigned char > 127.

Copy link
Copy Markdown
Owner

@maxrossi91 maxrossi91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix

@maxrossi91 maxrossi91 merged commit dee7c88 into maxrossi91:main Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants