Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some variables inappropriately using int32? #8

Closed
JohnMMa opened this issue Jan 19, 2024 · 3 comments
Closed

Some variables inappropriately using int32? #8

JohnMMa opened this issue Jan 19, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@JohnMMa
Copy link

JohnMMa commented Jan 19, 2024

Hi,

I was running splitcode v 1.29 with the following arguments:

splitcode -c tags.txt --x-only --assign --gzip -C 1 -N 2 -t 27 --outb=output/final_barcodes.fastq.gz --mapping=output/mapping.txt sample_S1_R1_001.fastq.gz sample_S1_R2_001.fastq.gz

Based on my knowledge of the issue, I'd say the important part is the input FASTQ files have about 8 billion pairs, so it's probably too large to share here.

The error happens at a certain point of the stdout output:

[...]
2132M reads processed (100.0% assigned)
2158M reads processed (-98.9% assigned)
[...]

Where all lines after that point all report negative assignment percentages. The output files appear normal.

This means, the between these two lines, the number of read pairs processed happens to caused an integer overflow if int32 is used as the numerator, since int32 has the range of [-2147483648 : 2147483647].

The output here relates to the following lines of ReadProcessor::processBuffer():

    if (numreads > 0 && numreads % 1000000 == 0 && mp.verbose) { 
        numreads = 0; // reset counter
        int nummapped = mp.sc.getNumMapped();

        std::cerr << '\r' << (mp.numreads/1000000) << "M reads processed";
        if (!mp.sc.always_assign) {
          std::cerr << " (" 
            << std::fixed << std::setw( 3 ) << std::setprecision( 1 ) << ((100.0*nummapped)/double(mp.numreads))
            << "% assigned)";
        } else {
          std::cerr << "         ";
        }
        std::cerr.flush();
      }

I suspect the problem has something to do with nummapped, which got overflowed when more than 2,147,483,647 read pairs get mapped. Since NovaSeq 6000 instruments can generate 20b clusters per run, and NovaSeq X instruments can generate 50b, I can foresee this particular overflow eventually become rather common if the code wasn't changed. I wonder if there're more places in the code affected by this?

@Yenaled
Copy link
Collaborator

Yenaled commented Jan 19, 2024

Thanks! Yeah, I actually noticed that strange output a while back (I was processing some large datasets myself) but hadn't had a chance to investigate further (I suspected it was an int overflow somewhere). I will fix this in the next release (in the next week or so).

Other parts of the code should be fine (and the functionality itself will be unaffected). It really only affects whenever we're counting the reads in a FASTQ file one-by-one for the entire duration of the run (i.e. numreads and nummapped -- which are just used for statistics / progress); otherwise, ints should never reach such a high value. But I'll go through the codebase and see where the 32-bit ints should be changed to 64-bit ints.

@Yenaled Yenaled added the bug Something isn't working label Jan 19, 2024
@JohnMMa
Copy link
Author

JohnMMa commented Jan 19, 2024

Just modified the OP to clarify that the FASTQs in question has ~8b pairs, instead of ~33b--the latter number is the number of lines of the input.

@Yenaled
Copy link
Collaborator

Yenaled commented Jan 25, 2024

Should be fixed in the latest (v0.29.2) release!

@Yenaled Yenaled closed this as completed Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants