Some variables inappropriately using int32? #8

JohnMMa · 2024-01-19T01:31:23Z

Hi,

I was running splitcode v 1.29 with the following arguments:

splitcode -c tags.txt --x-only --assign --gzip -C 1 -N 2 -t 27 --outb=output/final_barcodes.fastq.gz --mapping=output/mapping.txt sample_S1_R1_001.fastq.gz sample_S1_R2_001.fastq.gz

Based on my knowledge of the issue, I'd say the important part is the input FASTQ files have about 8 billion pairs, so it's probably too large to share here.

The error happens at a certain point of the stdout output:

[...]
2132M reads processed (100.0% assigned)
2158M reads processed (-98.9% assigned)
[...]

Where all lines after that point all report negative assignment percentages. The output files appear normal.

This means, the between these two lines, the number of read pairs processed happens to caused an integer overflow if int32 is used as the numerator, since int32 has the range of [-2147483648 : 2147483647].

The output here relates to the following lines of ReadProcessor::processBuffer():

    if (numreads > 0 && numreads % 1000000 == 0 && mp.verbose) { 
        numreads = 0; // reset counter
        int nummapped = mp.sc.getNumMapped();

        std::cerr << '\r' << (mp.numreads/1000000) << "M reads processed";
        if (!mp.sc.always_assign) {
          std::cerr << " (" 
            << std::fixed << std::setw( 3 ) << std::setprecision( 1 ) << ((100.0*nummapped)/double(mp.numreads))
            << "% assigned)";
        } else {
          std::cerr << "         ";
        }
        std::cerr.flush();
      }

I suspect the problem has something to do with nummapped, which got overflowed when more than 2,147,483,647 read pairs get mapped. Since NovaSeq 6000 instruments can generate 20b clusters per run, and NovaSeq X instruments can generate 50b, I can foresee this particular overflow eventually become rather common if the code wasn't changed. I wonder if there're more places in the code affected by this?

The text was updated successfully, but these errors were encountered:

Yenaled · 2024-01-19T06:57:57Z

Thanks! Yeah, I actually noticed that strange output a while back (I was processing some large datasets myself) but hadn't had a chance to investigate further (I suspected it was an int overflow somewhere). I will fix this in the next release (in the next week or so).

Other parts of the code should be fine (and the functionality itself will be unaffected). It really only affects whenever we're counting the reads in a FASTQ file one-by-one for the entire duration of the run (i.e. numreads and nummapped -- which are just used for statistics / progress); otherwise, ints should never reach such a high value. But I'll go through the codebase and see where the 32-bit ints should be changed to 64-bit ints.

JohnMMa · 2024-01-19T18:22:05Z

Just modified the OP to clarify that the FASTQs in question has ~8b pairs, instead of ~33b--the latter number is the number of lines of the input.

Yenaled · 2024-01-25T02:47:13Z

Should be fixed in the latest (v0.29.2) release!

Yenaled added the bug Something isn't working label Jan 19, 2024

Yenaled closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some variables inappropriately using int32? #8

Some variables inappropriately using int32? #8

JohnMMa commented Jan 19, 2024 •

edited

Yenaled commented Jan 19, 2024

JohnMMa commented Jan 19, 2024

Yenaled commented Jan 25, 2024

Some variables inappropriately using int32? #8

Some variables inappropriately using int32? #8

Comments

JohnMMa commented Jan 19, 2024 • edited

Yenaled commented Jan 19, 2024

JohnMMa commented Jan 19, 2024

Yenaled commented Jan 25, 2024

JohnMMa commented Jan 19, 2024 •

edited