BUG: Fix segfault in read_csv with extremely large exponents #63098

ssam18 · 2025-11-12T18:09:03Z

Description

This PR fixes a segmentation fault that occurs when reading CSV files containing numbers with extremely large exponents in scientific notation (e.g., 4e492493924924).

Root Cause

The issue was an integer overflow in the xstrtod() function in pandas/_libs/src/parser/tokenizer.c. When parsing the exponent portion of scientific notation, the code accumulated digits into an int variable without bounds checking:

int n = 0;
while (isdigit_ascii(*p)) {
    n = n * 10 + (*p - '0');  // Integer overflow with large exponents
    num_digits++;
    p++;
}

With an exponent like 492493924924, the variable n would overflow, causing undefined behavior that manifests as a segmentation fault.

Solution

I added a maximum digit cap (MAX_EXPONENT_DIGITS = 4) when accumulating the exponent value:

Only the first 4 digits are used for the actual exponent value (allowing up to 9999)
Remaining digits are still consumed to maintain correct parsing position
This is sufficient since valid double-precision exponents are limited to roughly ±308 anyway
The existing range check (DBL_MIN_EXP to DBL_MAX_EXP) will properly handle out-of-range values

Testing

Added test_issue_63089.py with test cases covering:

The exact case from the issue report
Various edge cases with extremely large positive and negative exponents
Numbers with decimal points and large exponents

The fix prevents the overflow while maintaining correct parsing behavior for valid scientific notation.

Checklist:

Closes BUG: read_csv results in SIGSEGV #63089
Added test case
All existing tests should pass (haven't run full suite locally due to build dependencies)

Fixes pandas-dev#63089 When parsing scientific notation in CSV files, extremely large exponent values (e.g., '4e492493924924') caused integer overflow in the exponent accumulation loop, leading to undefined behavior and segmentation faults. The issue occurred in xstrtod() at pandas/_libs/src/parser/tokenizer.c where exponent digits were accumulated without bounds checking: int n = 0; while (isdigit_ascii(*p)) { n = n * 10 + (*p - '0'); // Overflow here with large exponents ... } Solution: - Add a maximum exponent digits cap (MAX_EXPONENT_DIGITS = 4) to prevent overflow while still allowing valid scientific notation - Continue consuming remaining digits to maintain correct parsing position - The capped value (up to 9999) is sufficient since the subsequent range check (DBL_MIN_EXP to DBL_MAX_EXP) will catch invalid exponents This fix prevents the overflow while maintaining correct parsing behavior for both valid and invalid exponent values. Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

ssam18 added 2 commits November 12, 2025 12:08

Apply code formatting (ruff, isort, clang-format)

643d12c

jbrockmendel added the AI Slop Suspected of being AI-generated, which is not welcome. label Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix segfault in read_csv with extremely large exponents #63098

BUG: Fix segfault in read_csv with extremely large exponents #63098

ssam18 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

BUG: Fix segfault in read_csv with extremely large exponents #63098

Are you sure you want to change the base?

BUG: Fix segfault in read_csv with extremely large exponents #63098

Conversation

ssam18 commented Nov 12, 2025

Description

Root Cause

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants