-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
CLN: use integer parsing functions from stdlib #62658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
CLN: use integer parsing functions from stdlib #62658
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is much more difficult and less performant than it needs to be because we have a suboptimal structure for the parser.
Instead of storing a char **words
member, is there a way that we can define a struct like:
struct PandasParsedString {
const char *str;
size_t len;
int64_t[2] is_separator;
}
and then make the member in the parser a PandasParsedString *words
array. As you parse each word you can assign these struct members once, instead of having to query them after the fact.
I'm assuming the is_separator
member is a bit-packed struct signifying whether a particular character is a separator or not (see other comment about a 128 character limit). We could implement this a few other ways, but I figure this is the most memory efficient.
With that information up front you can avoid having to do a character search for separators here way after the fact; if there are no separators you can avoid looping altogether, but if there are you can move a region of characters in batches to the local buffer rather than having to go char by char
#define ERROR_NO_DIGITS 1 | ||
#define ERROR_OVERFLOW 2 | ||
#define ERROR_INVALID_CHARS 3 | ||
#define ERROR_NO_MEMORY 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a PARSER_OUT_OF_MEMORY define in this header - does this provide anything different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it. Since we will avoid using dynamic memory, I will delete it soon and the memory error won't be applicable.
pandas/_libs/src/parser/tokenizer.c
Outdated
} | ||
} | ||
|
||
char *start = malloc((chars_to_copy + 1) * sizeof(char)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons we should avoid using the heap as much as possible. Given this only applies to numeric digits, I think it would be better to just stack allocate an array of a certain size.
Trying to be forward looking, we can probably get away with a local heap of size 128 bytes, since that covers the maximum character length of Arrow's Decimal256 (76), and sizes up to a power of 2. Anything longer that that can reasonably not be parsed as numeric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons we should avoid using the heap as much as possible. Given this only applies to numeric digits, I think it would be better to just stack allocate an array of a certain size.
Question: The approach of using a stack will not be thread safe. Is this a problem?
Another approach that I was thinking of was to use a do-while
loop, verifying if it stopped in tsep
. It will complicate the code a little bit, but will be thread safe and won't need dynamic memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every thread gets its own stack, so the stack solution is thread safe while the heap is not. Perhaps you are mixing stack/local variables with the concept of static
?
Of course we aren't doing any multithreaded parsing here, but that's a great question and thing to consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every thread gets its own stack, so the stack solution is thread safe while the heap is not.
Thanks, I didn't know that.
Perhaps you are mixing stack/local variables with the concept of static?
I first thought that I should make the array a global variable, that's the reason for the confusion.
I created a buffer
of size 128 in the stack to remove tsep
. With these changes, I removed the malloc
call and the memory error.
pandas/_libs/src/parser/tokenizer.c
Outdated
* @param p_item Pointer to verify | ||
* @return Non-zero integer indicating that has a digit 0 otherwise. | ||
*/ | ||
static inline int has_digit_int(const char *str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static inline int has_digit_int(const char *str) { | |
static inline bool has_digit_int(const char *str) { |
We are far enough removed from C89 that we can use the bool type :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 87789e6
pandas/_libs/src/parser/tokenizer.c
Outdated
/* Copy a string without `char_to_remove` into `output`, | ||
* while ensuring it's null terminated. | ||
*/ | ||
static void copy_string_without_char(char *output, const char *str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void copy_string_without_char(char *output, const char *str, | |
static void copy_string_without_char(char output[PROCESSED_WORD_CAPACITY], const char *str, |
Don't think you need to pass this capacity as a separate argument, since it is always the same value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2287944
pandas/_libs/src/parser/tokenizer.c
Outdated
output[i++] = *src; | ||
} | ||
} | ||
if (i < output_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty wasteful to continually write null bytes. You could either memset the buffer before you send it, or ideally short circuit when you've processed all the necessary bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continually write null bytes
Actually, it wasn't continually writing null bytes. Just once after copying src
. Anyway, I changed to use memset
.
pandas/_libs/src/parser/tokenizer.c
Outdated
static void copy_string_without_char(char *output, const char *str, | ||
char char_to_remove, size_t output_size) { | ||
size_t i = 0; | ||
for (const char *src = str; *src != '\0' && i < output_size; src++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to avoid character-by-character writes if we can, especially since this is a relatively sparse character to search for.
You might be limited with what you can do without some of the larger changes I suggested, but maybe even just finding a region and writing multiple bytes at once will be more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps something like:
size_t pos = 0;
const char* oldptr = p_item;
while (const char* newptr = strchr(oldptr, tsep) != NULL) {
size_t len = newptr - oldptr;
memcpy(output[pos], oldptr, len);
oldptr = newptr + 1;
pos += len;
}
output[startpos + 1] = '\0';
Might be an off by one and probably worth bounds checking. You might also want to use strchrnul instead of strchr to avoid a buffer overrun
Just some brief unchecked thoughts - hope they help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make strchr
work, because I couldn't know where the null terminator would be. I tried your suggestion in 2bea3c2.
I don't understand the error on MacOS, but they are definitely related to the changes here. They weren't happening when I was using malloc. I will try a different solution that don't rely on transforming the string. |
pandas/_libs/src/parser/tokenizer.c
Outdated
return 0; | ||
} | ||
} | ||
static inline int64_t add_int_check_overflow(int64_t lhs, int64_t rhs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to implement custom overflow detection. We already have macros that defer to compiler builtins (see the np_datetime.c module). You can refactor those if needed to make available here.
FWIW C23 also standardizes checked overflow functions. I'm not sure how far along MSVC is in implementing that, but for other platforms we can likely use the standard on newer compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pinned meson-python version doesn't allow compiling with std=c23
. I don't know why stdckdint.h worked on my system with c11.
I will use the macro in np_datetime.c
to check for overflow, but I will move it to the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still using my implementation for uint
overflow, because the NumPy implementation seems specific to int64
(at least for Windows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pinned meson-python version doesn't allow compiling with
std=c23
. I don't know why stdckdint.h worked on my system with c11.
You can specify multiple standards with Meson, and it will pick the first that the compiler supports. So can just set the option of c_std=c23,c11
I don't know why stdckdint.h worked on my system with c11
While stdckdint.h may not have been standardized until after c11, it doesn't exclude your compiler from using it if it has already implemented.
I am still using my implementation for
uint
overflow, because the NumPy implementation seems specific toint64
(at least for Windows).
Its unfortunate the MSVC implementation doesn't use a generic macro like the gcc/clang versions do to make these work regardless of size. However, there's the ULongLongAdd
function you can use instead of your own implementation:
https://learn.microsoft.com/en-us/windows/win32/api/intsafe/nf-intsafe-ulonglongadd
Generally opt for the compiler built-ins when available; they are almost assuredly faster, and mean less code we have to maintain :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can specify multiple standards with Meson, and it will pick the first that the compiler supports. So can just set the option of c_std=c23,c11
I got this error
../../meson.build:2:0: ERROR: Value "c11,c23" (of type "string") for combo option "C language standard to use" is not one of the choices. Possible choices are (as string): "none",
"c89", "c99", "c11", "c17", "c18", "c2x", "gnu89", "gnu99", "gnu11", "gnu17", "gnu18", "gnu2x".
It needs meson>=1.4
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION | ||
#endif // NPY_NO_DEPRECATED_API | ||
|
||
#if defined(_WIN32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its probably best to move these to the include/pandas/portable.h
header
_Static_assert(0, | ||
"Overflow checking not detected; please try a newer compiler"); | ||
#endif | ||
// __has_builtin was added in gcc 10, but our muslinux_1_1 build environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is a separate branch down here, but I think you can combine with the macros on lines 51/52
pandas/_libs/src/parser/tokenizer.c
Outdated
return true; | ||
case '+': | ||
case '-': | ||
return isdigit_ascii(str[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will violate bounds checking if you pass a string of +
or -
pandas/_libs/src/parser/tokenizer.c
Outdated
return *str == '\0'; | ||
} | ||
|
||
static int power_int(int base, int exponent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is int the right return type for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly because I expect the result to be a small power of 10, up to 10**6
pandas/_libs/src/parser/tokenizer.c
Outdated
exponent /= 2; | ||
} | ||
|
||
return result * base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could overflow easily
((number == pre_max) && (d - '0' <= dig_pre_max))) { | ||
number = number * 10 + (d - '0'); | ||
d = *++p; | ||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that the errno
isn't set already before just clearing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem necessary, this is just parsing a new number and error handling is done in parsers.pyx
pandas/_libs/src/parser/tokenizer.c
Outdated
d = *++p; | ||
while (errno == 0 && tsep != '\0' && *endptr == tsep) { | ||
// Skip multiple consecutive tsep | ||
while (*endptr == tsep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misreading but its strange to see this in the invariant here and directly preceding it. Can we consolidate that logic somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to move endptr
to the next character and make sure it's a digit, if it's not, exit the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will probably read easier
pandas/_libs/src/parser/tokenizer.c
Outdated
return 0; | ||
} | ||
} | ||
char *new_end = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to assign NULL
pandas/_libs/src/parser/tokenizer.c
Outdated
while (isspace_ascii(*p)) { | ||
++p; | ||
ptrdiff_t digits = new_end - endptr; | ||
int64_t mul_result = power_int(10, (int)digits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK - I think you are assembling this in pieces. However, I'd still suggest just getting rid of the separators in a local buffer and then just calling strtoll
(or whatever function) on the whole buffer. There's a lot of nuance to range/value handling that its not worth reimplementing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the local buffer is that there were failing tests on macOS that I couldn't reproduce. The malloc solution worked, but the local buffer on the stack was failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like you were hitting some undefined behavior, but that would come from whatever the implementation was, not the general solution. We should definitely go back to that route and just see where we got stuck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will address some of your comments here, commit, and then revert to that state.
pandas/_libs/src/parser/tokenizer.c
Outdated
number = number * 10 + (d - '0'); | ||
d = *++p; | ||
errno = 0; | ||
char *endptr = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assign NULL
pandas/_libs/src/parser/tokenizer.c
Outdated
const char *str, char char_to_remove) { | ||
char *dst = output; | ||
const char *src = str; | ||
// last character is reserved for null terminator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to place a null terminator all the way at the end here - just place one where the string actually stops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a safeguard to prevent writing to the last position in the array and to guarantee the output is null terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try using this as a template for your implementation:
#include <stdio.h>
#include <string.h>
static const int bufsize = 128;
int main() {
char buffer[bufsize];
const char *foo = "123,456,789";
size_t len = strlen(foo);
const char *endPtr = foo + len;
size_t bytes_read = 0;
const char *cursor = foo;
while (1) {
const char *next = strchrnul(cursor, ',');
const size_t nbytes = next - cursor;
if (nbytes == 1) // skip consecutive separators - or should we error?
continue;
if (bytes_read + nbytes >= bufsize) {
// do something here to handle overrun
}
memcpy(&buffer[bytes_read], cursor, nbytes);
bytes_read += nbytes;
if (next == endPtr)
break;
cursor = next + 1;
}
buffer[bytes_read] = '\0';
printf("My string is: %s\n", buffer);
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted your template, also created some tests with thousands separator with several edge cases.
The tests fail against the Python engine, because I am allowing multiple consecutive thousand separators, because they are permitted on main
pandas/pandas/_libs/src/parser/tokenizer.c
Lines 1868 to 1872 in 066a4f7
if (tsep != '\0') { | |
while (1) { | |
if (d == tsep) { | |
d = *++p; | |
continue; |
Should I change the old behavior, or xfail the test for the Python engine?
pandas/_libs/src/parser/tokenizer.c
Outdated
* so it won't null terminate the result. | ||
*/ | ||
static void copy_string_without_char(char output[PROCESSED_WORD_CAPACITY], | ||
const char *str, char char_to_remove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still advise passing the length of the string as an argument here. That way you can short circuit on an issue where somehow the length of the string exceeds the buffer capacity.
Its probably better to have a return code for this function too, rather than void
pandas/_libs/src/parser/tokenizer.c
Outdated
size_t len = next - src; | ||
if (dst + len > end) { | ||
// Can't write here, str is too big | ||
errno = ERANGE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my previous comment about return codes, I think in user code its better to return a code when you can rather than setting the global errno. errno is a static global, so its not thread safe, and you can be mangling codes by doing this. I think its also more of an artifact to the language than a modern error messaging solution
pandas/_libs/src/parser/tokenizer.c
Outdated
while (*src != '\0' && dst < end) { | ||
const char *next = src; | ||
// find EOS or char_to_remove | ||
while (*next != '\0' && *next != char_to_remove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to last review - its much easier to hold one invariant than to nest loops like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote this function, and I think it addresses the rest of your comments.
- It receives the length of the string (had to use
strlen
)- It uses the length of the string to don't copy anything if it doesn't have space for it.
- The loop is a sliding window.
- It returns a status code, with 0 on success and
ERROR_WORD2BIG
if doesn't have enough space,
pandas/_libs/src/parser/tokenizer.c
Outdated
// go to next available location to write | ||
dst += len; | ||
// Move past char to remove | ||
src = *next == char_to_remove ? next + 1 : next; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't this always be jumping by next + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next could be '\0' or tsep, if it's tsep we go next + 1
if it's \0
we stop at next
pandas/_libs/src/parser/tokenizer.c
Outdated
} | ||
errno = 0; | ||
if (tsep != '\0' && strchr(p_item, tsep) != NULL) { | ||
char buffer[PROCESSED_WORD_CAPACITY]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is undefined behavior, at least according to https://stackoverflow.com/a/79330647/621736. You may need to move the buffer declaration outside of this branch (can run ASAN/UBSAN to check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fixes it.
- Returns status code - Simplify loop for a more common sliding window - use string length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work - headed in the right direction
left = right + 1; | ||
|
||
// Exit after processing the entire string | ||
if (left >= end_ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this with the main loop invariant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reaches there if the word ends with char_to_remove
. I put it there to avoid uint
underflow. I can put left < end_ptr
in the loop invariant if you prefer.
|
||
/* Copy a string without `char_to_remove` into `output`. | ||
*/ | ||
static int copy_string_without_char(char output[PROCESSED_WORD_CAPACITY], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now just return -1 for error. Its a nice idea to use the define, but its inconsistent with all of the other error handling we have in our code base (and CPython standards). We don't want every function/module to implement its own custom error handling
pandas/_libs/src/parser/tokenizer.c
Outdated
const char *end_ptr = str + str_len; | ||
size_t bytes_read = 0; | ||
|
||
while ((right = memchr(left, char_to_remove, end_ptr - left)) != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for the strchr/memchr change? Also can you remind me again why strchrnul isn't working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for the strchr/memchr change?
I prefer memchr
because I can specify how many bytes it should read, So I can ensure it won't read past end_ptr
, while strchr
and strchrnul
doesn't have this flexibility.
Also can you remind me again why strchrnul isn't working?
I also was afraid that strchrnul
wouldn't compile on Windows, mainly because I couldn't find the reference for it in cppreference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok - yea it's a GNU extension. My mistake.
I like the reasoning here - let's use in both instances
*error = ERROR_INVALID_CHARS; | ||
return 0; | ||
result = 0; | ||
} else if (errno == ERANGE || result > int_max || result < int_min) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is unrelated to your PR but why does this function accept int_max
and int_min
as arguments? Assuming those are actually set to INT64_MAX
and INT64_MIN
this is a no-op, and potentially waste of cycles if the compiler can't optimize it away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it's unnecessary. It's set to INT64_MAX
and INT64_MIN
.
pandas/pandas/_libs/parsers.pyx
Lines 1930 to 1931 in 066a4f7
data[i] = str_to_int64(word, INT64_MIN, INT64_MAX, | |
&error, parser.thousands) |
Should I just check errno
or I should also change the function signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the function signature is fine (let's do as a separate PR)
Co-authored-by: William Ayd <william.ayd@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to do another pass tomorrow but generally this looks great - thanks for sticking with all of the feedback.
Have you benchmarked in its current state?
d = *++p; | ||
errno = 0; | ||
char buffer[PROCESSED_WORD_CAPACITY]; | ||
if (tsep != '\0' && strchr(p_item, tsep) != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use memchr here for consistency?
#include "pandas/portable.h" | ||
#include "pandas/vendored/klib/khash.h" // for kh_int64_t, kh_destroy_int64 | ||
|
||
#define PROCESSED_WORD_CAPACITY 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here that we choose 128 because the Arrow256 allows up to 76 decimal digits, and this is the next highest power of 2?
This was a suggestion from @WillAyd in #62623 (comment)
This simplifies the current parsing logic by utilizing existing functions to parse integers from stdlib (
strtoll
andstrtoull
) instead of using our own.One disadvantage of this approach is that in case there is a thousand separator (tsep
) we allocate memory on the heap and process the existing string to remove this separator.