Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix atol parsing for prefixed integer literals with leading underscores #3180

Closed
wants to merge 1 commit into from

Conversation

jjvraw
Copy link
Contributor

@jjvraw jjvraw commented Jul 5, 2024

This addresses #3182 to correctly parse prefixed string literals. Changes include:

  • Bug-fix
  • Added appropriate test cases to verify the new behaviour

When the base 2, 8, 16 are set or inferred (base 0), for a string prefixed appropriately, leading underscores should be allowed as per Python's integer literal grammar:

integer      ::=  decinteger | bininteger | octinteger | hexinteger
decinteger   ::=  nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger   ::=  "0" ("b" | "B") (["_"] bindigit)+
octinteger   ::=  "0" ("o" | "O") (["_"] octdigit)+
hexinteger   ::=  "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::=  "1"..."9"
digit        ::=  "0"..."9"
bindigit     ::=  "0" | "1"
octdigit     ::=  "0"..."7"
hexdigit     ::=  digit | "a"..."f" | "A"..."F"

Currently, this such a string passed to atol raises. The changes in this PR fixes this issue.

I should note that I have another PR (#3207) to refactor, and clean up, atol for reusability to introduce stol as per request #2639. If this PR is merged, I will update #3207, which will require a minor tweak.

@jjvraw jjvraw requested a review from a team as a code owner July 5, 2024 21:58
@martinvuyk
Copy link
Contributor

Thanks for taking the time to do this, could you add some of the suggestions and tests for the other PR here? I see the same behavior missing (underscore and 0 at the beginning of non base 10 numbers)

@jjvraw jjvraw changed the title [stdlib] Refactor atol function and improve documentation [stdlib] Add leading underscore support and refactor code structure Jul 5, 2024
@jjvraw jjvraw force-pushed the refactor/atol branch 2 times, most recently from b457b4a to 98e3a69 Compare July 5, 2024 23:07
Copy link
Contributor

@martinvuyk martinvuyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
btw. loved this refactor

    var start = pos
    if start + 1 < str_len:
        var prefix_char = str_ref[start + 1]
        if str_ref[start] == "0" and (
            (base == 2 and (prefix_char == "b" or prefix_char == "B"))
            or (base == 8 and (prefix_char == "o" or prefix_char == "O"))
            or (base == 16 and (prefix_char == "x" or prefix_char == "X"))
        ):
            start += 2
    return start

@jjvraw jjvraw changed the title [stdlib] Add leading underscore support and refactor code structure [stdlib] Add leading underscore support and refactor code structure for atol Jul 5, 2024
@jjvraw jjvraw force-pushed the refactor/atol branch 6 times, most recently from 9e923ed to e2dc1e6 Compare July 6, 2024 15:32
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
@jjvraw
Copy link
Contributor Author

jjvraw commented Jul 6, 2024

Thanks @martinvuyk :) But single newline seems to be consistent with other docstring, no?

fn ord(s: String) -> Int:
"""Returns an integer that represents the given one-character string.
Given a string representing one character, return an integer
representing the code point of that character. For example, `ord("a")`
returns the integer `97`. This is the inverse of the `chr()` function.
Args:
s: The input string, which must contain only a single character.
Returns:
An integer representing the code point of the given character.
"""

@jjvraw jjvraw changed the title [stdlib] Add leading underscore support and refactor code structure for atol [stdlib] Fix atol parsing for prefixed integer literals with leading underscores Jul 7, 2024
@jjvraw
Copy link
Contributor Author

jjvraw commented Jul 7, 2024

I have repurposed this PR to focus on the mentioned issue.

@jjvraw jjvraw force-pushed the refactor/atol branch 4 times, most recently from ba3a8a5 to 6a0b673 Compare July 9, 2024 21:29
# starting "was_last_digit_undescore" to true such that
# if the first digit is an undesrcore an error is raised
var was_last_digit_undescore = True
var was_last_digit_undescore = real_base == 10
Copy link
Contributor Author

@jjvraw jjvraw Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want was_last_underscore to be false only when an accepted prefix is present, which is not necessarily when real_base != 10, but rather only if real_base is 2,8,16 and it contains a prefix, according to the lexical syntax of Python.

Thoughts @soraros, @martinvuyk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I hadn't seen "Base must be >= 2 and <= 36, or 0." so that means this could receive other bases that have no prefix. So yeah totally, leading underscore should fail
Python:

print(int("_123", 30))
ValueError: invalid literal for int() with base 30: '_123'

so it should be

real_base in (2, 8, 16)

nice catch, maybe we could add some tests with assert raises (and also passing ones) with bases other than the classics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will for sure add some test cases with the next commit. We will have to add a check for whether there exists a prefix as well though.

So something along the lines of real_base in (2, 8, 16) and has_prefix, as we could have atol("_ff", 16) which should raise as invalid.

Was playing around with (real_base & 0b11010) != real_base and has_prefix which would've have been fun... 10 and 18 breaks it though.

Copy link
Contributor

@martinvuyk martinvuyk Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something along the lines of real_base in (2, 8, 16) and has_prefix, as we could have atol("_ff", 16) which should raise as invalid.

yeah I guess str_ref[start] == "0" can be taken out of the if to a has_prefix var, nice catch again :D

Was playing around with (real_base & 0b11010) != real_base and has_prefix which would've have been fun...

Niiice, maybe this will work?

from bit import is_power_of_two
...
# first character as underscore is only valid for bases 2, 8, 16 with prefix
var was_last_digit_undescore = not (is_power_of_two(real_base) and has_prefix and real_base != 32)
...

Copy link
Contributor Author

@jjvraw jjvraw Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept! Will have to catch 4 explicitly there as well.

We can use a mask, bitmask = (1 << 2) | (1 << 8) | (1 << 16) and then not ((1 << real_base) & bitmask) and has_prefix.

This seems to work. But it might be overcomplicated for a simple value range check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this was a fun exercise lol but let's just stick with tuple contains

I agree. At least we can defer further optimisation to a later PR should we find it necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinvuyk, I couldn't get it to run. Cannot call cast on a Boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will probably work by just putting int(real_base < 17 and has_prefix)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue @martinvuyk, but now for Int.

Copy link
Contributor

@martinvuyk martinvuyk Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but removing the .cast since it is a SIMD method, but never mind it's not important

docs/changelog.md Outdated Show resolved Hide resolved
@JoeLoser
Copy link
Collaborator

!sync

- Support leading underscores (bug fix) for prefixed literal
- Added relevant tests

Signed-off-by: Joshua James Venter <venter.joshua@gmail.com>
@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jul 17, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added merged-internally Indicates that this pull request has been merged internally merged-externally Merged externally in public mojo repo labels Jul 17, 2024
modularbot added a commit that referenced this pull request Jul 18, 2024
…th leading underscores (#43607)

[External] [stdlib] Fix atol parsing for prefixed integer literals with
leading underscores

This addresses #3182 to
correctly parse prefixed string literals. Changes include:
- Bug-fix
- Added appropriate test cases to verify the new behavior

When the base 2, 8, 16 are set or inferred (base 0), for a string
prefixed appropriately, leading underscores should be allowed as per
Python's integer literal grammar:

```
integer      ::=  decinteger | bininteger | octinteger | hexinteger
decinteger   ::=  nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger   ::=  "0" ("b" | "B") (["_"] bindigit)+
octinteger   ::=  "0" ("o" | "O") (["_"] octdigit)+
hexinteger   ::=  "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::=  "1"..."9"
digit        ::=  "0"..."9"
bindigit     ::=  "0" | "1"
octdigit     ::=  "0"..."7"
hexdigit     ::=  digit | "a"..."f" | "A"..."F"
```

Currently, this such a string passed to `atol` raises. The changes in
this PR fixes this issue.

ORIGINAL_AUTHOR=Joshua James Venter
<67124214+jjvraw@users.noreply.github.com>
PUBLIC_PR_LINK=#3180

Co-authored-by: Joshua James Venter <67124214+jjvraw@users.noreply.github.com>
Closes #3180
MODULAR_ORIG_COMMIT_REV_ID: 71bba679c2eeeea09f0bd4393611da50bc6570a2
@modularbot
Copy link
Collaborator

Landed in 2aa4060! Thank you for your contribution 🎉

@modularbot modularbot closed this Jul 18, 2024
modularbot added a commit that referenced this pull request Sep 13, 2024
…th leading underscores (#43607)

[External] [stdlib] Fix atol parsing for prefixed integer literals with
leading underscores

This addresses #3182 to
correctly parse prefixed string literals. Changes include:
- Bug-fix
- Added appropriate test cases to verify the new behavior

When the base 2, 8, 16 are set or inferred (base 0), for a string
prefixed appropriately, leading underscores should be allowed as per
Python's integer literal grammar:

```
integer      ::=  decinteger | bininteger | octinteger | hexinteger
decinteger   ::=  nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger   ::=  "0" ("b" | "B") (["_"] bindigit)+
octinteger   ::=  "0" ("o" | "O") (["_"] octdigit)+
hexinteger   ::=  "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::=  "1"..."9"
digit        ::=  "0"..."9"
bindigit     ::=  "0" | "1"
octdigit     ::=  "0"..."7"
hexdigit     ::=  digit | "a"..."f" | "A"..."F"
```

Currently, this such a string passed to `atol` raises. The changes in
this PR fixes this issue.

ORIGINAL_AUTHOR=Joshua James Venter
<67124214+jjvraw@users.noreply.github.com>
PUBLIC_PR_LINK=#3180

Co-authored-by: Joshua James Venter <67124214+jjvraw@users.noreply.github.com>
Closes #3180
MODULAR_ORIG_COMMIT_REV_ID: 71bba679c2eeeea09f0bd4393611da50bc6570a2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants