Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix python.number pattern #1028

Closed
wants to merge 25 commits into from

Conversation

0dminnimda
Copy link
Contributor

Python doesn't accept numbers with the _ in the beginning/end and numbers with more than one _ in the allowed places:

>>> 69420
69420
>>> 69_420
69420
>>> 69__420
  File "<stdin>", line 1
    69__420
      ^
SyntaxError: invalid decimal literal
>>> 69_420_
  File "<stdin>", line 1
    69_420_
          ^
SyntaxError: invalid decimal literal
>>> _69_420
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_69_420' is not defined
>>> 03.1415
3.1415
>>> 0_3.14_15
3.1415
>>> 0__3.14_15
  File "<stdin>", line 1
    0__3.14_15
     ^
SyntaxError: invalid decimal literal
>>> 0_3.14__15
  File "<stdin>", line 1
    0_3.14__15
          ^
SyntaxError: invalid decimal literal
>>> 0_3.14_15_
  File "<stdin>", line 1
    0_3.14_15_
             ^
SyntaxError: invalid decimal literal
>>> 0_3._14_15
  File "<stdin>", line 1
    0_3._14_15
       ^
SyntaxError: invalid decimal literal
>>> 0_3_.14_15
  File "<stdin>", line 1
    0_3_.14_15
       ^
SyntaxError: invalid decimal literal
>>> _0_3.14_15
  File "<stdin>", line 1
    _0_3.14_15
    ^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

the same goes with complex numbers. And yes, python recognizes _xxx as a name, even though x is a digit, but it's still not a number, so this doesn't affect us.

The current implementation only filters numbers with _ in the beginning, so here's the fix for the other cases.


Hopefully, we still can make backward-incompatible changes, so it's fine to change IMAG_NUMBER to COMPLEX_NUMBER

I also tested \d(?:_?\d+)* for DEC_NUMBER but haven't seen any significant performance changes (everything is within the normal range, considering that I was not using a stable environment).

@erezsh
Copy link
Member

erezsh commented Nov 4, 2021

Thanks! I'll have a look.

Hopefully, we still can make backward-incompatible changes

We don't need to be backward-compatible with bugs :)

lark/grammars/python.lark Outdated Show resolved Hide resolved
lark/grammars/python.lark Outdated Show resolved Hide resolved
@0dminnimda
Copy link
Contributor Author

0dminnimda commented Nov 4, 2021

This pattern

FLOAT_NUMBER.2: (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER) (("E" | "e") ["+" | "-"] DEC_NUMBER)?
              | DEC_NUMBER (("E" | "e") ["+" | "-"] DEC_NUMBER)?

cannot parse 5.0_8e-6 it recognizes 5.0 as float, then sees _ and tries to continue parsing new terminal(?)/rule but there's none that starts with _, so it fails

If I change it to

FLOAT_NUMBER.2: (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER)
              | (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER) (("E" | "e") ["+" | "-"] DEC_NUMBER)
              | DEC_NUMBER (("E" | "e") ["+" | "-"] DEC_NUMBER)?

then everything works fine.

The reason is regexp, because if I use

DIGIT: "0".."9"

DEC_NUMBER: DIGIT ("_"? DIGIT)*

instead of previously used

DEC_NUMBER: /\d(?:_?\d)*/i

than both patterns for FLOAT_NUMBER works as intended.


This seems unintentional and therefore worth investigation.
@MegaIng @erezsh should I create an issue for that?

lark/grammars/python.lark Outdated Show resolved Hide resolved
@0dminnimda
Copy link
Contributor Author

0dminnimda commented Nov 4, 2021

ok, so I adapted tests from cpython

here's valid cases

0_0_0
4_2
1_0000_0000
0b1001_0100
0xffff_ffff
0o5_7_7
1_00_00.5
1_00_00.5e5
1_00_00e5_1
1e1_0
.1_4
.1_4e1
0b_0
0x_f
0o_5
1_00_00j
1_00_00.5j
1_00_00e5_1j
.1_4j
1_2.5+3_3j
.5_6j

0
0xffffffffffffffff
0Xffffffffffffffff
0o77777777777777777
0O77777777777777777
123456789012345678901234567890
0b100000000000000000000000000000000000000000000000000000000000000000000
0B111111111111111111111111111111111111111111111111111111111111111111111

0j
123456789012345678901234567890j

3.14
314.
0.314
000.314
.314
3e14
3E14
3e-14
3e+14
3.e14
.3e14
3.1e4

3.14j
314.j
0.314j
000.314j
.314j
3e14j
3E14j
3e-14j
3e+14j
3.e14j
.3e14j
3.1e4j

and invalid ones (each should result in the exception)

0_
42_
1.4j_
0x_
0b1_
0xf_
0o5_
1_Else
0_b0
0_xf
0_o5
0_7
09_99
4_______2
0.1__4
0.1__4j
0b1001__0100
0xffff__ffff
0x___
0o5__77
1e1__0
1e1__0j
1_.4
1_.4j
1._4
1._4j
._5
._5j
1.0e+_1
1.0e+_1j
1.4_j
1.4e5_j
1_e1
1.4_e1
1.4_e1j
1e_1
1.4e_1
1.4e_1j
1+1.5_j_
1+1.5_j

_0
_42
_1.4j
_0x
_0b1
_0xf
_0o5
_1_Else
_0_b0
_0_xf
_0_o5
_0_7
_09_99
_4_______2
_0.1__4
_0.1__4j
_0b1001__0100
_0xffff__ffff
_0x__
_0o5__77
_1e1__0
_1e1__0j
_1_.4
_1_.4j
_1._4
_1._4j
_._5
_._5j
_1.0e+_1
_1.0e+_1j
_1.4_j
_1.4e5_j
_1_e1
_1.4_e1
_1.4_e1j
_1e_1
_1.4e_1
_1.4e_1j
_1+1.5_j
_1+1.5_j

@0dminnimda
Copy link
Contributor Author

ok, now the pr is tested and ready for review, and if everything is fine ready for merge!

If anyone could spot an untested case, just write it here, I'll add it to this comment and test it

@MegaIng
Copy link
Member

MegaIng commented Nov 5, 2021

Mind creating a new test_python_grammar with unittest inside the tests folder? If you (and maybe we as well) put so much effort into it, we should also test it to make sure it doesn't break.

If you don't want to create the unittest, I will do it.

Also, the reason I didn't add those tests till now is because I didn't manage to find CPython's tests for their parser. Where did you find them?

@0dminnimda
Copy link
Contributor Author

Also, the reason I didn't add those tests till now is because I didn't manage to find CPython's tests for their parser. Where did you find them?

Well, currently Python uses PEG parser generator, so you may be interested in checking it's tests, there's such things like tests for match cases (I know you're trying to megre pr with them)

Cpython also have it's own tests and all of them live in the Lib/test. I myself took cases from test_int.py
and test_grammar.py.
(It's definitely not the most rigorous testing, for example there's test_float.py or test_complex.py and I probably should add test cases from there...)

@0dminnimda
Copy link
Contributor Author

Mind creating a new test_python_grammar with unittest inside the tests folder?

oh, ok, python.lark don't have a start inside, I was just worried that if I just use, for example, file_input* as start then nothing would work ...

@MegaIng
Copy link
Member

MegaIng commented Nov 5, 2021

No start also matches the behavior of the actual python grammar: Depending on what exactly you want to parse, use file_input (without *), eval_input or single_input

@0dminnimda
Copy link
Contributor Author

No start also matches the behavior of the actual python grammar: Depending on what exactly you want to parse, use file_input (without *), eval_input or single_input

there's bug, if I use file_input and put only one stmt (like 5), then the parser will break ...

it even can't parse

if x:
    5

and even if it could I'd still wouldn't rely on that and test each rule\terminal one-by-one

@MegaIng
Copy link
Member

MegaIng commented Nov 5, 2021

Are you putting a newline at the end?

and even if it could I'd still wouldn't rely on that and test each rule\terminal one-by-one

That's probably better anyway.

@0dminnimda
Copy link
Contributor Author

0dminnimda commented Nov 5, 2021

Are you putting a newline at the end?

I shouldn't have to, It's weird to oblige me to do this

as exec can consists of several eval's
file_input should include all of the eval/single_input's, it's a superset

@MegaIng
Copy link
Member

MegaIng commented Nov 5, 2021

It's the same weird obligation that actual python grammar also has. The point is that any higher level parsing function can just always append a newline with no loss.

The problem is that it becomes annoying to express the correct behavior when you can't rely on every line ending in a newline.

@erezsh
Copy link
Member

erezsh commented Nov 5, 2021

@0dminnimda It's not ideal, but making the last newline optional requires support for the EOF feature, which we're keeping back because the existing implementation has issues. It would also make the parser a little heavier, so it's not all positive, even if it worked.

@0dminnimda
Copy link
Contributor Author

@erezsh ok, it looks like we are now ready to merge, I'm waiting for you.

@erezsh
Copy link
Member

erezsh commented Nov 7, 2021

I wonder why you chose to test by specifying a different start for each number format, instead of just starting from number and making sure the result is the right token type?

The latter approach also has the advantage of testing for collision in token regexp/priority.

@0dminnimda
Copy link
Contributor Author

I wonder why you chose to test by specifying a different start for each number format, instead of just starting from number and making sure the result is the right token type?

I didn't think about this approach.
It seems better, gonna implement it

@0dminnimda
Copy link
Contributor Author

@erezsh I updated tests, any other comments?

@0dminnimda
Copy link
Contributor Author

@erezsh could you please review and possibly merge this PR?

@erezsh erezsh requested a review from MegaIng November 9, 2021 09:15
@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

@0dminnimda Looks fine. But please rebase it. It should be at most 3 commits, instead of 23.

(I can rebase it for you, if you prefer)

Copy link
Member

@MegaIng MegaIng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, only one small thing left.

tests/test_python_grammar.py Outdated Show resolved Hide resolved
@0dminnimda
Copy link
Contributor Author

(I can rebase it for you, if you prefer)

Why not squash and merge?

@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

@0dminnimda Squash is done by rebasing, no? Either way, whatever gets it down to a few commits.

@0dminnimda
Copy link
Contributor Author

Squash is done by rebasing, no?

No and that's the point.

Either way, whatever gets it down to a few commits.

Well yeah, probably one commit is the best option

@0dminnimda
Copy link
Contributor Author

Ok, @MegaIng do you have any other comments?

@0dminnimda
Copy link
Contributor Author

Ok, now we're probably ready to merge, @erezsh

@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

Rebased and merged: 2ec9636

@erezsh erezsh closed this Nov 9, 2021
@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

Thanks for the PR!

@0dminnimda
Copy link
Contributor Author

Rebased and merged: 2ec9636

Yeah, that's why I was talking about squash and merge, because commit that'll end up in the repo gonna be assigned to me, you know, fair credit/blame distribution

@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

Sorry, I do the merges via the shell, because I want to run extra tests on the way, and fix up any remaining issues.

The commits are still signed to your name. So you will still get both credit and blame 😉

@0dminnimda
Copy link
Contributor Author

The commits are still signed to your name. So you will still get both credit and blame

Huh, that's true. I'm sorry, I was confused by 2ec9636, but apparently there are 518bd3a and 553eb41, so everything is fine, I guess :)

@0dminnimda 0dminnimda deleted the Fix-python.number branch November 9, 2021 12:43
@erezsh
Copy link
Member

erezsh commented Nov 9, 2021

No worries. I'll check if there's a way to connect the merge to a PR from the shell, for next time. After all, I want contributors to feel appreciated and rewarded. (as much as open source allows :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants