Fix `python.number` pattern #1028

0dminnimda · 2021-11-03T22:33:36Z

Python doesn't accept numbers with the _ in the beginning/end and numbers with more than one _ in the allowed places:

>>> 69420
69420
>>> 69_420
69420
>>> 69__420
  File "<stdin>", line 1
    69__420
      ^
SyntaxError: invalid decimal literal
>>> 69_420_
  File "<stdin>", line 1
    69_420_
          ^
SyntaxError: invalid decimal literal
>>> _69_420
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_69_420' is not defined

>>> 03.1415
3.1415
>>> 0_3.14_15
3.1415
>>> 0__3.14_15
  File "<stdin>", line 1
    0__3.14_15
     ^
SyntaxError: invalid decimal literal
>>> 0_3.14__15
  File "<stdin>", line 1
    0_3.14__15
          ^
SyntaxError: invalid decimal literal
>>> 0_3.14_15_
  File "<stdin>", line 1
    0_3.14_15_
             ^
SyntaxError: invalid decimal literal
>>> 0_3._14_15
  File "<stdin>", line 1
    0_3._14_15
       ^
SyntaxError: invalid decimal literal
>>> 0_3_.14_15
  File "<stdin>", line 1
    0_3_.14_15
       ^
SyntaxError: invalid decimal literal
>>> _0_3.14_15
  File "<stdin>", line 1
    _0_3.14_15
    ^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

the same goes with complex numbers. And yes, python recognizes _xxx as a name, even though x is a digit, but it's still not a number, so this doesn't affect us.

The current implementation only filters numbers with _ in the beginning, so here's the fix for the other cases.

Hopefully, we still can make backward-incompatible changes, so it's fine to change IMAG_NUMBER to COMPLEX_NUMBER

I also tested \d(?:_?\d+)* for DEC_NUMBER but haven't seen any significant performance changes (everything is within the normal range, considering that I was not using a stable environment).

erezsh · 2021-11-04T05:08:09Z

Thanks! I'll have a look.

Hopefully, we still can make backward-incompatible changes

We don't need to be backward-compatible with bugs :)

lark/grammars/python.lark

0dminnimda · 2021-11-04T17:48:10Z

This pattern

FLOAT_NUMBER.2: (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER) (("E" | "e") ["+" | "-"] DEC_NUMBER)?
              | DEC_NUMBER (("E" | "e") ["+" | "-"] DEC_NUMBER)?

cannot parse 5.0_8e-6 it recognizes 5.0 as float, then sees _ and tries to continue parsing new terminal(?)/rule but there's none that starts with _, so it fails

If I change it to

FLOAT_NUMBER.2: (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER)
              | (DEC_NUMBER "." DEC_NUMBER? | "." DEC_NUMBER) (("E" | "e") ["+" | "-"] DEC_NUMBER)
              | DEC_NUMBER (("E" | "e") ["+" | "-"] DEC_NUMBER)?

then everything works fine.

The reason is regexp, because if I use

DIGIT: "0".."9"

DEC_NUMBER: DIGIT ("_"? DIGIT)*

instead of previously used

DEC_NUMBER: /\d(?:_?\d)*/i

than both patterns for FLOAT_NUMBER works as intended.

This seems unintentional and therefore worth investigation.
@MegaIng @erezsh should I create an issue for that?

lark/grammars/python.lark

0dminnimda · 2021-11-04T21:17:45Z

ok, so I adapted tests from cpython

here's valid cases

0_0_0
4_2
1_0000_0000
0b1001_0100
0xffff_ffff
0o5_7_7
1_00_00.5
1_00_00.5e5
1_00_00e5_1
1e1_0
.1_4
.1_4e1
0b_0
0x_f
0o_5
1_00_00j
1_00_00.5j
1_00_00e5_1j
.1_4j
1_2.5+3_3j
.5_6j

0
0xffffffffffffffff
0Xffffffffffffffff
0o77777777777777777
0O77777777777777777
123456789012345678901234567890
0b100000000000000000000000000000000000000000000000000000000000000000000
0B111111111111111111111111111111111111111111111111111111111111111111111

0j
123456789012345678901234567890j

3.14
314.
0.314
000.314
.314
3e14
3E14
3e-14
3e+14
3.e14
.3e14
3.1e4

3.14j
314.j
0.314j
000.314j
.314j
3e14j
3E14j
3e-14j
3e+14j
3.e14j
.3e14j
3.1e4j

and invalid ones (each should result in the exception)

0_
42_
1.4j_
0x_
0b1_
0xf_
0o5_
1_Else
0_b0
0_xf
0_o5
0_7
09_99
4_______2
0.1__4
0.1__4j
0b1001__0100
0xffff__ffff
0x___
0o5__77
1e1__0
1e1__0j
1_.4
1_.4j
1._4
1._4j
._5
._5j
1.0e+_1
1.0e+_1j
1.4_j
1.4e5_j
1_e1
1.4_e1
1.4_e1j
1e_1
1.4e_1
1.4e_1j
1+1.5_j_
1+1.5_j

_0
_42
_1.4j
_0x
_0b1
_0xf
_0o5
_1_Else
_0_b0
_0_xf
_0_o5
_0_7
_09_99
_4_______2
_0.1__4
_0.1__4j
_0b1001__0100
_0xffff__ffff
_0x__
_0o5__77
_1e1__0
_1e1__0j
_1_.4
_1_.4j
_1._4
_1._4j
_._5
_._5j
_1.0e+_1
_1.0e+_1j
_1.4_j
_1.4e5_j
_1_e1
_1.4_e1
_1.4_e1j
_1e_1
_1.4e_1
_1.4e_1j
_1+1.5_j
_1+1.5_j

0dminnimda · 2021-11-04T21:49:08Z

ok, now the pr is tested and ready for review, and if everything is fine ready for merge!

If anyone could spot an untested case, just write it here, I'll add it to this comment and test it

lark/grammars/python.lark

MegaIng · 2021-11-05T08:00:51Z

Mind creating a new test_python_grammar with unittest inside the tests folder? If you (and maybe we as well) put so much effort into it, we should also test it to make sure it doesn't break.

If you don't want to create the unittest, I will do it.

Also, the reason I didn't add those tests till now is because I didn't manage to find CPython's tests for their parser. Where did you find them?

0dminnimda · 2021-11-05T08:58:34Z

Also, the reason I didn't add those tests till now is because I didn't manage to find CPython's tests for their parser. Where did you find them?

Well, currently Python uses PEG parser generator, so you may be interested in checking it's tests, there's such things like tests for match cases (I know you're trying to megre pr with them)

Cpython also have it's own tests and all of them live in the Lib/test. I myself took cases from test_int.py
and test_grammar.py.
(It's definitely not the most rigorous testing, for example there's test_float.py or test_complex.py and I probably should add test cases from there...)

0dminnimda · 2021-11-05T09:04:15Z

Mind creating a new test_python_grammar with unittest inside the tests folder?

oh, ok, python.lark don't have a start inside, I was just worried that if I just use, for example, file_input* as start then nothing would work ...

MegaIng · 2021-11-05T09:14:40Z

No start also matches the behavior of the actual python grammar: Depending on what exactly you want to parse, use file_input (without *), eval_input or single_input

0dminnimda · 2021-11-05T10:03:40Z

No start also matches the behavior of the actual python grammar: Depending on what exactly you want to parse, use file_input (without *), eval_input or single_input

there's bug, if I use file_input and put only one stmt (like 5), then the parser will break ...

it even can't parse

if x:
    5

and even if it could I'd still wouldn't rely on that and test each rule\terminal one-by-one

MegaIng · 2021-11-05T10:04:27Z

Are you putting a newline at the end?

and even if it could I'd still wouldn't rely on that and test each rule\terminal one-by-one

That's probably better anyway.

0dminnimda · 2021-11-05T10:22:33Z

Are you putting a newline at the end?

I shouldn't have to, It's weird to oblige me to do this

as exec can consists of several eval's
file_input should include all of the eval/single_input's, it's a superset

MegaIng · 2021-11-05T10:32:26Z

It's the same weird obligation that actual python grammar also has. The point is that any higher level parsing function can just always append a newline with no loss.

The problem is that it becomes annoying to express the correct behavior when you can't rely on every line ending in a newline.

erezsh · 2021-11-05T11:43:59Z

@0dminnimda It's not ideal, but making the last newline optional requires support for the EOF feature, which we're keeping back because the existing implementation has issues. It would also make the parser a little heavier, so it's not all positive, even if it worked.

0dminnimda · 2021-11-07T10:18:35Z

@erezsh ok, it looks like we are now ready to merge, I'm waiting for you.

erezsh · 2021-11-07T10:25:09Z

I wonder why you chose to test by specifying a different start for each number format, instead of just starting from number and making sure the result is the right token type?

The latter approach also has the advantage of testing for collision in token regexp/priority.

0dminnimda · 2021-11-07T11:05:36Z

I wonder why you chose to test by specifying a different start for each number format, instead of just starting from number and making sure the result is the right token type?

I didn't think about this approach.
It seems better, gonna implement it

0dminnimda · 2021-11-07T12:07:09Z

@erezsh I updated tests, any other comments?

0dminnimda · 2021-11-09T08:53:36Z

@erezsh could you please review and possibly merge this PR?

erezsh · 2021-11-09T09:20:23Z

@0dminnimda Looks fine. But please rebase it. It should be at most 3 commits, instead of 23.

(I can rebase it for you, if you prefer)

MegaIng

Looks good for the most part, only one small thing left.

tests/test_python_grammar.py

0dminnimda · 2021-11-09T09:24:10Z

(I can rebase it for you, if you prefer)

Why not squash and merge?

erezsh · 2021-11-09T09:26:51Z

@0dminnimda Squash is done by rebasing, no? Either way, whatever gets it down to a few commits.

0dminnimda · 2021-11-09T09:36:13Z

Squash is done by rebasing, no?

No and that's the point.

Either way, whatever gets it down to a few commits.

Well yeah, probably one commit is the best option

0dminnimda · 2021-11-09T09:37:10Z

Ok, @MegaIng do you have any other comments?

0dminnimda · 2021-11-09T09:39:22Z

Ok, now we're probably ready to merge, @erezsh

erezsh · 2021-11-09T10:06:47Z

Rebased and merged: 2ec9636

erezsh · 2021-11-09T10:07:55Z

Thanks for the PR!

0dminnimda · 2021-11-09T10:45:30Z

Rebased and merged: 2ec9636

Yeah, that's why I was talking about squash and merge, because commit that'll end up in the repo gonna be assigned to me, you know, fair credit/blame distribution

erezsh · 2021-11-09T11:42:23Z

Sorry, I do the merges via the shell, because I want to run extra tests on the way, and fix up any remaining issues.

The commits are still signed to your name. So you will still get both credit and blame 😉

0dminnimda · 2021-11-09T12:43:30Z

The commits are still signed to your name. So you will still get both credit and blame

Huh, that's true. I'm sorry, I was confused by 2ec9636, but apparently there are 518bd3a and 553eb41, so everything is fine, I guess :)

erezsh · 2021-11-09T13:02:22Z

No worries. I'll check if there's a way to connect the merge to a PR from the shell, for next time. After all, I want contributors to feel appreciated and rewarded. (as much as open source allows :)

0dminnimda added 2 commits November 3, 2021 23:48

Create python.lark

9f287a5

Update python.lark

b7a5533

MegaIng suggested changes Nov 4, 2021

View reviewed changes

lark/grammars/python.lark Outdated Show resolved Hide resolved

lark/grammars/python.lark Outdated Show resolved Hide resolved

fix hex, oct, bin, COMPLEX_NUMBER -> IMAG_NUMBER

8a67efb

0dminnimda commented Nov 4, 2021

View reviewed changes

lark/grammars/python.lark Outdated Show resolved Hide resolved

0dminnimda added 6 commits November 4, 2021 21:35

Stop using regexp for dec, hex, ocx, bin

b1efb37

fix hex, oct, bin

47b0ca8

fix dec

51b1e03

fully fix dec

9ca28cf

fix FLOAT_NUMBER and IMAG_NUMBER

79177d3

fix DEC_NUMBER

fcb078c

0dminnimda added 2 commits November 5, 2021 00:24

fix dec, hex, oct, bin

7d83ea9

fix DEC_NUMBER

2c064b4

beautify, remove _DEC_END

7422eed

0dminnimda commented Nov 4, 2021

View reviewed changes

lark/grammars/python.lark Show resolved Hide resolved

0dminnimda added 2 commits November 7, 2021 12:48

Create test_python_grammar.py

7c062e9

test: add commas and cases for dec

68f1f95

test_python_grammar.py: fix parser

78ffc78

0dminnimda added 4 commits November 7, 2021 14:51

test_python_grammar.py: enhance tests

3f9b81b

Update test_python_grammar.py

b932b27

try to fix test_invalid_number

7a81f7d

fix test_invalid_number

42c272a

erezsh requested a review from MegaIng November 9, 2021 09:15

MegaIng approved these changes Nov 9, 2021

View reviewed changes

tests/test_python_grammar.py Outdated Show resolved Hide resolved

move python_parser to setUpClass

5f749a9

MegaIng approved these changes Nov 9, 2021

View reviewed changes

format

a78bbe2

erezsh closed this Nov 9, 2021

0dminnimda deleted the Fix-python.number branch November 9, 2021 12:43

0dminnimda mentioned this pull request Nov 9, 2021

[BUG] regexp works differently compared to the same lark pattern #1031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `python.number` pattern #1028

Fix `python.number` pattern #1028

0dminnimda commented Nov 3, 2021

erezsh commented Nov 4, 2021

0dminnimda commented Nov 4, 2021 •

edited

0dminnimda commented Nov 4, 2021 •

edited

0dminnimda commented Nov 4, 2021

MegaIng commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

MegaIng commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

MegaIng commented Nov 5, 2021 •

edited

0dminnimda commented Nov 5, 2021 •

edited

MegaIng commented Nov 5, 2021

erezsh commented Nov 5, 2021

0dminnimda commented Nov 7, 2021

erezsh commented Nov 7, 2021

0dminnimda commented Nov 7, 2021

0dminnimda commented Nov 7, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

MegaIng left a comment

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

erezsh commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021 •

edited

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

Fix python.number pattern #1028

Fix python.number pattern #1028

Conversation

0dminnimda commented Nov 3, 2021

erezsh commented Nov 4, 2021

0dminnimda commented Nov 4, 2021 • edited

0dminnimda commented Nov 4, 2021 • edited

0dminnimda commented Nov 4, 2021

MegaIng commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

MegaIng commented Nov 5, 2021

0dminnimda commented Nov 5, 2021

MegaIng commented Nov 5, 2021 • edited

0dminnimda commented Nov 5, 2021 • edited

MegaIng commented Nov 5, 2021

erezsh commented Nov 5, 2021

0dminnimda commented Nov 7, 2021

erezsh commented Nov 7, 2021

0dminnimda commented Nov 7, 2021

0dminnimda commented Nov 7, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

MegaIng left a comment

Choose a reason for hiding this comment

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

erezsh commented Nov 9, 2021

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021 • edited

0dminnimda commented Nov 9, 2021

erezsh commented Nov 9, 2021

Fix `python.number` pattern #1028

Fix `python.number` pattern #1028

0dminnimda commented Nov 4, 2021 •

edited

0dminnimda commented Nov 4, 2021 •

edited

MegaIng commented Nov 5, 2021 •

edited

0dminnimda commented Nov 5, 2021 •

edited

erezsh commented Nov 9, 2021 •

edited