Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regenerate Lexer.hs with latest Alex (fix #8892) #8896

Merged
merged 2 commits into from
Apr 26, 2023

Conversation

hsyl20
Copy link
Collaborator

@hsyl20 hsyl20 commented Apr 4, 2023

Regenerate Lexer.hs with Alex 3.2.7.2 to fix issue ##8892 (out-of-bound access due to haskell/alex#223).


Please include the following checklist in your PR:

Please also shortly describe how you tested your change. Bonus points for added tests!

@hsyl20 hsyl20 force-pushed the hsyl20/fix-lexer branch 2 times, most recently from 779f6dc to b5717e6 Compare April 4, 2023 08:14
Copy link
Collaborator

@ulysses4ever ulysses4ever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sounds like something worth backporting to 3.10, @Mikolaj ?

Copy link
Member

@Mikolaj Mikolaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely, let's backport.

I guess some old test results require a bit of tweaking, too.

@Mikolaj
Copy link
Member

Mikolaj commented Apr 4, 2023

@mergify backport 3.10

@mergify
Copy link
Contributor

mergify bot commented Apr 4, 2023

backport 3.10

✅ Backports have been created

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 14, 2023

I've updated to require Alex 3.2.7.3 which doesn't generate files that require PatternGuards (see upstream ticket)

@ulysses4ever
Copy link
Collaborator

@hsyl20 when you have time, could you look into the test failure? I guess, just updating the reference should be enough? (I haven't looked too close though) Updating (via --accept) is described in cabal-testsuite/README.md.

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 17, 2023

I've updated the test. However why did we have to update this? It could be an Alex bug: previously, two unbreakable whitespaces were correctly reported at column 3; now they are reported at column 5, which makes no sense except if column is expressed in bytes instead of characters.

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 17, 2023

The new failure seems to confirm this.

hsyl20 added a commit to hsyl20/cabal that referenced this pull request Apr 17, 2023
We use a Latin1 generated parser with Alex, but we also parses Unicode
BOM, unbreakable spaces, etc. In recent Alex, the reported column isn't
expressed in Unicode chars anymore but in bytes/ASCII chars (probably
due to
haskell/alex@ae525e3
but I haven't checked), which broke our tests (see
haskell#8896).

To work around this we report indentation warnings at token start position,
instead of token end position (i.e. always 1). Otherwise position makes
no sense anymore for the user.
@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 17, 2023

New commit with message:

Lexer: report indentation warnings at column 1

We use a Latin1 generated parser with Alex, but we also parses Unicode BOM, unbreakable spaces, etc. In recent Alex, the reported column isn't expressed in Unicode chars anymore but in bytes/ASCII chars (probably due to haskell/alex@ae525e3 but I haven't checked), which broke our tests (see #8896).

To work around this we report indentation warnings at token start position, instead of token end position (i.e. always 1). Otherwise position makes no sense anymore for the user.

I've tried to switch to a utf8 lexer instead of latin1 but it makes many more tests fail. I don't even know if .cabal files are supposed to be UTF-8 though.

@ulysses4ever
Copy link
Collaborator

This all sounds a bit worrying. @andreasabel, as the author of the corresponding alex commit, maybe you have a feeling for what's going on here?

For the sake of discussion, maybe we should preserve links to failing workflows. I know they will die at some point, but at least it gives us some time to contemplate on different approaches.

new_s = if GTE(offset,0#) && EQ(check,ord_c)
new_s = if GTE(offset,ILIT(0))
&& let check = alexIndexInt16OffAddr alex_check offset
in EQ(check,ord_c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from this Alex commit: haskell/alex@e96c347 (#224)

I do not see how this should have introduced the problems we are seeing.
GTE ILIT and EQ are macros, yes, but they do not have side effects, so moving the check binding into a local let should not be a semantic change except making things lazier.

@andreasabel
Copy link
Member

This all sounds a bit worrying. @andreasabel, as the author of the corresponding alex commit, maybe you have a feeling for what's going on here?

I currently do not have time to dig deeper, but I think the latest alex commit is not to be blamed, see my comment: https://github.com/haskell/cabal/pull/8896/files#r1168715008

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 17, 2023

Yes I also think it's just the result of this change: haskell/alex@ae525e3#diff-007f894e1221eb8cafde8fdf0ee317bd32859704c27652c29cbb5417a9d5c37dR179-R185

Which has been introduced in Alex 3.2.7

hsyl20 added a commit to hsyl20/cabal that referenced this pull request Apr 17, 2023
We use a Latin1 generated parser with Alex, but we also parses Unicode
BOM, unbreakable spaces, etc. In recent Alex, the reported column isn't
expressed in Unicode chars anymore but in bytes/ASCII chars (probably
due to
haskell/alex@ae525e3
but I haven't checked), which broke our tests (see
haskell#8896).

To work around this we report indentation warnings at token start position,
instead of token end position (i.e. always 1). Otherwise position makes
no sense anymore for the user.
@andreasabel
Copy link
Member

Ok, this was 3 years ago, I don't remember much of it.
If I introduced a bug in Alex, please report it.
But maybe the source here needs some adjustment to work with Alex 3.2.7. If this is the case, the Alex violates the PVP, please also report it then.

@ulysses4ever
Copy link
Collaborator

Technically, this has two approvals and can be merged: the author is expected to add either merge-me or merge+squash label to the PR. But the alex issue is concerning. Indeed, may be worth reporting...

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 18, 2023

I've opened an upstream ticket: haskell/alex#227

@hsyl20 hsyl20 added the merge me Tell Mergify Bot to merge label Apr 18, 2023
@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Apr 20, 2023
Regenerate Lexer.hs with Alex 3.2.7.2 to fix issue #haskell#8892 (out-of-bound
access due to haskell/alex#223).
We use a Latin1 generated parser with Alex, but we also parses Unicode
BOM, unbreakable spaces, etc. In recent Alex, the reported column isn't
expressed in Unicode chars anymore but in bytes/ASCII chars (probably
due to
haskell/alex@ae525e3
but I haven't checked), which broke our tests (see
haskell#8896).

To work around this we report indentation warnings at token start position,
instead of token end position (i.e. always 1). Otherwise position makes
no sense anymore for the user.
@mergify mergify bot merged commit 3af1731 into haskell:master Apr 26, 2023
mergify bot pushed a commit that referenced this pull request Apr 26, 2023
We use a Latin1 generated parser with Alex, but we also parses Unicode
BOM, unbreakable spaces, etc. In recent Alex, the reported column isn't
expressed in Unicode chars anymore but in bytes/ASCII chars (probably
due to
haskell/alex@ae525e3
but I haven't checked), which broke our tests (see
#8896).

To work around this we report indentation warnings at token start position,
instead of token end position (i.e. always 1). Otherwise position makes
no sense anymore for the user.

(cherry picked from commit 5f72880)
mergify bot added a commit that referenced this pull request Apr 26, 2023
Regenerate Lexer.hs with latest Alex (fix #8892) (backport #8896)
@andreabedini
Copy link
Collaborator

@hsyl20

I don't even know if .cabal files are supposed to be UTF-8 though.

The documentation confirms they are UTF-8.

The package description file must have a name ending in “.cabal”. It must be a Unicode text file encoded using valid UTF-8.

@hsyl20
Copy link
Collaborator Author

hsyl20 commented Apr 27, 2023

So in theory we could use Alex unicode support. But since then I've read in https://amelia.how/posts/parsing-layout.html that:

Nobody uses the Unicode support, not even GHC: The community wisdom is to trick Alex into reading Unicode by compressing Unicode classes down into high byte characters. Yeah, yikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days merge me Tell Mergify Bot to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants