Refactor logos-derive from tree to graph #94

maciejhirsz · 2020-03-22T08:48:45Z

Progress tracking:

I now believe that most of the issues current implementation has (#87, #81, #80, #79, #78, #70, and probably more) is due to the fact that trying to construct a tree is just not the right way of even trying to solve the problem.

What I think is a solution is a complete rewrite of the logos-derive from tree to a graph that can more adequately represent loops and arbitrary state jumps without the explosive nature of trying to build up all possible permutations in a tree. All the nodes of the graph are going to be stored on a single Vec based struct (called Graph), and will be referenced by their index in that Vec. The nodes are going to be immutable, so any permutations (merging forks) will have to create a new node with a new id.

Here is a current (custom) debug print for what I imagine a simple [a-z]* regex should look like in the graph:

[
    :0 "IDENT",
    :1 [
        [a-z] ⇒ :1,
        ⤷ :0,
    ],
]

The node :0 is a token, the node :1 is a fork with a single arm matching one byte to range [a-z], on success we navigate to node :1 (creating a loop), on miss we navigate to node :0 and return a token. Generating Rust code out of this should be pretty straight forward, we can make every node a function definition, every jump to a function call (loops shouldn't lose performance due to tail call recursion). There is going to be room for optimization in code generation, although LLVM is probably going to do a better job at figuring out how and when to inline stuff than I ever will.

This removes the need of marking forks as Plain | Maybe | Repeat, and it should also remove the need for the fallback on branches, that really was just a hack to make identifiers working alongside named keywords.

Going to leave this draft open for comments (CC #88).

maciejhirsz · 2020-03-22T10:20:31Z

While I'm at it, derive will now report more than one error at once, and will do so spanned to where the problem is:

error: `Token::End` has a discriminant value set. This is not allowed for Tokens.
 --> tests/tests/dev.rs:8:5
  |
8 | /     #[end]
9 | |     End = 1,
  | |___________^

error: Only one #[error] variant can be declared.
  --> tests/tests/dev.rs:14:5
   |
14 | /     #[token = "hello!"]
15 | |     #[error]
16 | |     Hello,
   | |_________^

error: Previously declared #[error]:
 --> tests/tests/dev.rs:5:5
  |
5 | /     #[error]
6 | |     Error,
  | |_________^

error: aborting due to 3 previous errors

wirelyre · 2020-03-22T23:16:02Z

I also started exploring this in wirelyre/logos@8db611da674deccdf9dbb2df4d37cc4d0f16a6d8.

My design was to not rely on tail recursion by using labeled loops:

let mut token = Error;

'n0: loop { match src.next() {
    'a' => 'n1: loop { match src.next() {
        'a' => { token = A; continue 'n1; }
        'b' => continue 'n0,
    }}
    'c' => 'n2: loop { /* etc. */ }
}

return token;

This works whenever nodes are nested (no jumps from 'n1 to 'n2), which in practice is almost always. In other cases you have to duplicate part of the graph inside the loops.

I have some notes on how well LLVM tends to optimize that; I'm not sure how well it will inline the tail recursive nodes but it's bound to be less predictable.

I found it useful to:

Remove Regex and match only one character at a time
Replace Pattern with a bitset

Contrary to the code I wrote, I planned to:

Construct an NFA from all patterns at once, then generate the graph directly from that NFA
Remove the graph library since the algorithms aren't useful

wirelyre · 2020-03-22T23:22:30Z

Also I was going to use this disambiguation strategy:

Take the longest sequence of bytes that match, and consider all tokens or regexes that match those bytes.
If both a token and regex match, return the token.
If two regexes match, return the first one. (But show a warning at compile time.)

This correctly deals with identifiers and keywords (let vs. letter); and integers and floats (123 vs. 123.45).

maciejhirsz · 2020-03-23T08:04:38Z

Remove Regex and match only one character at a time.

There is an optimization in place where for any branch, and any fork with branches that lave length > 1, you can read multiple bytes at once and avoid doing bounds checking. This is especially useful when a branch is a byte sequence of 4 or 8 bytes, you can load [u8; 4] or [u8; 8] and LLVM can optimize those compares to 32/64 bit integer instructions.

Replace Pattern with a bitset

Pattern is definitely a mess. In this rewrite I've already opted not to use it. All my sequences are just bytes, while the fork is using a table 256 ids long, so I just splatter the all the ranges onto that and never have to worry about having to find where the sets are overlapping.

wirelyre · 2020-03-23T20:12:52Z

you can read multiple bytes at once

Yes, sorry, I meant only "matching" single bytes in the graph, not actually in the generated code.

That seemed promising because it simplified the graph structure. And then any optimizations could logically be done on the graph as a complete data structure, rather than making the edges more complex, if that makes sense.

For instance this would seem to leave room for:

// tokens "aaaa", "aabb"; regex "[a-z]+"

match chunk_of_4 {
    b"aaaa" => _,
    b"aabb" => _,
    _ => { if regex(b[0]) && regex(b[1]) /* ... */ }
}

because even though all three patterns share the first three nodes, so they'll be in separate Branches, if we have the whole graph we can recognize the larger pattern.

Although writing this it's now clear that loops are a big tradeoff, because in this case you'd need to duplicate the regex state like eight times.

maciejhirsz · 2020-03-29T20:39:25Z

Debugging took a while, but we are all green ✔️. There might still be some edge cases, turns out the graph makes everything simpler and easier, but it's hard to produce a canonical structure for every possible permutation of regex.

Unwinding loops and simple lookup tables for multiple ranges, and numbers are getting there:

test identifiers                       ... bench:         813 ns/iter (+/- 56) = 958 MB/s
test keywords_operators_and_punctators ... bench:       2,563 ns/iter (+/- 105) = 831 MB/s

Next I need to do jump tables for expensive match branches, and more aggressively read multiple bytes at a time when possible, and then I reckon this thing is home.

maciejhirsz · 2020-03-30T12:19:43Z

Optimized branching :)

test identifiers                       ... bench:         681 ns/iter (+/- 54) = 1143 MB/s
test keywords_operators_and_punctators ... bench:       2,315 ns/iter (+/- 162) = 920 MB/s

ratmice · 2020-03-30T15:09:25Z

So, tried this out on a branch of a project of mine,

I had to fix a bug in my lexer (mystery solved!), but it went green,
after fixing that bug though 0.10-rc2 which was green broke,
and 0.9.7 is still red.

So, this is the only branch/release currently passing, my ci tests.
There are a few minor differences in error messages (relative to pre-mystery state), I'll have to look into.
previously passing
this branch

maciejhirsz · 2020-03-30T17:00:43Z

@ratmice great to hear. If you can either submit a PR or point me to whatever edge case regex you have that is breaking 0.10-rc2 or 0.9.7, I'm happy to have it running as a test here to ensure we don't get regressions.

ratmice · 2020-03-30T17:21:12Z

@maciejhirsz I don't know exactly whats going on. I had an errant #[token = r"\p{Whitespace}"] which should be a regex rather than token, I haven't dug into it yet to see whats happening yet.

the 0.9.7 cases are instances fixed by #53 multi-byte reads in errors.
I've also tried a second parser and that one worked fine,

maciejhirsz · 2020-03-30T21:40:30Z

There are still things I know are suboptimal in the generated code, so some more fine tuning is coming. For now though, I think things are not bad at all.

test identifiers                       ... bench:         701 ns/iter (+/- 30) = 1111 MB/s
test keywords_operators_and_punctators ... bench:       2,006 ns/iter (+/- 70) = 1062 MB/s

Going to publish this as 0.10.0-rc3, to see if the regressions people reported are gone.

maciejhirsz added 3 commits March 21, 2020 22:26

Starting the refactor

45227f8

Debug prints and testing helpers

bb3834f

Disabled tests that are unwanted for now

45ffd32

maciejhirsz mentioned this pull request Mar 22, 2020

Fix r"\w+" infinite loop #71

Closed

Even better errors

aab6f95

maciejhirsz added 12 commits March 22, 2020 12:41

Organizing the graph module

7dcebd5

Start building the graph from attributes

3ce0e1a

Deduplicate some code

d2d0bfa

Adding logic to the new Fork

57e9aa1

Range iterator

56692aa

Forks should only match 1 byte, Sequences instead of Branches

f500c65

Rename Sequence to Rope, impl Rope::fork_off(&self) -> Fork

e1b5bcb

Fix comment typo

2a36be6

Rearrange lut and miss in Fork

68c25ec

No reason not to own the Token Ident

c3d77b8

Convert #[token] attributes to Ropes

cf8e950

Non-conflict merging done

e175128

maciejhirsz added 3 commits March 23, 2020 09:45

Rope to only use Vec<u8>

395baad

Handle branches for non-regex tokens

768ac1e

Reimplementing regex

6f29c63

maciejhirsz added 5 commits March 23, 2020 22:50

Basic Regex handling

9422e97

Handle *, +, and ? regex repeats

9273cf7

Return empty fork on empty HirKind

09013dc

Cleaner HirKind::Concat parsing

fa6261e

Fix looping for OneOrMore repeats

ae93032

maciejhirsz added 5 commits March 29, 2020 14:12

All green

c3657fb

Add a panic if two empty nodes are ever tried to be merged

e014eb8

Introduced Mir to regex

d7b3ed9

Reintroduce fast loop

7420972

Lookup tables for complex ranges

ec8c74c

maciejhirsz added 6 commits March 29, 2020 22:58

Drop old generator, reformat _fast_loop

acd64cc

Hir to Mir with TryFrom

406de2f

Break up Generator into modules

e5e8bd2

Move _fast_loop into fork where it belongs

4bd0e96

Faster forks

e9dde8a

Jump tables

0d15d35

Print unsuffixed u8s in jump tables

1551f3e

Calculate minimum amount of bytes to read on forks

82d8f49

maciejhirsz mentioned this pull request Mar 30, 2020

Comparison to lex #90

Closed

maciejhirsz added 3 commits March 30, 2020 22:04

Relax discriminants, generic lex.read() for multiple bytes on forks

1a23356

Move Context into its own module

ef90fe7

Laying down foundation for passing bound check bytes

07775ea

maciejhirsz marked this pull request as ready for review March 30, 2020 21:33

maciejhirsz added 2 commits March 30, 2020 23:34

Passing pre-bound-check arrays across jumps

583a105

Version bump!

4e2efbc

maciejhirsz merged commit 8f8d89e into master Mar 30, 2020

maciejhirsz deleted the rewrite branch March 30, 2020 21:45

This was referenced Mar 30, 2020

Large branches on repeat fall into infinite loop in proc macro #70

Closed

Document disambiguation strategy when a token's prefix matches another token #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor logos-derive from tree to graph #94

Refactor logos-derive from tree to graph #94

maciejhirsz commented Mar 22, 2020 •

edited

Loading

maciejhirsz commented Mar 22, 2020 •

edited

Loading

wirelyre commented Mar 22, 2020 •

edited

Loading

wirelyre commented Mar 22, 2020 •

edited

Loading

maciejhirsz commented Mar 23, 2020 •

edited

Loading

wirelyre commented Mar 23, 2020 •

edited

Loading

maciejhirsz commented Mar 29, 2020

maciejhirsz commented Mar 30, 2020

ratmice commented Mar 30, 2020 •

edited

Loading

maciejhirsz commented Mar 30, 2020

ratmice commented Mar 30, 2020 •

edited

Loading

maciejhirsz commented Mar 30, 2020

Refactor logos-derive from tree to graph #94

Refactor logos-derive from tree to graph #94

Conversation

maciejhirsz commented Mar 22, 2020 • edited Loading

maciejhirsz commented Mar 22, 2020 • edited Loading

wirelyre commented Mar 22, 2020 • edited Loading

wirelyre commented Mar 22, 2020 • edited Loading

maciejhirsz commented Mar 23, 2020 • edited Loading

wirelyre commented Mar 23, 2020 • edited Loading

maciejhirsz commented Mar 29, 2020

maciejhirsz commented Mar 30, 2020

ratmice commented Mar 30, 2020 • edited Loading

maciejhirsz commented Mar 30, 2020

ratmice commented Mar 30, 2020 • edited Loading

maciejhirsz commented Mar 30, 2020

maciejhirsz commented Mar 22, 2020 •

edited

Loading

maciejhirsz commented Mar 22, 2020 •

edited

Loading

wirelyre commented Mar 22, 2020 •

edited

Loading

wirelyre commented Mar 22, 2020 •

edited

Loading

maciejhirsz commented Mar 23, 2020 •

edited

Loading

wirelyre commented Mar 23, 2020 •

edited

Loading

ratmice commented Mar 30, 2020 •

edited

Loading

ratmice commented Mar 30, 2020 •

edited

Loading