Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor performance in some cases #5

Open
matthew-dean opened this issue Aug 23, 2023 · 7 comments
Open

Very poor performance in some cases #5

matthew-dean opened this issue Aug 23, 2023 · 7 comments

Comments

@matthew-dean
Copy link

matthew-dean commented Aug 23, 2023

One thing I've noticed when using this lookahead strategy is that while the successful parses are very fast, the failures are orders of magnitude slower than Chevrotain's default. I thought maybe it was during error logging, but I built a custom error message provider, and that had no effect.

In some of my debugging, I think this is because the alternations that are built can contain hundreds of entries by the time it's sent to buildNoViableAltMessage, whereas Chevrotain's default lookahead strategy might produce 10 for the same OR block.

There's obviously no benefit for an error message handler to receive hundreds or thousands of possible paths as there's no error message that could include all of them that would be useful in that circumstance, so could that be scaled back to a max limit of paths? (Maybe configurable?)

Addendum

So, I've found another side effect of this problem which is backtracking. I know Chevrotain recommends against backtracking if at all possible, but because I had a custom error message provider, I kept seeing buildNoViableAltMessage called but then no error thrown. I figured out that error messages are built even if they aren't used. So, backtracking, which is already warned may have poor performance, is exacerbated greatly when using backtracking + chevrotain-allstar. It takes Chevrotain a long time to recover from a backtracking rule because the thread is locked by this package building expected paths.

@matthew-dean matthew-dean changed the title Poor error performance Poor performance in error handling Aug 27, 2023
@matthew-dean
Copy link
Author

So, I think part of the performance hit was actually (possibly?) related to: #6. I'm not sure what I did, but after some refactoring, I'm not getting such a massive performance drain when building errors? I still feel like somehow the number of alternations should be capped somehow 🤔

@matthew-dean matthew-dean changed the title Poor performance in error handling Very poor performance in some cases Aug 29, 2023
@matthew-dean
Copy link
Author

@msujew Okay, I've narrowed this down quite a bit, and I've discovered it's not just a performance issue upon errors, but in some particular parsing paths cause hugely degraded performance in this package. (I'll happily concede that it may actually be a mistake in the grammar, but the effect of that is that no errors / validation problems are thrown, and this occurs when a file is parsed correctly.)

I've frozen a branch here.

If you check it out and do a pnpm install in the root, and then build css-parser (run pnpm build in the css-parser folder, you can go into the less-parser branch and run pnpm test.

I've narrowed down to a single test-case. It's a .less file that's actually very short. However, by shadowing the LLStarLookaheadStrategy class and logging functions, I've discovered that chevrotain-allstar can spend up to 12 seconds calling the function returned by buildLookaheadForAlternation. This results in 20 seconds (on an M1 Mac) parsing a file, the contents of which are here.

Any idea what's going on here?

@matthew-dean
Copy link
Author

I'm going to try to take a look tomorrow to see if I can set some breakpoints and do some logging and figure out what exactly is happening in that function that is taking 12 seconds.

@msujew
Copy link
Member

msujew commented Aug 30, 2023

@matthew-dean I cannot reproduce the performance degradation you're experiencing (on a few years old Intel). I wrote a small test for the file in question:

describe.only('can parse colors.less file quickly', () => {
    console.time('colors.less');
    const result = fs.readFileSync(__dirname + '/css/colors.less', 'utf8')
    const { cst, lexerResult, parser } = cssParser.parse(result)
    expect(lexerResult.errors.length).toBe(0)
    expect(parser.errors.length).toBe(0)
    console.timeEnd('colors.less');
})

And received:

colors.less: 63.231ms

Seems pretty slow, but that's mostly due to reading from disk and some warmup in the lookahead. The parser can probably be optimized quite a bit. Reading the file a second time right after the first call yields:

colors.less: 5.938ms

Note that I had to change the comment terminal for a successful parse to also tokenize // single line comments correctly:

-['comment', '\\/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*\\/'],
+['comment', '(\\/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*\\/|\\/\\/[^\\n\\r]*)'],

@matthew-dean
Copy link
Author

matthew-dean commented Aug 30, 2023

@msujew That's the wrong parser. The CSS Parser doesn't demonstrate that slow-down. It was the Less Parser. That's why comments were not parsing in colors.less, as CSS doesn't support line comments.

Here's what I've discovered so far.

This package really grinds to a halt with recursion, because it will predict paths per alt that encompass the remainder of the file.

Originally, I had this, which matches the CSS spec exactly:

$.RULE('declarationList', () => {
    $.OR({
      DEF: [
        {
          ALT: () => {
            $.OPTION(() => $.SUBRULE($.declaration))
            $.OPTION2(() => {
              $.CONSUME(T.Semi)
              $.SUBRULE3($.declarationList)
            })
          }
        },
        {
          ALT: () => {
            $.SUBRULE($.innerAtRule)
            $.SUBRULE($.declarationList)
          }
        },
        {
          ALT: () => {
            $.SUBRULE($.qualifiedRule, { ARGS: [{ inner: true }] })
            $.SUBRULE2($.declarationList)
          }
        }
      ]
    })
  })

This mirrors the railroad diagrams for a declaration list.

This didn't cause a problem until extending into the Less Parser. Less / Sass allow identifiers for inner qualified rules, which means long-chain ambiguous rule starts (a:b d e f g h i j k l m n o p is either an entire declaration, or the start of an inner qualified rule in Less/Sass).

The combination of those two things is what caused chevrotain-allstar to bork entirely and take 20 seconds to parse a file.

I intuited what this package was doing when I had logging turned on and the "ambiguous alternatives" warning was not only long, but contained all tokens for the remainder of the file, which it termed as prefixes.

I re-tooled the above declarationList to look like this, which has the exact same parsing outcome, but eliminates recursion:

$.RULE('declarationList', () => {
    let needsSemi = false
    $.MANY({
      GATE: () => !needsSemi || (needsSemi && $.LA(1).tokenType === T.Semi),
      DEF: () => {
        $.OR([
          {
            ALT: () => {
              $.SUBRULE($.declaration)
              needsSemi = true
            }
          },
          {
            ALT: () => {
              $.OR2([
                { ALT: () => $.SUBRULE($.innerAtRule) },
                { ALT: () => $.SUBRULE2($.qualifiedRule, { ARGS: [{ inner: true }] }) },
                { ALT: () => $.CONSUME(T.Semi) }
              ])
              needsSemi = false
            }
          }
        ])
      }
    })
  })

I guess I sort of mistakenly figured that this package would handle ambiguity, because it handles longer-path prediction, but anytime ambiguity happens, chevrotain-allstar shows a noticeable uptick in processing time. When it's ambiguity + recusion, then it really falls apart.

The good news is that with that refactoring, speed isn't as much as an issue. Parsing of colors.less is, right now, around 58ms. I still have weird ambiguity errors where there isn't any ambiguity, AFAICT, but I can discuss in a separate issue. (Derp, nope, there was ambiguity lol.)

I appreciate your help!

@msujew
Copy link
Member

msujew commented Aug 30, 2023

I see, good to know 👍

Yeah, as mentioned in the other issue, having ambiguities present (especially those with long prefixes) in your grammar can be quite the performance bottleneck. While chevrotain-allstar fixes some of the ambiguity issues appearing in the LL(k) strategy, there are some true ambiguities that just cannot be resolved until they are fully evaluated.

I assume with the recursion case, you actually hit the theoretical worse case boundary of the ALL(*) algorithm. Normally, ALL(*) is supposed to operate in O(n), but the worst case makes it explode into O(n^4). It's outlined in the original paper.

@matthew-dean
Copy link
Author

@msujew

While chevrotain-allstar fixes some of the ambiguity issues appearing in the LL(k) strategy, there are some true ambiguities that just cannot be resolved until they are fully evaluated.

This is really good to know! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants