Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements Scanner type for tokenizing nginx configs #80

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ornj
Copy link
Member

@ornj ornj commented Jan 26, 2024

Proposed changes

Implemented crossplane.Scanner that follows the example of other "scanner" types implemented in the Go stdlib. The existing Lex uses concurrency to make tokens available to the caller while managing "state". I think this design queue was taken from Rob Pike's 2011 talk on Lexical Scanning in Go. If you look at examples from the Go stdlib-- such as bufio.Scanner that Lex depends on-- you'd find that this isn't the strategy being employed and instead there is a struct that manages the state of the scanner and a method that used by the caller to advance the scanner to obtain tokens.

After a bit of Internet archeology, I found this post on golang-nuts from Rob Pike himself:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.

I think it's a fine approach to a lexer if you don't care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I'm probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don't care about that for your application, I'm sure.

So: It's pretty and nice to work on, but you'd probably not choose that approach for a production compiler.

An implementation of a "scanner" using the more "traditional" model-- much of the logic is the same or very close to Lex-- seems to support the above statement.

$ go test -benchmem -run=^$ -bench "^BenchmarkScan|BenchmarkLex$" github.com/nginxinc/nginx-go-crossplane -count=1 -v
goos: darwin
goarch: arm64
pkg: github.com/nginxinc/nginx-go-crossplane
BenchmarkLex
BenchmarkLex/simple
BenchmarkLex/simple-10             70982             16581 ns/op          102857 B/op         37 allocs/op
BenchmarkLex/with-comments
BenchmarkLex/with-comments-10      64125             18366 ns/op          102921 B/op         43 allocs/op
BenchmarkLex/messy
BenchmarkLex/messy-10              28171             42697 ns/op          104208 B/op        166 allocs/op
BenchmarkLex/quote-behavior
BenchmarkLex/quote-behavior-10     83667             14154 ns/op          102768 B/op         24 allocs/op
BenchmarkLex/quoted-right-brace
BenchmarkLex/quoted-right-brace-10                 48022             24799 ns/op          103369 B/op         52 allocs/op
BenchmarkScan
BenchmarkScan/simple
BenchmarkScan/simple-10                           179712              6660 ns/op            4544 B/op         34 allocs/op
BenchmarkScan/with-comments
BenchmarkScan/with-comments-10                    133178              7628 ns/op            4608 B/op         40 allocs/op
BenchmarkScan/messy
BenchmarkScan/messy-10                             49251             24106 ns/op            5896 B/op        163 allocs/op
BenchmarkScan/quote-behavior
BenchmarkScan/quote-behavior-10                   240026              4854 ns/op            4456 B/op         21 allocs/op
BenchmarkScan/quoted-right-brace
BenchmarkScan/quoted-right-brace-10                87468             13534 ns/op            5056 B/op         49 allocs/op
PASS
ok      github.com/nginxinc/nginx-go-crossplane 13.676s

This alternative to Lex is probably a micro-optimization for many use cases. As the size and number of NGINX configurations that need to be analyzed grows, optimization can be a good thing as well as an API that feels familiar to Go developers who might use this tool for their own purposes.

Next steps:

  • Use Scanner to "parse" NGINX configurations. I think this should be done in place so that the existing API works as is, but we should also expose a way to allow the caller to provide the scanner.
  • Deprecate Lex in favor of Scanner. If we leave Lex in place then I don't think we would need a v2 of the crossplane package (yet).

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING document
  • If applicable, I have added tests that prove my fix is effective or that my feature works
  • If applicable, I have checked that any relevant tests pass after adding my changes
  • I have updated any relevant documentation (README.md)

@ornj ornj force-pushed the scanner branch 3 times, most recently from 7f9daf9 to a103f17 Compare January 26, 2024 19:40
scanner.go Outdated Show resolved Hide resolved
ornj added 3 commits July 5, 2024 14:58
Implemented `crossplane.Scanner` that follows the example of other
"scanner" types implemented in the Go stdlib. The existing `Lex` uses
concurrency to make tokens available to the caller while managing
"state". I think this design queue was taken from Rob Pike's 2011 talk
on [Lexical Scanning in Go](https://go.dev/talks/2011/lex.slide). If you
look at examples from the Go stdlib-- such as `bufio.Scanner` that `Lex`
depends on-- you'd find that this isn't the strategy being employed and
instead there is a struct that manages the state of the scanner and a
method that used by the caller to advance the scanner to obtain tokens.

After a bit of Internet archeology, I found [this](https://groups.google.com/g/golang-nuts/c/q--5t2cxv78/m/Vkr9bNuhP5sJ)
post on `golang-nuts` from Rob Pike himself:

> That talk was about a lexer, but the deeper purpose was to demonstrate
> how concurrency can make programs nice even without obvious parallelism
> in the problem. And like many such uses of concurrency, the code is
> pretty but not necessarily fast.
>
> I think it's a fine approach to a lexer if you don't care about
> performance. It is significantly slower than some other approaches but
> is very easy to adapt. I used it in ivy, for example, but just so you
> know, I'm probably going to replace the one in ivy with a more
> traditional model to avoid some issues with the lexer accessing global
> state. You don't care about that for your application, I'm sure.

> So: It's pretty and nice to work on, but you'd probably not choose that
> approach for a production compiler.

An implementation of a "scanner" using the more "traditional" model--
much of the logic is the same or very close to `Lex`-- seems to support
the above statement.

```
go test -benchmem -run=^$ -bench "^BenchmarkScan|BenchmarkLex$" github.com/nginxinc/nginx-go-crossplane -count=1 -v
goos: darwin
goarch: arm64
pkg: github.com/nginxinc/nginx-go-crossplane
BenchmarkLex
BenchmarkLex/simple
BenchmarkLex/simple-10             70982             16581 ns/op          102857 B/op         37 allocs/op
BenchmarkLex/with-comments
BenchmarkLex/with-comments-10      64125             18366 ns/op          102921 B/op         43 allocs/op
BenchmarkLex/messy
BenchmarkLex/messy-10              28171             42697 ns/op          104208 B/op        166 allocs/op
BenchmarkLex/quote-behavior
BenchmarkLex/quote-behavior-10     83667             14154 ns/op          102768 B/op         24 allocs/op
BenchmarkLex/quoted-right-brace
BenchmarkLex/quoted-right-brace-10                 48022             24799 ns/op          103369 B/op         52 allocs/op
BenchmarkScan
BenchmarkScan/simple
BenchmarkScan/simple-10                           179712              6660 ns/op            4544 B/op         34 allocs/op
BenchmarkScan/with-comments
BenchmarkScan/with-comments-10                    133178              7628 ns/op            4608 B/op         40 allocs/op
BenchmarkScan/messy
BenchmarkScan/messy-10                             49251             24106 ns/op            5896 B/op        163 allocs/op
BenchmarkScan/quote-behavior
BenchmarkScan/quote-behavior-10                   240026              4854 ns/op            4456 B/op         21 allocs/op
BenchmarkScan/quoted-right-brace
BenchmarkScan/quoted-right-brace-10                87468             13534 ns/op            5056 B/op         49 allocs/op
PASS
ok      github.com/nginxinc/nginx-go-crossplane 13.676s
```

This alternative to `Lex` is probably a micro-optimization for many use
cases. As the size and number of NGINX configurations that need to be
analyzed grows, optimization can be a good thing as well as an API that
feels familiar to Go developers who might use this tool for their own
purposes.

Next steps:

- Use `Scanner` to "parse" NGINX configurations. I think this should be
  done in place so that the existing API works as is, but we should also
  expose a way to allow the caller to provide the scanner.
- Deprecate `Lex` in favor of `Scanner`. If we leave `Lex` in place then
  I don't think we would need a `v2` of the crossplane package (yet).
Stores the first error encountered by Scan() and checks it to make sure
scanning stops unrecoverably. The Err() method can use used to fetch the
last non-EOF error.
Fixed bug where the quoted token did not have `IsQuoted` set to `true`.
I added an additional lex fixture which shows both the existing lexer
and new scanner handle the case correctly.
@ornj
Copy link
Member Author

ornj commented Jul 5, 2024

Now supports @xynicole's changes to enable tokenizing Lua.

Benchmarks:

❯ go test -benchmem -run=^$ -bench "^(BenchmarkLex|BenchmarkLexWithLua|BenchmarkScanner|BenchmarkScannerWithLua)$" github.com/nginxinc/nginx-go-crossplane -count=1
goos: darwin
goarch: arm64
pkg: github.com/nginxinc/nginx-go-crossplane
BenchmarkLex/simple-10             57963             17756 ns/op          103049 B/op         39 allocs/op
BenchmarkLex/with-comments-10      60025             20067 ns/op          103112 B/op         45 allocs/op
BenchmarkLex/messy-10              26170             47822 ns/op          104400 B/op        168 allocs/op
BenchmarkLex/quote-behavior-10             74510             17693 ns/op          102961 B/op         26 allocs/op
BenchmarkLex/quoted-right-brace-10         43134             27752 ns/op          103560 B/op         54 allocs/op
BenchmarkLex/comments-between-args-10      78271             14866 ns/op          102937 B/op         27 allocs/op
BenchmarkLexWithLua/lua-basic-10           46273             26012 ns/op          105499 B/op         53 allocs/op
BenchmarkLexWithLua/lua-block-simple-10                    22514             54149 ns/op          108556 B/op        143 allocs/op
BenchmarkLexWithLua/lua-block-larger-10                    25983             46605 ns/op          108403 B/op         59 allocs/op
BenchmarkLexWithLua/lua-block-tricky-10                    33756             35067 ns/op          106684 B/op         66 allocs/op
BenchmarkScanner/simple-10                                163138              7084 ns/op            4648 B/op         36 allocs/op
BenchmarkScanner/with-comments-10                         144558              8100 ns/op            4712 B/op         42 allocs/op
BenchmarkScanner/messy-10                                  47570             25026 ns/op            6000 B/op        165 allocs/op
BenchmarkScanner/quote-behavior-10                        222280              5083 ns/op            4560 B/op         23 allocs/op
BenchmarkScanner/quoted-right-brace-10                     82656             14281 ns/op            5160 B/op         51 allocs/op
BenchmarkScanner/comments-between-args-10                 225475              4872 ns/op            4536 B/op         24 allocs/op
BenchmarkScannerWithLua/lua-basic-10                       93081             12833 ns/op            7866 B/op         66 allocs/op
BenchmarkScannerWithLua/lua-block-simple-10                31426             37989 ns/op           10924 B/op        156 allocs/op
BenchmarkScannerWithLua/lua-block-larger-10                37148             30723 ns/op           10770 B/op         72 allocs/op
BenchmarkScannerWithLua/lua-block-tricky-10                54890             22383 ns/op            9050 B/op         79 allocs/op
PASS
ok      github.com/nginxinc/nginx-go-crossplane 29.969s

Fixed up the Scanner logic to mirror changes made to support Lua
extension in Lex. Added a compat layer so that the existing Lua type can
be used with `Scanner` vs trying to refactor the implementation to
remove the channel. Doing so I think would result in further gains.

Benchmarks:

```
❯ go test -benchmem -run=^$ -bench "^(BenchmarkLex|BenchmarkLexWithLua|BenchmarkScanner|BenchmarkScannerWithLua)$" github.com/nginxinc/nginx-go-crossplane -count=1
goos: darwin
goarch: arm64
pkg: github.com/nginxinc/nginx-go-crossplane
BenchmarkLex/simple-10             57963             17756 ns/op          103049 B/op         39 allocs/op
BenchmarkLex/with-comments-10      60025             20067 ns/op          103112 B/op         45 allocs/op
BenchmarkLex/messy-10              26170             47822 ns/op          104400 B/op        168 allocs/op
BenchmarkLex/quote-behavior-10             74510             17693 ns/op          102961 B/op         26 allocs/op
BenchmarkLex/quoted-right-brace-10         43134             27752 ns/op          103560 B/op         54 allocs/op
BenchmarkLex/comments-between-args-10      78271             14866 ns/op          102937 B/op         27 allocs/op
BenchmarkLexWithLua/lua-basic-10           46273             26012 ns/op          105499 B/op         53 allocs/op
BenchmarkLexWithLua/lua-block-simple-10                    22514             54149 ns/op          108556 B/op        143 allocs/op
BenchmarkLexWithLua/lua-block-larger-10                    25983             46605 ns/op          108403 B/op         59 allocs/op
BenchmarkLexWithLua/lua-block-tricky-10                    33756             35067 ns/op          106684 B/op         66 allocs/op
BenchmarkScanner/simple-10                                163138              7084 ns/op            4648 B/op         36 allocs/op
BenchmarkScanner/with-comments-10                         144558              8100 ns/op            4712 B/op         42 allocs/op
BenchmarkScanner/messy-10                                  47570             25026 ns/op            6000 B/op        165 allocs/op
BenchmarkScanner/quote-behavior-10                        222280              5083 ns/op            4560 B/op         23 allocs/op
BenchmarkScanner/quoted-right-brace-10                     82656             14281 ns/op            5160 B/op         51 allocs/op
BenchmarkScanner/comments-between-args-10                 225475              4872 ns/op            4536 B/op         24 allocs/op
BenchmarkScannerWithLua/lua-basic-10                       93081             12833 ns/op            7866 B/op         66 allocs/op
BenchmarkScannerWithLua/lua-block-simple-10                31426             37989 ns/op           10924 B/op        156 allocs/op
BenchmarkScannerWithLua/lua-block-larger-10                37148             30723 ns/op           10770 B/op         72 allocs/op
BenchmarkScannerWithLua/lua-block-tricky-10                54890             22383 ns/op            9050 B/op         79 allocs/op
PASS
ok      github.com/nginxinc/nginx-go-crossplane 29.969s
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants