Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Syntax highlighting is slow and inefficient. #4685

Closed
amano-kenji opened this issue Jul 23, 2022 · 12 comments
Closed

[BUG] Syntax highlighting is slow and inefficient. #4685

amano-kenji opened this issue Jul 23, 2022 · 12 comments
Labels

Comments

@amano-kenji
Copy link

amano-kenji commented Jul 23, 2022

Version of Kakoune

v2021.11.08

Reproducer

  • Try to open any file with 20,000 or more lines.
  • Try multi selection editing in the file.

Outcome

For files that have 20,000 lines or more, syntax highlighting is slow.

The file size is still less than 1 megabytes with a bit more than 20,000 lines.

Multi selection editing is also slow due to syntax highlighting. Perhaps, syntax highlighting is updated as I type each letter in editing mode.

neovim doesn't seem to apply syntax highlighting to the whole file. It seems to apply syntax highlighting to just visible parts. But, neovim's syntax highlighting is buggy because it applies syntax highlighting to visible parts only.

3D engines are efficient because they render only what they have to. Recent 3D engines even tweak level of detail dynamically to make rendering more efficient.

Expectations

Syntax highlighting should be efficient.

Additional information

No response

@mawww
Copy link
Owner

mawww commented Jul 24, 2022

Would you have an example source file to reproduce this on ? Kakoune does only apply highlighting to the currently displayed part of the buffer and tries to aggressively cache data to avoid reparsing. The only highlighting logic that uses the full buffer is region handling but this should only parse the whole buffer once, then only apply to modified lines between highlighting updates.

@amano-kenji
Copy link
Author

amano-kenji commented Jul 24, 2022

example.journal.txt

This is an example ledger journal file.

  • Try to select Yeti with %sYeti[Enter] and edit it.

With ledger filetype, doing this takes a long time even with no hook(\). With plain filetype, doing this is very fast.

@mawww
Copy link
Owner

mawww commented Jul 31, 2022

Took a look, this seems to be an issue in ledger.kak that defines highly costly highlighting (it seems to define regions that match every single character of the buffer)

@asb
Copy link

asb commented Aug 8, 2022

I'm seeing the same issue with a very large Markdown file (I keep all my notes in one big file). The ~8.5MiB file takes > 10s to display in Kakoune, though displays instantly if I rename it to .txt.

@mawww
Copy link
Owner

mawww commented Aug 8, 2022

@asb would you have an example file you could provide for the markdown case ?

@asb
Copy link

asb commented Aug 8, 2022

It looks like there's nothing in particular special about my notes file. The following script generates a ~5MiB file that takes >5 seconds to open with Kakoune.

#!/bin/sh

for i in $(seq 10000); do
  cat <<EOF >> big_markdown.md
# Title
## Subtitle
Some notes about things.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum

* A list
  * With a sub-item
* Item 2
* Item 3
* Item 4
* Item 5
  * Item 5.1

EOF
done

krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
As reported in
mawww#4685 (comment)
ledger.kak defines a region end that matches every character of
the buffer. This causes performance issues for large buffers.
Since the affected regions are only ever filled with a single color,
just use a regex highlighter instead of a region highlighter.
This should fix the most serious performance issues.

In general, ledger.kak could use some improvements to use more
idiomatic highlighters.

Speedup on [example.journal.txt](mawww#4685 (comment))

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-
	parse; hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     947.8 ms ±  35.1 ms    [User: 880.0 ms, System: 72.4 ms]
	  Range (min … max):   895.2 ms … 994.3 ms    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     364.7 ms ±  13.7 ms    [User: 349.0 ms, System: 21.7 ms]
	  Range (min … max):   344.6 ms … 391.9 ms    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy' ran
	    2.60 ± 0.14 times faster than 'git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy'

(The exact numbers depend on other optimizations I'm using.)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
As reported in
mawww#4685 (comment)
ledger.kak defines a region end that matches every character of
the buffer. This causes performance issues for large buffers.
Since the affected regions are only ever filled with a single color,
just use a regex highlighter instead of a region highlighter.
This fixes the most serious performance issues when loading
the file for the first time (slow updating of many selections will
be fixed by a different change).

In general, ledger.kak could use some improvements to use more
idiomatic highlighters.

Speedup on [example.journal.txt](mawww#4685 (comment))

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-
	parse; hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     947.8 ms ±  35.1 ms    [User: 880.0 ms, System: 72.4 ms]
	  Range (min … max):   895.2 ms … 994.3 ms    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     364.7 ms ±  13.7 ms    [User: 349.0 ms, System: 21.7 ms]
	  Range (min … max):   344.6 ms … 391.9 ms    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy' ran
	    2.60 ± 0.14 times faster than 'git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy'

(The exact numbers depend on other optimizations I'm using.)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
… buffer

There have been proposals to add more language aliases to markdown.kak
(mawww#4592) and allow users to add their own aliases (mawww#4489).

To recap: various markdown implementations allow specifying aliases
for languages. For example, here is a code block that should be
highlighted as filetype "haskell" but isn't

	```hs
	-- highlight as haskell
	```

There are lots of aliases out in the wild - "pygmentize -L"lists some
but I don't think there is a canonical list.

Today we have a hardcoded list of supported filetypes. This is hard
to extend and it can also impact performance.
This patch simply attempts to load the module "hs" and the shared
highlighter "hs". This means that users can use this (obvious?) snippet
to add their own aliases:

	provide-module hs %{
		require-module haskell
		add-highlighter shared/hs ref haskell
	}

Untrusted Markdown files can load arbitrary modules, but that was
already true before, and modules are assumed to be trusted anyway.

Since language highlighters are now loaded *after* the generic
code-block highlighter, we need to make sure the language highlighters
take precedence. Do this by making them sub-regions of the generic one.

Closes mawww#4489

This improves performance on the [5MB Markdown
file](mawww#4685 (comment)).

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.122 s ±  0.023 s    [User: 1.099 s, System: 0.025 s]
	  Range (min … max):    1.075 s …  1.157 s    10 runs

	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     544.6 ms ±  13.5 ms    [User: 525.5 ms, System: 22.8 ms]
	  Range (min … max):   525.9 ms … 564.5 ms    10 runs

	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    2.06 ± 0.07 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(These numbers depend on other optimizations I'm using.)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.969 s ±  0.033 s    [User: 1.935 s, System: 0.035 s]
	  Range (min … max):    1.906 s …  2.036 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.111 s ±  0.022 s    [User: 1.093 s, System: 0.020 s]
	  Range (min … max):    1.079 s …  1.145 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.05 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on a other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but we must not match a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
Running %sYeti<ret>casdf on the file
[example.journal.txt](mawww#4685 (comment))
causes noticeable lag.
It creates 6000 selections. Whenever text is inserted/deleted in all
selections, it takes Kakoune 300 milliseconds to update the range
highlighters. This is because the runtime is quadratic in the number
of selections:
for each selection, we call on_new_range(), which calls add_matches()
which calls std::rotate(), which needs needs linear time.

Fix the quadratic behavior by first computing all new ranges and then
adding all matches in one go. This seems to fix the lag.
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
Our regex engine uses the virtual machine approach, see
https://swtch.com/~rsc/regexp/regexp2.html
It is very robust but still does quite a bit of work if the pattern
prefix matches in many places, like for this regex that tries to
match the start of a URL:

	[a-z]+://

For every lowercase word in the buffer, we spawn a bunch of VM threads
only to throw most of them away later, since most words don't contain
"://".

Since the region highlighter works line-by-line, we can skip lines
that do not match the highlighter regex.  Most highlighters have some
literal parts. Extract literal parts that must always match. Before
performing a full regex match, check if the line contains that literal
part.  This is much faster in practice. Of course it's a heuristic,
so it can also cause slow downs.  Note that this optimization only
applies to the region highlighter; it would be less effective for
searches that are not limited to a single line.

The actual regex used by markdown.kak is this one:

	add-highlighter shared/markdown/inline/url region -recurse \( ([a-z]+://|(mailto|magnet|xmpp):) (?!\\).(?=\))|\s fill link

which takes almost 0.5 seconds to apply to the [5MB Markdown file].
With this optimization it only takes 0.005 seconds. The speedup on
the entire file is also solid:

	$ HOME=$PWD hyperfine -w 1 ./kak.opt.{old,new}' big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: ./kak.opt.old big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      6.478 s ±  0.096 s    [User: 6.442 s, System: 0.030 s]
	  Range (min … max):    6.297 s …  6.570 s    10 runs
	 
	Benchmark 2: ./kak.opt.new big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.966 s ±  0.038 s    [User: 1.940 s, System: 0.029 s]
	  Range (min … max):    1.915 s …  2.021 s    10 runs
	 
	Summary
	  './kak.opt.new big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    3.30 ± 0.08 times faster than './kak.opt.old big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

[5MB Markdown file]: mawww#4685 (comment)
Here is the script to create that file:

	: >big_markdown.md
	for i in $(seq 10000); do
	  cat <<EOF >> big_markdown.md
	# Title
	## Subtitle
	Some notes about things.

	Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
	incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
	nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
	Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
	eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
	in culpa qui officia deserunt mollit anim id est laborum

	* A list
	  * With a sub-item
	* Item 2
	* Item 3
	* Item 4
	* Item 5
	  * Item 5.1

	EOF
	done
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.969 s ±  0.033 s    [User: 1.935 s, System: 0.035 s]
	  Range (min … max):    1.906 s …  2.036 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.111 s ±  0.022 s    [User: 1.093 s, System: 0.020 s]
	  Range (min … max):    1.079 s …  1.145 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.05 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on a other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but we must not match a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
… buffer

There have been proposals to add more language aliases to markdown.kak
(mawww#4592) and allow users to add their own aliases (mawww#4489).

To recap: various markdown implementations allow specifying aliases
for languages. For example, here is a code block that should be
highlighted as filetype "haskell" but isn't

	```hs
	-- highlight as haskell
	```

There are lots of aliases out in the wild - "pygmentize -L"lists some
but I don't think there is a canonical list.

Today we have a hardcoded list of supported filetypes. This is hard
to extend and it can also impact performance.
This patch simply attempts to load the module "hs" and the shared
highlighter "hs". This means that users can use this (obvious?) snippet
to add their own aliases:

	provide-module hs %{
		require-module haskell
		add-highlighter shared/hs ref haskell
	}

Untrusted Markdown files can load arbitrary modules, but that was
already true before, and modules are assumed to be trusted anyway.

Since language highlighters are now loaded *after* the generic
code-block highlighter, we need to make sure the language highlighters
take precedence. Do this by making them sub-regions of the generic one.

Closes mawww#4489

This improves performance on the [5MB Markdown
file](mawww#4685 (comment)).

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.122 s ±  0.023 s    [User: 1.099 s, System: 0.025 s]
	  Range (min … max):    1.075 s …  1.157 s    10 runs

	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     544.6 ms ±  13.5 ms    [User: 525.5 ms, System: 22.8 ms]
	  Range (min … max):   525.9 ms … 564.5 ms    10 runs

	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    2.06 ± 0.07 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(These numbers depend on other optimizations I'm using.)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
Running %sYeti<ret>casdf on the file
[example.journal.txt](mawww#4685 (comment))
causes noticeable lag.
It creates 6000 selections. Whenever text is inserted/deleted in all
selections, it takes Kakoune 300 milliseconds to update the range
highlighters. This is because the runtime is quadratic in the number
of selections:
for each selection, we call on_new_range(), which calls add_matches()
which calls std::rotate(), which needs needs linear time.

Fix the quadratic behavior by first computing all new ranges and then
adding all matches in one go. This seems to fix the lag.
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 16, 2022
@krobelus
Copy link
Contributor

krobelus commented Aug 16, 2022

I've attempted some optimizations, you can try my optimize-highlighters branch which has all of them.
This should make both the Markdown and the Ledger files usable. There are definitely more optimizations we can add.

krobelus added a commit to krobelus/kakoune that referenced this issue Aug 17, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.969 s ±  0.033 s    [User: 1.935 s, System: 0.035 s]
	  Range (min … max):    1.906 s …  2.036 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.111 s ±  0.022 s    [User: 1.093 s, System: 0.020 s]
	  Range (min … max):    1.079 s …  1.145 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.05 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on a other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but we must not match a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
@amano-kenji
Copy link
Author

amano-kenji commented Aug 17, 2022

I just tested optimize-highlighters branch with example.journal.txt briefly. Now, syntax highlighting seems to be a lot faster.

But, you mention hacky in your commit. I don't like hacks. Wouldn't the proper fix be to fix hacky highlighter code in ledger.kak?

@krobelus
Copy link
Contributor

I think you are talking about #4716 which optimizes an existing hack.
That one is actually not relevant to ledger.kak; the most important one is #4717. Apart from #4714 I don't see anything really wrong in ledger.kak. The example.journal.txt takes roughly 200ms to highlight initially, updates are much faster because of the cache. Almost all of those 200ms are spent on matching regexes that contain either ^ or $ which we could trivially optimize. Though there's probably a bigger fish to fry first, see the WIP commit 55a76b3

mawww added a commit that referenced this issue Aug 20, 2022
Instead of storing regexes in each regions, move them to the core
highlighter in a hash map so that shared regexes between different
regions are only applied once per update instead of once per region

Also change iteration logic to apply all regex together to each
changed lines to improve memory locality on big buffers.

For the big_markdown.md file described in #4685 this reduces
initial display time from 3.55s to 2.41s on my machine.
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
As reported in
mawww#4685 (comment)
ledger.kak defines a region end that matches every character of
the buffer. This causes performance issues for large buffers.
Since the affected regions are only ever filled with a single color,
just use a regex highlighter instead of a region highlighter.
This improves performance when loading the file for the first time.

Speedup on [example.journal.txt](mawww#4685 (comment))

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     362.1 ms ±   5.1 ms    [User: 336.6 ms, System: 30.2 ms]
	  Range (min … max):   352.6 ms … 369.1 ms    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     271.2 ms ±  16.7 ms    [User: 252.8 ms, System: 24.0 ms]
	  Range (min … max):   253.9 ms … 305.0 ms    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy' ran
	    1.33 ± 0.08 times faster than 'git checkout HEAD~ -- :/rc/filetype/ledger.kak && ./kak.opt example.journal.txt -e "modeline-parse; hook global NormalIdle .* quit" -ui dummy'
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
… buffer

There have been proposals to add more language aliases to markdown.kak
(mawww#4592) and allow users to add their own aliases (mawww#4489).

To recap: various markdown implementations allow specifying aliases
for languages. For example, here is a code block that should be
highlighted as filetype "haskell" but isn't:

	```hs
	-- highlight as haskell
	```

There are lots of aliases out in the wild - "pygmentize -L" lists
some but I don't think there is a canonical list.

Today we have a hardcoded list of supported filetypes. This is hard
to mainta, extend, and it can impact performance.
This patch simply attempts to load the module "hs" and the shared
highlighter "hs". This means that users can use this (obvious?) snippet
to add their own aliases:

	provide-module hs %{
		require-module haskell
		add-highlighter shared/hs ref haskell
	}

Untrusted Markdown files can load arbitrary modules, but that was
already true before, and modules are assumed to be trusted anyway.

Since language highlighters are now loaded *after* the generic
code-block highlighter, we need to make sure the language highlighters
take precedence. Do this by making them sub-regions of the generic one.

Closes mawww#4489

This improves performance on the [5MB Markdown
file](mawww#4685 (comment)).

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      3.225 s ±  0.074 s    [User: 3.199 s, System: 0.027 s]
	  Range (min … max):    3.099 s …  3.362 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.181 s ±  0.030 s    [User: 1.162 s, System: 0.021 s]
	  Range (min … max):    1.149 s …  1.234 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    2.73 ± 0.09 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(These numbers depend on another optimization.)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      4.792 s ±  0.119 s    [User: 4.742 s, System: 0.046 s]
	  Range (min … max):    4.640 s …  4.991 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      3.788 s ±  0.105 s    [User: 3.759 s, System: 0.026 s]
	  Range (min … max):    3.650 s …  3.938 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.27 ± 0.05 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but it must not include a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      4.792 s ±  0.119 s    [User: 4.742 s, System: 0.046 s]
	  Range (min … max):    4.640 s …  4.991 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      3.788 s ±  0.105 s    [User: 3.759 s, System: 0.026 s]
	  Range (min … max):    3.650 s …  3.938 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.27 ± 0.05 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but it must not include a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are sorted).

The implementation uses C++20 coroutines to implement a generator
that supplies the ranges to the above loop.  One alternative is to
allocate an intermediate Vector<LineRange>, which is actually a few
percent faster but less fun. Another alternative is to replace the
generator with an iterator with a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf


The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs
	 
	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs
	 
	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs
	 
	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs
	 
	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are sorted).

The implementation uses C++20 coroutines to implement a generator
that supplies the ranges to the above loop.  One alternative is to
allocate an intermediate Vector<LineRange>, which is actually a few
percent faster but less fun. Another alternative is to replace the
generator with an iterator with a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf


The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs
	 
	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs
	 
	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs
	 
	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs
	 
	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are sorted).

The implementation uses C++20 coroutines to implement a generator
that supplies the ranges to the above loop.  One alternative is to
allocate an intermediate Vector<LineRange>, which is actually a few
percent faster but less fun. Another alternative is to replace the
generator with an iterator with a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf


The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs
	 
	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs
	 
	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs
	 
	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs
	 
	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Aug 28, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are already sorted).

The most natural implementation would use coroutines to implement a
generator that supplies the ranges to the above loop.  Unfortunately
clang < 14 does not support coroutines when using -stdlib=libstdc++
(there is only experimental support with -stdlib=libc++). Use a
temporary vector for now.  Another alternative is to an iterator with
a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf


The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs
	 
	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs
	 
	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs
	 
	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs
	 
	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Sep 10, 2022
… buffer

There have been proposals to add more language aliases to markdown.kak
(mawww#4592) and allow users to add their own aliases (mawww#4489).

To recap: various markdown implementations allow specifying aliases
for languages. For example, here is a code block that should be
highlighted as filetype "haskell" but isn't:

	```hs
	-- highlight as haskell
	```

There are lots of aliases out in the wild - "pygmentize -L" lists
some but I don't think there is a canonical list.

Today we have a hardcoded list of supported filetypes. This is hard
to mainta, extend, and it can impact performance.
This patch simply attempts to load the module "hs" and the shared
highlighter "hs". This means that users can use this (obvious?) snippet
to add their own aliases:

	provide-module hs %{
		require-module haskell
		add-highlighter shared/hs ref haskell
	}

Untrusted Markdown files can load arbitrary modules, but that was
already true before, and modules are assumed to be trusted anyway.

Since language highlighters are now loaded *after* the generic
code-block highlighter, we need to make sure the language highlighters
take precedence. Do this by making them sub-regions of the generic one.

Closes mawww#4489

This improves performance on the [5MB Markdown
file](mawww#4685 (comment)).

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      3.225 s ±  0.074 s    [User: 3.199 s, System: 0.027 s]
	  Range (min … max):    3.099 s …  3.362 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.181 s ±  0.030 s    [User: 1.162 s, System: 0.021 s]
	  Range (min … max):    1.149 s …  1.234 s    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    2.73 ± 0.09 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(These numbers depend on another optimization.)
krobelus added a commit to krobelus/kakoune that referenced this issue Sep 10, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are already sorted).

The most natural implementation would use coroutines to implement a
generator that supplies the ranges to the above loop.  Unfortunately
clang < 14 does not support coroutines when using -stdlib=libstdc++
(there is only experimental support with -stdlib=libc++). Use a
temporary vector for now.  Another alternative is to an iterator with
a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf

The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs

	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs

	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs

	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs

	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Sep 16, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by updating all ranges in the same loop.
This means that we no longer need to use std::rotate() for every
single range; instead we can just use one call to std::inplace_merge()
for all ranges (since ranges are already sorted).

The most natural implementation would use coroutines to implement a
generator that supplies the ranges to the above loop.  Unfortunately
clang < 14 does not support coroutines when using -stdlib=libstdc++
(there is only experimental support with -stdlib=libc++). Use a
temporary vector for now.  Another alternative is to an iterator with
a custom destructor.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf

The average runtimes for this script show an improvement as the input
file grows:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs

	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs

	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs

	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs

	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
krobelus added a commit to krobelus/kakoune that referenced this issue Sep 17, 2022
Running %sYeti<ret>casdf on file
[example.journal.txt](mawww#4685 (comment))
can cause noticeable lag.  This is because we insert text at 6000
selections, which means we need to update highlighters in those lines.
The runtime for updating range highlighters is quadratic in the
number of selections: for each selection, we call on_new_range(),
which calls add_matches(), which calls std::rotate(), which needs
needs linear time.

Fix the quadratic runtime by calling std::inplace_merge() once instead
of repeatedly calling std::rotate().  This is works because ranges
are already sorted.

I used this script to benchmark the improvements.
(In hindsight I could have just used "-ui json" instead of tmux).

	#!/bin/sh
	set -ex
	N=${1:-100}
	kak=${2:-./kak.opt}
	for i in $(seq "$N")
	do
		echo -n "\
	2022-02-06 * Earth
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	2022-02-06 * Blue Yeti USB Microphone
	    expense:electronics:audio    116.7 USD
	    liability:card              -116.7 USD

	"
	done > big-journal.ledger
	echo > .empty-tmux.conf 'set -sg escape-time 5'
	test_tmux() {
		tmux -S .tmux-socket -f .empty-tmux.conf "$@"
	}
	test_tmux new-session -d "$kak" big-journal.ledger
	test_tmux send-keys '%sYeti' Enter c 1234567890
	sleep .2
	test_tmux send-keys Escape
	while ! test_tmux capture-pane -p | grep 123
	do
		sleep .1
	done
	test_tmux send-keys ':wq' Enter
	while test_tmux ls
	do
		sleep .1
	done
	rm -f .tmux-socket .empty-tmux.conf

This script's runtime used to grow super-linearly but now it grows
linearly:

	         kak.old  kak.new
	N=10000    1.142    0.897
	N=20000    2.879    1.400

Detailed results:

	$ hyperfine -w 1 './bench.sh 10000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 10000 ./kak.opt.old
	  Time (mean ± σ):      1.142 s ±  0.072 s    [User: 0.252 s, System: 0.059 s]
	  Range (min … max):    1.060 s …  1.242 s    10 runs

	Benchmark 2: ./bench.sh 10000 ./kak.opt.new
	  Time (mean ± σ):     897.2 ms ±  19.3 ms    [User: 241.6 ms, System: 57.4 ms]
	  Range (min … max):   853.9 ms … 923.6 ms    10 runs

	Summary
	  './bench.sh 10000 ./kak.opt.new' ran
	    1.27 ± 0.09 times faster than './bench.sh 10000 ./kak.opt.old'
	$ hyperfine -w 1 './bench.sh 20000 ./kak.opt.'{old,new}
	Benchmark 1: ./bench.sh 20000 ./kak.opt.old
	  Time (mean ± σ):      2.879 s ±  0.065 s    [User: 0.553 s, System: 0.126 s]
	  Range (min … max):    2.768 s …  2.963 s    10 runs

	Benchmark 2: ./bench.sh 20000 ./kak.opt.new
	  Time (mean ± σ):      1.400 s ±  0.018 s    [User: 0.428 s, System: 0.083 s]
	  Range (min … max):    1.374 s …  1.429 s    10 runs

	Summary
	  './bench.sh 20000 ./kak.opt.new' ran
	    2.06 ± 0.05 times faster than '../repro.sh 20000 ./kak.opt.old'
@mawww
Copy link
Owner

mawww commented Sep 18, 2022

Latest master should be much faster for the two use cases described here, can you confirm and close ?

krobelus added a commit to krobelus/kakoune that referenced this issue Sep 18, 2022
Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	Error: Command terminated with non-zero exit code: 127. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.
	$ cd src
	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.754 s ±  0.028 s    [User: 1.720 s, System: 0.034 s]
	  Range (min … max):    1.725 s …  1.804 s    10 runs
	 
	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     988.6 ms ±  10.1 ms    [User: 974.3 ms, System: 17.5 ms]
	  Range (min … max):   980.6 ms … 1011.6 ms    10 runs
	 
	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.03 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but it must not include a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
@amano-kenji
Copy link
Author

I just tested it. It's fixed.

@amano-kenji
Copy link
Author

Not everything from krobelus has been merged, but it seems fast already.

krobelus added a commit to krobelus/kakoune that referenced this issue Apr 16, 2023
…ighlighter

Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	Error: Command terminated with non-zero exit code: 127. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.
	$ cd src
	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.754 s ±  0.028 s    [User: 1.720 s, System: 0.034 s]
	  Range (min … max):    1.725 s …  1.804 s    10 runs

	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     988.6 ms ±  10.1 ms    [User: 974.3 ms, System: 17.5 ms]
	  Range (min … max):   980.6 ms … 1011.6 ms    10 runs

	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.03 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but it must not include a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue Apr 16, 2023
…low-empty-region-end:generic-language-highlighting:optimize-ledger:faster-update-matches] Consolidated optimizations

Part of mawww#4685
krobelus added a commit to krobelus/kakoune that referenced this issue May 13, 2023
On huge files [1], initial highlighting can take a while.
The canonical solution is to disable highlighting for such files.
However we can improve the experience if the user forgot to do so

Allow regex highlighters to be interrupted with Control-C.  When that
happens, mark the affected buffer and never again try to highlight
it with regex highlighters.

This commit is experimental because the behavior is difficult to
predict (it only affects some highlighters) and I'm not sure it's
really needed in practice.

[1]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue May 13, 2023
On huge files [1], initial highlighting can take a while.
The canonical solution is to disable highlighting for such files.
However we can improve the experience if the user forgot to do so

Allow regex highlighters to be interrupted with Control-C.  When that
happens, mark the affected buffer and never again try to highlight
it with regex highlighters.

This commit is experimental because the behavior is difficult to
predict (it only affects some highlighters) and I'm not sure it's
really needed in practice.

[1]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue May 18, 2023
On huge files [1], initial highlighting can take a while.
The canonical solution is to disable highlighting for such files.
However we can improve the experience if the user forgot to do so

Allow regex highlighters to be interrupted with the interrupt key.
When that happens, mark the affected buffer and never again try to
highlight it with regex highlighters.

This commit is experimental because the behavior is difficult to
predict (it only affects some highlighters) and I'm not sure it's
really needed in practice.

[1]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue May 19, 2023
…ighlighter

Some of our highlighters abuse the regions highlighter by using one
of these patterns as region end:

	'\K'
	'.\K'
	(?!\\).(?=>)|\n
	(?!\\).(?=\))|\s
	\b

These highlighters should be regex highlighters (because they have
no children) but are probably required to be region highlighters to
compete with other region highlighters.

A region highlighter works by first computing *all* matches of both
region start and region end (which faciliates caching).  Unfortunately
this means that these oddball region-end-patterns create loads of
matches, most of which are useless.

Let's standardize these hacks by allowing the region end to be empty.
This allows us to match the entire region in the region begin pattern,
and do no extra work if the region end is empty.

It's still hacky (the region highlighter feels like the wrong tool
for some of these cases) but I don't think we have a good alternative
today.

On the [5MB Markdown file] this gives a decent speedup for the initial
highlighting pass:

	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	Error: Command terminated with non-zero exit code: 127. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.
	$ cd src
	$ HOME=$PWD hyperfine -w 1 'git checkout HEAD'{~,}' -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'
	Benchmark 1: git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):      1.754 s ±  0.028 s    [User: 1.720 s, System: 0.034 s]
	  Range (min … max):    1.725 s …  1.804 s    10 runs

	Benchmark 2: git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy
	  Time (mean ± σ):     988.6 ms ±  10.1 ms    [User: 974.3 ms, System: 17.5 ms]
	  Range (min … max):   980.6 ms … 1011.6 ms    10 runs

	Summary
	  'git checkout HEAD -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy' ran
	    1.77 ± 0.03 times faster than 'git checkout HEAD~ -- :/rc/filetype/markdown.kak && ./kak.opt big_markdown.md -e "hook global NormalIdle .* quit" -ui dummy'

(The numbers depend on other optimizations I'm using.)

---

Regarding markdown.kak :

To avoid the empty-region-end hack I considered using regex
highlighters (instead of region highlighters) for Markdown
URLs.  However, we probably want to keep highlighting bare
URLs (without enclosing <>).  This violates the [CommonMark
spec](https://spec.commonmark.org/0.29/#example-607) but both GitHub
Markdown and editors like VSCode and Emacs highlight bare URLs here.

The problem with highlighting bare URLs is that
they may contain Markdown metacharacters, see
https://spec.commonmark.org/dingus/?text=https%3A%2F%2Fexample.com%2F(1)_2_(3)

So unless we keep using a region highlighter, it will be very difficult
to avoid spurious Markdown highlighting in bare URLs.

So for the above performance improvement we need the empty-region-end hack.
This means we need to match the entire URL in the region-start pattern.
This is slightly challenging because a URL can contain balanced
parentheses but it must not include a trailing ")".
CommonMark [mandates](https://spec.commonmark.org/0.30/#link-destination)
support for at least 3 levels of nested parentheses, so let's just
unroll the regex to implement that.

---

[5MB Markdown file]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue May 19, 2023
…low-empty-region-end:generic-language-highlighting:optimize-ledger:faster-update-matches] Consolidated optimizations

Part of mawww#4685
krobelus added a commit to krobelus/kakoune that referenced this issue May 19, 2023
On huge files [1], initial highlighting can take a while.
The canonical solution is to disable highlighting for such files.
However we can improve the experience if the user forgot to do so

Allow regex highlighters to be interrupted with the interrupt key.
When that happens, mark the affected buffer and never again try to
highlight it with regex highlighters.

This commit is experimental because the behavior is difficult to
predict (it only affects some highlighters) and I'm not sure it's
really needed in practice.

[1]: mawww#4685 (comment)
krobelus added a commit to krobelus/kakoune that referenced this issue May 21, 2023
On huge files [1], initial highlighting can take a while.
The canonical solution is to disable highlighting for such files.
However we can improve the experience if the user forgot to do so

Allow regex highlighters to be interrupted with the interrupt key.
When that happens, mark the affected buffer and never again try to
highlight it with regex highlighters.

This commit is experimental because the behavior is difficult to
predict (it only affects some highlighters) but it could be quite
helpful for users who forgot to disable highlighting.

[1]: mawww#4685 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants