Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: runtime/grammars is quite large #6187

Closed
ghost opened this issue Mar 5, 2023 · 17 comments
Closed

release: runtime/grammars is quite large #6187

ghost opened this issue Mar 5, 2023 · 17 comments
Labels
C-discussion Category: Discussion or questions that doesn't represent real issues

Comments

@ghost
Copy link

ghost commented Mar 5, 2023

Currently I use Vim, but I have been thinking about switching to Helix. I prefer minimal programs, so the first test I did was extracted size. First Vim:

https://github.com/vim/vim-win32-installer/releases/download/v9.0.1380/gvim_9.0.1380_x64.zip

where extracted size is 51 MB. then Helix:

https://github.com/helix-editor/helix/releases/download/22.12/helix-22.12-x86_64-windows.zip

where extracted size is 111 MB, over double. I was curious what takes the size, so I checked. Currently the runtime/grammars folder is 95.6 MB, or 86% of the total extracted size. More detail:

  Length Name
  ------ ----
18198528 verilog.dll
15670784 lean.dll
 4071424 c-sharp.dll
 3690496 kotlin.dll
 3278336 perl.dll
 2999808 haskell.dll
 2756608 ocaml.dll
 2615296 d.dll
 2310656 cpp.dll

for example, the Verilog grammar is currently 17.3 MB, or 16% of the total Helix extracted size. I have never used Verilog, and dont even know what it is. Would it be possible to have a Helix download without the grammars, then users can just download whatever grammars they might need separately?

@dead10ck
Copy link
Member

dead10ck commented Mar 5, 2023

@ghost
Copy link
Author

ghost commented Mar 5, 2023

Ideally I would like the releases here to be smaller:

https://github.com/helix-editor/helix/releases

whether that means stripping the DLLs, or omitting them and having them as a separate download, I am not particular to any one solution. For more data, currently Vim uses three files for Verilog:

https://github.com/vim/vim/blob/master/runtime/ftplugin/verilog.vim
https://github.com/vim/vim/blob/master/runtime/syntax/verilog.vim
https://github.com/vim/vim/blob/master/runtime/indent/verilog.vim

combined size is 14.85 KB. so the Helix Verilog DLL is currently over 1,000 times larger than that.

@dead10ck
Copy link
Member

dead10ck commented Mar 5, 2023

combined size is 14.85 KB. so the Helix Verilog DLL is currently over 1,000 times larger than that.

These grammars are 1. compiled and 2. do quite a bit more than provide some regex patterns for syntax highlighting.

Generally speaking, these releases are meant to provide a ready to use package for helix, with batteries included. This is part of helix's vision. Minimalism for the sake of minimalism is not. And providing extra "stripped down" variants of our releases on an ongoing basis for something that isn't a problem for most people probably isn't something the maintainers have an appetite for.

If you'd like to reduce the disk footprint, it's quite easy to delete these files after installing.

@archseer
Copy link
Member

archseer commented Mar 5, 2023

If you want a minimal release, handpicking the grammars is what you want. A release is 10MB which is still in the acceptable range for me. In the future we'll probably stop shipping as many grammars and compile some on demand.

@archseer
Copy link
Member

archseer commented Mar 5, 2023

As mentioned also: we use tree-sitter grammars that are compiled libraries that do more than just simple highlighting: they're actual parsers that allow us to understand the syntax a lot more deeply let us determine indentation and other decisions (jump to definition, unused references etc)

@archseer
Copy link
Member

archseer commented Mar 5, 2023

@dead10ck does maintain various areas of this editor. He's the sole maintainer of auto pairs.

Extracting 100 MB, only to delete 80 MB, is not something I have an interest in.

If you have custom requirements you're always welcome to build from source to your specifications.

@archseer
Copy link
Member

archseer commented Mar 5, 2023

I agree that verilog could be an optional grammar that's not included in the default release though

@kirawi
Copy link
Member

kirawi commented Mar 5, 2023

It's understandable that you feel irked about the previous responses, but let's try to move on from it so other users can benefit from this issue :)

@archseer
Copy link
Member

archseer commented Mar 5, 2023

OK, well perhaps you could actually add him to the organization?

He is a part of the org. Either way, it shouldn't matter in a discussion.

This is about the users who might agree with me. if you take the entire ftplugin, indent and syntax folders from the Vim release above, you get 7.31 MB, or 14% of the extracted release size. As previously mentioned, runtime/grammars for Helix is currently 86% of the total release size.

And as I've previously stated, that's not a fair comparison because you're comparing vimscript code containing a bunch of highlight regexes vs libraries (binary code) that are full-blown parsers. A more fair comparison would be versus nvim+nvim-treesitter (after you've compiled grammars).

Releases are there for convenience (prefer to use an official package when you can, it'll actually set up the runtime path for you correctly) so we try to include all possible grammars for now. If we don't, then we have to go down the rabbit hole of which popular languages to include (upsetting users when their preferred language doesn't work out of the box). More importantly, if we drop verilog from the build now verilog users have to manually compile the grammar (causing a lot more headaches vs you downloading a slightly larger tarball and excluding some grammars on unpack -- grammars compress quite well).

As an example, here's the official Alpine package: https://pkgs.alpinelinux.org/package/edge/community/x86_64/helix

It comes with no grammars installed and it's up to the user to install what they need (https://pkgs.alpinelinux.org/packages?name=tree-sitter-*&branch=edge&repo=&arch=x86_64&maintainer=) and the grammars can be shared with other editors. But this is dependent on upstream packaging decisions.

We do listen to feedback in this area: In older versions helix actually bundled all the grammars into a single binary, which was a lot more inefficient and didn't allow excluding grammars or building custom grammars without a full editor recompile.

but we might want to ask ourselves why a single language DLL is larger than the editor executable itself.

Yeah I've stuck with Verilog example because it's an outlier, the rest of grammars are a lot more reasonably sized. It's partly how the grammars are structured, but very complex grammars also produce a lot of complex parsing states: https://github.com/tree-sitter/tree-sitter-verilog/blob/master/grammar.js

@archseer
Copy link
Member

archseer commented Mar 5, 2023

That said, I think binary size isn't the best metric for judging whether a program is minimal -- I'm sure if you compared the source code of the two you'd find helix is a lot slimmer. (Though our package is only slightly larger than neovim but provides support for lots of languages out of the box and has almost no build dependencies.)

@dryya
Copy link

dryya commented Mar 8, 2023

@4cq2 I don't mean to spam this issue, but as mentioned in another comment in that thread, would

[editor.cursor-shape]
insert = "bar"
normal = "block"
select = "underline"

satisfy your requirements?

(FWIW: I also thought always being in block cursor mode was kind of weird, but now I think it makes more sense given that you always have a selection that is maintained through entering/exiting insert mode. In my case, I was only bothered by it until I set editor.color-modes = true, when I realized that I was having a bit of an XY problem which was mostly just having a really hard time differentiating modes with the default settings.)

@kirawi kirawi added the C-discussion Category: Discussion or questions that doesn't represent real issues label Mar 9, 2023
@Zapeth
Copy link

Zapeth commented Mar 12, 2023

FWIW this issue is also something that irks me about helix, even though its actually a tree-sitter issue (ie tree-sitter/tree-sitter#1799).

Reading some other tree-sitter related discussions I got the impression that sometimes the grammar is just written in a way that disproportionately increases the generated code size and can be mitigated by writing it slightly differently, but that might be language specific.

Packaging the editor without the grammars would be nicer if you only work with some specific languages, but that would also make the update process more tedious.

I guess it depends on how often the grammars need to be updated, or are they always tied to the release they are packaged with?

@EriKWDev
Copy link

EriKWDev commented Mar 21, 2023

I don't mind the binary being 100M, but having single grammar sources be 100M feels extreme

du -h . | sort -h | tail
34M	./grammars/sources/scala
41M	./grammars/sources/ponylang
56M	./grammars/sources/verilog
96M	./grammars/sources/lean

ponylang??

I understand that we want batteries included with helix, but where do we draw the line?

@ghost
Copy link
Author

ghost commented Mar 21, 2023

I say keep it in single digits. pick the 9 most popular languages, pick whatever metric you want. then put the rest of them as a separate download, either each language as a separate download, or one bulk "other languages" download. but I hope others can agree that the current system is not ideal, and only going to get worse

@Likeny3
Copy link

Likeny3 commented Oct 18, 2023

I agree, the distribution size is frankly just ridiculous.
I'd suggest using something like UPX which compresses the grammars folder down to ~9MB for me on compression level 9 and doesn't take long at all.

@dead10ck
Copy link
Member

This is not really something we're interested in tackling right now. If you'd like a smaller package, you are welcome to make one that fits your needs.

@dead10ck dead10ck closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2023
@helix-editor helix-editor locked and limited conversation to collaborators Oct 18, 2023
@archseer
Copy link
Member

Grammar sources also aren't meant to be packaged and distributed, and are usually excluded by packagers. They're simply there for development since re-cloning the grammar every single time from scratch would take a ton of time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C-discussion Category: Discussion or questions that doesn't represent real issues
Projects
None yet
Development

No branches or pull requests

7 participants