Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace submodules with flags for fetching and building grammars #1659

Merged
merged 18 commits into from
Mar 10, 2022

Conversation

the-mikedavis
Copy link
Member

@the-mikedavis the-mikedavis commented Feb 13, 2022

This PR adds a non-interactive --grammar {fetch|build} (aliased to -g) flag to the hx binary which replaces the current mechanisms for fetching tree-sitter grammars (git submodule ..) and building grammars (a compile-time step).

This is an alternative to the scripts1 approach in #1560.
Closes #1549.

Here's an outline of what changes:

💥 user-facing

You no longer need to install all grammars! Add a use-grammars key to the top of your languages.toml:

# Note: this must come **before** the [[language]] sections
use-grammars = { only = [ "rust", "c", "cpp" ] }
# or
use-grammars = { except = [ "json", "yaml" ] }
# or omit the key to use _all_ grammars.

Now running

hx --grammar build     # clones all 'use-grammars' in languags.toml
hx --grammar fetch     # compiles all 'use-grammars' in languages.toml

will fetch and build only the grammars allowed by use-grammars.

However now note that it is now possible to build or package helix without any grammars. You may need to invoke these new commands yourself if syntax highlighting is not working.

Are you building from source and don't like how long it takes to fetch the submodules? Now they're gone, and hx --grammar fetch is much faster: it takes me 5 seconds to fetch the entire set of grammars on a 10 Mbps connection ⚡

🏗️ development

We can now customize the set of grammars we'll build in CI for example. This makes the CI much faster: this branch consistently takes less than 1 minute for Check and Lint, 1-2 minutes for Doc, and 1-5 minutes for the Test workflow.

Grammar developers like myself rejoice! You can now point to an in-progress grammar along-side its [[language]] entry in the languages.toml like so:

[[grammar]]
name = "erlang"
source = { path = "~/trees/tree-sitter-erlang" }
# .. other config

Develop your grammar and then run hx --grammar build to compile it. Launch hx on a file of your language and viola, interactive grammar development! When you're ready to add the language upstream, specify the revision and remote of the grammar.

[[grammar]]
name = "erlang"
source = { git = "https://github.com/the-mikedavis/tree-sitter-erlang", rev = "86985bde399c5f40b00bc75f7ab70a6c69a5f9c3" }

Also, now all configuration for a grammar lives in languages.toml rather than in the build code, which affects the grammars for

  • ocaml(-interface)
  • typescript
  • tsx
  • jsx (not yet added)

All grammars that need subdirectories within their repositories to be built. This can be configured now like so:

[[language]]
name = "tsx"
# .. other config

[[grammar]]
name = "tsx"
source = { git = "..", rev = "..", subpath = "tsx" }

📦 packaging

The biggest change is that there are no more submodules, which has been a pain-point for at least nix (#1448).

Most packagers will not see a difference outside of the submodules change: grammars are fetched and compiled automatically in the helix-term build phase. Nix (and other packagers that sandbox network connectivity) may struggle with this. Set a HELIX_DISABLE_AUTO_GRAMMAR_BUILD enviroment variable to disable the automatic fetching and building of grammars.

If a packager would like to package the grammars themselves, you may parse the languages.toml for [[grammar]] entries and invoke a C compiler on the grammars.

❄️ nix packaging

This PR takes packaging with nix into special consideration. Because of sandboxing, nix builds should disable automatic grammar fetching with by setting a HELIX_DISABLE_AUTO_GRAMMAR_BUILD environment variable. Instead Nix should parse languages.toml for the [[grammar]] entries, download the grammars, and put them into the build's runtime/grammars/sources directory under the grammar's name key as each directory name. I am using an overlay in my dotfiles based on the derivation in nixpkgs which does this.

Note though that there was a bug in nix's built-in toml parser (builtins.fromTOML) that was only fixed in v2.6. To work around this, nix builds should convert languages.toml into JSON and use builtins.fromJSON. (See this commit for an example.)

You may also run over the tree-sitter grammars with a C(++) compiler with Nix, in order to fully control their packaging. ./grammars.nix from this PR provides an example.

🔮 future work

  • Streamline grammar downloads from GitHub by downloading a tarball at an exact revision (Nix uses this trick in fetchFromGitHub)
    • e.g. https://github.com/helix-editor/helix/archive/<rev>.zip
  • Enhance health-check information to include status of grammars (see Add --health command for troubleshooting #1669)

Footnotes

  1. Why not scripts? Scripts are difficult because (1) you need to maintain separate scripts for windows and unix, (2) it can be difficult to parse configuration data like build paths within grammar directories, and (3) its makes an arguably unergonomic and bespoke CLI for people to learn.

@dead10ck
Copy link
Member

dead10ck commented Feb 13, 2022

I really like this idea personally. Much tighter integration

  • build and revision info are right there in the language config with the rest of it
  • we can put runtime sanity checks at helix's startup time, like show a warning when grammars aren't built, or have a different revision, or a dirty repo (mimicking submodule warnings)
  • it could check for updates and print a message at startup, or maybe just have a dedicated command line switch to check for updates. This could be built into the CI as well. Could maybe even use toml_edit to update the languages.toml without editing the file by hand

This is a huge improvement in my opinion. Great idea.

@the-mikedavis the-mikedavis force-pushed the md-grammar-flags branch 2 times, most recently from f391fc8 to 70a78d3 Compare February 14, 2022 00:32
@the-mikedavis the-mikedavis force-pushed the md-grammar-flags branch 5 times, most recently from 91f458d to fc61878 Compare February 15, 2022 03:01
@the-mikedavis the-mikedavis marked this pull request as ready for review February 15, 2022 03:02
@the-mikedavis
Copy link
Member Author

There's a snag with updating the flake.nix: this line breaks builtins.fromTOML on nix < 2.6 (prior versions break spec by disallowing arrays with mixed tables and strings). We can get around that by converting the file to json (with something like yj) but that seems hacky. For now I left the grammars out of the package entirely.

@sudormrfbin
Copy link
Member

Do we need two flags to have separate fetch and build steps ? IMO replacing them both with a single flag to do the whole thing in one go would be more ergonomic (and more common).

@the-mikedavis
Copy link
Member Author

The separation is mostly for the sake of packagers. For example, nix should fetch the grammar repositories itself and then call out to helix to build them during packaging. I would be open to adding a flag that does both in addition to these flags though.

book/src/guides/adding_languages.md Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
@the-mikedavis the-mikedavis force-pushed the md-grammar-flags branch 2 times, most recently from 0295eea to 157cd4b Compare February 15, 2022 14:52
@the-mikedavis
Copy link
Member Author

the-mikedavis commented Feb 15, 2022

Bringing this back to draft status while I port the commands to git2

See following comment:

@the-mikedavis the-mikedavis marked this pull request as draft February 15, 2022 17:29
@the-mikedavis
Copy link
Member Author

the-mikedavis commented Feb 15, 2022

I started porting the Command stuff to git2 but fetching fails because we disable the default features (including the https/ssh stuff). Adding those features back in gives me a compilation error (can't find openssl - I assume because I'm on NixOS). the-mikedavis/helix@md-grammar-flags...md-grammar-flags-git2

So I'll try to cleaning up the typing with Command and we'll see how close that gets us.

Edit: with the Results now being passed up, I think this is now much cleaner 🎉

@the-mikedavis the-mikedavis force-pushed the md-grammar-flags branch 2 times, most recently from ec7a0fe to ca785c4 Compare February 15, 2022 21:02
@the-mikedavis the-mikedavis marked this pull request as ready for review February 15, 2022 21:02
@archseer
Copy link
Member

(can't find openssl - I assume because I'm on NixOS)

Yeah I'd like to avoid an openssl dependency, it's nice that there's no build time requirements right now (git2 bundles libgit2 if needed).

Here's an explainer on the Rust + NixOS + OpenSSL issue: https://nixos.wiki/wiki/Rust#Building_Rust_crates_that_require_external_system_libraries

helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
helix-term/src/grammars.rs Outdated Show resolved Hide resolved
@dead10ck
Copy link
Member

@archseer

Yeah I'd like to avoid an openssl dependency, it's nice that there's no build time requirements right now (git2 bundles libgit2 if needed).

Don't we have a build time dependency in git (and thus openssl) now? The build host and target host both need git the CLI tool, so they should already have openssl. And wouldn't we need that anyway for downloading zip files?

Sticking with shelling out isn't a bad choice, since these are relatively simple commands, but it seems odd to decide against libgit2 because NixOS makes easy things really hard.

@archseer
Copy link
Member

For the git2 crate:

Currently this library requires libgit2 1.4.0. The source for libgit2 is included in the libgit2-sys crate so there's no need to pre-install the libgit2 library, the libgit2-sys crate will figure that and/or build that for you.

It's not a decision based on NixOS, I've tried to avoid external dependencies because it simplifies the build process. It feels wrong to pull in openssl just for fetching grammars.

@the-mikedavis the-mikedavis force-pushed the md-grammar-flags branch 2 times, most recently from 77598dd to 963e7bc Compare March 9, 2022 15:49
the-mikedavis and others added 16 commits March 9, 2022 14:17
The submodules system is being replaced with a command-line flag

    hx --fetch-grammars

Which shallow-clones grammar repositories at the given revision and

    hx --build-grammars

For building grammars separate of the initial compilation of helix.

Why remove submodules?

* Cloning helix in general takes a long time because of the submodules,
  especially when the submodules are not fetched as shallow
* Packaging is consistently painful no matter the package-manager
* It is quite difficult to devise a scheme where users can declare
  a desired set of grammars and implement it with submodules

This commit fully removes the existing tree-sitter submodules from
the tree (as well as the .gitmodules file which is no longer used).
Here we add syntax to the languages.toml languge

    [[grammar]]
    name = "<name>"
    source = { .. }

Which can be used to specify a tree-sitter grammar separately of
the language that defines it, and we make this distinction for
two reasons:

* In later commits, we will separate this code from helix-core
  and bring it to a new helix-loader crate. Using separate schemas
  for language and grammar configurations allows for a nice divide
  between the types needed to be declared in helix-loader and in
  helix-core/syntax

* Two different languages may use the same grammar. This is currently
  the case with llvm-mir-yaml and yaml. We could accomplish a config
  that works for this with just `[[languages]]`, but it gets a bit
  dicey with languages depending on one another. If you enable
  llvm-mir-yaml and disable yaml, does helix still need to fetch and
  build tree-sitter-yaml? It could be a matter of interpretation.
helix-syntax mostly existed for the sake of the build task which
checks and compiles the submodules. Since we won't be relying on
that process anymore, it doesn't end up making much sense to have
a very thin crate just for some functions that we could port to
helix-core.

The remaining build-related code is moved to helix-term which will
be able to provide grammar builds through the --build-grammars CLI
flag.
This is not strictly speaking necessary. tree_sitter_library was used by
just one grammar: llvm-mir-yaml, which uses the yaml grammar. This will
make the language more consistent, though. Each language can explicitly
say that they use Some(grammar), defaulting when None to the grammar that
has a grammar_id matching the language's language_id.
build_grammars adapts the functionality that previously came from
helix-syntax to be used at runtime from the command line flags.

fetch_grammars wraps command-line git to perform the same actions
previously done in the scripts in helix-editor#1560.
The vision with 'use-grammars' is to allow the long-requested feature
of being able to declare your own set of grammars that you would like.
A simple schema with only/except grammar names controls the list
of grammars that is fetched and built. It does not (yet) control which
grammars may be loaded at runtime if they already exist.
This is a rather large refactor that moves most of the code for
loading, fetching, and building grammars into a new helix-loader
module. This works well with the [[grammars]] syntax for
languages.toml defined earlier: we only have to depend on the types
for GrammarConfiguration in helix-loader and can leave all the
[[language]] entries for helix-core.
This is a bit of a micro-optimization: in the current setup we waste
a thread in the pool for a local grammar only to println! a message
saying we're skipping fetching because it's a local grammar.
This restores much of the behavior that existed before this PR:
helix will build the grammars when compiling. The difference is that
now fetching is also done during the build phase and is done much
more quickly - both shallow and in parallel.
This commit replaces the out-of-date builder in the flake which relied
on submodules for fetching and the compiler for building. Now we
disable fetching and building explicitly with the environment variable
and then use builtins.fetchGit and a derivation mostly derived from
upstream to compile the grammars.

Anecdotally, this is still quite slow as builtins.fetchGit does not
seem to do shallow clones. I'm not sure I see a way around it though
without recording sha256s, which seems cumbersome.
The old flags were a bit long. --grammar is also aliased to -g to make
it even easier.
Here we perform a shallow fetch using builtins.fetchTree. In order
to make this work, we need to specify the `ref' for any repository
that doesn't have `master' as its default branch (I'm not sure why
this limitation exists since we don't need this when performing
the shallow fetch in `--grammar build')

This `ref' field is ignored by helix, so I have left it undocumented
for now, but I could be open to documenting it.
Looks like this was rebased a few hours ago and now the 789a171
revision no longer exists.
@archseer
Copy link
Member

Great work! 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate replacing the submodules system for grammars
6 participants