Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable ropey unicode_lines feature #50

Merged
merged 1 commit into from
Dec 15, 2023
Merged

Disable ropey unicode_lines feature #50

merged 1 commit into from
Dec 15, 2023

Conversation

rojas-diego
Copy link
Contributor

Hi! First of all, thanks a lot for open-sourcing this.

I was working on an internal fork of the project and noticed a potential issue. I have little experience with the LSP and Ropey so this might not be relevant. Anyway, here's the issue:

With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls' representation might have more lines.

See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks
See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls might have more lines.

See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks
See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
@McPatate
Copy link
Member

McPatate commented Dec 2, 2023

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.
We could issue a warning to the client saying there might be some offsetting errors when utf-16 is the only supported format.
It could look something like this:

    async fn initialize(&self, params: InitializeParams) -> LspResult<InitializeResult> {
        *self.workspace_folders.write().await = params.workspace_folders;
+      let position_encoding = params.capabilities.general.and_then(|general_cap| {
+           general_cap.position_encodings.and_then(|encodings| {
+               if encodings.contains(&PositionEncodingKind::UTF8) {
+                   Some(PositionEncodingKind::UTF8)
+               } else {
+                    self.client.show_message(MessageType::WARNING, "utf8 is not supported, defaulting to utf16 which may result in offset errors").await;
+                   None
+               }
+           })
+       });
        Ok(InitializeResult {
            server_info: Some(ServerInfo {
                name: "llm-ls".to_owned(),
                version: Some(VERSION.to_owned()),
            }),
            capabilities: ServerCapabilities {
                text_document_sync: Some(TextDocumentSyncCapability::Kind(
                    TextDocumentSyncKind::INCREMENTAL,
                )),
+               position_encoding,
                ..Default::default()
            },
        })
    }

@rojas-diego
Copy link
Contributor Author

rojas-diego commented Dec 3, 2023

Hi! Thanks for the quick response.

Ropey Line Breaks

I think this is a separate issue from position encoding negotiation. Enforcing UTF8 position encodings will not prevent Ropey's line count from diverging from Tree Sitter's or VSCode's, rendering lsp::Ranges out-of-sync with the Document.

Issues with Encoding

Regarding the PositionEncodingKind, we had to issue a fix for this as well in our fork.

After looking at llm-ls's implementation, we noticed the mirror of the user's workspace goes out of sync and/or the server crashes when the user's document contains certain graphemes as the current implementation doesn't translate well between:

  • LSP UTF-16 character offsets
  • TreeSitter's byte offsets
  • Ropey's characters offsets (read Unicode code points)

Here's a video showcasing 1) a crash and 2) llm-ls' mirror going out of sync due to some unicode characters.

LLM-LS.Bug.Report.mov

Here's a test case that illustrates this:

Long Test Case for `Document::change`

Case:

mod test {
    use tower_lsp::lsp_types::Position;
    use tree_sitter::Node;

    #[allow(unused_imports)]
    use super::*;

    #[tokio::test]
    async fn test_document_change_tree_consistency_medium() {
        let a = "let a = '🥸 你好';\rfunction helloWorld() { return '🤲🏿'; }\nlet b = 'Hi, 😊';";

        let mut document = Document::open("javascript", a).await.unwrap();

        document
            .change(Range::new(Position::new(0, 14), Position::new(2, 13)), ",")
            .await
            .unwrap();

        let b = "let a = '🥸 你好,😊';";

        assert_eq!(document.text.to_string(), b);

        let mut parser = Parser::new();

        parser
            .set_language(tree_sitter_javascript::language())
            .unwrap();

        let b_tree = parser.parse(b, None).unwrap();

        assert!(nodes_are_equal_recursive(
            &document.tree.unwrap().root_node(),
            &b_tree.root_node()
        ));
    }

    #[allow(dead_code)]
    fn nodes_are_equal_recursive(node1: &Node, node2: &Node) -> bool {
        if node1.kind() != node2.kind() {
            return false;
        }

        if node1.start_byte() != node2.start_byte() {
            return false;
        }

        if node1.end_byte() != node2.end_byte() {
            return false;
        }

        if node1.start_position() != node2.start_position() {
            return false;
        }

        if node1.end_position() != node2.end_position() {
            return false;
        }

        if node1.child_count() != node2.child_count() {
            return false;
        }

        for i in 0..node1.child_count() {
            let child1 = node1.child(i).unwrap();
            let child2 = node2.child(i).unwrap();

            if !nodes_are_equal_recursive(&child1, &child2) {
                return false;
            }
        }

        true
    }
}

Output:

running 1 test
test document::test::test_document_change_tree_consistency_medium ... FAILED

failures:

---- document::test::test_document_change_tree_consistency_medium stdout ----
thread 'document::test::test_document_change_tree_consistency_medium' panicked at 'assertion failed: `(left == right)`
  left: `"let a = '🥸 你好',😊';"`,
 right: `"let a = '🥸 你好,😊';"`', crates/llm-ls/src/document.rs:293:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Enforcing UTF-8 Position Encoding

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.

This is by far the most convenient option, and we tried to go that route as well. Unfortunately, position encoding kind negotiation is a relatively new feature of the Language Server Protocol. It was introduced in 3.17.0, in 2022. As of writing, UTF-16 is the default, mandatory encoding of the protocol and must always be supported by servers.

I thought it was still worth a try but it turns out that the VSCode on my machine, the latest version, does not even support anything other than UTF-16.

[Info  - 09:40:14] Available position encodings: Some([PositionEncodingKind("utf-16")])

In our case, we deemed it was necessary to stick to UTF-16 for the following reasons:

  • We put all this effort into maintaining a perfect replica of the user's workspace to get accurate syntax trees to provide relevant completions at key points of the user's file. If we allow the user's file and our representation to become out of sync, we loose this ability and we would be better off with using regexes in the end.
  • Looking at VSCode's language client code, it doesn't seem to support UTF-16, concurring the experience I had on my machine. We didn't want to fight against the protocol and assumed many other clients would lack support for UTF-8 too.

If you managed to successfully negotiate UTF-8 position encodings, we'd love to hear more!


Once again, I might be wrong about some of the details of this. It's based off my current understanding of LSP and library implementations which is quite modest.

Anyway, all this Unicode handling caused quite the headaches on our side so we'd be happy to upstream our tests and implementation which handles the translation between the different encodings if you want. We're also looking at other approaches to simplify this complexity and if you're interested, I'm happy to collaborate on this together!

Bon Dimanche 🤗

@McPatate
Copy link
Member

McPatate commented Dec 3, 2023

Thanks for the detailed response. I'd be happy to take a look at your code and more than happy to have you contribute to llm-ls.

My main concern is that the rope crate only supports utf8, so I'll have to check if there are other rope crates that do support utf16.

I'm also not sure how other editors fair regarding Unicode encoding, I'll take a look.

For now let's merge the current PR.

@rojas-diego
Copy link
Contributor Author

Awesome! I'll kickstart a PR today or tomorrow.

I see the CI is not passing, it seems it's failing on main too, is that an issue?

@McPatate
Copy link
Member

McPatate commented Dec 4, 2023

I see the CI is not passing, it seems it's failing on main too, is that an issue?

I'm not sure where it's coming from, just updated the CI's secret, let's see if that was the issue.

@rojas-diego
Copy link
Contributor Author

Hi @McPatate, hope you're doing good!

I was working on upstreaming our document syncing implementation but two things happened:

1. Didn't get around to thoroughly test it

I'm pretty sure our current implementation is correct but I also wouldn't bet my hand on it. Currently we have a few unit tests that ensure the documents are kept in sync by simulating different kind of edits but it's hard to know whether it will stand the test of thousand of user files/actions once in production.

I'd feel bad about sharing under-tested code hence I'm currently looking around the web to try to find suitable edit traces that could help us simulate real-world, long-lived, complex editing sessions in our tests and ensure that everything matches in the end.

2. Uncovered additional issues this time relating to tower-lsp

There are other inherent design decisions of tower-lsp that could lead in our case and yours to out-of-sync-documents or completions based on outdated context1 in rare and less rare cases so that ate up quite a bit of our time as we are looking to mitigate that too ☹️


Given all this, I've created a Gist and just dumped our document.rs implementation for reference.

  • If it looks good to you, I'll make a PR to integrate it
  • If you feel like it would benefit from more testing, I'll circle back once we found a scalable way to test it

LMK. Cheers!

Footnotes

  1. See tower-lsp#284 deno#10437

@McPatate
Copy link
Member

McPatate commented Dec 15, 2023

Hey @rojas-diego, thanks for the detailed message.

I'm going to merge the PR, just ran testbed locally and lgtm. Don't hesitate to run testbed yourself in future PRs, this is the way I test llm-ls.

Regarding 1., I think the tests you provided should cover most use cases. Maybe try to find unicode characters encoded with different sizes if it's not already the case. People will report bugs when they found them and I would assume that most code out there is ASCII only, with some occasional exception.

Regarding 2., I wouldn't worry too much about it for now. I do recall seeing strange behaviour that may be linked to such a thing, but cannot confirm as we have an encoding issue currently :)

Feel free to create a new PR and we can discuss the issues in more detail over there.

Thanks again for the effort!

@McPatate McPatate merged commit 2a433cd into huggingface:main Dec 15, 2023
1 of 14 checks passed
@McPatate
Copy link
Member

Hey @rojas-diego, do you still want to contribute your gist to the project?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants