-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HCL Parsing fails on valid UTF-8 identifier #350
Comments
Hello @TeamDman. First of all, thank you very much for the very detailed bug report and reproducers, that's very helpful. This is a serious bug in the parser and the current implementation is even unsound because the bug is triggered in an After digging around a bit I was able to find the root cause for this. The parser is accepting a These two function are the culpit here: https://github.com/martinohmann/hcl-rs/blob/main/crates/hcl-edit/src/parser/string.rs#L228-L236 They work on single bytes, checking if these are ID start/continue. This obviously excludes UTF-8 chars from the mix and the handling for these is broken. I completely overlooked this in the implementation phase. There are now multiple alternatives for fixing this:
Alternative 1 is the easiest to implement, but i'd like to avoid harming parser performance too much if it can be avoided, so i'll also try to first explore variant 2, which is a bit more complex to get right, but should have better performance. Here's a minimal reproducer comparing the current mod issue_350_minimal {
use winnow::{
stream::AsChar,
token::{one_of, take_while},
PResult, Parser,
};
fn ident_bytes<'a>(input: &mut &'a [u8]) -> PResult<&'a str> {
(
one_of(|b: u8| hcl_primitives::ident::is_id_start(b.as_char())),
take_while(0.., |b: u8| {
hcl_primitives::ident::is_id_continue(b.as_char())
}),
)
.recognize()
.map(|s: &[u8]| {
print_bytes("parsed bytes with `u8` stream", s);
std::str::from_utf8(s)
.expect("`is_id_start` and `is_id_continue` filter out non-utf8")
})
.parse_next(input)
}
fn ident_str<'a>(input: &mut &'a str) -> PResult<&'a str> {
(
one_of(hcl_primitives::ident::is_id_start),
take_while(0.., hcl_primitives::ident::is_id_continue),
)
.recognize()
.map(|s: &str| {
print_bytes("parsed bytes with `char` stream", s.as_bytes());
s
})
.parse_next(input)
}
fn print_bytes(desc: &str, buf: &[u8]) {
print!("{desc}: ");
for b in buf {
print!("0x{b:02x} ");
}
println!();
}
#[test]
fn parse_ident() {
let mut input = "ééé";
let mut input_bytes = input.as_bytes();
print_bytes("input bytes", input_bytes);
let parsed = ident_str.parse(&mut input).unwrap();
assert_eq!(parsed, "ééé");
let parsed = ident_bytes.parse(&mut input_bytes).unwrap();
assert_eq!(parsed, "ééé");
}
} Output:
|
Btw: the reason why the following does not have the same issue is that it works on a let ident = "ééé";
ident.parse::<hcl::edit::Ident>().unwrap(); |
Fixes #350 This fixes an unsoundness around handling of UTF-8 identifiers in the parser at the expense of decreasing the performance by around 20%. Correctness goes over performance. I may switch back to a `&[u8]` based stream once I figured out how to handle UTF-8 identifiers with it safely.
Fixes #350 This fixes an unsoundness around handling of UTF-8 identifiers in the parser at the expense of decreasing the performance by around 20-30%. I may switch back to a `&[u8]` based stream once I figured out how to handle UTF-8 identifiers with it safely without adding a truck load of complexity, but for now correctness is more important than performance.
Fixes #350 This fixes an unsoundness around handling of UTF-8 identifiers in the parser at the expense of decreasing the performance by around 20-30%. I may switch back to a `&[u8]` based stream once I figured out how to handle UTF-8 identifiers with it safely without adding a truck load of complexity, but for now correctness is more important than performance.
@TeamDman thanks again for the bug report! I just released the fix for this via Sadly, the fix decreased the parser performance by 20-30%, but I'm planning to iterate on that to bring performance levels back closer to where they were before. For now, a correct parser is more important than performance. Funny enough, I also found another bug related to unicode in the error reporting and fixed that along the way. Let me know if you spot other issues around unicode handling. |
Woohoo! Props for the quick fix 🎉🐇 I'm using import blocks to generate hundreds of thousands of lines of HCL, so I definitely appreciate the care being placed on the performance Thank you! |
tl;dr at bottom section
backtrace
Terraform and OpenTofu support labels beginning with utf-8
When I use
terraform plan -generate-config-out="generated.tf"
followed byterraform init
andterraform apply
it works with the special characters in the identifier, so Terraform supports it but hcl-rs doesn't D:also tried
crates:
relevant:
hcl-rs/crates/hcl-edit/src/parser/structure.rs
Line 63 in 48de0e7
hcl-rs/crates/hcl-edit/src/parser/string.rs
Lines 220 to 226 in 48de0e7
hcl-rs/crates/hcl-edit/src/parser/string.rs
Lines 193 to 200 in 48de0e7
hcl-rs/crates/hcl-primitives/src/ident.rs
Lines 277 to 309 in 48de0e7
tangent
The Hashicorp spec says
We can search the repo for ID_START
https://github.com/search?q=repo%3Ahashicorp%2Fhcl%20ID_START&type=code
which reveals the truth for what is supported
https://github.com/hashicorp/hcl/blob/212a40e528766634a1aa6dd1e820d7936762196e/hclsyntax/unicode_derived.rl
preview:
Looks like the
unicode_ident
crate used byhcl-rs
is also targetting the UAX #31 spec.I tried asking ChatGPT if "é" is a valid ID_START or ID_CONTINUE character and it said both yes and no when I restarted the chat, so I'm not sure where the spec deviates, it's either Terraform+OpenTofu or the
unicode_ident
crate....
I asked again and it gave the following
https://chatgpt.com/share/e910f089-fee1-47ad-bdbe-e988cde57aee
TL;DR - parsing fails for some reason and I have no idea why
backtrace
The text was updated successfully, but these errors were encountered: