Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape backspace etc graphemes in string.inspect #600

Closed
Michael-Mark-Edu opened this issue May 22, 2024 · 15 comments · Fixed by #615
Closed

Escape backspace etc graphemes in string.inspect #600

Michael-Mark-Edu opened this issue May 22, 2024 · 15 comments · Fixed by #615
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Michael-Mark-Edu
Copy link
Contributor

Michael-Mark-Edu commented May 22, 2024

Fixed by #602

Edited 5/24, 8:00 PM PST

I was having issues related to accidentally interpreting binary data as text, resulting in the following error output from gleeunit:
image

Eventually, I figured out that there's a 0x08 character in there, which is causing the first " to get deleted, then a 0x40 (@) immediately following it takes its place, resulting in strange behavior that looked just close enough to intentional to stump me for a bit.

Ideally, 0x08 should not be doing anything here, since it can easily destroy the output and make it unreadable. Other ASCII escape character like 0x0B (escape) have similar effects here. This could be done by having io.print-like functions replace these sequences with hex (0x08) or escape codes (\b, \u{0008}).

This is most severe for gleeunit, where the developer depends on its output to properly debug their program, but I was also having this happen with io.debug and io.print as well (where this behavior might be intentional).

Theoretically, to recreate this you just need to write 0x08 to a file, load it (such as with the file_streams package), then get it to print to screen (either by io.debug or using gleeunit should.equal to get it to print on test fail) and weird things will happen. I imagine having several 0x08s in the string will cause even more damage to the output.

@lpil lpil changed the title Console output affected by ASCII escape codes when it shouldn't Escape backspace etc graphemes in string.inspect May 22, 2024
@lpil
Copy link
Member

lpil commented May 22, 2024

Thank you

@lpil lpil added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels May 22, 2024
@Michael-Mark-Edu
Copy link
Contributor Author

Here are all the escape codes that are relevant to this:
gleam_issue600

(code 127 is in the "invisible" category, codes 128+ are UTF-8 and shouldn't have special functionality)

@Michael-Mark-Edu
Copy link
Contributor Author

This also seems to be an Erlang-specific issue, since Javascript just turns everything into unicode
image

@mooreryan
Copy link
Contributor

mooreryan commented May 22, 2024

Shouldn't form feed also be green in your list? It seems to escape fine:

import gleam/io
import gleam/list
import gleam/string

fn cp(n: Int) -> UtfCodepoint {
  let assert Ok(cp) = string.utf_codepoint(n)
  cp
}

pub fn main() {
  [34, 92, 13, 10, 9, 12]
  |> list.map(cp)
  |> string.from_utf_codepoints
  |> string.inspect
  |> io.println
}

// prints: "\"\\\r\n\t\f"

See the erlang code here:

inspect_maybe_utf8_string(Binary, Acc) ->

@mooreryan
Copy link
Contributor

I suppose fixing may be as simple as adding more control characters that should potentially be escaped there in that function.

@Michael-Mark-Edu
Copy link
Contributor Author

I opened a PR that should hopefully fix this issue. Null characters affecting string comparison is untouched because it's arguably intentional behavior.

@Michael-Mark-Edu
Copy link
Contributor Author

@mooreryan The \f patch is unreleased, and I originally did testing on the public build. I tested it lightly in the unreleased build and the \f patch does seem to work.

@mooreryan
Copy link
Contributor

mooreryan commented May 23, 2024

I'm not sure what you mean by the \f patch. The pull request you link doesn't change behavior with the form feed character, as it is already properly escaped in the code.

Edit: oh, I think you mean this patch.

@mooreryan
Copy link
Contributor

mooreryan commented May 24, 2024

This comment (#602 (comment)) has me thinking, should string.inspect handle more of the first 32 non-printable ascii characters?

The pull request #602 adds handling for \b \v and \e, however, it may be useful to also show the Gleam escape syntax for other non-printable characters. Going back to the original motivating example, if more of the non-printable characters were handled by string.inspect, then the diff would look something like this:

expected: Ok("abc123")
     got: Ok("\u{0008}abc123")

which seems more helpful.

@lpil
Copy link
Member

lpil commented May 24, 2024

Oh that's clever. We could use that syntax for any invisible grapheme- is that what you're saying? How would we identify which ones are invisible?

@Hyperion-21
Copy link

I made a chart with the invisible ones earlier in this thread. Theoretically, anything not being converted and has a value <32 is invisible (and 127). The conversion list just does the first enter found (or maybe not, I don't know Erlang well), so maybe we can just add a conversion rule to <32 at the end of the list. And 127 too.

I prefer \b over \u{08} when possible, but I don't think a lot of these invisible ones have a well defined C escape so it may be inevitable.

@mooreryan
Copy link
Contributor

mooreryan commented May 24, 2024

Oh that's clever. We could use that syntax for any invisible grapheme- is that what you're saying? How would we identify which ones are invisible?

Yes that would be the general idea. The "simplest" solution may be as @Hyperion-21 says, and just convert the beginning of the ascii table. However, the point of identifying which are invisible is trickier than just taking ascii values < 32. For one thing, there are many "non printing" things outside of that range when you consider unicode...check it out:

import gleeunit
import gleeunit/should

pub fn main() {
  gleeunit.main()
}

pub fn a_test() {
  let x = "a b"
  let y = "a\u{0020}b"

  should.equal(x, y)
}

pub fn b_test() {
  let x = "a b"
  let y = "a\u{00A0}b"

  should.equal(x, y)
}

pub fn c_test() {
  let x = "a\u{0020}b"
  let y = "a\u{00A0}b"

  should.equal(x, y)
}

which yields:

Failures:

  1) invisible_chars_test.b_test
     Values were not equal
     expected: "a b"
          got: "a b"
     output: 

  2) invisible_chars_test.c_test
     Values were not equal
     expected: "a b"
          got: "a b"
     output: 

Those all look like spaces, but they're not the same. So, the "ideal" string.inspect function may somehow account for that. But it is getting trickier, and so maybe should be left to some 3rd party library? (not sure about that).

Second, you could imagine going beyond "invisible" characters. Check out this classic example:

pub fn e_accent_test() {
  let e1 = "\u{00E9}"
  let e2 = "\u{0065}\u{0301}"

  should.equal(e1, e2)
}

and that yields this:

  3) invisible_chars_test.e_accent_test
     Values were not equal
     expected: "é"
          got: "é"
     output: 

Which both look like the same e with accute accent.

Both of the outputs shown in those three failures could be considered pretty unhelpful, and worth treating, but, it is complicated, so I'm not sure how complex the string.inspect should be. It should probably be examined what some common other languages do.

My point the semantics of string.inspect could get tricky, and I'm not sure how far the escaping should be taken, though it could be potentially useful.


I prefer \b over \u{08} when possible, but I don't think a lot of these invisible ones have a well defined C escape so it may be inevitable.

It's true that \b is nice, but it is not valid gleam syntax.

@Michael-Mark-Edu
Copy link
Contributor Author

Is there anything stopping Gleam from supporting an extended set of escape codes? I found an old line in the compiler's changelog saying "Gleam now only supports \r, \n, \t, \", and \\ string escapes" which makes me think this is an intentional decision... but why? It seems like an arbitrary decision.

I'll update #602 momentarily to match the \u syntax. I'll also see if I can get it to show the invisible graphemes.

@Michael-Mark-Edu
Copy link
Contributor Author

Alright, that's done.

@Michael-Mark-Edu
Copy link
Contributor Author

Edited the parent post of this thread to better represent the current state of the issue/pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
4 participants