Skip to content

Commit

Permalink
[j8] Make '' an alias for u''
Browse files Browse the repository at this point in the history
Only in data, not in code.

I think we will discourage this for awhile.
  • Loading branch information
Andy C committed Jan 5, 2024
1 parent dad3d9d commit c505aac
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 25 deletions.
36 changes: 16 additions & 20 deletions data_lang/pyj8.py
Expand Up @@ -33,7 +33,7 @@ def EncodeString(s, options):


# similar to frontend/consts.py
_JSON_ESCAPES = {
_COMMON_ESCAPES = {
# Notes:
# - we don't escape \/
# - \' and \" are decided dynamically, based on the quote
Expand All @@ -54,7 +54,7 @@ def _EscapeUnprintable(s, buf, is_j8=False):
\\u{1f} for J8 - these are "u6 escapes"
"""
for ch in s:
escaped = _JSON_ESCAPES.get(ch)
escaped = _COMMON_ESCAPES.get(ch)
if escaped is not None:
buf.write(escaped)
continue
Expand Down Expand Up @@ -173,24 +173,21 @@ def WriteString(s, options, buf):
return 0


def PartIsUtf8(s, start, end):
# type: (str, int, int) -> bool
part = s[start:end]
try:
part.decode('utf-8')
except UnicodeDecodeError as e:
return False
return True


class LexerDecoder(object):
"""J8 lexer and string decoder.
Similar interface as SimpleLexer2, except we return an optional decoded
Similar interface as SimpleLexer, except we return an optional decoded
string
TODO: Combine
match.J8Lexer
match.J8StrLexer
When you hit "" b'' u''
1. Start the string lexer
2. decode it in place
3. validate utf-8 on the Id.Char_Literals tokens -- these are the only ones
that can be arbitrary strings
4. return decoded string
"""

def __init__(self, s):
Expand Down Expand Up @@ -274,13 +271,12 @@ def _DecodeString(self, left_id, str_pos):

if tok_id == Id.Char_Literals: # JSON and J8
part = self.s[str_pos:str_end]
try:
part.decode('utf-8')
except UnicodeDecodeError as e:
if not PartIsUtf8(self.s, str_pos, str_end):
# Syntax error because JSON must be valid UTF-8
# Limit context to 20 chars arbitrarily
snippet = self.s[str_pos:str_pos+20]
raise self._Error(
'Invalid UTF-8 in JSON string literal: %r' % part[:20],
'Invalid UTF-8 in JSON string literal: %r' % snippet,
str_end)

# TODO: would be nice to avoid allocation in all these cases.
Expand Down
14 changes: 14 additions & 0 deletions doc/j8-notation.md
Expand Up @@ -287,6 +287,19 @@ TODO: example
This is exactly the confusion that J8 notation sets out to fix, so we choose to
be ultra **explicit** and different.

### Why have both `u''` and `b''` strings, if only `b''` are needed?

Oils doesn't have a string/bytes distinction (on the "interior"), but many
languages like Python and Rust do. Certain apps could make use of the
distinction.

Round-tripping arbitrary JSON strings also involves crazy hacks like WTF-8.
Our `u''` strings don't require WTF-8 because they can't represent surrogate
halves.

`u''` strings add trivial weight to the spec, since they just remove `\yff`
from the valid escapes.

### How Do I Write a J8 Encoder or Decoder?

The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
Expand All @@ -299,3 +312,4 @@ We could have an SEXP8 format:
- Concrete syntax trees
- with location information
- Textual IRs like WebAssembly

6 changes: 2 additions & 4 deletions frontend/lexer_def.py
Expand Up @@ -527,6 +527,7 @@ def R(pat, tok_type):
J8_DEF = [
C('"', Id.Left_DoubleQuote), # JSON string
C("u'", Id.Left_USingleQuote), # unicode string
C("'", Id.Left_USingleQuote), # '' is alias for u'' in data, not in code
C("b'", Id.Left_BSingleQuote), # byte string
C('[', Id.J8_LBracket),
C(']', Id.J8_RBracket),
Expand All @@ -543,10 +544,7 @@ def R(pat, tok_type):
# TODO: emit Id.Ignored_Newline to count lines for error messages?
R(r'[ \r\n\t]+', Id.Ignored_Space),

# TODO: AnyString, UString, and BString will also
# - additionally validate utf-8
# - decode
# I guess this takes 2 passes?
# This will reject ASCII control chars
R(r'[^\0]', Id.Unknown_Tok),
]

Expand Down
20 changes: 19 additions & 1 deletion spec/ysh-json.test.sh
@@ -1,4 +1,4 @@
## oils_failures_allowed: 7
## oils_failures_allowed: 8
## tags: dev-minimal

#### usage errors
Expand Down Expand Up @@ -534,3 +534,21 @@ echo status=$?

## STDOUT:
## END

#### '' means the same thing as u''

echo "''" | json8 read
pp line (_reply)

echo "'\u{3bc}'" | json8 read
pp line (_reply)

# TODO: syntax error
echo "'\yff'" | json8 read
pp line (_reply)

## STDOUT:
(Str) ""
(Str) "μ"
(Str) b'\yff'
## END

0 comments on commit c505aac

Please sign in to comment.