Skip to content

Commit

Permalink
Unicode correctness
Browse files Browse the repository at this point in the history
Correctly handle UTF-16 surrogate pairs in `String`s.

All prior tests pass with no modifications. Add a few new tests.

Non-breaking changes
====================

Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
`CodePoint`s.

Add the `match` combinator.

Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't
export it.

Breaking changes
================

Change the definition of `whiteSpace` and `skipSpaces` to
`Data.CodePoint.Unicode.isSpace`.

Move the character class parsers from `Text.Parsing.Parser.Token` module into
the `Text.Parsing.Parser.String` module.

To make this library handle Unicode correctly, it is necessary to
either alter the `StringLike` class or delete it.
We decided to delete it. The `String` module will now operate only
on inputs of the concrete `String` type.
`StringLike` has no laws, and during the five years of its life,
no-one on Github has ever written another instance of `StringLike`.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
The last time someone tried to alter `StringLike`, this is what
happened:
purescript-contrib#62

Breaking changes which won’t be caught by the compiler
======================================================

Fundamentally, we change the way we consume the next input character from
`Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`.

`anyChar` will no longer always succeed. It will only succeed on a Basic
Multilingual Plane character. The new parser `anyCodePoint` will always succeed.

We are not quite “making the default `CodePoint`”, as was discussed in
purescript-contrib#76 (comment) .
Rather we are keeping most of the current API and making it work
properly with astral Unicode.

We keep the `Char` parsers for backward compatibility.
We also keep the `Char` parsers for ergonomic reasons. For example
the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`.
This parser is usually called with a literal like `char 'a'`. It would
be annoying to call this parser with `char (codePointFromChar 'a')`.

Benchmarks
==========

For Unicode correctness, we're now consuming characters with
`Data.String.CodePoints.uncons` instead of
`Data.String.CodeUnits.uncons`. If that were going to effect
performance, then the effect would show up in the `runParser parse23`
benchmark, but it doesn’t.

Before
------

```
runParser parse23
mean   = 43.36 ms
stddev = 6.75 ms
min    = 41.12 ms
max    = 124.65 ms

runParser parseSkidoo
mean   = 22.53 ms
stddev = 3.86 ms
min    = 21.40 ms
max    = 61.76 ms
```

After
-----

```
runParser parse23
mean   = 42.90 ms
stddev = 6.01 ms
min    = 40.97 ms
max    = 115.74 ms

runParser parseSkidoo
mean   = 22.03 ms
stddev = 2.79 ms
min    = 20.78 ms
max    = 53.34 ms
```
  • Loading branch information
jamesdbrock committed Sep 24, 2021
1 parent cf4578b commit 307549e
Show file tree
Hide file tree
Showing 7 changed files with 197 additions and 134 deletions.
3 changes: 1 addition & 2 deletions bench/Main.purs
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@ import Effect.Exception (throw)
import Effect.Unsafe (unsafePerformEffect)
import Performance.Minibench (benchWith)
import Text.Parsing.Parser (Parser, runParser)
import Text.Parsing.Parser.Token (digit)
import Text.Parsing.Parser.String (string)
import Text.Parsing.Parser.String (digit, string)

string23 :: String
string23 = "23"
Expand Down
1 change: 1 addition & 0 deletions spago.dhall
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
, "transformers"
, "tuples"
, "unicode"
, "unsafe-coerce"
]
, packages = ./packages.dhall
, sources = [ "src/**/*.purs" ]
Expand Down
4 changes: 2 additions & 2 deletions src/Text/Parsing/Parser/Language.purs
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ import Prelude

import Control.Alt ((<|>))
import Text.Parsing.Parser (ParserT)
import Text.Parsing.Parser.String (char, oneOf)
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
import Text.Parsing.Parser.String (char, oneOf, alphaNum, letter)
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser)

-----------------------------------------------------------
-- Styles: haskellStyle, javaStyle
Expand Down
14 changes: 1 addition & 13 deletions src/Text/Parsing/Parser/Pos.purs
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
module Text.Parsing.Parser.Pos where

import Prelude

import Data.Generic.Rep (class Generic)
import Data.Foldable (foldl)
import Data.Newtype (wrap)
import Data.String (split)

-- | `Position` represents the position of the parser in the input.
-- |
Expand All @@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
-- | The `Position` before any input has been parsed.
initialPos :: Position
initialPos = Position { line: 1, column: 1 }

-- | Updates a `Position` by adding the columns and lines in `String`.
updatePosString :: Position -> String -> Position
updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
where
updatePosChar (Position pos) c = case c of
"\n" -> Position { line: pos.line + 1, column: 1 }
"\r" -> Position { line: pos.line + 1, column: 1 }
"\t" -> Position { line: pos.line, column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
_ -> Position { line: pos.line, column: pos.column + 1 }
222 changes: 164 additions & 58 deletions src/Text/Parsing/Parser/String.purs
Original file line number Diff line number Diff line change
@@ -1,92 +1,198 @@
-- | Primitive parsers for working with an input stream of type `String`.

module Text.Parsing.Parser.String where
-- | In most JavaScript runtime environments, the `String` is encoded
-- | as little-endian [UTF-16](https://en.wikipedia.org/wiki/UTF-16), but
-- | these primitive parsers should work with any runtime encoding.
-- |
-- | The primitive parsers which return `Char` will only succeed when the character
-- | being parsed is a code point in the
-- | [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane)
-- | (the “BMP”). These parsers can be convenient because of the good support
-- | that PureScript has for writing `Char` literals like `'あ', 'β', 'C'`.
-- |
-- | The other primitive parsers, which return `CodePoint` and `String` types,
-- | can parse the full Unicode character set. All of the primitive parsers
-- | in this module can be used together.
module Text.Parsing.Parser.String
( string
, eof
, anyChar
, anyCodePoint
, satisfy
, satisfyCodePoint
, char
, whiteSpace
, skipSpaces
, oneOf
, noneOf
, match
, digit
, hexDigit
, octDigit
, upper
, space
, letter
, alphaNum
)
where

import Prelude hiding (between)

import Control.Monad.State (gets, modify_)
import Data.Array (many)
import Data.Foldable (elem, notElem)
import Control.Monad.State (get, put)
import Data.Array (notElem)
import Data.Char (fromCharCode)
import Data.CodePoint.Unicode (isAlpha, isAlphaNum, isDecDigit, isHexDigit, isOctDigit, isSpace, isUpper)
import Data.Foldable (elem)
import Data.Maybe (Maybe(..))
import Data.Newtype (wrap)
import Data.String (Pattern)
import Data.String as S
import Data.String (CodePoint, Pattern(..), codePointFromChar, null, stripPrefix, uncons)
import Data.String.CodeUnits as SCU
import Data.Tuple (Tuple(..), fst)
import Text.Parsing.Parser (ParseState(..), ParserT, fail)
import Text.Parsing.Parser.Combinators (tryRethrow, (<?>))
import Text.Parsing.Parser.Pos (updatePosString)

-- | This class exists to abstract over streams which support the string-like
-- | operations which this modules needs.
class StringLike s where
drop :: Int -> s -> s
stripPrefix :: Pattern -> s -> Maybe s
null :: s -> Boolean
uncons :: s -> Maybe { head :: Char, tail :: s }

instance stringLikeString :: StringLike String where
uncons = SCU.uncons
drop = S.drop
stripPrefix = S.stripPrefix
null = S.null
import Text.Parsing.Parser.Combinators (skipMany, tryRethrow, (<?>))
import Text.Parsing.Parser.Pos (Position(..))
import Unsafe.Coerce (unsafeCoerce)

-- | Match end-of-file.
eof :: forall s m. StringLike s => Monad m => ParserT s m Unit
eof :: forall m. Monad m => ParserT String m Unit
eof = do
input <- gets \(ParseState input _ _) -> input
ParseState input _ _ <- get
unless (null input) (fail "Expected EOF")

-- | Match the specified string.
string :: forall s m. StringLike s => Monad m => String -> ParserT s m String
string :: forall m. Monad m => String -> ParserT String m String
string str = do
input <- gets \(ParseState input _ _) -> input
case stripPrefix (wrap str) input of
ParseState input position _ <- get
case stripPrefix (Pattern str) input of
Just remainder -> do
modify_ \(ParseState _ position _) ->
ParseState remainder
(updatePosString position str)
true
put $ ParseState remainder (updatePosString position str) true
pure str
_ -> fail ("Expected " <> show str)

-- | Match any character.
anyChar :: forall s m. StringLike s => Monad m => ParserT s m Char
anyChar = do
input <- gets \(ParseState input _ _) -> input
-- | Match any BMP `Char`.
-- | Parser will fail if the character is not in the Basic Multilingual Plane.
anyChar :: forall m. Monad m => ParserT String m Char
anyChar = tryRethrow do
cp :: Int <- deconstructCodePoint <$> anyCodePoint
-- the `fromCharCode` function doesn't check if this is beyond the
-- BMP, so we check that ourselves.
-- https://github.com/purescript/purescript-strings/issues/153
if cp > 65535 -- BMP
then fail "Not a Char"
else case fromCharCode cp of
Nothing -> fail "Not a Char"
Just c -> pure c

-- | Match any Unicode character.
-- | Always succeeds.
anyCodePoint :: forall m. Monad m => ParserT String m CodePoint
anyCodePoint = do
ParseState input position _ <- get
case uncons input of
Nothing -> fail "Unexpected EOF"
Just { head, tail } -> do
modify_ \(ParseState _ position _) ->
ParseState tail
(updatePosString position (SCU.singleton head))
true
put $ ParseState tail (updatePosSingle position head) true
pure head

-- | Match a character satisfying the specified predicate.
satisfy :: forall s m. StringLike s => Monad m => (Char -> Boolean) -> ParserT s m Char
-- | Match a BMP `Char` satisfying the predicate.
satisfy :: forall m. Monad m => (Char -> Boolean) -> ParserT String m Char
satisfy f = tryRethrow do
c <- anyChar
if f c then pure c
else fail $ "Character '" <> SCU.singleton c <> "' did not satisfy predicate"
if f c
then pure c
else fail "Predicate unsatisfied"

-- | Match the specified character
char :: forall s m. StringLike s => Monad m => Char -> ParserT s m Char
-- | Match a Unicode character satisfying the predicate.
satisfyCodePoint :: forall m. Monad m => (CodePoint -> Boolean) -> ParserT String m CodePoint
satisfyCodePoint f = tryRethrow do
c <- anyCodePoint
if f c
then pure c
else fail "Predicate unsatisfied"

-- | Match the specified BMP `Char`.
char :: forall m. Monad m => Char -> ParserT String m Char
char c = satisfy (_ == c) <?> show c

-- | Match zero or more whitespace characters.
whiteSpace :: forall s m. StringLike s => Monad m => ParserT s m String
whiteSpace = do
cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t'
pure $ SCU.fromCharArray cs
-- | Match zero or more whitespace characters satisfying
-- | `Data.CodePoint.Unicode.isSpace`.
whiteSpace :: forall m. Monad m => ParserT String m String
whiteSpace = fst <$> match skipSpaces

-- | Skip whitespace characters.
skipSpaces :: forall s m. StringLike s => Monad m => ParserT s m Unit
skipSpaces = void whiteSpace
skipSpaces :: forall m. Monad m => ParserT String m Unit
skipSpaces = skipMany (satisfyCodePoint isSpace)

-- | Match one of the characters in the array.
oneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
-- | Match one of the BMP `Char`s in the array.
oneOf :: forall m. Monad m => Array Char -> ParserT String m Char
oneOf ss = satisfy (flip elem ss) <?> ("one of " <> show ss)

-- | Match any character not in the array.
noneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
-- | Match any BMP `Char` not in the array.
noneOf :: forall m. Monad m => Array Char -> ParserT String m Char
noneOf ss = satisfy (flip notElem ss) <?> ("none of " <> show ss)

-- | Updates a `Position` by adding the columns and lines in `String`.
updatePosString :: Position -> String -> Position
updatePosString pos str = case uncons str of
Nothing -> pos
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive

-- | Updates a `Position` by adding the columns and lines in a
-- | single `CodePoint`.
updatePosSingle :: Position -> CodePoint -> Position
updatePosSingle (Position {line,column}) cp = case deconstructCodePoint cp of
10 -> Position { line: line + 1, column: 1 } -- "\n"
13 -> Position { line: line + 1, column: 1 } -- "\r"
9 -> Position { line, column: column + 8 - ((column - 1) `mod` 8) } -- "\t" Who says that one tab is 8 columns?
_ -> Position { line, column: column + 1 }

-- | Combinator which returns both the result of a parse and the portion of
-- | the input that was consumed while it was being parsed.
match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a)
match p = do
ParseState input1 _ _ <- get
x <- p
ParseState input2 _ _ <- get
-- We use the `SCU.length`, which is in units of “code units”
-- instead of `Data.String.length`. which is in units of “code points”.
-- This is more efficient, and it will be correct as long as we can assume
-- the invariant that the `ParseState input` always begins on a code point
-- boundary.
pure $ Tuple (SCU.take (SCU.length input1 - SCU.length input2) input1) x

-- Helper function.
satisfyCP :: forall m . Monad m => (CodePoint -> Boolean) -> ParserT String m Char
satisfyCP p = satisfy (p <<< codePointFromChar)

-- | Parse a digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isDecDigit`.
digit :: forall m . Monad m => ParserT String m Char
digit = satisfyCP isDecDigit <?> "digit"

-- | Parse a hex digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isHexDigit`.
hexDigit :: forall m . Monad m => ParserT String m Char
hexDigit = satisfyCP isHexDigit <?> "hex digit"

-- | Parse an octal digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isOctDigit`.
octDigit :: forall m . Monad m => ParserT String m Char
octDigit = satisfyCP isOctDigit <?> "oct digit"

-- | Parse an uppercase letter. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isUpper`.
upper :: forall m . Monad m => ParserT String m Char
upper = satisfyCP isUpper <?> "uppercase letter"

-- | Parse a space character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isSpace`.
space :: forall m . Monad m => ParserT String m Char
space = satisfyCP isSpace <?> "space"

-- | Parse an alphabetical character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlpha`.
letter :: forall m . Monad m => ParserT String m Char
letter = satisfyCP isAlpha <?> "letter"

-- | Parse an alphabetical or numerical character.
-- | Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlphaNum`.
alphaNum :: forall m . Monad m => ParserT String m Char
alphaNum = satisfyCP isAlphaNum <?> "letter or digit"

-- The CodePoint newtype constructor is not exported, so here's a helper.
-- This will break at runtime if the definition of CodePoint ever changes
-- to something other than `newtype CodePoint = CodePoint Int`.
deconstructCodePoint :: CodePoint -> Int
deconstructCodePoint = unsafeCoerce

0 comments on commit 307549e

Please sign in to comment.