forked from purescript-contrib/purescript-parsing
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
- Loading branch information
1 parent
cf4578b
commit 7795887
Showing
7 changed files
with
199 additions
and
134 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,92 +1,200 @@ | ||
-- | Primitive parsers for working with an input stream of type `String`. | ||
|
||
module Text.Parsing.Parser.String where | ||
-- | | ||
-- | The behavior of these primitive parsers is based on the behavior of the | ||
-- | `Data.String` module in the __strings__ package. | ||
-- | In most JavaScript runtime environments, the `String` | ||
-- | is little-endian [UTF-16](https://en.wikipedia.org/wiki/UTF-16). | ||
-- | | ||
-- | The primitive parsers which return `Char` will only succeed when the character | ||
-- | being parsed is a code point in the | ||
-- | [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) | ||
-- | (the “BMP”). These parsers can be convenient because of the good support | ||
-- | that PureScript has for writing `Char` literals like `'あ'`, `'β'`, `'C'`. | ||
-- | | ||
-- | The other primitive parsers, which return `CodePoint` and `String` types, | ||
-- | can parse the full Unicode character set. All of the primitive parsers | ||
-- | in this module can be used together. | ||
module Text.Parsing.Parser.String | ||
( string | ||
, eof | ||
, anyChar | ||
, anyCodePoint | ||
, satisfy | ||
, satisfyCodePoint | ||
, char | ||
, whiteSpace | ||
, skipSpaces | ||
, oneOf | ||
, noneOf | ||
, match | ||
, digit | ||
, hexDigit | ||
, octDigit | ||
, upper | ||
, space | ||
, letter | ||
, alphaNum | ||
) | ||
where | ||
|
||
import Prelude hiding (between) | ||
|
||
import Control.Monad.State (gets, modify_) | ||
import Data.Array (many) | ||
import Data.Foldable (elem, notElem) | ||
import Control.Monad.State (get, put) | ||
import Data.Array (notElem) | ||
import Data.Char (fromCharCode) | ||
import Data.CodePoint.Unicode (isAlpha, isAlphaNum, isDecDigit, isHexDigit, isOctDigit, isSpace, isUpper) | ||
import Data.Foldable (elem) | ||
import Data.Maybe (Maybe(..)) | ||
import Data.Newtype (wrap) | ||
import Data.String (Pattern) | ||
import Data.String as S | ||
import Data.String (CodePoint, Pattern(..), codePointFromChar, null, stripPrefix, uncons) | ||
import Data.String.CodeUnits as SCU | ||
import Data.Tuple (Tuple(..), fst) | ||
import Text.Parsing.Parser (ParseState(..), ParserT, fail) | ||
import Text.Parsing.Parser.Combinators (tryRethrow, (<?>)) | ||
import Text.Parsing.Parser.Pos (updatePosString) | ||
|
||
-- | This class exists to abstract over streams which support the string-like | ||
-- | operations which this modules needs. | ||
class StringLike s where | ||
drop :: Int -> s -> s | ||
stripPrefix :: Pattern -> s -> Maybe s | ||
null :: s -> Boolean | ||
uncons :: s -> Maybe { head :: Char, tail :: s } | ||
|
||
instance stringLikeString :: StringLike String where | ||
uncons = SCU.uncons | ||
drop = S.drop | ||
stripPrefix = S.stripPrefix | ||
null = S.null | ||
import Text.Parsing.Parser.Combinators (skipMany, tryRethrow, (<?>)) | ||
import Text.Parsing.Parser.Pos (Position(..)) | ||
import Unsafe.Coerce (unsafeCoerce) | ||
|
||
-- | Match end-of-file. | ||
eof :: forall s m. StringLike s => Monad m => ParserT s m Unit | ||
eof :: forall m. Monad m => ParserT String m Unit | ||
eof = do | ||
input <- gets \(ParseState input _ _) -> input | ||
ParseState input _ _ <- get | ||
unless (null input) (fail "Expected EOF") | ||
|
||
-- | Match the specified string. | ||
string :: forall s m. StringLike s => Monad m => String -> ParserT s m String | ||
string :: forall m. Monad m => String -> ParserT String m String | ||
string str = do | ||
input <- gets \(ParseState input _ _) -> input | ||
case stripPrefix (wrap str) input of | ||
ParseState input position _ <- get | ||
case stripPrefix (Pattern str) input of | ||
Just remainder -> do | ||
modify_ \(ParseState _ position _) -> | ||
ParseState remainder | ||
(updatePosString position str) | ||
true | ||
put $ ParseState remainder (updatePosString position str) true | ||
pure str | ||
_ -> fail ("Expected " <> show str) | ||
|
||
-- | Match any character. | ||
anyChar :: forall s m. StringLike s => Monad m => ParserT s m Char | ||
anyChar = do | ||
input <- gets \(ParseState input _ _) -> input | ||
-- | Match any BMP `Char`. | ||
-- | Parser will fail if the character is not in the Basic Multilingual Plane. | ||
anyChar :: forall m. Monad m => ParserT String m Char | ||
anyChar = tryRethrow do | ||
cp :: Int <- deconstructCodePoint <$> anyCodePoint | ||
-- the `fromCharCode` function doesn't check if this is beyond the | ||
-- BMP, so we check that ourselves. | ||
-- https://github.com/purescript/purescript-strings/issues/153 | ||
if cp > 65535 -- BMP | ||
then fail "Not a Char" | ||
else case fromCharCode cp of | ||
Nothing -> fail "Not a Char" | ||
Just c -> pure c | ||
|
||
-- | Match any Unicode character. | ||
-- | Always succeeds. | ||
anyCodePoint :: forall m. Monad m => ParserT String m CodePoint | ||
anyCodePoint = do | ||
ParseState input position _ <- get | ||
case uncons input of | ||
Nothing -> fail "Unexpected EOF" | ||
Just { head, tail } -> do | ||
modify_ \(ParseState _ position _) -> | ||
ParseState tail | ||
(updatePosString position (SCU.singleton head)) | ||
true | ||
put $ ParseState tail (updatePosSingle position head) true | ||
pure head | ||
|
||
-- | Match a character satisfying the specified predicate. | ||
satisfy :: forall s m. StringLike s => Monad m => (Char -> Boolean) -> ParserT s m Char | ||
-- | Match a BMP `Char` satisfying the predicate. | ||
satisfy :: forall m. Monad m => (Char -> Boolean) -> ParserT String m Char | ||
satisfy f = tryRethrow do | ||
c <- anyChar | ||
if f c then pure c | ||
else fail $ "Character '" <> SCU.singleton c <> "' did not satisfy predicate" | ||
if f c | ||
then pure c | ||
else fail "Predicate unsatisfied" | ||
|
||
-- | Match the specified character | ||
char :: forall s m. StringLike s => Monad m => Char -> ParserT s m Char | ||
-- | Match a Unicode character satisfying the predicate. | ||
satisfyCodePoint :: forall m. Monad m => (CodePoint -> Boolean) -> ParserT String m CodePoint | ||
satisfyCodePoint f = tryRethrow do | ||
c <- anyCodePoint | ||
if f c | ||
then pure c | ||
else fail "Predicate unsatisfied" | ||
|
||
-- | Match the specified BMP `Char`. | ||
char :: forall m. Monad m => Char -> ParserT String m Char | ||
char c = satisfy (_ == c) <?> show c | ||
|
||
-- | Match zero or more whitespace characters. | ||
whiteSpace :: forall s m. StringLike s => Monad m => ParserT s m String | ||
whiteSpace = do | ||
cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' | ||
pure $ SCU.fromCharArray cs | ||
-- | Match zero or more whitespace characters satisfying | ||
-- | `Data.CodePoint.Unicode.isSpace`. | ||
whiteSpace :: forall m. Monad m => ParserT String m String | ||
whiteSpace = fst <$> match skipSpaces | ||
|
||
-- | Skip whitespace characters. | ||
skipSpaces :: forall s m. StringLike s => Monad m => ParserT s m Unit | ||
skipSpaces = void whiteSpace | ||
skipSpaces :: forall m. Monad m => ParserT String m Unit | ||
skipSpaces = skipMany (satisfyCodePoint isSpace) | ||
|
||
-- | Match one of the characters in the array. | ||
oneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char | ||
-- | Match one of the BMP `Char`s in the array. | ||
oneOf :: forall m. Monad m => Array Char -> ParserT String m Char | ||
oneOf ss = satisfy (flip elem ss) <?> ("one of " <> show ss) | ||
|
||
-- | Match any character not in the array. | ||
noneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char | ||
-- | Match any BMP `Char` not in the array. | ||
noneOf :: forall m. Monad m => Array Char -> ParserT String m Char | ||
noneOf ss = satisfy (flip notElem ss) <?> ("none of " <> show ss) | ||
|
||
-- | Updates a `Position` by adding the columns and lines in `String`. | ||
updatePosString :: Position -> String -> Position | ||
updatePosString pos str = case uncons str of | ||
Nothing -> pos | ||
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive | ||
|
||
-- | Updates a `Position` by adding the columns and lines in a | ||
-- | single `CodePoint`. | ||
updatePosSingle :: Position -> CodePoint -> Position | ||
updatePosSingle (Position {line,column}) cp = case deconstructCodePoint cp of | ||
10 -> Position { line: line + 1, column: 1 } -- "\n" | ||
13 -> Position { line: line + 1, column: 1 } -- "\r" | ||
9 -> Position { line, column: column + 8 - ((column - 1) `mod` 8) } -- "\t" Who says that one tab is 8 columns? | ||
_ -> Position { line, column: column + 1 } | ||
|
||
-- | Combinator which returns both the result of a parse and the portion of | ||
-- | the input that was consumed while it was being parsed. | ||
match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a) | ||
match p = do | ||
ParseState input1 _ _ <- get | ||
x <- p | ||
ParseState input2 _ _ <- get | ||
-- We use the `SCU.length`, which is in units of “code units” | ||
-- instead of `Data.String.length`. which is in units of “code points”. | ||
-- This is more efficient, and it will be correct as long as we can assume | ||
-- the invariant that the `ParseState input` always begins on a code point | ||
-- boundary. | ||
pure $ Tuple (SCU.take (SCU.length input1 - SCU.length input2) input1) x | ||
|
||
-- Helper function. | ||
satisfyCP :: forall m . Monad m => (CodePoint -> Boolean) -> ParserT String m Char | ||
satisfyCP p = satisfy (p <<< codePointFromChar) | ||
|
||
-- | Parse a digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isDecDigit`. | ||
digit :: forall m . Monad m => ParserT String m Char | ||
digit = satisfyCP isDecDigit <?> "digit" | ||
|
||
-- | Parse a hex digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isHexDigit`. | ||
hexDigit :: forall m . Monad m => ParserT String m Char | ||
hexDigit = satisfyCP isHexDigit <?> "hex digit" | ||
|
||
-- | Parse an octal digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isOctDigit`. | ||
octDigit :: forall m . Monad m => ParserT String m Char | ||
octDigit = satisfyCP isOctDigit <?> "oct digit" | ||
|
||
-- | Parse an uppercase letter. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isUpper`. | ||
upper :: forall m . Monad m => ParserT String m Char | ||
upper = satisfyCP isUpper <?> "uppercase letter" | ||
|
||
-- | Parse a space character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isSpace`. | ||
space :: forall m . Monad m => ParserT String m Char | ||
space = satisfyCP isSpace <?> "space" | ||
|
||
-- | Parse an alphabetical character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlpha`. | ||
letter :: forall m . Monad m => ParserT String m Char | ||
letter = satisfyCP isAlpha <?> "letter" | ||
|
||
-- | Parse an alphabetical or numerical character. | ||
-- | Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlphaNum`. | ||
alphaNum :: forall m . Monad m => ParserT String m Char | ||
alphaNum = satisfyCP isAlphaNum <?> "letter or digit" | ||
|
||
-- The CodePoint newtype constructor is not exported, so here's a helper. | ||
-- This will break at runtime if the definition of CodePoint ever changes | ||
-- to something other than `newtype CodePoint = CodePoint Int`. | ||
deconstructCodePoint :: CodePoint -> Int | ||
deconstructCodePoint = unsafeCoerce |
Oops, something went wrong.