Unicode correctness

Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
jamesdbrock · Sep 29, 2021 · ab104c3 · ab104c3
1 parent 1d0aaa1
commit ab104c3
Show file tree

Hide file tree

Showing 11 changed files with 354 additions and 94 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -25,10 +25,10 @@ jobs:
             output
 
       - name: Install dependencies
-        run: spago install
+        run: spago -x spago-dev.dhall install
 
       - name: Build source
-        run: spago build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''
+        run: spago -x spago-dev.dhall build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''
 
       - name: Run tests
-        run: spago test --no-install
+        run: spago -x spago-dev.dhall test --no-install
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,10 +6,28 @@ Notable changes to this project are documented in this file. The format is based
 
 Breaking changes:
 
+- `anyChar` will no longer always succeed. It will only succeed on a Basic
+  Multilingual Plane character. The new parser `anyCodePoint` will always
+  succeed. (#119 by @jamesdbrock)
+- Delete the `StringLike` typeclass. Users must delete all `StringLike`
+  constraints. (#119 by @jamesdbrock)
+- Move `updatePosString` to the `String` module and don’t
+  export it. (#119 by @jamesdbrock)
+- Change the definition of `whiteSpace` and `skipSpaces` to
+  `Data.CodePoint.Unicode.isSpace`. (#119 by @jamesdbrock)
+
 New features:
 
+- Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
+  `CodePoint`s. (#119 by @jamesdbrock)
+- Add `match` combinator (#119 by @jamesdbrock)
+- Add benchmark suite (#119 by @jamesdbrock)
+- Split the dev dependencies out into `spago-dev.dhall`.
+
 Bugfixes:
 
+- Unicode correctness (#119 by @jamesdbrock)
+
 Other improvements:
 
 ## [v6.0.2](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.2) - 2021-05-09
@@ -26,12 +44,12 @@ Other improvements:
 ## [v6.0.0](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.0) - 2021-02-26
 
 Breaking changes:
-- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93) 
+- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93)
 - Non-empty combinators now return `NonEmptyList` (#102)
 - Added support for PureScript 0.14 and dropped support for all previous versions (#101)
 
 New features:
-- Derived `Generic` instance of Position (#87) 
+- Derived `Generic` instance of Position (#87)
 
 Bugfixes:
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -3,3 +3,16 @@
 Thanks for your interest in contributing to `parsing`! We welcome new contributions regardless of your level of experience or familiarity with PureScript.
 
 Every library in the Contributors organization shares a simple handbook that helps new contributors get started. With that in mind, please [read the short contributing guide on purescript-contrib/governance](https://github.com/purescript-contrib/governance/blob/main/contributing.md) before contributing to this library.
+
+# Development
+
+This package includes a `spago-dev.dhall` which provides the dependencies
+for development and testing.
+
+## Testing
+
+To run the test suite:
+
+```
+spago -x spago-dev.dhall test
+```
diff --git a/bench/Main.purs b/bench/Main.purs
@@ -0,0 +1,134 @@
+-- | # Benchmarking
+-- |
+-- |     spago -x spago-dev.dhall run --main Bench.Main
+-- |
+-- | This benchmark suite is intended to guide changes to this package so that
+-- | we can compare the benchmarks of different commits.
+-- |
+-- | This benchmark suite also compares parsers to equivalent Regex. This
+-- | provides an answer to the common question “How much slower is this package
+-- | than Regex?” Answer: approximately 100×. The Regex benchmarks also give
+-- | us a rough way to calibrate benchmarks run on different platforms.
+-- |
+-- | # Profiling
+-- |
+-- | https://nodejs.org/en/docs/guides/simple-profiling/
+-- | https://nodesource.com/blog/diagnostics-in-NodeJS-2
+-- |
+-- |     spago -x spago-dev.dhall build --source-maps
+-- |     purs bundle output/**/*.js --source-maps --output ./index.bundle.js
+-- |
+-- |
+-- |     spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
+-- |     zephyr Bench.Main.main --codegen sourcemaps,js
+-- |     purs bundle dce-output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.dce.bundle.js
+-- |     node index.dce.bundle.js
+-- |
+-- |     spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
+-- |     purs bundle output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.bundle.js
+-- |     node index.bundle.js
+-- |     node --prof --enable-source-maps ./index.bundle.js
+-- |     node --prof-process --source-map ./index.bundle.js.map isolate--.log > prof.txt
+-- |
+-- |     node --prof --enable-source-maps -e 'require("./output/Bench.Main/index.js").main()'
+-- |     node --prof-process isolate--.log
+-- |
+-- |     spago -x spago-dev.dhall build
+-- |     node --prof -e 'require("./output/Bench.Main/index.js").main()'
+-- |     node --prof-process isolate--.log > prof.txt
+
+module Bench.Main where
+
+import Prelude
+
+import Data.Array (fold, replicate)
+import Data.Either (either)
+import Data.List (manyRec)
+import Data.List.Types (List)
+import Data.String.Regex (Regex, regex)
+import Data.String.Regex as Regex
+import Data.String.Regex.Flags (RegexFlags(..))
+import Effect (Effect)
+import Effect.Console (log)
+import Effect.Exception (throw)
+import Effect.Unsafe (unsafePerformEffect)
+import Performance.Minibench (benchWith)
+import Text.Parsing.Parser (Parser, runParser)
+import Text.Parsing.Parser.String (string)
+import Text.Parsing.Parser.Token (digit)
+import Text.Parsing.StringParser as StringParser
+import Text.Parsing.StringParser.CodePoints as StringParser.CodePoints
+import Text.Parsing.StringParser.CodeUnits as StringParser.CodeUnits
+
+string23 :: String
+string23 = "23"
+string23_2 :: String
+string23_2 = fold $ replicate 2 string23
+string23_10000 :: String
+string23_10000 = fold $ replicate 10000 string23
+
+stringSkidoo :: String
+stringSkidoo = "skidoo"
+stringSkidoo_2 :: String
+stringSkidoo_2 = fold $ replicate 2 stringSkidoo
+stringSkidoo_10000 :: String
+stringSkidoo_10000 = fold $ replicate 10000 stringSkidoo
+
+parse23 :: Parser String (List Char)
+parse23 = manyRec digit
+
+parse23Points :: StringParser.Parser (List Char)
+parse23Points = manyRec StringParser.CodePoints.anyDigit
+
+parse23Units :: StringParser.Parser (List Char)
+parse23Units = manyRec StringParser.CodeUnits.anyDigit
+
+pattern23 :: Regex
+pattern23 = either (unsafePerformEffect <<< throw) identity $
+  regex "\\d" $ RegexFlags
+    { dotAll: true
+    , global: true
+    , ignoreCase: false
+    , multiline: true
+    , sticky: false
+    , unicode: true
+    }
+
+parseSkidoo :: Parser String (List String)
+parseSkidoo = manyRec $ string "skidoo"
+
+patternSkidoo :: Regex
+patternSkidoo = either (unsafePerformEffect <<< throw) identity $
+  regex "skidoo" $ RegexFlags
+    { dotAll: true
+    , global: true
+    , ignoreCase: false
+    , multiline: true
+    , sticky: false
+    , unicode: true
+    }
+
+main :: Effect Unit
+main = do
+  -- log $ show $ runParser string23_2 parse23
+  -- log $ show $ Regex.match pattern23 string23_2
+  -- log $ show $ runParser stringSkidoo_2 parseSkidoo
+  -- log $ show $ Regex.match patternSkidoo stringSkidoo_2
+  log "runParser parse23"
+  benchWith 200
+    $ \_ -> runParser string23_10000 parse23
+  log "StringParser.runParser parse23Points"
+  benchWith 20
+    $ \_ -> StringParser.runParser parse23Points string23_10000
+  log "StringParser.runParser parse23Units"
+  benchWith 200
+    $ \_ -> StringParser.runParser parse23Units string23_10000
+  log "Regex.match pattern23"
+  benchWith 200
+    $ \_ -> Regex.match pattern23 string23_10000
+  log "runParser parseSkidoo"
+  benchWith 200
+    $ \_ -> runParser stringSkidoo_10000 parseSkidoo
+  log "Regex.match patternSkidoo"
+  benchWith 200
+    $ \_ -> Regex.match patternSkidoo stringSkidoo_10000
diff --git a/spago-dev.dhall b/spago-dev.dhall
@@ -0,0 +1,22 @@
+-- Spago configuration for testing, benchmarking, development.
+--
+-- See:
+-- * ./CONTRIBUTING.md
+-- * https://github.com/purescript/spago#devdependencies-testdependencies-or-in-general-a-situation-with-many-configurations
+--
+
+let conf = ./spago.dhall
+
+in conf //
+{ sources = [ "src/**/*.purs", "test/**/*.purs", "bench/**/*.purs" ]
+, dependencies = conf.dependencies #
+  [ "assert"
+  , "console"
+  , "effect"
+  , "psci-support"
+  , "minibench"
+  , "exceptions"
+  , "string-parsers"
+  ]
+, packages = ./packages.dhall
+}
diff --git a/spago.dhall b/spago.dhall
@@ -1,10 +1,7 @@
 { name = "parsing"
 , dependencies =
   [ "arrays"
-  , "assert"
-  , "console"
   , "control"
-  , "effect"
   , "either"
   , "foldable-traversable"
   , "identity"
@@ -14,13 +11,13 @@
   , "maybe"
   , "newtype"
   , "prelude"
-  , "psci-support"
   , "strings"
   , "tailrec"
   , "transformers"
   , "tuples"
   , "unicode"
+  , "unsafe-coerce"
   ]
 , packages = ./packages.dhall
-, sources = [ "src/**/*.purs", "test/**/*.purs" ]
+, sources = [ "src/**/*.purs" ]
 }
diff --git a/src/Text/Parsing/Parser/Language.purs b/src/Text/Parsing/Parser/Language.purs
@@ -13,7 +13,7 @@ import Prelude
 import Control.Alt ((<|>))
 import Text.Parsing.Parser (ParserT)
 import Text.Parsing.Parser.String (char, oneOf)
-import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
+import Text.Parsing.Parser.Token (GenLanguageDef(..), LanguageDef, TokenParser, alphaNum, letter, makeTokenParser, unGenLanguageDef)
 
 -----------------------------------------------------------
 -- Styles: haskellStyle, javaStyle

diff --git a/src/Text/Parsing/Parser/Pos.purs b/src/Text/Parsing/Parser/Pos.purs
@@ -1,10 +1,8 @@
 module Text.Parsing.Parser.Pos where
 
 import Prelude
+
 import Data.Generic.Rep (class Generic)
-import Data.Foldable (foldl)
-import Data.Newtype (wrap)
-import Data.String (split)
 
 -- | `Position` represents the position of the parser in the input.
 -- |
@@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
 -- | The `Position` before any input has been parsed.
 initialPos :: Position
 initialPos = Position { line: 1, column: 1 }
-
--- | Updates a `Position` by adding the columns and lines in `String`.
-updatePosString :: Position -> String -> Position
-updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
-  where
-  updatePosChar (Position pos) c = case c of
-    "\n" -> Position { line: pos.line + 1, column: 1 }
-    "\r" -> Position { line: pos.line + 1, column: 1 }
-    "\t" -> Position { line: pos.line,     column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
-    _    -> Position { line: pos.line,     column: pos.column + 1 }