forked from purescript-contrib/purescript-parsing
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
- Loading branch information
1 parent
1d0aaa1
commit ab104c3
Showing
11 changed files
with
354 additions
and
94 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
-- | # Benchmarking | ||
-- | | ||
-- | spago -x spago-dev.dhall run --main Bench.Main | ||
-- | | ||
-- | This benchmark suite is intended to guide changes to this package so that | ||
-- | we can compare the benchmarks of different commits. | ||
-- | | ||
-- | This benchmark suite also compares parsers to equivalent Regex. This | ||
-- | provides an answer to the common question “How much slower is this package | ||
-- | than Regex?” Answer: approximately 100×. The Regex benchmarks also give | ||
-- | us a rough way to calibrate benchmarks run on different platforms. | ||
-- | | ||
-- | # Profiling | ||
-- | | ||
-- | https://nodejs.org/en/docs/guides/simple-profiling/ | ||
-- | https://nodesource.com/blog/diagnostics-in-NodeJS-2 | ||
-- | | ||
-- | spago -x spago-dev.dhall build --source-maps | ||
-- | purs bundle output/**/*.js --source-maps --output ./index.bundle.js | ||
-- | | ||
-- | | ||
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps' | ||
-- | zephyr Bench.Main.main --codegen sourcemaps,js | ||
-- | purs bundle dce-output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.dce.bundle.js | ||
-- | node index.dce.bundle.js | ||
-- | | ||
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps' | ||
-- | purs bundle output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.bundle.js | ||
-- | node index.bundle.js | ||
-- | node --prof --enable-source-maps ./index.bundle.js | ||
-- | node --prof-process --source-map ./index.bundle.js.map isolate--.log > prof.txt | ||
-- | | ||
-- | node --prof --enable-source-maps -e 'require("./output/Bench.Main/index.js").main()' | ||
-- | node --prof-process isolate--.log | ||
-- | | ||
-- | spago -x spago-dev.dhall build | ||
-- | node --prof -e 'require("./output/Bench.Main/index.js").main()' | ||
-- | node --prof-process isolate--.log > prof.txt | ||
|
||
module Bench.Main where | ||
|
||
import Prelude | ||
|
||
import Data.Array (fold, replicate) | ||
import Data.Either (either) | ||
import Data.List (manyRec) | ||
import Data.List.Types (List) | ||
import Data.String.Regex (Regex, regex) | ||
import Data.String.Regex as Regex | ||
import Data.String.Regex.Flags (RegexFlags(..)) | ||
import Effect (Effect) | ||
import Effect.Console (log) | ||
import Effect.Exception (throw) | ||
import Effect.Unsafe (unsafePerformEffect) | ||
import Performance.Minibench (benchWith) | ||
import Text.Parsing.Parser (Parser, runParser) | ||
import Text.Parsing.Parser.String (string) | ||
import Text.Parsing.Parser.Token (digit) | ||
import Text.Parsing.StringParser as StringParser | ||
import Text.Parsing.StringParser.CodePoints as StringParser.CodePoints | ||
import Text.Parsing.StringParser.CodeUnits as StringParser.CodeUnits | ||
|
||
string23 :: String | ||
string23 = "23" | ||
string23_2 :: String | ||
string23_2 = fold $ replicate 2 string23 | ||
string23_10000 :: String | ||
string23_10000 = fold $ replicate 10000 string23 | ||
|
||
stringSkidoo :: String | ||
stringSkidoo = "skidoo" | ||
stringSkidoo_2 :: String | ||
stringSkidoo_2 = fold $ replicate 2 stringSkidoo | ||
stringSkidoo_10000 :: String | ||
stringSkidoo_10000 = fold $ replicate 10000 stringSkidoo | ||
|
||
parse23 :: Parser String (List Char) | ||
parse23 = manyRec digit | ||
|
||
parse23Points :: StringParser.Parser (List Char) | ||
parse23Points = manyRec StringParser.CodePoints.anyDigit | ||
|
||
parse23Units :: StringParser.Parser (List Char) | ||
parse23Units = manyRec StringParser.CodeUnits.anyDigit | ||
|
||
pattern23 :: Regex | ||
pattern23 = either (unsafePerformEffect <<< throw) identity $ | ||
regex "\\d" $ RegexFlags | ||
{ dotAll: true | ||
, global: true | ||
, ignoreCase: false | ||
, multiline: true | ||
, sticky: false | ||
, unicode: true | ||
} | ||
|
||
parseSkidoo :: Parser String (List String) | ||
parseSkidoo = manyRec $ string "skidoo" | ||
|
||
patternSkidoo :: Regex | ||
patternSkidoo = either (unsafePerformEffect <<< throw) identity $ | ||
regex "skidoo" $ RegexFlags | ||
{ dotAll: true | ||
, global: true | ||
, ignoreCase: false | ||
, multiline: true | ||
, sticky: false | ||
, unicode: true | ||
} | ||
|
||
main :: Effect Unit | ||
main = do | ||
-- log $ show $ runParser string23_2 parse23 | ||
-- log $ show $ Regex.match pattern23 string23_2 | ||
-- log $ show $ runParser stringSkidoo_2 parseSkidoo | ||
-- log $ show $ Regex.match patternSkidoo stringSkidoo_2 | ||
log "runParser parse23" | ||
benchWith 200 | ||
$ \_ -> runParser string23_10000 parse23 | ||
log "StringParser.runParser parse23Points" | ||
benchWith 20 | ||
$ \_ -> StringParser.runParser parse23Points string23_10000 | ||
log "StringParser.runParser parse23Units" | ||
benchWith 200 | ||
$ \_ -> StringParser.runParser parse23Units string23_10000 | ||
log "Regex.match pattern23" | ||
benchWith 200 | ||
$ \_ -> Regex.match pattern23 string23_10000 | ||
log "runParser parseSkidoo" | ||
benchWith 200 | ||
$ \_ -> runParser stringSkidoo_10000 parseSkidoo | ||
log "Regex.match patternSkidoo" | ||
benchWith 200 | ||
$ \_ -> Regex.match patternSkidoo stringSkidoo_10000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
-- Spago configuration for testing, benchmarking, development. | ||
-- | ||
-- See: | ||
-- * ./CONTRIBUTING.md | ||
-- * https://github.com/purescript/spago#devdependencies-testdependencies-or-in-general-a-situation-with-many-configurations | ||
-- | ||
|
||
let conf = ./spago.dhall | ||
|
||
in conf // | ||
{ sources = [ "src/**/*.purs", "test/**/*.purs", "bench/**/*.purs" ] | ||
, dependencies = conf.dependencies # | ||
[ "assert" | ||
, "console" | ||
, "effect" | ||
, "psci-support" | ||
, "minibench" | ||
, "exceptions" | ||
, "string-parsers" | ||
] | ||
, packages = ./packages.dhall | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.