Skip to content

Commit

Permalink
Unicode correctness
Browse files Browse the repository at this point in the history
Correctly handle UTF-16 surrogate pairs in `String`s.
We keep all of the API, but we change the primitive parsers so that instead
of succeeding and incorrectly returning half of a surrogate pair, they will fail.

All prior tests pass with no modifications. Add a few new tests.

Non-breaking changes
====================

Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
`CodePoint`s.

Add the `match` combinator.

Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't
export it.

Split dev dependencies into spago-dev.dhall.

Add benchmark suite.

Add astral UTF-16 test.

Breaking changes
================

Change the definition of `whiteSpace` and `skipSpaces` to
`Data.CodePoint.Unicode.isSpace`.

To make this library handle Unicode correctly, it is necessary to
either alter the `StringLike` class or delete it.
We decided to delete it. The `String` module will now operate only
on inputs of the concrete `String` type.
`StringLike` has no laws, and during the five years of its life,
no-one on Github has ever written another instance of `StringLike`.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
The last time someone tried to alter `StringLike`, this is what
happened:
purescript-contrib#62

Breaking changes which won’t be caught by the compiler
======================================================

Fundamentally, we change the way we consume the next input character from
`Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`.

`anyChar` will no longer always succeed. It will only succeed on a Basic
Multilingual Plane character. The new parser `anyCodePoint` will always succeed.

We are not quite “making the default `CodePoint`”, as was discussed in
purescript-contrib#76 (comment) .
Rather we are keeping most of the current API and making it work
properly with astral Unicode.

We keep the `Char` parsers for backward compatibility.
We also keep the `Char` parsers for ergonomic reasons. For example
the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`.
This parser is usually called with a literal like `char 'a'`. It would
be annoying to call this parser with `char (codePointFromChar 'a')`.

Benchmarks
==========

For Unicode correctness, we're now consuming characters with
`Data.String.CodePoints.uncons` instead of
`Data.String.CodeUnits.uncons`. If that were going to effect
performance, then the effect would show up in the `runParser parse23`
benchmark, but it doesn’t.

Before
------

```
runParser parse23
mean   = 43.36 ms
stddev = 6.75 ms
min    = 41.12 ms
max    = 124.65 ms

runParser parseSkidoo
mean   = 22.53 ms
stddev = 3.86 ms
min    = 21.40 ms
max    = 61.76 ms
```

After
-----

```
runParser parse23
mean   = 42.90 ms
stddev = 6.01 ms
min    = 40.97 ms
max    = 115.74 ms

runParser parseSkidoo
mean   = 22.03 ms
stddev = 2.79 ms
min    = 20.78 ms
max    = 53.34 ms
```
  • Loading branch information
jamesdbrock committed Sep 29, 2021
1 parent 1d0aaa1 commit ab104c3
Show file tree
Hide file tree
Showing 11 changed files with 354 additions and 94 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ jobs:
output
- name: Install dependencies
run: spago install
run: spago -x spago-dev.dhall install

- name: Build source
run: spago build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''
run: spago -x spago-dev.dhall build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''

- name: Run tests
run: spago test --no-install
run: spago -x spago-dev.dhall test --no-install
22 changes: 20 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,28 @@ Notable changes to this project are documented in this file. The format is based

Breaking changes:

- `anyChar` will no longer always succeed. It will only succeed on a Basic
Multilingual Plane character. The new parser `anyCodePoint` will always
succeed. (#119 by @jamesdbrock)
- Delete the `StringLike` typeclass. Users must delete all `StringLike`
constraints. (#119 by @jamesdbrock)
- Move `updatePosString` to the `String` module and don’t
export it. (#119 by @jamesdbrock)
- Change the definition of `whiteSpace` and `skipSpaces` to
`Data.CodePoint.Unicode.isSpace`. (#119 by @jamesdbrock)

New features:

- Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
`CodePoint`s. (#119 by @jamesdbrock)
- Add `match` combinator (#119 by @jamesdbrock)
- Add benchmark suite (#119 by @jamesdbrock)
- Split the dev dependencies out into `spago-dev.dhall`.

Bugfixes:

- Unicode correctness (#119 by @jamesdbrock)

Other improvements:

## [v6.0.2](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.2) - 2021-05-09
Expand All @@ -26,12 +44,12 @@ Other improvements:
## [v6.0.0](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.0) - 2021-02-26

Breaking changes:
- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93)
- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93)
- Non-empty combinators now return `NonEmptyList` (#102)
- Added support for PureScript 0.14 and dropped support for all previous versions (#101)

New features:
- Derived `Generic` instance of Position (#87)
- Derived `Generic` instance of Position (#87)

Bugfixes:

Expand Down
13 changes: 13 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,16 @@
Thanks for your interest in contributing to `parsing`! We welcome new contributions regardless of your level of experience or familiarity with PureScript.

Every library in the Contributors organization shares a simple handbook that helps new contributors get started. With that in mind, please [read the short contributing guide on purescript-contrib/governance](https://github.com/purescript-contrib/governance/blob/main/contributing.md) before contributing to this library.

# Development

This package includes a `spago-dev.dhall` which provides the dependencies
for development and testing.

## Testing

To run the test suite:

```
spago -x spago-dev.dhall test
```
134 changes: 134 additions & 0 deletions bench/Main.purs
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
-- | # Benchmarking
-- |
-- | spago -x spago-dev.dhall run --main Bench.Main
-- |
-- | This benchmark suite is intended to guide changes to this package so that
-- | we can compare the benchmarks of different commits.
-- |
-- | This benchmark suite also compares parsers to equivalent Regex. This
-- | provides an answer to the common question “How much slower is this package
-- | than Regex?” Answer: approximately 100×. The Regex benchmarks also give
-- | us a rough way to calibrate benchmarks run on different platforms.
-- |
-- | # Profiling
-- |
-- | https://nodejs.org/en/docs/guides/simple-profiling/
-- | https://nodesource.com/blog/diagnostics-in-NodeJS-2
-- |
-- | spago -x spago-dev.dhall build --source-maps
-- | purs bundle output/**/*.js --source-maps --output ./index.bundle.js
-- |
-- |
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
-- | zephyr Bench.Main.main --codegen sourcemaps,js
-- | purs bundle dce-output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.dce.bundle.js
-- | node index.dce.bundle.js
-- |
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
-- | purs bundle output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.bundle.js
-- | node index.bundle.js
-- | node --prof --enable-source-maps ./index.bundle.js
-- | node --prof-process --source-map ./index.bundle.js.map isolate--.log > prof.txt
-- |
-- | node --prof --enable-source-maps -e 'require("./output/Bench.Main/index.js").main()'
-- | node --prof-process isolate--.log
-- |
-- | spago -x spago-dev.dhall build
-- | node --prof -e 'require("./output/Bench.Main/index.js").main()'
-- | node --prof-process isolate--.log > prof.txt

module Bench.Main where

import Prelude

import Data.Array (fold, replicate)
import Data.Either (either)
import Data.List (manyRec)
import Data.List.Types (List)
import Data.String.Regex (Regex, regex)
import Data.String.Regex as Regex
import Data.String.Regex.Flags (RegexFlags(..))
import Effect (Effect)
import Effect.Console (log)
import Effect.Exception (throw)
import Effect.Unsafe (unsafePerformEffect)
import Performance.Minibench (benchWith)
import Text.Parsing.Parser (Parser, runParser)
import Text.Parsing.Parser.String (string)
import Text.Parsing.Parser.Token (digit)
import Text.Parsing.StringParser as StringParser
import Text.Parsing.StringParser.CodePoints as StringParser.CodePoints
import Text.Parsing.StringParser.CodeUnits as StringParser.CodeUnits

string23 :: String
string23 = "23"
string23_2 :: String
string23_2 = fold $ replicate 2 string23
string23_10000 :: String
string23_10000 = fold $ replicate 10000 string23

stringSkidoo :: String
stringSkidoo = "skidoo"
stringSkidoo_2 :: String
stringSkidoo_2 = fold $ replicate 2 stringSkidoo
stringSkidoo_10000 :: String
stringSkidoo_10000 = fold $ replicate 10000 stringSkidoo

parse23 :: Parser String (List Char)
parse23 = manyRec digit

parse23Points :: StringParser.Parser (List Char)
parse23Points = manyRec StringParser.CodePoints.anyDigit

parse23Units :: StringParser.Parser (List Char)
parse23Units = manyRec StringParser.CodeUnits.anyDigit

pattern23 :: Regex
pattern23 = either (unsafePerformEffect <<< throw) identity $
regex "\\d" $ RegexFlags
{ dotAll: true
, global: true
, ignoreCase: false
, multiline: true
, sticky: false
, unicode: true
}

parseSkidoo :: Parser String (List String)
parseSkidoo = manyRec $ string "skidoo"

patternSkidoo :: Regex
patternSkidoo = either (unsafePerformEffect <<< throw) identity $
regex "skidoo" $ RegexFlags
{ dotAll: true
, global: true
, ignoreCase: false
, multiline: true
, sticky: false
, unicode: true
}

main :: Effect Unit
main = do
-- log $ show $ runParser string23_2 parse23
-- log $ show $ Regex.match pattern23 string23_2
-- log $ show $ runParser stringSkidoo_2 parseSkidoo
-- log $ show $ Regex.match patternSkidoo stringSkidoo_2
log "runParser parse23"
benchWith 200
$ \_ -> runParser string23_10000 parse23
log "StringParser.runParser parse23Points"
benchWith 20
$ \_ -> StringParser.runParser parse23Points string23_10000
log "StringParser.runParser parse23Units"
benchWith 200
$ \_ -> StringParser.runParser parse23Units string23_10000
log "Regex.match pattern23"
benchWith 200
$ \_ -> Regex.match pattern23 string23_10000
log "runParser parseSkidoo"
benchWith 200
$ \_ -> runParser stringSkidoo_10000 parseSkidoo
log "Regex.match patternSkidoo"
benchWith 200
$ \_ -> Regex.match patternSkidoo stringSkidoo_10000
22 changes: 22 additions & 0 deletions spago-dev.dhall
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Spago configuration for testing, benchmarking, development.
--
-- See:
-- * ./CONTRIBUTING.md
-- * https://github.com/purescript/spago#devdependencies-testdependencies-or-in-general-a-situation-with-many-configurations
--

let conf = ./spago.dhall

in conf //
{ sources = [ "src/**/*.purs", "test/**/*.purs", "bench/**/*.purs" ]
, dependencies = conf.dependencies #
[ "assert"
, "console"
, "effect"
, "psci-support"
, "minibench"
, "exceptions"
, "string-parsers"
]
, packages = ./packages.dhall
}
7 changes: 2 additions & 5 deletions spago.dhall
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
{ name = "parsing"
, dependencies =
[ "arrays"
, "assert"
, "console"
, "control"
, "effect"
, "either"
, "foldable-traversable"
, "identity"
Expand All @@ -14,13 +11,13 @@
, "maybe"
, "newtype"
, "prelude"
, "psci-support"
, "strings"
, "tailrec"
, "transformers"
, "tuples"
, "unicode"
, "unsafe-coerce"
]
, packages = ./packages.dhall
, sources = [ "src/**/*.purs", "test/**/*.purs" ]
, sources = [ "src/**/*.purs" ]
}
2 changes: 1 addition & 1 deletion src/Text/Parsing/Parser/Language.purs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import Prelude
import Control.Alt ((<|>))
import Text.Parsing.Parser (ParserT)
import Text.Parsing.Parser.String (char, oneOf)
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
import Text.Parsing.Parser.Token (GenLanguageDef(..), LanguageDef, TokenParser, alphaNum, letter, makeTokenParser, unGenLanguageDef)

-----------------------------------------------------------
-- Styles: haskellStyle, javaStyle
Expand Down
14 changes: 1 addition & 13 deletions src/Text/Parsing/Parser/Pos.purs
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
module Text.Parsing.Parser.Pos where

import Prelude

import Data.Generic.Rep (class Generic)
import Data.Foldable (foldl)
import Data.Newtype (wrap)
import Data.String (split)

-- | `Position` represents the position of the parser in the input.
-- |
Expand All @@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
-- | The `Position` before any input has been parsed.
initialPos :: Position
initialPos = Position { line: 1, column: 1 }

-- | Updates a `Position` by adding the columns and lines in `String`.
updatePosString :: Position -> String -> Position
updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
where
updatePosChar (Position pos) c = case c of
"\n" -> Position { line: pos.line + 1, column: 1 }
"\r" -> Position { line: pos.line + 1, column: 1 }
"\t" -> Position { line: pos.line, column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
_ -> Position { line: pos.line, column: pos.column + 1 }
Loading

0 comments on commit ab104c3

Please sign in to comment.