Fast codepointOffset #451

axman6 · 2022-07-01T08:07:31Z

Implements codepointOffset with code from the FreeBSD project.

I'm planning to explore making a vectorised implementation of the searching for 2, 3 and 4 char codepoints, but will leave that out in the first iteration.

This may be relevant to #369, by eliminating the need to decode codepoints via Haskell.

axman6 · 2022-07-01T10:04:32Z

I'm not sure why older GHCs are unable to infer the types for the tests I've added, since the types should all be trivially known (Text and Char).

Bodigrim · 2022-07-01T20:29:58Z

Thanks @axman6! I suggest we start with splitOnChar / breakOnChar in a separate PR. First naive implementation, tests and benchmarks, then make it fast with whatever it takes. Tackling both splitOnChar and memmem in one go feels a bit overwhelming.

axman6 · 2022-07-02T00:34:09Z

Yeah I've been working on rewriting the C to avoid going via memmem, and removing the twoway_memmem would significantly reduce the amount of code to maintain. I would guess there are faster memmem implementations out there, hopefully under permissive licenses too. I'll get the changes working and push those today.

Bodigrim · 2022-07-02T15:09:02Z

I have a suspicion that breakOnChar / splitOnChar does not mandate any additional C code at all. It might be enough to memchr the least significant byte of the UTF-8 encoding and then check manually that other bytes match.

Anyways, let's separate concerns. From my perspective the first task is to add breakOnChar / splitOnChar with naive, pure Haskell implementation. Once it is done and merged, we can discuss optimizations in a separate PR.

axman6 · 2022-07-16T06:23:10Z

I'll try and find some time to write a Haskell only version, and then we can think about making a faster C one later. I wonder if it's worth having both, and only moving to the C call when there's enough data to justify it.

axman6 added 8 commits July 1, 2022 17:53

Add C bits

9a0ee2e

Initial Text functions and tests

8aa7155

Fix and add tests

672d2d7

Add haddocs for splitOnChar

13bba0d

Add haddocs for breakOnChar

868d97a

Add temporary docs for splitOn'

277e1bd

Add haddocs for codepointOffset

bef46db

Remove Debug.Trace

f885b48

axman6 added 3 commits July 1, 2022 21:50

Add type signatures to new tests

15a111a

less cryptic naming in C functions

27adcd9

Remove use of <> for older GHCs

b25b93c

axman6 marked this pull request as ready for review July 1, 2022 12:32

axman6 added 5 commits July 2, 2022 12:48

Remove all the memmem code, inline calls to Nbyte_memmem.

f3ce9dd

Make types consistent (thanks to godbolt.org)

842f951

Export splitOnChar

94eb8e8

Clean up

880bdaf

Remove memmem import

5f07e3a

Bodigrim marked this pull request as draft April 11, 2024 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast codepointOffset #451

Fast codepointOffset #451

axman6 commented Jul 1, 2022 •

edited

axman6 commented Jul 1, 2022

Bodigrim commented Jul 1, 2022

axman6 commented Jul 2, 2022

Bodigrim commented Jul 2, 2022

axman6 commented Jul 16, 2022

Fast codepointOffset #451

Are you sure you want to change the base?

Fast codepointOffset #451

Conversation

axman6 commented Jul 1, 2022 • edited

axman6 commented Jul 1, 2022

Bodigrim commented Jul 1, 2022

axman6 commented Jul 2, 2022

Bodigrim commented Jul 2, 2022

axman6 commented Jul 16, 2022

axman6 commented Jul 1, 2022 •

edited