Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: consider splitting out shortbytestring #444

Closed
hasufell opened this issue Dec 1, 2021 · 26 comments · Fixed by #471
Closed

RFC: consider splitting out shortbytestring #444

hasufell opened this issue Dec 1, 2021 · 26 comments · Fixed by #471

Comments

@hasufell
Copy link
Member

hasufell commented Dec 1, 2021

As part of my efforts of the abstract filepath proposal, I have added a lot of missing functions from the ShortByteString modules and published it as the shortbytestring package.

Since I'm now maintainer of filepath I will move the AFPP forward inside filepath via new modules (that will require patches to unix and Win32). I believe it makes sense to move all ShortByteString stuff into the new shortbytestring package, because:

  1. the size has now grown quite a lot and I implemented all the missing functionality by hand
  2. shortbytestring might have to evolve at a different (quicker) pace than bytestring as part of the whole AFPP effort
  3. this decouples major PVP bumps from bytestring, which is good
  4. it's a separate type anyway with very different performance semantics

For backwards compat, bytestring could/should re-export the current API, no more, no less. That also means shortbytestring can't depend on it and the functions toShort and fromShort will have to live in bytestring for the foreseeable future. This can be added to the module documentation to avoid confusion.

The code sharing between both packages is otherwise negligible, imo.

This also means that shortbytestring will have to become a boot package (and be properly reviewed/audited beforehand).

@Bodigrim @bgamari

@sjakobi
Copy link
Member

sjakobi commented Dec 1, 2021

@hasufell If you want to have a ByteArray#-based type with a lot of new functionality, why don't you simply make your own, or at least a newtype of ShortByteString? I think it will be confusing for many users when both bytestring and shortbytestring provide Data.ByteString.Short and Data.ByteString.Short.Internal. If you'd use a different name, that would prevent a lot of confusion and compatibility issues.

@hasufell
Copy link
Member Author

hasufell commented Dec 1, 2021

@sjakobi I don't think I want a new type, because that would even be more maintenance burden. The new functionality is useful for anyone using ShortByteString for whatever purpose.

Also see #83

@sjakobi
Copy link
Member

sjakobi commented Dec 1, 2021

Hmm, well I doubt that we'll ever have two boot packages that both expose Data.ByteString.Short. I also feel that it's suboptimal to share responsibilities for the ShortByteString type between two packages. At that point it would be better to fully remove Data.ByteString.Short from bytestring. At which point I guess shortbytestring will have its own type anyway.

To simplify the migration, maybe we can re-create the existing Data.ByteString.Short API in base's new Data.Array.Byte. Then we can remove Data.ByteString.Short from bytestring and let shortbytestring have the module name to itself. Existing users of Data.ByteString.Short can consider adopting shortbytestring or moving to Data.Byte.Array or even primitive.

@sjakobi
Copy link
Member

sjakobi commented Dec 1, 2021

4. it's a separate type anyway with very different performance semantics

What does this mean? I thought shortbytestring was using the ShortByteString type from bytestring?!

@hasufell
Copy link
Member Author

hasufell commented Dec 1, 2021

What does this mean? I thought shortbytestring was using the ShortByteString type from bytestring?!

Yes it does. What I mean is that ByteString and ShortByteString are sufficiently different to warrant two packages.


Hmm, well I doubt that we'll ever have two boot packages that both expose Data.ByteString.Short

The module names don't have to be the same. shortbytestring can expose Data.ShortByteString instead.

I don't mind breaking backwards compat, but given the recent discussion on the CLC issue tracker I thought maybe try to not break it. Literally every package using Data.ByteString.Short will break otherwise.

Also, I'd want to avoid pulling in more core libraries into the equation for now, otherwise the scope goes out of hand and I will burn out from too many open PRs and discussions.

@sjakobi
Copy link
Member

sjakobi commented Dec 2, 2021

What does this mean? I thought shortbytestring was using the ShortByteString type from bytestring?!

Yes it does. What I mean is that ByteString and ShortByteString are sufficiently different to warrant two packages.

Right, thanks for clarifying. Do note that ByteString itself may eventually be based on ByteArray#. See #193.

Hmm, well I doubt that we'll ever have two boot packages that both expose Data.ByteString.Short

The module names don't have to be the same. shortbytestring can expose Data.ShortByteString instead.

Good! ShortByteString might also be a good module name that would fit in the current trend of naming modules after the package name.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 3, 2021

@sjakobi it was me who suggested @hasufell that we can split ShortByteString out and make bytestring depend on it. Ideally I would like to have less bytearraish types around, not more. But indeed migration issues could be overwhelming.

One more thing to consider is that base-4.17 will provide Data.Array.Byte.

shortbytestring might have to evolve at a different (quicker) pace than bytestring as part of the whole AFPP effort

All boot packages evolve at the same pace with GHC. You cannot deliver a new version sooner than it is bundled with a new GHC release. And you cannot expect to fix bugs by putting a new version onto Hackage and deprecating an older one, instead users must switch to a newer GHC. Essentially it means that you want a very high level of stability for any boot package.

Another way around is to integrate all encoding-independent functions from shortbytestring into bytestring, and then make shortbytestring16 for UTF16 routines. Given that the implementation is done and tested, I think the first part is pretty uncontroversial. @sjakobi how do you feel about it?

@hasufell I'm really sorry, I'm very busy with bytestring-0.11.2.0 and text-2.0 at the moment, but I'll have more time to discuss afterwards.

@sjakobi
Copy link
Member

sjakobi commented Dec 4, 2021

All boot packages evolve at the same pace with GHC. You cannot deliver a new version sooner than it is bundled with a new GHC release. And you cannot expect to fix bugs by putting a new version onto Hackage and deprecating an older one, instead users must switch to a newer GHC. Essentially it means that you want a very high level of stability for any boot package.

That sounds like a fairly gross simplification to me. At least with cabal and stack it is easy to use different versions of boot packages than those that are bundled with GHC. In my brief experience this is more tricky with Nix. Nevertheless stability is of course quite important in a boot package.

Another way around is to integrate all encoding-independent functions from shortbytestring into bytestring

Does this roughly amount to providing the same API that we have for ByteString and lazy ByteString but for ShortByteString? That would sound like a good direction to me, but I haven't thought very deeply about.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 4, 2021

Does this roughly amount to providing the same API that we have for ByteString and lazy ByteString but for ShortByteString?

It roughly amounts to adopting https://hackage.haskell.org/package/shortbytestring-0.2.0.0/docs/Data-ByteString-Short.html into bytestring.

@sjakobi
Copy link
Member

sjakobi commented Dec 4, 2021

It roughly amounts to adopting https://hackage.haskell.org/package/shortbytestring-0.2.0.0/docs/Data-ByteString-Short.html into bytestring.

Looks good. 👍

@hasufell
Copy link
Member Author

hasufell commented Dec 6, 2021

I don't see a clear verdict here. Do we split out or do we not? If not, what's the exact course of action?

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 8, 2021

@hasufell Let's merge (encoding-agnostic) https://hackage.haskell.org/package/shortbytestring-0.2.0.0/docs/Data-ByteString-Short.html into bytestring proper. Does it work for your purposes?

@hasufell
Copy link
Member Author

hasufell commented Dec 8, 2021

@hasufell Let's merge (encoding-agnostic) https://hackage.haskell.org/package/shortbytestring-0.2.0.0/docs/Data-ByteString-Short.html into bytestring proper. Does it work for your purposes?

I'd personally prefer a split, but the primary goal is to get more eyes on the codebase (including the UTF-16 variant), so that things like
hasufell/shortbytestring#5 are less likely.

In that sense: whatever the strategy, I need reviews for the entire codebase.

Also: merging only the encoding agnostic part won't save me the trouble of adding a new boot library, would it? And what if ShortByteString as a type becomes obsolete in the future? Will we drop shortbytestring16 as a boot library again? Could that cause problems?

Would a split not make it easier for bytestring to change its memory semantics, because it can simply drop the ShortByteString modules with relatively little effort and provide a simple migration guide for legacy codebases?

But I don't have a strong opinion.

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 8, 2021

Also: merging only the encoding agnostic part won't save me the trouble of adding a new boot library, would it?

You have to add a new boot library anyway, but without a split there is less impact and churn. In fact with a split you'd likely end up with two new boot libraries, because text-dependent stuff must go below bytestring.

The trouble is that if bytestring re-exports a data type from shortbytestring, then in order to provide a stable interface (including type class instances) it must constrain a minor version of shortbytestring. This gets fragile pretty soon. If we were in position just to shave off ShortByteString and do not re-export, it could work out (and indeed provide a better separation of concerns), but breaking changes of such magnitude are currently out of scope.

And what if ShortByteString as a type becomes obsolete in the future?

Why would it become obsolete?
Dropping boot libraries is painless, adding them (especially above bytestring) - very painful.

Would a split not make it easier for bytestring to change its memory semantics

Are you talking about bytestring using unpinned memory? It is not happening any time soon, but anyways I don't quite see how it is connected to ShortByteString story.

hasufell added a commit to hasufell/bytestring that referenced this issue Jan 21, 2022
hasufell added a commit to hasufell/bytestring that referenced this issue Jan 21, 2022
@hasufell
Copy link
Member Author

I opened #471 ...good luck 😅

hasufell added a commit to hasufell/bytestring that referenced this issue Jan 21, 2022
hasufell added a commit to hasufell/bytestring that referenced this issue Jan 22, 2022
sjakobi pushed a commit that referenced this issue Feb 15, 2022
* Merge `shortbytestring` package back into `bytestring` wrt #444

* Fix build on ARM

Reusing compareByteArrays and avoiding
excessive pointer arithmetic.

* Speed up reverse by using byteSwap64 tricks

* Remove phase control from inlines

* Improve performance of elemIndex

* Use setByteArray in replicate

* Implement intercalate manually

* Annotate partial functions with HasCallStack

* Fix build on base < 4.12.0.0

* Add uncons/unsnoc

* Correct complexities

* Exclude reverse optimization path from ARM

It seems to cause segfaults on armv7, suggesting
there are issues with 'indexWord8ArrayAsWord64#'.

All other platforms are fine and tests pass.

* Add benchmarks for ShortByteString

* Improve inlining

* Adjust haddock identifiers

* Get rid of writeCharArray#

* Haddock fixes

* Clean up tests

* Use -fexpose-all-unfoldings

* Improve reverse

* Cleanup 'reverse'

* Fix possible GC race with foreign imports

For more information, see
  #471 (comment)

* Disable asserts in shortbytestring.c

* Remove redundant import

* Add documentation about partial functions

* Fold ShortByteString prop tests into ByteString

* Restore previous INLINEs

* Improve naming of bindings

* Consolidate error handling functions

* Remove trailing whitespace

* Fix uncons in documentation

* Rename indexWord64Array to indexWord8ArrayAsWord64

* Improve error message

* Clean up incorrect documentation

* Use div/mod instead of quot/rem

* Simplify branching in reverse

* Move asserts to Haskell

* Prefix C functions

* Fix return type of c_elem_index

* Fix documentation in unfoldrN

* Make unfoldrN more efficient

* Fix maintainer field

* Fix formatting

* Implement takeEnd, dropeEnd and splitAt manually

* Fix some haddock identifiers

* Fix unfoldrN doc

* Add a primops bounds-checking job to CI

* Document and clean up createAndTrim

* Rename errorEmptyList to errorEmptySBS

* Improve documentation for findFromEndUntil

* Improve documentation and naming

* Optimize out quotRem

* Document compareByteArraysOff

* Simplify findIndexOrLength and findFromEndUntil

* Use c_count for count

* Simplify elemIndex

* Remove use of 'mempty'

* Make sure breakSubstring is inlined into isInfixOf

* Simplify stripSuffix and stripPrefix

* Fix redundant import warnings

* Improve 'take'

* Use existing bounnds check in 'drop'

* Avoid 'create' when bytestring is empty

* Optimize filter

* Remove redundant INLINABLE

* Use shorter 'createAndTrim' in 'filter'

* Simplify 'take'

* Simplify 'drop'

* Better formatting

* Add comment to explain DNDEBUG

* Refactor elemIndex

* Optimize 'partition'

* Optimize hot loop in 'partition'
@Bodigrim Bodigrim added this to the 0.11.3.0 milestone Feb 15, 2022
@Bodigrim Bodigrim linked a pull request Feb 15, 2022 that will close this issue
sjakobi pushed a commit that referenced this issue Feb 15, 2022
* Merge `shortbytestring` package back into `bytestring` wrt #444

* Fix build on ARM

Reusing compareByteArrays and avoiding
excessive pointer arithmetic.

* Speed up reverse by using byteSwap64 tricks

* Remove phase control from inlines

* Improve performance of elemIndex

* Use setByteArray in replicate

* Implement intercalate manually

* Annotate partial functions with HasCallStack

* Fix build on base < 4.12.0.0

* Add uncons/unsnoc

* Correct complexities

* Exclude reverse optimization path from ARM

It seems to cause segfaults on armv7, suggesting
there are issues with 'indexWord8ArrayAsWord64#'.

All other platforms are fine and tests pass.

* Add benchmarks for ShortByteString

* Improve inlining

* Adjust haddock identifiers

* Get rid of writeCharArray#

* Haddock fixes

* Clean up tests

* Use -fexpose-all-unfoldings

* Improve reverse

* Cleanup 'reverse'

* Fix possible GC race with foreign imports

For more information, see
  #471 (comment)

* Disable asserts in shortbytestring.c

* Remove redundant import

* Add documentation about partial functions

* Fold ShortByteString prop tests into ByteString

* Restore previous INLINEs

* Improve naming of bindings

* Consolidate error handling functions

* Remove trailing whitespace

* Fix uncons in documentation

* Rename indexWord64Array to indexWord8ArrayAsWord64

* Improve error message

* Clean up incorrect documentation

* Use div/mod instead of quot/rem

* Simplify branching in reverse

* Move asserts to Haskell

* Prefix C functions

* Fix return type of c_elem_index

* Fix documentation in unfoldrN

* Make unfoldrN more efficient

* Fix maintainer field

* Fix formatting

* Implement takeEnd, dropeEnd and splitAt manually

* Fix some haddock identifiers

* Fix unfoldrN doc

* Add a primops bounds-checking job to CI

* Document and clean up createAndTrim

* Rename errorEmptyList to errorEmptySBS

* Improve documentation for findFromEndUntil

* Improve documentation and naming

* Optimize out quotRem

* Document compareByteArraysOff

* Simplify findIndexOrLength and findFromEndUntil

* Use c_count for count

* Simplify elemIndex

* Remove use of 'mempty'

* Make sure breakSubstring is inlined into isInfixOf

* Simplify stripSuffix and stripPrefix

* Fix redundant import warnings

* Improve 'take'

* Use existing bounnds check in 'drop'

* Avoid 'create' when bytestring is empty

* Optimize filter

* Remove redundant INLINABLE

* Use shorter 'createAndTrim' in 'filter'

* Simplify 'take'

* Simplify 'drop'

* Better formatting

* Add comment to explain DNDEBUG

* Refactor elemIndex

* Optimize 'partition'

* Optimize hot loop in 'partition'

(cherry picked from commit 731caea)
Bodigrim pushed a commit that referenced this issue Feb 15, 2022
* Merge `shortbytestring` package back into `bytestring` wrt #444

* Fix build on ARM

Reusing compareByteArrays and avoiding
excessive pointer arithmetic.

* Speed up reverse by using byteSwap64 tricks

* Remove phase control from inlines

* Improve performance of elemIndex

* Use setByteArray in replicate

* Implement intercalate manually

* Annotate partial functions with HasCallStack

* Fix build on base < 4.12.0.0

* Add uncons/unsnoc

* Correct complexities

* Exclude reverse optimization path from ARM

It seems to cause segfaults on armv7, suggesting
there are issues with 'indexWord8ArrayAsWord64#'.

All other platforms are fine and tests pass.

* Add benchmarks for ShortByteString

* Improve inlining

* Adjust haddock identifiers

* Get rid of writeCharArray#

* Haddock fixes

* Clean up tests

* Use -fexpose-all-unfoldings

* Improve reverse

* Cleanup 'reverse'

* Fix possible GC race with foreign imports

For more information, see
  #471 (comment)

* Disable asserts in shortbytestring.c

* Remove redundant import

* Add documentation about partial functions

* Fold ShortByteString prop tests into ByteString

* Restore previous INLINEs

* Improve naming of bindings

* Consolidate error handling functions

* Remove trailing whitespace

* Fix uncons in documentation

* Rename indexWord64Array to indexWord8ArrayAsWord64

* Improve error message

* Clean up incorrect documentation

* Use div/mod instead of quot/rem

* Simplify branching in reverse

* Move asserts to Haskell

* Prefix C functions

* Fix return type of c_elem_index

* Fix documentation in unfoldrN

* Make unfoldrN more efficient

* Fix maintainer field

* Fix formatting

* Implement takeEnd, dropeEnd and splitAt manually

* Fix some haddock identifiers

* Fix unfoldrN doc

* Add a primops bounds-checking job to CI

* Document and clean up createAndTrim

* Rename errorEmptyList to errorEmptySBS

* Improve documentation for findFromEndUntil

* Improve documentation and naming

* Optimize out quotRem

* Document compareByteArraysOff

* Simplify findIndexOrLength and findFromEndUntil

* Use c_count for count

* Simplify elemIndex

* Remove use of 'mempty'

* Make sure breakSubstring is inlined into isInfixOf

* Simplify stripSuffix and stripPrefix

* Fix redundant import warnings

* Improve 'take'

* Use existing bounnds check in 'drop'

* Avoid 'create' when bytestring is empty

* Optimize filter

* Remove redundant INLINABLE

* Use shorter 'createAndTrim' in 'filter'

* Simplify 'take'

* Simplify 'drop'

* Better formatting

* Add comment to explain DNDEBUG

* Refactor elemIndex

* Optimize 'partition'

* Optimize hot loop in 'partition'

(cherry picked from commit 731caea)
@Bodigrim
Copy link
Contributor

@hasufell can we close this now?

@hasufell
Copy link
Member Author

@hasufell can we close this now?

Two more questions:

  1. Is there a reason data ShortByteString = SBS ByteArray# is not a newtype?
  2. How to proceed with the Word16 variant. Include it in filepath?

@sjakobi
Copy link
Member

sjakobi commented Feb 16, 2022

  1. Is there a reason data ShortByteString = SBS ByteArray# is not a newtype?

Do you mean a newtype of Data.Array.Byte.ByteArray? #410 attempts this, but it's a bit tricky because the constructor is public.

  1. How to proceed with the Word16 variant. Include it in filepath?

Maybe put it into a new package word16-string?!

@sjakobi sjakobi closed this as completed Feb 16, 2022
@hasufell
Copy link
Member Author

Maybe put it into a new package word16-string?!

Well, then that would require a new boot library.

@Bodigrim
Copy link
Contributor

Word16 variant certainly does not fit into bytestring. Either filepath or a separate library, IMHO.

@Ericson2314
Copy link
Contributor

I would have preferred another package too. ByteArray# makes ShortByteString have a much more defined memory management than bytestring using ForeignPtr. It seems thus like a short-bytestring package could be fine ancillary to bytestring and vector alike.

@ulysses4ever
Copy link

Should this new package be based off the recently added base type of the same nature? https://gitlab.haskell.org/ghc/ghc/-/issues/20044

@sjakobi
Copy link
Member

sjakobi commented Jun 22, 2022

Where is this need for a new package coming from? We already have Data.ByteString.Short in bytestring, Data.Array.Byte in base and Data.Primitive.ByteArray in primitive. What's the missing functionality that can't be included in one of these packages?

@Ericson2314
Copy link
Contributor

@sjakobi you can always put more things in the same package. It's better to flip things around: If I can use ShortByteString without using (any sort of) ByteString, why must they live in the same library?

More libraries frees us up to get more creative with the bootstraping. E.g. we have a looming template-haskell <-> filepath dependency cycle (https://gitlab.haskell.org/ghc/ghc/-/issues/21738). Splitting template-haskell would fix the issue, but splitting bytestring would also fix the issue so long as shortbytestring doesn't have an TH, which I don't believe it does.

@Bodigrim
Copy link
Contributor

Splitting template-haskell would fix the issue, but splitting bytestring would also fix the issue so long as shortbytestring doesn't have an TH, which I don't believe it does.

You would not believe it ;)

instance TH.Lift ShortByteString where
#if MIN_VERSION_template_haskell(2,16,0)
lift sbs = [| unsafePackLenLiteral |]
`TH.appE` TH.litE (TH.integerL (fromIntegral len))
`TH.appE` TH.litE (TH.BytesPrimL $ TH.Bytes ptr 0 (fromIntegral len))
where
BS ptr len = fromShort sbs


Just to be clear: the decision to expand Data.ByteString.Short instead of separating it into an independent library has already been made, please refer to the discussion above for the motivation.

@Ericson2314
Copy link
Contributor

That instance could move to template-haskell, no? Define it with Lift itself rather than ShortByteString.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants