Skip to content

Move baked data into parley#477

Closed
robertbastian wants to merge 4 commits intolinebender:mainfrom
robertbastian:dg
Closed

Move baked data into parley#477
robertbastian wants to merge 4 commits intolinebender:mainfrom
robertbastian:dg

Conversation

@robertbastian
Copy link
Contributor

Having a semver boundary in what is essentially an implementation detail might complicate improvements in the future. ICU4X baked data in particular does not have stability guarantees across even minor versions. Moving the data to the parley crate also simplifies some code.

@robertbastian robertbastian force-pushed the dg branch 4 times, most recently from 89fd57b to a0fd6e9 Compare December 9, 2025 13:36
@nicoburns
Copy link
Collaborator

I might be missing something, but I don't understand why this makes a difference from a semver perspective:

  • I believe the data is codegened. So just updating th icu4x dependency wouldn't update it.
  • I believe the data is not publicly exposed from parley. So parley could freely update it without bumping a semver breaking version anyway.

Is it that the version of the baked data and the version of icu4x that Parley is using to decode the baked data needs to match the version of icu4x that was used to encode the data? If so, then I think Parley might have to pin to an exact version of icu4x anyway (or generate the baked data at build time).

@robertbastian
Copy link
Contributor Author

The problem is that the interface between unicode_data and parley is public. This interface includes the ICU4X baked data, which is a bit iffy, as well as the custom baked properties. In the end it's fine to have this interface be public, however you wouldn't choose to make it public if it was all inside parley, because it really is an implementation detail. Having it in different crates also means having to make breaking changes whenever you want to make changes to this interface (e.g. if you replace the CodePointTrie by packtab you might want to make API changes).

I think Parley might have to pin to an exact version of icu4x anyway

Yes I think if you want to continue to use custom data you'll have to (however I think you should migrate to compiled data anyway)

Decoupling data versioning and code versioning (by having two crates which Cargo can resolve somewhat independently) also has the potential for confusing behaviour when some user's Cargo decides to resolve a combination that you've never tested. We've been burnt by this in ICU4X.

@nicoburns
Copy link
Collaborator

nicoburns commented Dec 9, 2025

I think Parley could avoid these issues by simply bumping the major version of the unicode_data crate every time it changes. End-users won't be depending on that crate directly anyway, so they wouldn't be affected. And bumping the version of unicode_data wouldn't require a major version of bump of Parley because Parley doesn't publicly expose the unicode_data types in it's own API (I don't think). Cargo can't resolve versions you don't expect if the versions are using semver-incompatible version numbers.

@robertbastian
Copy link
Contributor Author

Yes that is a way to avoid these issues, but you have to be careful with this. I don't see the advantage of having the data in a separate crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments