Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patches carried by conda-forge for packaging sentencepiece #876

Open
h-vetinari opened this issue Jun 6, 2023 · 3 comments
Open

Patches carried by conda-forge for packaging sentencepiece #876

h-vetinari opened this issue Jun 6, 2023 · 3 comments

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 6, 2023

Hi there - I help package many a package for conda-forge, which is a pretty big ecosystem of packages especially in data science and ML/AI. It started out mainly focussed on Python, but has come to include many other languages as well.

One strong point (especially in comparison to pip) is that we are able to sanely redistribute non-Python artefacts. We can track the relevant ABI of a package and rebuild its descendant packages where necessary. This allows us to (almost) exclusively distribute shared libraries, which lowers package footprints, and has several other benefits.

We've been shipping sentencepiece for about 3 years through the so-called feedstock, and packages can be installed like:

conda install -c conda-forge sentencepiece  # or mamba instead of conda

We have builds for linux-{64, aarch64, ppc64le}, osx-{64, arm64}, win-64, multiplied by CPython 3.8-3.11 as well as PyPy 3.8 & 3.9.

In the context of wanting to package torchtext (which depends on sentencepiece -- but really only the C++ bits) I wanted to split off a separate "output" / "package" called libsentencepiece that torchtext can build against, without having to spuriously rebuild it per python version (it would literally be the same in each case).

Also, crucially, we absolutely need to rely on our own abseil & protobuf builds, which became a pretty big sticking point. I finally managed to pull off this split last year (there's references there to my very long-winded previous tries; eventually I gave up on trying to build a shared library on windows because it was getting too insane1).

However, at the time, I ran out of steam to raise an issue here, and discuss how - if at all - it would be possible to upstream some of these changes. That's all the more the case because some of the patches I came up with are definitely not ready for primetime here. I got quite exasperated with how the code in thirdparty/absl was neither third-party nor abseil proper2, so I ended up replacing that wholesale. Of course that doesn't mean I'm suggesting to do this here verbatim.

However, the abseil situation is perhaps one of the biggest sticking points. There's a variable called SPM_USE_EXTERNAL_ABSL, but it really doesn't do what it says on the tin, because all the include paths still point to third_party/absl, rather than abseil proper. This issue has also been encountered by other people: #869.

Since I came up with my initial patches, sentencepiece is not forcing static linkage (/MT) of the MSVC runtime anymore (yay!). Other list of changes that were necessary for us:

  • allow switching off the building of static libraries (we really don't want them anywhere)
  • use external protobuf & abseil (including potentially shared builds, which requires stuff like setting PROTOBUF_USE_DLLS3)
  • enforce C++17 (we need this because abseil's ABI depends on the standard version, and we need to be consistent everywhere)
  • respect CMAKE_INSTALL_PREFIX
  • add cmake & pkg-config metadata for the library
  • allow building python bindings against an installed libsentencepiece (in our case, this will always match the underlying version 1:1, though the ABI-guarantees - or lack thereof - of sentencepiece would be good to know)

I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).

Would be happy if we could find a way that lets some of these improvements flow back upstream. 🙃
(I fully understand that some changes are specific to our needs, and it's fine if we keep carrying some patches; but any reduction helps, and some of the changes would also benefit other people I believe)

I'm happy to prepare dedicated PRs for individual changes, but wanted to discuss first which bits are interesting to the maintainers here, and how we could make them palatable for merging here (e.g. with switches based on symbols or CMake resp. environment variables).

Footnotes

  1. by that point I was already patching the output generated by protobuf for some silly constexpr errors

  2. as force-replacing it with actual abseil lead to build errors because -- to the best of my understanding -- it contains sentencepiece-specific glue-code, but in the absl namespace.

  3. this at least should be getting easier as of protobuf 23, because autotools is gone, and we have better metadata for CMake / pkgconfig.

@h-vetinari
Copy link
Contributor Author

@taku910, did you have a moment to have a look at this issue already? Would there be a way to collaborate on this, e.g. starting with the abseil situation?

@h-vetinari
Copy link
Contributor Author

Gentle ping @taku910 - is there a way I could contribute e.g. some of the abseil changes (to truly build against an external abseil) in a way that it would be acceptable to you?

@h-vetinari
Copy link
Contributor Author

I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).

I've turned the patches against 0.1.99 into a tag, and rebased the branch of our patches against current master. With all the recent changes, the patch set has shrunken a lot, which is great! Thanks for your efforts to solve some of these tricky problems! 🙏

Of the remaining patches, the main thing that would be interesting to us to upstream in some form would be:

  • don't mix builds of static and shared libraries; either SPM_ENABLE_SHARED should only build static libs, or there should be a similar SPM_DISABLE_STATIC option
  • install metadata for CMake & pkg-config about libsentencepiece.so

Is this something that you would be interested in @taku910?

Long-term it would be great to also support builds against shared abseil & protobuf (resp. shared sentencepiece builds on windows).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant