Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PCRE for our fallback regex engine when regcomp_l is unavailable #4935

Merged
merged 19 commits into from
May 21, 2019

Conversation

ethomson
Copy link
Member

This removes the regex implementation that we took from glibc and replaces it with PCRE. PCRE contains a POSIX regex compatibility layer and ignores the currently running locale settings by default, which ensures that LC_COLLATE will not alter our functionality.

Now we will always use regcomp_l on systems where it is available, and when it is not, we will fallback to our builtin PCRE implementation. This means that we will never call the system regcomp which could be influenced by environment variables.

This adds the older PCRE 8.42 library, instead of the newer PCRE2 library. The original PCRE has a smaller footprint, which is nice for us, and is still supported and receiving bugfixes. Minor modifications have been made to remove the CLI components included with PCRE and to prefix the names of the POSIX compatibility functions and structures with pcre_ so that they do not collide with the system's library (if it has one).

I've also cherry-picked the tests from #4560.

@ethomson ethomson force-pushed the ethomson/pcre branch 4 times, most recently from 589ed7b to 5cbf617 Compare January 13, 2019 11:02
@ethomson
Copy link
Member Author

Aaaaaand we've reached that point in a C project where our circular #includes have all collapsed in on their weight.

@ethomson ethomson force-pushed the ethomson/pcre branch 2 times, most recently from 232611c to 3883800 Compare January 13, 2019 21:20
@ethomson
Copy link
Member Author

ethomson commented Jan 14, 2019

I broke some things out into their own header files. I'm not exactly thrilled with this in f89ab66, but 🤷‍♂️

@csware
Copy link
Contributor

csware commented Jan 16, 2019

Please consider pcre2 instead of pcre.

@ethomson
Copy link
Member Author

Please consider pcre2 instead of pcre.

I did consider it, as I mentioned in this PR description and commit messages. What's the benefit to pcre2?

@tiennou
Copy link
Contributor

tiennou commented Jan 17, 2019

From a cursory look (FWIW PHP RFC), it seems the supported Unicode version is 10 (vs. 7), and there might be some minor behavioral differences.
There's also this thread from the git mailing-list, which seems to imply that git makes use of PCRE, so I'd be tempted to go for consistency and use that as well.

@ethomson
Copy link
Member Author

@pks-t since you went deep on understanding the locale problems with regular expressions in #4560, I'd like your input here when you have a moment.

Copy link
Member

@pks-t pks-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using an external library like pcre or pcre2 is a good solution to our problem. Does the syntax match the normal POSIX syntax, though? From documentation of PCRE2 1:

When PCRE2 is called via these functions, it is only the API that is POSIX-like in style. The syntax and semantics of the regular expressions themselves are still those of Perl, subject to the setting of various PCRE2 options, as described below. "POSIX-like in style" means that the API approximates to the POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding domains it is probably even less compatible.

So to me, it doesn't sound like they really are compatible. I don't know whether the same applies to libpcre(1), too.

Otherwise, I'm not a big fan of including the dependency directly. I'd rather say that we should try to use system provided dependencies, in the following order: regcomp_l, libpcre/libpcre2, regcomp, builtin regcomp. The last two obviously still have the notable shortcomings, but developers are actually able to work around them by calling setlocale at the start of their program. And given that we didn't have any reports of that problem for such a long time, I expect that this is really only a quite limited problem. If a developer hits it, then he may just make sure to also distribute the required system libraries.

* a Linking Exception. For full terms see the included COPYING file.
*/
#ifndef INCLUDE_posix_regex_h__
#define INCLUDE_posix_regex_h__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use a separate header for this? Couldn't it simply be part of "posix.h"? I guess it'll get clearer in the following commits

#define p_regexec regexec
#define p_regfree regfree

#ifdef GIT_USE_REGCOMP_L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken, right? If you directly included "posix_regex.h", this define might never be set due to the header missing an include of "common.h"->"features.h"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed 👍

tests/core/posix.c Outdated Show resolved Hide resolved
@@ -0,0 +1,137 @@
INCLUDE(CheckIncludeFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not much of a fan of including another dependency into our code which we have to keep up-to-date. I'd rather make this a compile-time option, in the following order:

  • libc's regcomp_l
  • system libpcre
  • system libpcre2
  • libc's regcomp, with the known shortcomings
  • built-in regcomp, with the known shortcomings

src/posix_regex.h Show resolved Hide resolved
tests/core/posix.c Outdated Show resolved Hide resolved
tests/core/posix.c Outdated Show resolved Hide resolved
tests/core/posix.c Outdated Show resolved Hide resolved
@ethomson
Copy link
Member Author

The last two obviously still have the notable shortcomings, but developers are actually able to work around them by calling setlocale at the start of their program.

I don't feel like it's reasonable to expect developers to have to change the locale settings in order for compiled-in parts of libgit2 to work, though. I thought that was the whole point of making these changes.

Otherwise, I'm not a big fan of including the dependency directly.

We have to include some regex dependency. Today it's code from glibc that always misbehaves in some situations. If we change that dependency to be PCRE instead, we can potentially fix that issue.

Note that PCRE (v1, not PCRE2) is also git's fallback. So even if there are slight incompatibilities between GNU and PCRE, they're the same incompatibilities that git has, meaning paradoxically, that PCRE's incompatibilities keep us more compatible with git.

@pks-t
Copy link
Member

pks-t commented Jan 25, 2019

The last two obviously still have the notable shortcomings, but developers are actually able to work around them by calling setlocale at the start of their program.

I don't feel like it's reasonable to expect developers to have to change the locale settings in order for compiled-in parts of libgit2 to work, though. I thought that was the whole point of making these changes.

Well, the point is giving developers at least some way of working around the issues, agreed. In some cases, they might just change locales. If that's not possible, they may decide to use this new route of using PCRE as a dependency instead by packaging it along with their application.

Otherwise, I'm not a big fan of including the dependency directly.

We have to include some regex dependency. Today it's code from glibc that always misbehaves in some situations. If we change that dependency to be PCRE instead, we can potentially fix that issue.

That we need to use some regex dependency is obvious. My point is only whether we want to bundle the dependency into our own codebase or make use of a system-provided PCRE dependency. I'm more inclined to use a system-provided one instead of bundling it with us, as otherwise we are responsible to notice and update whenever important bugfixes happen in bundled dependencies.

Note that PCRE (v1, not PCRE2) is also git's fallback. So even if there are slight incompatibilities between GNU and PCRE, they're the same incompatibilities that git has, meaning paradoxically, that PCRE's incompatibilities keep us more compatible with git.

Ah, good to know. Indeed, git recently gained the ability to also use PCRE2 instead of PCRE1.

@ethomson
Copy link
Member Author

My point is only whether we want to bundle the dependency into our own codebase or make use of a system-provided PCRE dependency.

Hmm, okay. So what you're saying is ship the current regcomp dependency in our codebase, not ship PCRE as a dependency in our codebase instead.

You gave this list of priorities:

libc's regcomp_l
system libpcre
system libpcre2
libc's regcomp, with the known shortcomings
built-in regcomp, with the known shortcomings

It sounds like you're saying that you want to keep shipping the glibc regcomp. I guess I don't understand why shipping one dependency (the current glibc regcomp) is superior to shipping a different dependency (PCRE). We have the same problem where we are responsible for noticing and updating when important bugfixes happen, but there's no opportunity for us to fix #4528 on systems where we use our built-in regcomp (eg, Windows).

@ethomson ethomson mentioned this pull request Jan 25, 2019
6 tasks
@pks-t
Copy link
Member

pks-t commented Jan 28, 2019 via email

@ethomson
Copy link
Member Author

  • We should have a compile-time switch that allows users to
    explicitly choose which backend to use. If no preference is
    given, then we should pick: system regcomp_l, system PCRE,
    builtin PCRE, in this order. Plain regcomp should not
    automatically be chosen due to the reported problems, but it
    might be useful to allow users to explicitly request it.

  • The ability to choose between either system-provided libpcre or
    builtin libpcre. On Linux-based distros, I expect that
    distributors would rather like to link against libpcre than
    using the bundled one.

This seems like a fine plan to me.

@pks-t
Copy link
Member

pks-t commented Feb 15, 2019

You need any help with regards to my proposal? Happy to help, as it's essentially my request :)

@ethomson
Copy link
Member Author

ethomson commented Feb 17, 2019

OK, I updated this to take a -DREGEX=... option during the cmake. Options are regcomp_l (for the system regcomp_l), pcre (for a system PCRE), regcomp(for the systemregcomp) or builtin` (for the bundled PCRE).

If that is not specified during the cmake, then we will now prefer system regcomp_l, then system PCRE, and finally the builtin PCRE. If users want regcomp they will need to enable it themselves.

src/CMakeLists.txt Outdated Show resolved Hide resolved
src/posix_regex.h Show resolved Hide resolved
tests/core/posix.c Outdated Show resolved Hide resolved
@@ -47,6 +47,7 @@
# include <ws2tcpip.h>
# include "win32/msvc-compat.h"
# include "win32/mingw-compat.h"
# include "win32/w32_common.h"
# include "win32/win32-compat.h"
# include "win32/error.h"
# include "win32/version.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to have a "win32/win32.h" header that automatically handles all win32-specific includes. This is not in the scope of this PR, though

@@ -15,11 +15,15 @@
#include "posix.h"
#include "userdiff.h"

#if LC_ALL > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. Does Windows define LC_ALL to 0 or doesn't it define it at all? If the latter's the case, then we should definitely use #ifndef instead. I'll also upload a PR that enables -Wundef in the short-term future to generate a warning for such uses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows defines LC_ALL as 0.

src/CMakeLists.txt Outdated Show resolved Hide resolved
src/CMakeLists.txt Show resolved Hide resolved
src/CMakeLists.txt Show resolved Hide resolved
Users can now select which regex implementation they want to use: one of
the system `regcomp_l`, the system PCRE, the builtin PCRE or the
system's `regcomp`.

By default the system `regcomp_l` will be used if it exists, otherwise
the system PCRE will be used.  If neither of those exist, then the
builtin PCRE implementation will be used.

The system's `regcomp` is not used by default due to problems with
locales.
Attempt to locate a system-installed version of PCRE and use its POSIX
compatibility layer, if possible.
Use PCRE2 and its POSIX compatibility layer if requested by the user.
Although PCRE2 is adequate for our needs, the PCRE2 POSIX layer as
installed on Debian and Ubuntu systems is broken, so we do not opt-in to
it by default to avoid breaking users on those platforms.
@ethomson
Copy link
Member Author

OK, updated this and rebased it.

#define p_regexec regexec
#define p_regfree regfree

#ifdef GIT_USE_REGCOMP_L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed 👍

CMakeLists.txt Outdated
@@ -65,6 +65,7 @@ OPTION(DEBUG_POOL "Enable debug pool allocator" OFF)
OPTION(ENABLE_WERROR "Enable compilation with -Werror" OFF)
OPTION(USE_BUNDLED_ZLIB "Use the bundled version of zlib" OFF)
OPTION(DEPRECATE_HARD "Do not include deprecated functions in the library" OFF)
SET(REGEX "" CACHE STRING "Regular expression implementation. One of regcomp_l, pcre, regcomp, or builtin.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name may be a bit problematic, as it is a keyword in CMake. We might want to rename it. "REGEX_BACKEND" would come to my mind, as there's precedent for this already, even though I seem to recall that you don't like the name "backend"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with regex backend (my avoidance of "backend" comes from using it without qualification. That is I'm fine with having a regex backend but to talk about a "backend" as something concrete sounds odd to me.)

@@ -15,7 +15,7 @@
* compatible implementation.
*/

#ifdef GIT_REGEX_PCRE
#ifdef GIT_REGEX_BUILTIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Where do we define p_regex_ stuff for GIT_REGEX_PCRE now?

Edit: never mind, the following commit clears that up

@@ -307,6 +307,18 @@ ENDIF()
IF(REGEX STREQUAL "regcomp_l")
ADD_FEATURE_INFO(regex ON "using system regcomp_l")
SET(GIT_REGEX_REGCOMP_L 1)
ELSEIF(REGEX STREQUAL "pcre2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we shouldn't auto-detect the PCRE2 library IF(REGEX STREQUAL "")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tried to mention this in a comment, but it looks like it got buried. (It is mentioned in the commit that introduces PCRE2.)

Basically, the POSIX compatibility layer for PCRE2 is completely broken on Ubuntu and Debian. If you look at PCRE, you'll see that it has symbols of pcre_regcomp, and then pcreposix.h sets up that indirection as #define regcomp pcre_regcomp. This avoids conflicting with the regex functions that are in libc.

The PCRE2 packages only do half of this, they have symbols for PCRE2regcomp but they forgot to set up the indirection in the header files. (They just declare regcomp.) So you end up using the regular libc regex bits. Worse, you use the pcre2 header files and their #defines for constants, so you end up passing the wrong constants to libc, and things go very wrong.

So we allow people to set PCRE2 manually, in case they have a working installation, but we don't autodetect it to avoid breaking Debian/Ubuntu users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Couldn't we just use CMake to detect whether the system PCRE2 is broken or not with CHECK_FUNCTION_EXISTS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe yes, but since they haven't fixed it yet, I don't even have a test case that I can verify that we're getting it correct (yet).

My preference would be to hold off until a (shipping) version of Debian/Ubuntu has this fixed and we can test it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. By holding off you probably just refer to the proper detection of non-broken PCRE2, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, keep it as-is where people can explicitly opt-in to PCRE2 support but not enable PCRE2 by default (yet).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, pcre2 has added compatibility aliases (#define regexec pcre2_regexec) in 10.33 (in r1064).

@ethomson
Copy link
Member Author

I'll update REGEX in the cmake to REGEX_BACKEND. Are you happy with me 🚢 this once I do?

@pks-t
Copy link
Member

pks-t commented May 21, 2019

I'm still a bit hesitant with regards to the PCRE2 bits, but this is really nothing that should hold back this PR as we do not loose any functionality at all. So: go for it!

@ethomson
Copy link
Member Author

Cool. I want to get this merged so that we can stomp on the warnings in Windows (a lot of them come from the existing regex implementation). We can iterate on the PCRE2 support.

This avoids any misunderstanding with the REGEX keyword in cmake.
@ethomson
Copy link
Member Author

Updated to use REGEX_BACKEND as the option.

@pks-t
Copy link
Member

pks-t commented May 21, 2019

By the way, just came to my mind: should we update Azure for this? Like, having at least one build explicitly using bundled PCRE and regcomp_l, respectively?

@tiennou
Copy link
Contributor

tiennou commented May 21, 2019

(relatedly, macOS also has both the POSIX2008.smth regcomp_l and pcre1/2 (in Homebrew), so it might be somewhat easier to test the build system)

Explicitly enable the `builtin` regex backend and the PCRE backend for
some Linux builds.
@ethomson
Copy link
Member Author

OK, I've added PCRE to the trusty docker image, and set some explicit regex configuration during builds. (One Linux build explicitly sets PCRE, one explicitly sets builtin, and now the macOS built explicitly sets regcomp_l.)

@pks-t
Copy link
Member

pks-t commented May 21, 2019

OK, I've added PCRE to the trusty docker image, and set some explicit regex configuration during builds. (One Linux build explicitly sets PCRE, one explicitly sets builtin, and now the macOS built explicitly sets regcomp_l.)

Perfect, thanks! No more remarks from my side now, I promise.

@ethomson
Copy link
Member Author

Perfect, thanks! No more remarks from my side now, I promise.

In that case, 🚢 . Thanks for the review everyone!

@ethomson ethomson merged commit 3d9e82f into master May 21, 2019
@ethomson ethomson deleted the ethomson/pcre branch February 2, 2020 13:08
orivej added a commit to orivej/nixpkgs that referenced this pull request Apr 11, 2020
libgit2 has bundled pcre (not pcre2) in
libgit2/libgit2#4935

Before this change configure printed "regex, using bundled PCRE",
after this change it prints "regex, using system PCRE".
@orivej orivej mentioned this pull request Apr 12, 2020
10 tasks
orivej-nixos pushed a commit to NixOS/nixpkgs that referenced this pull request Apr 12, 2020
libgit2 has bundled pcre (not pcre2) in
libgit2/libgit2#4935

Before this change configure printed "regex, using bundled PCRE",
after this change it prints "regex, using system PCRE".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants