HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2 #18677

JamesDeFabia · 2024-05-17T14:39:11Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

github-actions · 2024-05-17T15:00:45Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-31739

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR

dcamper · 2024-05-17T17:31:46Z

docs/EN_US/ECLLanguageReference/ECLR_mods/BltInFunc-REGEXFIND.xml

-  We use version 2.6 which should support all listed features.</para>
+  </para>
+
+  <para>For unicode <emphasis>text</emphasis>, <ulink


We have settled on using only one third-party library for regex support (PCRE2), no matter what data type is used. PCRE2 always supports the syntax outlined in pcre2pattern.html, even for Unicode (UNICODE or UTF8). When searching Unicode data, PCRE2 also supports the syntax outlined in pcre2unicode.html.

To avoid confusion, the description here might want to lead with that.

Background info: Reading what I just wrote, it may seem that the latter is a superset of the former. That is not the case. It all comes down to things like "what is a letter" and "what is a letter" and what character set is supported. Example: [[:digit:]] is a pattern that defines the numeric characters 0-9 in ASCII (our STRING datatype), but Unicode contains many different sets of numeric digits. [[:digit:]] will match that same set of 0-9 characters, even in UNICODE or UTF8, but it will not match the other numeric characters; to do that you need to use \p{Nd} instead (which will get them all, even the [[:digit:]] set).

The above applies to all three REGEX ECL functions.

Can the description be reworded?

Signed-off-by: Jim DeFabia <jamesdefabia@lexisnexis.com>

dcamper

Looks good.

JamesDeFabia · 2024-06-03T13:11:46Z

@ghalliday This should be ready to merge

JamesDeFabia requested a review from dcamper May 17, 2024 14:39

dcamper reviewed May 17, 2024

View reviewed changes

HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2

f880b7a

Signed-off-by: Jim DeFabia <jamesdefabia@lexisnexis.com>

JamesDeFabia force-pushed the HPCC-31739Regex2 branch from bdc322f to f880b7a Compare May 17, 2024 19:33

JamesDeFabia requested a review from dcamper May 17, 2024 19:34

dcamper approved these changes May 20, 2024

View reviewed changes

ghalliday merged commit 8460bd6 into hpcc-systems:master Jun 7, 2024
22 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2 #18677

HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2 #18677

JamesDeFabia commented May 17, 2024 •

edited

Loading

github-actions bot commented May 17, 2024

dcamper May 17, 2024

dcamper left a comment

JamesDeFabia commented Jun 3, 2024

HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2 #18677

HPCC-31739 Document changes in moving Unicode regex from ICU to PCRE2 #18677

Conversation

JamesDeFabia commented May 17, 2024 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented May 17, 2024

dcamper May 17, 2024

Choose a reason for hiding this comment

dcamper left a comment

Choose a reason for hiding this comment

JamesDeFabia commented Jun 3, 2024

JamesDeFabia commented May 17, 2024 •

edited

Loading