Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path#5418
Conversation
PR Reviewer Guide 🔍(Review updated until commit aa0ae82)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to aa0ae82 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit c655e4a
Suggestions up to commit 6d0dc2c
Suggestions up to commit e46eaa0
Suggestions up to commit 62d617d
Suggestions up to commit 0f1c8ce
|
95521c9 to
0f1c8ce
Compare
|
Persistent review updated to latest commit 0f1c8ce |
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
## Why an adapter extension is necessary
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
## Out of scope (deferred to Part 2)
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
## Test results
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
## Companion PR
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
## Why an adapter extension is necessary
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
## Out of scope (deferred to Part 2)
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
## Test results
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
## Companion PR
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
Persistent review updated to latest commit 62d617d |
|
Persistent review updated to latest commit e46eaa0 |
The PPL `rex` command's AstBuilder reads `Settings.Key.PPL_REX_MAX_MATCH_LIMIT`
unconditionally and unboxes the result to `int`:
int maxMatchLimit =
(settings != null) ? settings.getSettingValue(...) : 10;
The `(settings != null)` guard only protects against the Settings instance
being null — not against `getSettingValue` returning null for a key that the
caller never registered. On the unified query path, `UnifiedQueryContext`
builds its `Settings` map with only a small whitelist of keys (e.g.
`QUERY_SIZE_LIMIT`, `CALCITE_ENGINE_ENABLED` per opensearch-project#5413). For any unregistered
key, `getSettingValue` returns null, and the auto-unbox NPEs the planner
before any operator-level capability check runs. Every `rex` query through
`/_analytics/ppl` (the analytics-engine route's REST front-end) hits this
NPE today.
Default `PPL_REX_MAX_MATCH_LIMIT=10` in `buildSettings()` so unified-path
behavior matches the cluster-side default registered by
`OpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING` — making the v2 path and
the analytics-engine path agree on the same fallback value when neither has
an explicit cluster override. Mirrors the precedent Kai introduced for
`CALCITE_ENGINE_ENABLED` in opensearch-project#5413.
Companion to the OpenSearch core PR onboarding PPL `rex mode=sed` to the
analytics-engine route via DataFusion (Part 1 — sed-mode bridge only;
extract-mode Rust UDFs deferred to Part 2).
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
e46eaa0 to
6d0dc2c
Compare
|
Persistent review updated to latest commit 6d0dc2c |
6d0dc2c to
c655e4a
Compare
|
Persistent review updated to latest commit c655e4a |
| if (settings != null) { | ||
| Object liveRexLimit = settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT); | ||
| if (liveRexLimit != null) { | ||
| defaults.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit); | ||
| } | ||
| } |
There was a problem hiding this comment.
Is this required since PPL_REX_MAX_MATCH_LIMIT already in the defaults?
There was a problem hiding this comment.
Good catch — removed in latest push. The REST handler now writes through the existing setting(String, Object) API directly, which overwrites the default-10 entry in the same map when the operator has a cluster value.
| * @param settings the cluster's live {@code OpenSearchSettings} instance | ||
| * @return this builder instance | ||
| */ | ||
| public Builder settings(Settings settings) { |
There was a problem hiding this comment.
If possible, can we avoid this new API because of Settings in argument? We want to keep it internal and may decouple from it later.
There was a problem hiding this comment.
Agreed — dropped settings(Settings) entirely. REST handler routes the cluster value through the existing setting(String, Object) API:
Object rexLimit = pluginSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
if (rexLimit != null) {
builder.setting(PPL_REX_MAX_MATCH_LIMIT.getKeyValue(), rexLimit);
}
UnifiedQueryContext stays decoupled from any specific Settings impl. Also makes @penghuo's earlier "merge logic in buildSettings()" comment moot since there's no merge logic anymore.
| } | ||
|
|
||
| @Test | ||
| public void testLiveSettingsAbsentFallsBackToStaticDefault() { |
There was a problem hiding this comment.
Test default value in existing testContextCreationWithDefaults?
There was a problem hiding this comment.
Done — folded into existing tests:
testContextCreationWithDefaultsnow also asserts PPL_REX_MAX_MATCH_LIMIT defaults to 10.testContextCreationWithCustomConfignow also asserts that.setting("plugins.ppl.rex.max_match.limit", 5)resolves to 5.
Dropped the four standalone tests — they exercised the removed API.
… setting() API
The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in
`UnifiedQueryContext.Builder.settings` to fix the NPE in
`AstBuilder.visitRexCommand` on the unified path. The default is correct, but
it doesn't respect mid-run cluster overrides — every key in the static map
returns its hardcoded value regardless of `_cluster/settings` updates. This
breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which sets the
cluster-side limit to 5 and asserts `max_match=0` caps at 5; on the unified
path it stayed at 10.
Rather than introducing a new `Settings`-typed Builder API, the REST handler
reads the live cluster value itself and routes it through the existing
`Builder.setting(String, Object)` method — keeping `UnifiedQueryContext`
decoupled from any specific `Settings` implementation:
RestUnifiedQueryAction.applyClusterOverrides(builder)
└── pluginSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)
└── builder.setting("plugins.ppl.rex.max_match.limit", value)
`RestUnifiedQueryAction` gains a `pluginSettings` field (the same
`OpenSearchSettings` instance bound in the Guice module). Both
construction sites — `SQLPlugin.createSqlAnalyticsRouter` and
`TransportPPLQueryAction.<init>` — pass it through.
`RestUnifiedQueryActionTest` updated to pass a `mock(Settings.class)` for the
new constructor parameter.
## Why scoped to PPL_REX_MAX_MATCH_LIMIT only
The same architectural gap exists for every key in the static defaults map
(`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`,
`CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine
in practice (no test overrides them mid-run; `head N` covers
`QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally
pinned to `true` for the unified path. So this PR widens only the one key
that demonstrably needs it; widening the snapshot to the rest is a future
scope decision tied to whichever new IT first depends on it.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Pins two behaviors the previous commits introduced:
* `testContextCreationWithDefaults` now asserts that
`PPL_REX_MAX_MATCH_LIMIT` resolves to its static default of 10 — the
fallback value `AstBuilder.visitRexCommand` reads when no cluster-side
override is present.
* `testContextCreationWithCustomConfig` now asserts that
`setting("plugins.ppl.rex.max_match.limit", 5)` reaches
`getSettingValue(PPL_REX_MAX_MATCH_LIMIT)` — the path the REST handler
uses to forward an operator-configured cluster value into the
unified-path settings map.
Folds the two assertions into the existing default / custom-config tests
rather than adding new test methods, per review feedback.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
c655e4a to
aa0ae82
Compare
|
Persistent review updated to latest commit aa0ae82 |
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Description
The PPL
rexcommand'sAstBuilderreadsSettings.Key.PPL_REX_MAX_MATCH_LIMITunconditionally and unboxes the result toint. On the unified query path,UnifiedQueryContextbuilds itsSettingsmap with only a small whitelist of planning-required keys; for any unregistered key,getSettingValuereturns null and the auto-unbox NPEs the planner before any operator-level capability check runs. Everyrexquery through/_analytics/pplhits this NPE today.This PR ships two changes that together let the unified path execute
rexcorrectly:1. Default
PPL_REX_MAX_MATCH_LIMIT=10inUnifiedQueryContextAdds the key to the static settings map so
AstBuilder.visitRexCommandno longer NPEs. The value mirrors the cluster-side default of10registered byOpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING, so unified-path behavior matches v2-path behavior when neither has an explicit cluster override. Mirrors the precedent Kai introduced forCALCITE_ENGINE_ENABLEDin #5413.2. Bridge live cluster settings for
PPL_REX_MAX_MATCH_LIMITonlyWithout this, every key in the static map resolves to its hardcoded default and
_cluster/settingsupdates are invisible to the unified path.CalciteRexCommandIT.testRexMaxMatchConfigurableLimitexercises this: it sets the cluster-side limit to 5 and asserts thatmax_match=0caps at 5, but on the unified path the static10keeps winning.Adds a
Builder.liveSettings(Settings)hook so the REST handler can inject the cluster's liveOpenSearchSettingsinstance. Atbuild()time the Builder snapshots the live value ofPPL_REX_MAX_MATCH_LIMITinto the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle ofUnifiedQueryContextand avoids per-call lookup overhead.RestUnifiedQueryActiongains apluginSettingsfield (the sameOpenSearchSettingsinstance bound in the Guice module) and forwards it to the Builder in bothbuildContextandbuildParsingContext. Both construction sites —SQLPlugin.createSqlAnalyticsRouterandTransportPPLQueryAction.<init>— are updated.Why scoped to
PPL_REX_MAX_MATCH_LIMITonlyThe same architectural gap exists for every key in the static map (
QUERY_SIZE_LIMIT,PPL_SUBSEARCH_MAXOUT,PPL_JOIN_SUBSEARCH_MAXOUT,CALCITE_ENGINE_ENABLED). For three of those, the static defaults are fine in practice (no test overrides them mid-run;head NcoversQUERY_SIZE_LIMITper-query).CALCITE_ENGINE_ENABLEDis intentionally pinned totruefor the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it.Companion PR
opensearch-project/OpenSearch#21550 — onboards PPL
rexto DataFusion via the analytics-engine path. Without this PR's fixes, every rex query through/_analytics/pplNPEs at parse time and never reaches the planner.Test results
CalciteRexCommandITthrough the analytics-engine route (every PPL query forced through/_analytics/pplviatests.analytics.force_routing=true):AstBuilder.visitRexCommand.testRexMaxMatchConfigurableLimitfails withexpected:<5> but was:<10>.testRexMaxMatch*variants honor the cluster setting.Signed-off-by: Jialiang Liang jiallian@amazon.com