Skip to content

Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path#5418

Open
RyanL1997 wants to merge 3 commits intoopensearch-project:feature/mustang-ppl-integrationfrom
RyanL1997:mustang-rex-unified-default
Open

Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path#5418
RyanL1997 wants to merge 3 commits intoopensearch-project:feature/mustang-ppl-integrationfrom
RyanL1997:mustang-rex-unified-default

Conversation

@RyanL1997
Copy link
Copy Markdown
Collaborator

@RyanL1997 RyanL1997 commented May 7, 2026

Description

The PPL rex command's AstBuilder reads Settings.Key.PPL_REX_MAX_MATCH_LIMIT unconditionally and unboxes the result to int. On the unified query path, UnifiedQueryContext builds its Settings map with only a small whitelist of planning-required keys; for any unregistered key, getSettingValue returns null and the auto-unbox NPEs the planner before any operator-level capability check runs. Every rex query through /_analytics/ppl hits this NPE today.

This PR ships two changes that together let the unified path execute rex correctly:

1. Default PPL_REX_MAX_MATCH_LIMIT=10 in UnifiedQueryContext

Adds the key to the static settings map so AstBuilder.visitRexCommand no longer NPEs. The value mirrors the cluster-side default of 10 registered by OpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING, so unified-path behavior matches v2-path behavior when neither has an explicit cluster override. Mirrors the precedent Kai introduced for CALCITE_ENGINE_ENABLED in #5413.

2. Bridge live cluster settings for PPL_REX_MAX_MATCH_LIMIT only

Without this, every key in the static map resolves to its hardcoded default and _cluster/settings updates are invisible to the unified path. CalciteRexCommandIT.testRexMaxMatchConfigurableLimit exercises this: it sets the cluster-side limit to 5 and asserts that max_match=0 caps at 5, but on the unified path the static 10 keeps winning.

Adds a Builder.liveSettings(Settings) hook so the REST handler can inject the cluster's live OpenSearchSettings instance. At build() time the Builder snapshots the live value of PPL_REX_MAX_MATCH_LIMIT into the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle of UnifiedQueryContext and avoids per-call lookup overhead.

RestUnifiedQueryAction gains a pluginSettings field (the same OpenSearchSettings instance bound in the Guice module) and forwards it to the Builder in both buildContext and buildParsingContext. Both construction sites — SQLPlugin.createSqlAnalyticsRouter and TransportPPLQueryAction.<init> — are updated.

Why scoped to PPL_REX_MAX_MATCH_LIMIT only

The same architectural gap exists for every key in the static map (QUERY_SIZE_LIMIT, PPL_SUBSEARCH_MAXOUT, PPL_JOIN_SUBSEARCH_MAXOUT, CALCITE_ENGINE_ENABLED). For three of those, the static defaults are fine in practice (no test overrides them mid-run; head N covers QUERY_SIZE_LIMIT per-query). CALCITE_ENGINE_ENABLED is intentionally pinned to true for the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it.

Companion PR

opensearch-project/OpenSearch#21550 — onboards PPL rex to DataFusion via the analytics-engine path. Without this PR's fixes, every rex query through /_analytics/ppl NPEs at parse time and never reaches the planner.

Test results

CalciteRexCommandIT through the analytics-engine route (every PPL query forced through /_analytics/ppl via tests.analytics.force_routing=true):

  • Without this PR: 0/18 — every test NPEs in AstBuilder.visitRexCommand.
  • With change 1 only (default 10): 17/18 — testRexMaxMatchConfigurableLimit fails with expected:<5> but was:<10>.
  • With both changes: 18/18, 100% — all testRexMaxMatch* variants honor the cluster setting.

Signed-off-by: Jialiang Liang jiallian@amazon.com

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Reviewer Guide 🔍

(Review updated until commit aa0ae82)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

buildParsingContext now calls applyClusterOverrides, which reads from pluginSettings. If buildParsingContext is invoked before pluginSettings is initialized (e.g., during early initialization or from a code path that doesn't set it), accessing pluginSettings.getSettingValue will throw a NullPointerException. The old implementation was static and had no such dependency.

private UnifiedQueryContext buildParsingContext(QueryType queryType) {
  return applyClusterOverrides(UnifiedQueryContext.builder().language(queryType)).build();
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Suggestions ✨

Latest suggestions up to aa0ae82

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate type before using setting value

The method retrieves rexLimit as Object but doesn't validate its type before passing
it to builder.setting(). If pluginSettings returns an unexpected type (e.g., String
instead of Integer), this could cause runtime failures downstream when the value is
unboxed. Add type validation or casting to ensure rexLimit is an Integer before
using it.

plugin/src/main/java/org/opensearch/sql/plugin/rest/RestUnifiedQueryAction.java [184-194]

 private UnifiedQueryContext.Builder applyClusterOverrides(UnifiedQueryContext.Builder builder) {
   Object rexLimit =
       pluginSettings.getSettingValue(
           org.opensearch.sql.common.setting.Settings.Key.PPL_REX_MAX_MATCH_LIMIT);
-  if (rexLimit != null) {
+  if (rexLimit instanceof Integer) {
     builder.setting(
         org.opensearch.sql.common.setting.Settings.Key.PPL_REX_MAX_MATCH_LIMIT.getKeyValue(),
         rexLimit);
   }
   return builder;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential type safety issue where rexLimit is retrieved as Object and could be an unexpected type. However, the improved code changes the behavior from passing through any non-null value to only passing Integer instances, which may be overly restrictive if the setting system guarantees type consistency. The score reflects a valid defensive programming improvement with moderate impact.

Medium

Previous suggestions

Suggestions up to commit c655e4a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add type safety for setting value

The code retrieves liveRexLimit as Object but PPL_REX_MAX_MATCH_LIMIT expects an
Integer. If the settings return a different numeric type (e.g., Long), this could
cause a ClassCastException when getSettingValue unboxes to int in
AstBuilder.visitRexCommand. Add type validation or safe casting to ensure the value
is an Integer before inserting it into defaults.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [277-282]

 if (settings != null) {
   Object liveRexLimit = settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
-  if (liveRexLimit != null) {
+  if (liveRexLimit instanceof Integer) {
     defaults.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit);
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential type safety issue where liveRexLimit is retrieved as Object and could cause a ClassCastException if it's not an Integer. However, this is primarily a defensive programming improvement rather than a critical bug, as the settings system likely returns the correct type. The score reflects its value as a safety enhancement.

Medium
Suggestions up to commit 6d0dc2c
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add type validation for setting value

The code retrieves liveRexLimit as Object but doesn't validate its type before
inserting into the settings map. If getSettingValue returns an unexpected type
(e.g., String instead of Integer), this could cause runtime failures when the value
is later unboxed to int in AstBuilder.visitRexCommand. Add type validation or safe
casting to ensure the value is an Integer before insertion.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [261-266]

 if (liveSettings != null) {
   Object liveRexLimit = liveSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
-  if (liveRexLimit != null) {
+  if (liveRexLimit instanceof Integer) {
     settings.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit);
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential type safety issue where liveRexLimit is retrieved as Object and could be an unexpected type. Adding instanceof Integer check prevents runtime failures during unboxing. However, the PR context shows this is a controlled setting value that should already be typed correctly, making this a defensive improvement rather than fixing a critical bug.

Medium
General
Validate pluginSettings parameter for null

The constructor accepts pluginSettings but doesn't validate it for null. Since
buildParsingContext and buildContext methods call liveSettings(pluginSettings)
without null checks, a null pluginSettings parameter could lead to unexpected
behavior. Add null validation in the constructor to fail fast with a clear error
message.

plugin/src/main/java/org/opensearch/sql/plugin/rest/RestUnifiedQueryAction.java [64-73]

 public RestUnifiedQueryAction(
     NodeClient client,
     ClusterService clusterService,
     QueryPlanExecutor<RelNode, Iterable<Object[]>> planExecutor,
     org.opensearch.sql.common.setting.Settings pluginSettings) {
   this.client = client;
   this.clusterService = clusterService;
   this.analyticsEngine = new AnalyticsExecutionEngine(planExecutor);
-  this.pluginSettings = pluginSettings;
+  this.pluginSettings = Objects.requireNonNull(pluginSettings, "pluginSettings cannot be null");
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion adds defensive null checking for pluginSettings parameter. While this improves robustness, the PR shows pluginSettings is always instantiated before being passed to the constructor (in SQLPlugin.java and TransportPPLQueryAction.java), making null unlikely. The null check in build() method (line 261) already handles null liveSettings gracefully, so this is a minor defensive improvement.

Low
Suggestions up to commit e46eaa0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add type validation before storing setting

The code retrieves liveRexLimit as Object and puts it directly into the settings map
without type validation. If getSettingValue returns an unexpected type (e.g., String
instead of Integer), this could cause runtime ClassCastException when the value is
later unboxed to int in AstBuilder.visitRexCommand.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [251-256]

 if (liveSettings != null) {
   Object liveRexLimit = liveSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
-  if (liveRexLimit != null) {
+  if (liveRexLimit instanceof Integer) {
     settings.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit);
   }
 }
Suggestion importance[1-10]: 7

__

Why: Valid concern about type safety. The code retrieves liveRexLimit as Object and stores it without validation. Adding an instanceof Integer check prevents potential ClassCastException when the value is later unboxed to int in AstBuilder.visitRexCommand. However, this is defensive programming rather than fixing a critical bug, as the Settings implementation should return the correct type.

Medium
Suggestions up to commit 62d617d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate type before storing setting value

The code retrieves liveRexLimit as Object but doesn't validate its type before
putting it into the settings map. If getSettingValue returns an unexpected type,
downstream code expecting an int (as mentioned in the javadoc about unboxing) could
fail with a ClassCastException.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [251-256]

 if (liveSettings != null) {
   Object liveRexLimit = liveSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
-  if (liveRexLimit != null) {
+  if (liveRexLimit instanceof Integer) {
     settings.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit);
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential type safety issue where liveRexLimit is retrieved as Object without type validation. While the javadoc mentions unboxing to int, adding an instanceof Integer check would prevent potential ClassCastException downstream. However, this is defensive programming rather than fixing a critical bug, as the setting system likely returns the correct type.

Medium
Suggestions up to commit 0f1c8ce
CategorySuggestion                                                                                                                                    Impact
General
Extract hardcoded default to constant

The hardcoded value 10 for PPL_REX_MAX_MATCH_LIMIT creates a maintenance burden and
potential inconsistency if the default changes elsewhere. Consider extracting this
default value to a constant in SysLimit or the settings definition class, similar to
how other limits are accessed via SysLimit.DEFAULT.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [135-142]

 private final Map<Settings.Key, Object> settings =
     new HashMap<Settings.Key, Object>(
         Map.of(
             QUERY_SIZE_LIMIT, SysLimit.DEFAULT.querySizeLimit(),
             PPL_SUBSEARCH_MAXOUT, SysLimit.DEFAULT.subsearchLimit(),
             PPL_JOIN_SUBSEARCH_MAXOUT, SysLimit.DEFAULT.joinSubsearchLimit(),
             CALCITE_ENGINE_ENABLED, true,
-            PPL_REX_MAX_MATCH_LIMIT, 10));
+            PPL_REX_MAX_MATCH_LIMIT, SysLimit.DEFAULT.rexMaxMatchLimit()));
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that hardcoding 10 for PPL_REX_MAX_MATCH_LIMIT creates maintenance burden. Extracting to a constant like SysLimit.DEFAULT.rexMaxMatchLimit() would improve consistency with other limits and make future changes easier. However, this is a code quality improvement rather than a critical bug fix.

Medium

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Persistent review updated to latest commit 0f1c8ce

RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 7, 2026
…dge-only

Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:

  * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
    (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).

  * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
    (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
    natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.

  * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
    New bridge in this PR. Resolves to DataFusion's `translate` UDF
    (datafusion-functions/src/unicode/translate.rs).

## Why an adapter extension is necessary

The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.

`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.

## Out of scope (deferred to Part 2)

  * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
    Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
    native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
    UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).

  * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
    `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
    support (max 4 args). Also Part 2.

## Test results

  * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
  * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
    sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
    combined, backreferences via `$N`, transliteration `y/from/to/` and
    no-match passthrough.
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.

## Companion PR

The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
…dge-only

Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:

  * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
    (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).

  * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
    (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
    natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.

  * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
    New bridge in this PR. Resolves to DataFusion's `translate` UDF
    (datafusion-functions/src/unicode/translate.rs).

## Why an adapter extension is necessary

The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
opensearch-project#21527, is extended here to recognize 3 OR 4 operands. Pattern is at position 1
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.

`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.

## Out of scope (deferred to Part 2)

  * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
    Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
    native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
    UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).

  * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
    `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
    support (max 4 args). Also Part 2.

## Test results

  * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
  * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
    sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
    combined, backreferences via `$N`, transliteration `y/from/to/` and
    no-match passthrough.
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.

## Companion PR

The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit 62d617d

@RyanL1997 RyanL1997 changed the title Default plugins.ppl.rex.max_match.limit=10 on the unified query path Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit e46eaa0

The PPL `rex` command's AstBuilder reads `Settings.Key.PPL_REX_MAX_MATCH_LIMIT`
unconditionally and unboxes the result to `int`:

  int maxMatchLimit =
      (settings != null) ? settings.getSettingValue(...) : 10;

The `(settings != null)` guard only protects against the Settings instance
being null — not against `getSettingValue` returning null for a key that the
caller never registered. On the unified query path, `UnifiedQueryContext`
builds its `Settings` map with only a small whitelist of keys (e.g.
`QUERY_SIZE_LIMIT`, `CALCITE_ENGINE_ENABLED` per opensearch-project#5413). For any unregistered
key, `getSettingValue` returns null, and the auto-unbox NPEs the planner
before any operator-level capability check runs. Every `rex` query through
`/_analytics/ppl` (the analytics-engine route's REST front-end) hits this
NPE today.

Default `PPL_REX_MAX_MATCH_LIMIT=10` in `buildSettings()` so unified-path
behavior matches the cluster-side default registered by
`OpenSearchSettings.PPL_REX_MAX_MATCH_LIMIT_SETTING` — making the v2 path and
the analytics-engine path agree on the same fallback value when neither has
an explicit cluster override. Mirrors the precedent Kai introduced for
`CALCITE_ENGINE_ENABLED` in opensearch-project#5413.

Companion to the OpenSearch core PR onboarding PPL `rex mode=sed` to the
analytics-engine route via DataFusion (Part 1 — sed-mode bridge only;
extract-mode Rust UDFs deferred to Part 2).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-unified-default branch from e46eaa0 to 6d0dc2c Compare May 8, 2026 02:51
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit 6d0dc2c

Comment thread api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java Outdated
Comment thread api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java Outdated
@RyanL1997 RyanL1997 force-pushed the mustang-rex-unified-default branch from 6d0dc2c to c655e4a Compare May 8, 2026 03:31
@RyanL1997 RyanL1997 added enhancement New feature or request and removed feature labels May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit c655e4a

Comment on lines +277 to +282
if (settings != null) {
Object liveRexLimit = settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
if (liveRexLimit != null) {
defaults.put(PPL_REX_MAX_MATCH_LIMIT, liveRexLimit);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required since PPL_REX_MAX_MATCH_LIMIT already in the defaults?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — removed in latest push. The REST handler now writes through the existing setting(String, Object) API directly, which overwrites the default-10 entry in the same map when the operator has a cluster value.

* @param settings the cluster's live {@code OpenSearchSettings} instance
* @return this builder instance
*/
public Builder settings(Settings settings) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can we avoid this new API because of Settings in argument? We want to keep it internal and may decouple from it later.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dropped settings(Settings) entirely. REST handler routes the cluster value through the existing setting(String, Object) API:

  Object rexLimit = pluginSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT);
  if (rexLimit != null) {
    builder.setting(PPL_REX_MAX_MATCH_LIMIT.getKeyValue(), rexLimit);
  }

UnifiedQueryContext stays decoupled from any specific Settings impl. Also makes @penghuo's earlier "merge logic in buildSettings()" comment moot since there's no merge logic anymore.

}

@Test
public void testLiveSettingsAbsentFallsBackToStaticDefault() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test default value in existing testContextCreationWithDefaults?

Copy link
Copy Markdown
Collaborator Author

@RyanL1997 RyanL1997 May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — folded into existing tests:

  • testContextCreationWithDefaults now also asserts PPL_REX_MAX_MATCH_LIMIT defaults to 10.
  • testContextCreationWithCustomConfig now also asserts that .setting("plugins.ppl.rex.max_match.limit", 5) resolves to 5.

Dropped the four standalone tests — they exercised the removed API.

RyanL1997 added 2 commits May 7, 2026 22:23
… setting() API

The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in
`UnifiedQueryContext.Builder.settings` to fix the NPE in
`AstBuilder.visitRexCommand` on the unified path. The default is correct, but
it doesn't respect mid-run cluster overrides — every key in the static map
returns its hardcoded value regardless of `_cluster/settings` updates. This
breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which sets the
cluster-side limit to 5 and asserts `max_match=0` caps at 5; on the unified
path it stayed at 10.

Rather than introducing a new `Settings`-typed Builder API, the REST handler
reads the live cluster value itself and routes it through the existing
`Builder.setting(String, Object)` method — keeping `UnifiedQueryContext`
decoupled from any specific `Settings` implementation:

  RestUnifiedQueryAction.applyClusterOverrides(builder)
    └── pluginSettings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)
          └── builder.setting("plugins.ppl.rex.max_match.limit", value)

`RestUnifiedQueryAction` gains a `pluginSettings` field (the same
`OpenSearchSettings` instance bound in the Guice module). Both
construction sites — `SQLPlugin.createSqlAnalyticsRouter` and
`TransportPPLQueryAction.<init>` — pass it through.

`RestUnifiedQueryActionTest` updated to pass a `mock(Settings.class)` for the
new constructor parameter.

## Why scoped to PPL_REX_MAX_MATCH_LIMIT only

The same architectural gap exists for every key in the static defaults map
(`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`,
`CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine
in practice (no test overrides them mid-run; `head N` covers
`QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally
pinned to `true` for the unified path. So this PR widens only the one key
that demonstrably needs it; widening the snapshot to the rest is a future
scope decision tied to whichever new IT first depends on it.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Pins two behaviors the previous commits introduced:

  * `testContextCreationWithDefaults` now asserts that
    `PPL_REX_MAX_MATCH_LIMIT` resolves to its static default of 10 — the
    fallback value `AstBuilder.visitRexCommand` reads when no cluster-side
    override is present.

  * `testContextCreationWithCustomConfig` now asserts that
    `setting("plugins.ppl.rex.max_match.limit", 5)` reaches
    `getSettingValue(PPL_REX_MAX_MATCH_LIMIT)` — the path the REST handler
    uses to forward an operator-configured cluster value into the
    unified-path settings map.

Folds the two assertions into the existing default / custom-config tests
rather than adding new test methods, per review feedback.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-unified-default branch from c655e4a to aa0ae82 Compare May 8, 2026 05:23
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit aa0ae82

RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
…dge-only

Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:

  * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
    (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).

  * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
    (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
    natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.

  * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
    New bridge in this PR. Resolves to DataFusion's `translate` UDF
    (datafusion-functions/src/unicode/translate.rs).

The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.

`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.

  * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
    Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
    native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
    UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).

  * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
    `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
    support (max 4 args). Also Part 2.

  * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
  * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
    sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
    combined, backreferences via `$N`, transliteration `y/from/to/` and
    no-match passthrough.
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.

The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants