Query Information
PPL Command/Query:
# This succeeds
source=ocsf-1.1.0-4003 | dedup accountid, activity_name
# This fails with an error
source=ocsf-1.1.0-4003 | dedup accountid, src_endpoint.ip
Expected Result:
The query should return deduplicated results based on both accountid and src_endpoint.ip fields, similar to how the first query successfully deduplicates on flat fields.
Actual Result:
The query fails with an internal server error:
{
"error": {
"reason": "There was internal problem at backend",
"details": "java.sql.SQLException: Error while preparing plan ...",
"type": "RuntimeException"
},
"status": 500
}
Root Cause (from logs):
Caused by: org.codehaus.commons.compiler.CompileException: Line 3, Column 26: ';' expected instead of '.'
Dataset Information
Dataset/Schema Type
Index Mapping
{
"mappings": {
"properties": {
"account_id": {
"type": "keyword"
},
"endpoint": {
"properties": {
"ip": {
"type": "ip"
},
"port": {
"type": "integer"
}
}
}
}
}
}
Sample Data
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}
Bug Description
Issue Summary:
The dedup command fails with a Java error when deduplicating on struct/object field paths (e.g., endpoint.ip). The error occurs during physical plan generation when Calcite attempts to compile Java code for the window function used internally by dedup.
Steps to Reproduce:
# 1. Create test index with struct field
curl -X PUT "http://localhost:9200/test-dedup-struct" -H 'Content-Type: application/json' -d '{
"mappings": {
"properties": {
"account_id": {"type": "keyword"},
"endpoint": {
"properties": {
"ip": {"type": "ip"},
"port": {"type": "integer"}
}
}
}
}
}'
# 2. Insert test data
curl -X POST "http://localhost:9200/test-dedup-struct/_bulk" -H 'Content-Type: application/json' -d '
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}}
{"index":{}}
{"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}}
{"index":{}}
{"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}
'
curl -X POST "http://localhost:9200/test-dedup-struct/_refresh"
# 3. Test dedup with flat field (works)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
"query": "source=test-dedup-struct | dedup account_id"
}'
# Returns 2 results
# 4. Test dedup with struct field (fails)
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
"query": "source=test-dedup-struct | dedup account_id, endpoint.ip"
}'
# Returns compilation error
Impact:
This bug prevents users from deduplicating on any struct/object field paths, which is critical for OCSF, OTEL, and other structured observability data that heavily uses nested structures.
Environment Information
OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (also affects 3.3.0)
Additional Details:
- Affects Calcite query engine (V3)
- Issue occurs with all struct field types (ip, keyword, integer, etc.)
- Not specific to nested type - affects regular object/struct fields
Root Cause Analysis
Tentative Root Cause:
Note: This is a preliminary analysis and requires further investigation.
The dedup command internally uses Apache Calcite's window functions (ROW_NUMBER() OVER PARTITION BY) to implement deduplication logic. When struct field paths like endpoint.ip are used in the PARTITION BY clause, Calcite generates Java code that uses the field name directly as a variable identifier.
Code Flow:
-
Logical Plan Generation (CalciteRelNodeVisitor.java:1409-1428):
List<RexNode> dedupeFields =
node.getFields().stream().map(f -> rexVisitor.analyze(f, context)).toList();
// ...
context.relBuilder.aggregateCall(SqlStdOperatorTable.ROW_NUMBER)
.over()
.partitionBy(dedupeFields) // Contains RexNode for "endpoint.ip"
.orderBy(dedupeFields)
-
Field Resolution (QualifiedNameResolver.java:250-254):
- Struct field paths are resolved using
INTERNAL_ITEM function
- Creates field references like
$1 (for endpoint.ip)
- Field names retain dots in metadata
-
Physical Plan Generation:
- Calcite generates Java code for window function execution
- Uses field names directly from row type metadata
- Generates invalid Java:
int endpoint.ip = ...
-
Compilation Failure:
- Janino compiler sees the dot as field access operator
- Expects semicolon, finds dot instead
- Throws:
';' expected instead of '.'
Verification:
The logical plan shows correct structure:
LogicalProject(...)
LogicalFilter(condition=[<=($10, 1)])
LogicalProject(..., _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $3, $1 ORDER BY $3, $1)])
LogicalFilter(condition=[AND(IS NOT NULL($3), IS NOT NULL($1))])
CalciteLogicalIndexScan(table=[[OpenSearch, test-dedup-struct]])
But physical plan generation fails because $1 maps to field name endpoint.ip which is invalid in generated Java code.
Workaround
No workaround available. Users cannot use dedup with struct field paths. The only alternative is to:
Related Issues
Query Information
PPL Command/Query:
Expected Result:
The query should return deduplicated results based on both
accountidandsrc_endpoint.ipfields, similar to how the first query successfully deduplicates on flat fields.Actual Result:
The query fails with an internal server error:
{ "error": { "reason": "There was internal problem at backend", "details": "java.sql.SQLException: Error while preparing plan ...", "type": "RuntimeException" }, "status": 500 }Root Cause (from logs):
Dataset Information
Dataset/Schema Type
Index Mapping
{ "mappings": { "properties": { "account_id": { "type": "keyword" }, "endpoint": { "properties": { "ip": { "type": "ip" }, "port": { "type": "integer" } } } } } }Sample Data
{"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8080}} {"account_id":"acc1","endpoint":{"ip":"10.0.0.1","port":8081}} {"account_id":"acc1","endpoint":{"ip":"10.0.0.2","port":8080}} {"account_id":"acc2","endpoint":{"ip":"10.0.0.1","port":8080}}Bug Description
Issue Summary:
The
dedupcommand fails with a Java error when deduplicating on struct/object field paths (e.g.,endpoint.ip). The error occurs during physical plan generation when Calcite attempts to compile Java code for the window function used internally by dedup.Steps to Reproduce:
Impact:
This bug prevents users from deduplicating on any struct/object field paths, which is critical for OCSF, OTEL, and other structured observability data that heavily uses nested structures.
Environment Information
OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (also affects 3.3.0)
Additional Details:
Root Cause Analysis
Tentative Root Cause:
The
dedupcommand internally uses Apache Calcite's window functions (ROW_NUMBER() OVER PARTITION BY) to implement deduplication logic. When struct field paths likeendpoint.ipare used in thePARTITION BYclause, Calcite generates Java code that uses the field name directly as a variable identifier.Code Flow:
Logical Plan Generation (
CalciteRelNodeVisitor.java:1409-1428):Field Resolution (
QualifiedNameResolver.java:250-254):INTERNAL_ITEMfunction$1(forendpoint.ip)Physical Plan Generation:
int endpoint.ip = ...Compilation Failure:
';' expected instead of '.'Verification:
The logical plan shows correct structure:
But physical plan generation fails because
$1maps to field nameendpoint.ipwhich is invalid in generated Java code.Workaround
No workaround available. Users cannot use
dedupwith struct field paths. The only alternative is to:Related Issues