Support aggregation/window commands with dynamic fields#4743
Conversation
33bab3f to
6b2e491
Compare
|
Updated to utilize type coercion. |
| if (!context.fieldBuilder.isFieldSpecificType(byFieldName)) { | ||
| throw new IllegalArgumentException( | ||
| String.format( | ||
| "By field `%s` needs to be specific type. Please cast explicitly.", byFieldName)); | ||
| } |
There was a problem hiding this comment.
Can we cast to string for groupBy field?
There was a problem hiding this comment.
I realized timechart requires bigger change due to type assigned to span function, which prevents automatic type coercion work properly.
Let me address this in a separate PR.
There was a problem hiding this comment.
Found simpler way to solve the problem, and included the change in this PR.
44e2d10 to
0e2d036
Compare
| projectDynamicFieldAsString(node.getBinExpression(), context); | ||
| projectDynamicFieldAsString(node.getByField(), context); | ||
|
|
There was a problem hiding this comment.
is it required for all visitor?
There was a problem hiding this comment.
could u add a test in CalciteDynamicFieldsTimechartIT to help understand what is correspond logical plan / sql
There was a problem hiding this comment.
Added CalcitePPLDynamicFieldsTest.java for spark SQL. Added explains in IT.
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
d6acee2 to
d552010
Compare
| verifyLogical(root, expectedLogical); | ||
|
|
||
| String expectedSparkSql = | ||
| "SELECT `id`, `name`, `_MAP`\n" |
There was a problem hiding this comment.
The output always include _MAP columns?
@dai-chen does it works with unified ppl in spark?
There was a problem hiding this comment.
It contains _MAP when the query does not explicitly select fields, since it should output all the dynamic fields along with static fields. (You can refer test case: testProjectStaticFields)
There was a problem hiding this comment.
As I understand if we submit such SQL query on S3 table to Spark directly, the changes include at least:
- Add
_MAPto Spark table schema - Add result expanding logic similarly as
DynamicFieldsResultProcessor.expandDynamicFields()
Do you have example for writing _MAP? I want to check if more changes required.
There was a problem hiding this comment.
@ykmr1224 I just want to make sure I’m understanding this correctly.
- Case 1: For
_MAPgenerated from a table, do we need to update the Spark catalog to add it when permissive mode is enabled? When you say "automatically added to the table", it means current OpenSearch schema right? - Case 2: For
_MAPgenerated dynamically by a command likespath, could you share a concrete example, including:- the PPL query, and
- the Spark SQL query generated?
Since our approach is to transpile PPL into Spark SQL, I’d like to ensure that all required semantics are encoded in the SQL we generate. Otherwise, we’ll need to estimate the effort for any changes required in the Spark SQL engine.
There was a problem hiding this comment.
@dai-chen
Case 1: Yes, it is added to OpenSearch schema (specifically to metadata fields). I am not sure how Spark catalog works, but I suppose we need to add _MAP to the catalog schema.
Case 2: Here is the sample SQL for ppl source=EMP | fields ENAME | spath input=ENAME
SELECT `mvappend`(`ENAME`, `JSON_EXTRACT_ALL`(`ENAME`)['ENAME']) `ENAME`, `MAP_REMOVE`(`JSON_EXTRACT_ALL`(`ENAME`), ARRAY ('ENAME')) `_MAP`
FROM `scott`.`EMP`
`MAP_REMOVE`(`JSON_EXTRACT_ALL`(`ENAME`), ARRAY ('ENAME')) `_MAP` is where _MAP is assigned. (MAP_REMOVE is to dedupe the fields in static field)
There was a problem hiding this comment.
Yes, case 1 may need some changes and we can focus on case 2. Posted the Spark SQL query generated in my understanding.
# Test data
search source=test_events;
25/11/19 11:10:06 WARN UnifiedQueryParser: PPL translated to Spark SQL:
SELECT *
FROM `spark_catalog`.`default`.`test_events`
@timestamp host packets message
2025-09-08 10:00:00 server1 60 {"category":1, "resource":"A"}
2025-09-08 10:01:00 server1 120 {"category":2, "resource":"B"}
2025-09-08 10:02:00 server1 60 {"category":3, "resource":"C"}
2025-09-08 10:02:30 server2 180 {"category":4, "resource":"D"}
# PPL query
# source=test_events | spath input=message | eval cat = abs(category) * 10
# Spark SQL query expected
spark-sql (default)>
> SELECT
> ABS(TRY_CAST(`_MAP`['category'] AS INT) * 10) AS `cat`
> FROM (
> SELECT `JSON_EXTRACT_ALL`(`message`) AS `_MAP`
> FROM `test_events`
> );
line 2:14 missing ')' at '('
cat
10
20
30
40
If this is correct, the only question is expand logic in DynamicFieldsResultProcessor.expandDynamicFields()
|
|
||
| JSONObject result = executeQuery(query); | ||
|
|
||
| assertExplainYaml( |
There was a problem hiding this comment.
Can we only assert the part we're interested in?
There was a problem hiding this comment.
I've added this per request from @penghuo to add explain verification, and I think it is better keeping whole part to detect when plan is changed.
I would migrate it to separate file once I merge the change and enabled permissive mode in main branch. (it is currently enabled only in integration test and cannot use same test base class)
| verifyLogical(root, expectedLogical); | ||
|
|
||
| String expectedSparkSql = | ||
| "SELECT `id`, `name`, `_MAP`\n" |
There was a problem hiding this comment.
As I understand if we submit such SQL query on S3 table to Spark directly, the changes include at least:
- Add
_MAPto Spark table schema - Add result expanding logic similarly as
DynamicFieldsResultProcessor.expandDynamicFields()
Do you have example for writing _MAP? I want to check if more changes required.
990346a
into
opensearch-project:feature/permissive
This PR is for feature branch
feature/permissiveDescription
Related Issues
Permissive mode RFC: #4349
Dynamic fields RFC: #4433
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.