Improved patterns command with new algorithm#3263
Improved patterns command with new algorithm#3263LantaoJin merged 16 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Songkan Tang <songkant@amazon.com>
|
Thanks for the initiative!! |
penghuo
left a comment
There was a problem hiding this comment.
In general,
- add user facing doc to explain new BRAIN alg with example will be helpfull.
- add UT could help reviwer understand the logic also. @dai-chen did a good example. https://github.com/opensearch-project/sql/blob/main/core/src/main/java/org/opensearch/sql/expression/window/frame/PeerRowsWindowFrame.java
- add benchmark to compare BRAIN and SIMPLE also help a lot.
| new Function( | ||
| ctx.pattern_method != null | ||
| ? ctx.pattern_method.getText().toLowerCase(Locale.ROOT) | ||
| : BuiltinFunctionName.BRAIN.name(), // By default, use new algorithm |
There was a problem hiding this comment.
Could we introduce a setting to control the default pattern algorithms?
There was a problem hiding this comment.
Ok, I see. Like a java property to specify default algorithm name? I can add that.
| ((NamedArgumentExpression) expression).getValue().valueOf().stringValue()) | ||
| .findFirst() | ||
| .orElse(""); | ||
| return new StreamPatternRowWindowFrame( |
There was a problem hiding this comment.
do we need StreamPatternRowWindowFrame? could we just use CurrentRowWindowFrame instead? patternExpression could be member of StreamPatternWindowFunction?
There was a problem hiding this comment.
Sure, I will remove the need of StreamPatternRowWindowFrame in the next revision.
| } | ||
|
|
||
| private boolean isSamePartition(ExprValue next) { | ||
| protected boolean isSamePartition(ExprValue next) { |
There was a problem hiding this comment.
revert unnecessary change.
| return value.stringValue(); | ||
| }) | ||
| .toList(); | ||
| this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages)); |
There was a problem hiding this comment.
BufferPatternRowsWindowFrame should have it own spec, doet it means over all rows? @dai-chen WindowFrame definition and How to use WIndowFrame should be seperate, right?
There was a problem hiding this comment.
For simplicity, I set the spec to empty partition by and empty sort by during AST tree parsing unresolved WindowFunction, which treats the window frame range is all rows on coordinator node. Because I haven't seen requirements on sorting and partitioning on other columns.
I think we can add partition by and sort by syntax if we see values in case users want to specify them. Thoughts?
There was a problem hiding this comment.
Yes, I think users are free to use pattern functions added with any window frame definition.
There was a problem hiding this comment.
Tests with non-empty partition by and non-empty sort by are still required in PR.
There was a problem hiding this comment.
Added non-empty partition by and sort by unit tests.
Signed-off-by: Songkan Tang <songkant@amazon.com>
| return value.stringValue(); | ||
| }) | ||
| .toList(); | ||
| this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages)); |
There was a problem hiding this comment.
Yes, I think users are free to use pattern functions added with any window frame definition.
| repository.register(brain()); | ||
| repository.register(simplePattern()); |
There was a problem hiding this comment.
I'm thinking are these new algorithm really window function? If users specify order by or partition by, does it still generate meaningful result?
| UnresolvedExpression sourceField, | ||
| String alias, | ||
| java.util.Map<String, Literal> arguments) { | ||
| return new Pattern( |
There was a problem hiding this comment.
Just wonder if pattern is mostly syntax sugar for pattern window function, is new logical operator still required?
There was a problem hiding this comment.
After the integration with LogicalWindow, I see they are quite similar. Yeah, I think Pattern operator is probably not needed.
There was a problem hiding this comment.
I don't find the Window unresolvedPlan. So replace Pattern with Window instead.
| import org.opensearch.sql.expression.window.frame.WindowFrame; | ||
|
|
||
| @EqualsAndHashCode(callSuper = true) | ||
| public class BufferPatternWindowFunction extends FunctionExpression |
There was a problem hiding this comment.
order by expression in window definition decides who're peers right? What's the order by expression for buffer (offline) pattern function? I didn't find it and may miss it in design doc:
Project[message#1, patterns_field#2]
+- Window[brain(message#1), windowsSpec(partitionBy=null, frame=PeerRowsWindowFrame)]
+- OpenSearchIndexScan
There was a problem hiding this comment.
A high-level question about the long-term approach: Currently, the pattern command always generates a new field, which offers flexibility as users can perform aggregation (stats) or other operations on it afterward. However, for large datasets, I assume most users would prefer functionality similar to CloudWatch Logs Query Syntax - Pattern.
Do you see any potential issues with the current implementation, which is based on SQL window functions combined with aggregate functions, as we scale this feature in the future?
@dai-chen It's a good question. Actually, it's one of my previous thought to use a new specific operator to directly return grouped logs with sample count or grouped samples. It's more like a compound aggregation operator with a default As to |
|
To scale the pattern function, I think other issues are pending investigation.
|
| return value.stringValue(); | ||
| }) | ||
| .toList(); | ||
| this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages)); |
There was a problem hiding this comment.
Tests with non-empty partition by and non-empty sort by are still required in PR.
|
|
||
| Simple Pattern | ||
| ============ | ||
| patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN |
There was a problem hiding this comment.
The seems a breaking change, why not make SIMPLE_PATTERN optional for compatibility.
There was a problem hiding this comment.
It's still compatible. If we omit SIMPLE_PATTERN, it will use BRAIN method by default. Peng suggested we can add a configuration to decide which is the default behavior.
There was a problem hiding this comment.
It's still compatible. If we omit
SIMPLE_PATTERN, it will use BRAIN method by default. Peng suggested we can add a configuration to decide which is the default behavior.
We don't call it compatible if it brings breaking change with original command syntax. I doubt about using the new BRAIN algorithm as default output before it has been running stably for a while. @penghuo any thoughts here?
There was a problem hiding this comment.
Peng suggested we can add a configuration to decide which is the default behavior.
Did we add setting already?
If not, we should default to SIMPLE_PATTERN.
There was a problem hiding this comment.
Add a new setting now. It will default to SIMPLE_PATTERN. Please review the change. Let me know if it makes sense.
| patternMethod | ||
| : SIMPLE_PATTERN | ||
| | BRAIN |
There was a problem hiding this comment.
patternMethod should be added to keywordsCanBeId as well.
| patternsMethod | ||
| : PUNCT | ||
| | REGEX |
There was a problem hiding this comment.
patternsMethod should be added to keywordsCanBeId
|
@songkant-aws could u resolve conflict? |
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
|
@penghuo @dai-chen @LantaoJin I have addressed several comments and added a new setting |
|
@songkant-aws please fix failed UT coverage. then we are ready to merge. |
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
|
@penghuo Fixed all of CI checks with more test coverage and cases. |
| ImmutableMap.Builder<String, Literal> builder = ImmutableMap.builder(); | ||
| List<UnresolvedExpression> unresolvedArguments = new ArrayList<>(); | ||
| unresolvedArguments.add(sourceField); | ||
| AtomicReference<String> alias = new AtomicReference<>("patterns_field"); |
There was a problem hiding this comment.
AstBuilder is created per query. No necessary to wrap with AtomicReference
There was a problem hiding this comment.
Seems like IDE has compilation error when using a temp String variable in lambda expression. So I use AtomicReference as a workaround
There was a problem hiding this comment.
Ok, I see, the String alias must be a "final" variable from the enclosing scope.
| if (thresholdPercentage < 0.0f || thresholdPercentage > 1.0f) { | ||
| throw new IllegalArgumentException("Threshold percentage must be between 0.0 and 1.0"); | ||
| } |
There was a problem hiding this comment.
What if the thresholdPercentage = 0 or thresholdPercentage = 1
There was a problem hiding this comment.
It's valid, which means 0 or the highest frequency.
| import org.opensearch.sql.expression.window.frame.WindowFrame; | ||
| import org.opensearch.sql.utils.FunctionUtils; | ||
|
|
||
| @EqualsAndHashCode(callSuper = true) |
There was a problem hiding this comment.
I override this function's toString specifically.
| import org.opensearch.sql.expression.window.frame.WindowFrame; | ||
| import org.opensearch.sql.utils.FunctionUtils; | ||
|
|
||
| @EqualsAndHashCode(callSuper = true) |
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.x
# Create a new branch
git switch --create backport/backport-3263-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 44ff520f08b606257240dd009758788638f24acb
# Push it to GitHub
git push --set-upstream origin backport/backport-3263-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.xThen, create a pull request where the |
* Improved patterns command with new algorithm Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor change log parser default configurations Signed-off-by: Songkan Tang <songkant@amazon.com> * Refactor a bit and add partial unit tests Signed-off-by: Songkan Tang <songkant@amazon.com> * Add more unit tests Signed-off-by: Songkan Tang <songkant@amazon.com> * Amend patterns user facing doc Signed-off-by: Songkan Tang <songkant@amazon.com> * Add average time benchmark for patterns window functions Signed-off-by: Songkan Tang <songkant@amazon.com> * Add new default pattern method setting to allow change Signed-off-by: Songkan Tang <songkant@amazon.com> * Update unit tests per injected setting Signed-off-by: Songkan Tang <songkant@amazon.com> * Update patterns command user facing doc after introducing setting Signed-off-by: Songkan Tang <songkant@amazon.com> * Complement more unit test cases Signed-off-by: Songkan Tang <songkant@amazon.com> * Adjust patterns.rst file format Signed-off-by: Songkan Tang <songkant@amazon.com> * Handle null like ExprValue cases and fix additional doctest Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix spotless style check Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor doctest fix style Signed-off-by: Songkan Tang <songkant@amazon.com> --------- Signed-off-by: Songkan Tang <songkant@amazon.com>
* Improved patterns command with new algorithm Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor change log parser default configurations Signed-off-by: Songkan Tang <songkant@amazon.com> * Refactor a bit and add partial unit tests Signed-off-by: Songkan Tang <songkant@amazon.com> * Add more unit tests Signed-off-by: Songkan Tang <songkant@amazon.com> * Amend patterns user facing doc Signed-off-by: Songkan Tang <songkant@amazon.com> * Add average time benchmark for patterns window functions Signed-off-by: Songkan Tang <songkant@amazon.com> * Add new default pattern method setting to allow change Signed-off-by: Songkan Tang <songkant@amazon.com> * Update unit tests per injected setting Signed-off-by: Songkan Tang <songkant@amazon.com> * Update patterns command user facing doc after introducing setting Signed-off-by: Songkan Tang <songkant@amazon.com> * Complement more unit test cases Signed-off-by: Songkan Tang <songkant@amazon.com> * Adjust patterns.rst file format Signed-off-by: Songkan Tang <songkant@amazon.com> * Handle null like ExprValue cases and fix additional doctest Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix spotless style check Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor doctest fix style Signed-off-by: Songkan Tang <songkant@amazon.com> --------- Signed-off-by: Songkan Tang <songkant@amazon.com>
Description
This PR introduces enhancement on original patterns command. It keeps the patterns command naming and rebuild it with specific window functions to parse log messages with different pattern method(log parser algorithms). See this RFC: #3251
The sample query input and output will be like:
Related Issues
Resolves #3251
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.