Skip to content

Improved patterns command with new algorithm#3263

Merged
LantaoJin merged 16 commits intoopensearch-project:mainfrom
songkant-aws:log-pattern-command
Feb 18, 2025
Merged

Improved patterns command with new algorithm#3263
LantaoJin merged 16 commits intoopensearch-project:mainfrom
songkant-aws:log-pattern-command

Conversation

@songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Jan 24, 2025

Description

This PR introduces enhancement on original patterns command. It keeps the patterns command naming and rebuild it with specific window functions to parse log messages with different pattern method(log parser algorithms). See this RFC: #3251

The sample query input and output will be like:

  • New brain algorithm log parser:
Screenshot 2025-02-14 at 14 22 01
  • Original regex based patterns command:
Screenshot 2025-02-14 at 14 22 21
  • After changing default pattern method to brain, the default behavior is changed:
Screenshot 2025-02-14 at 14 23 15

Related Issues

Resolves #3251

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@YANG-DB
Copy link
Member

YANG-DB commented Jan 24, 2025

Thanks for the initiative!!

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general,

new Function(
ctx.pattern_method != null
? ctx.pattern_method.getText().toLowerCase(Locale.ROOT)
: BuiltinFunctionName.BRAIN.name(), // By default, use new algorithm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we introduce a setting to control the default pattern algorithms?

Copy link
Contributor Author

@songkant-aws songkant-aws Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Like a java property to specify default algorithm name? I can add that.

((NamedArgumentExpression) expression).getValue().valueOf().stringValue())
.findFirst()
.orElse("");
return new StreamPatternRowWindowFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need StreamPatternRowWindowFrame? could we just use CurrentRowWindowFrame instead? patternExpression could be member of StreamPatternWindowFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove the need of StreamPatternRowWindowFrame in the next revision.

}

private boolean isSamePartition(ExprValue next) {
protected boolean isSamePartition(ExprValue next) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert unnecessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return value.stringValue();
})
.toList();
this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferPatternRowsWindowFrame should have it own spec, doet it means over all rows? @dai-chen WindowFrame definition and How to use WIndowFrame should be seperate, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I set the spec to empty partition by and empty sort by during AST tree parsing unresolved WindowFunction, which treats the window frame range is all rows on coordinator node. Because I haven't seen requirements on sorting and partitioning on other columns.

I think we can add partition by and sort by syntax if we see values in case users want to specify them. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think users are free to use pattern functions added with any window frame definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests with non-empty partition by and non-empty sort by are still required in PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added non-empty partition by and sort by unit tests.

@penghuo penghuo added the enhancement New feature or request label Jan 25, 2025
return value.stringValue();
})
.toList();
this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think users are free to use pattern functions added with any window frame definition.

Comment on lines +41 to +42
repository.register(brain());
repository.register(simplePattern());
Copy link
Collaborator

@dai-chen dai-chen Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking are these new algorithm really window function? If users specify order by or partition by, does it still generate meaningful result?

UnresolvedExpression sourceField,
String alias,
java.util.Map<String, Literal> arguments) {
return new Pattern(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder if pattern is mostly syntax sugar for pattern window function, is new logical operator still required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the integration with LogicalWindow, I see they are quite similar. Yeah, I think Pattern operator is probably not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the Window unresolvedPlan. So replace Pattern with Window instead.

import org.opensearch.sql.expression.window.frame.WindowFrame;

@EqualsAndHashCode(callSuper = true)
public class BufferPatternWindowFunction extends FunctionExpression
Copy link
Collaborator

@dai-chen dai-chen Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order by expression in window definition decides who're peers right? What's the order by expression for buffer (offline) pattern function? I didn't find it and may miss it in design doc:

Project[message#1, patterns_field#2]
+- Window[brain(message#1), windowsSpec(partitionBy=null, frame=PeerRowsWindowFrame)]
   +- OpenSearchIndexScan

Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level question about the long-term approach: Currently, the pattern command always generates a new field, which offers flexibility as users can perform aggregation (stats) or other operations on it afterward. However, for large datasets, I assume most users would prefer functionality similar to CloudWatch Logs Query Syntax - Pattern.

Do you see any potential issues with the current implementation, which is based on SQL window functions combined with aggregate functions, as we scale this feature in the future?

@songkant-aws
Copy link
Contributor Author

songkant-aws commented Jan 28, 2025

Do you see any potential issues with the current implementation, which is based on SQL window functions combined with aggregate functions, as we scale this feature in the future?

@dai-chen It's a good question. Actually, it's one of my previous thought to use a new specific operator to directly return grouped logs with sample count or grouped samples.

It's more like a compound aggregation operator with a default group by. But current aggregation abstraction only supports row by row iteration. The log pattern algorithms cover both streaming and buffering computing paradigm. Then I figure out this is more close to window operator's abstraction. To achieve what Cloudwatch has done, I think we could extend pattern functions from AggregateWindowFunction that has a special aggregator state structure.

As to partition by and sort by specification, my initial plan is to deliver a simple version without them because that's probably enough.
partition by is probably more useful than sort by. Imagine users may want to see different log patterns for INFO, WARN, ERROR partitions.
I haven't seen a use case for sort by yet. Maybe they need to see latest logs sort by date?

@songkant-aws
Copy link
Contributor Author

songkant-aws commented Jan 28, 2025

To scale the pattern function, I think other issues are pending investigation.

  1. How to push down operator to OpenSearch data node? Is Painless script one of option?
  2. OpenSearch doesn't have shuffle mechanism for now, how to perform large dataset aggregation like operation with simple map reduce? If data volume is quite large, spilling additional data to disk is probably required during execution.
  3. To log pattern, streaming algorithms are probably more suitable to large scale dataset. We may need to introduce more by then.

return value.stringValue();
})
.toList();
this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests with non-empty partition by and non-empty sort by are still required in PR.


Simple Pattern
============
patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seems a breaking change, why not make SIMPLE_PATTERN optional for compatibility.

Copy link
Contributor Author

@songkant-aws songkant-aws Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still compatible. If we omit SIMPLE_PATTERN, it will use BRAIN method by default. Peng suggested we can add a configuration to decide which is the default behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still compatible. If we omit SIMPLE_PATTERN, it will use BRAIN method by default. Peng suggested we can add a configuration to decide which is the default behavior.

We don't call it compatible if it brings breaking change with original command syntax. I doubt about using the new BRAIN algorithm as default output before it has been running stably for a while. @penghuo any thoughts here?

Copy link
Collaborator

@penghuo penghuo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peng suggested we can add a configuration to decide which is the default behavior.

Did we add setting already?
If not, we should default to SIMPLE_PATTERN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new setting now. It will default to SIMPLE_PATTERN. Please review the change. Let me know if it makes sense.

Comment on lines +134 to +136
patternMethod
: SIMPLE_PATTERN
| BRAIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patternMethod should be added to keywordsCanBeId as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +118 to +120
patternsMethod
: PUNCT
| REGEX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patternsMethod should be added to keywordsCanBeId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@penghuo
Copy link
Collaborator

penghuo commented Feb 13, 2025

@songkant-aws could u resolve conflict?

Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
@songkant-aws
Copy link
Contributor Author

@penghuo @dai-chen @LantaoJin I have addressed several comments and added a new setting plugins.ppl.default.pattern.method to allow overriding default pattern method. For now, the default pattern method is still regex based SIMPLE_PATTERN. Please let me know if the current change is reasonable. Thanks!

@penghuo
Copy link
Collaborator

penghuo commented Feb 14, 2025

@songkant-aws please fix failed UT coverage. then we are ready to merge.

Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
@songkant-aws
Copy link
Contributor Author

@penghuo Fixed all of CI checks with more test coverage and cases.

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

ImmutableMap.Builder<String, Literal> builder = ImmutableMap.builder();
List<UnresolvedExpression> unresolvedArguments = new ArrayList<>();
unresolvedArguments.add(sourceField);
AtomicReference<String> alias = new AtomicReference<>("patterns_field");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AstBuilder is created per query. No necessary to wrap with AtomicReference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like IDE has compilation error when using a temp String variable in lambda expression. So I use AtomicReference as a workaround

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, the String alias must be a "final" variable from the enclosing scope.

Comment on lines +131 to +133
if (thresholdPercentage < 0.0f || thresholdPercentage > 1.0f) {
throw new IllegalArgumentException("Threshold percentage must be between 0.0 and 1.0");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the thresholdPercentage = 0 or thresholdPercentage = 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's valid, which means 0 or the highest frequency.

import org.opensearch.sql.expression.window.frame.WindowFrame;
import org.opensearch.sql.utils.FunctionUtils;

@EqualsAndHashCode(callSuper = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: @ToString

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I override this function's toString specifically.

import org.opensearch.sql.expression.window.frame.WindowFrame;
import org.opensearch.sql.utils.FunctionUtils;

@EqualsAndHashCode(callSuper = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocker any more.

@LantaoJin LantaoJin merged commit 44ff520 into opensearch-project:main Feb 18, 2025
16 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.x
# Create a new branch
git switch --create backport/backport-3263-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 44ff520f08b606257240dd009758788638f24acb
# Push it to GitHub
git push --set-upstream origin backport/backport-3263-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-3263-to-2.x.

songkant-aws added a commit to songkant-aws/sql that referenced this pull request Feb 18, 2025
* Improved patterns command with new algorithm

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Minor change log parser default configurations

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Refactor a bit and add partial unit tests

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add more unit tests

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Amend patterns user facing doc

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add average time benchmark for patterns window functions

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add new default pattern method setting to allow change

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Update unit tests per injected setting

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Update patterns command user facing doc after introducing setting

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Complement more unit test cases

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Adjust patterns.rst file format

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Handle null like ExprValue cases and fix additional doctest

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Fix spotless style check

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Minor doctest fix style

Signed-off-by: Songkan Tang <songkant@amazon.com>

---------

Signed-off-by: Songkan Tang <songkant@amazon.com>
@songkant-aws songkant-aws mentioned this pull request Feb 19, 2025
7 tasks
songkant-aws added a commit to songkant-aws/sql that referenced this pull request Feb 19, 2025
* Improved patterns command with new algorithm

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Minor change log parser default configurations

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Refactor a bit and add partial unit tests

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add more unit tests

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Amend patterns user facing doc

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add average time benchmark for patterns window functions

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Add new default pattern method setting to allow change

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Update unit tests per injected setting

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Update patterns command user facing doc after introducing setting

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Complement more unit test cases

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Adjust patterns.rst file format

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Handle null like ExprValue cases and fix additional doctest

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Fix spotless style check

Signed-off-by: Songkan Tang <songkant@amazon.com>

* Minor doctest fix style

Signed-off-by: Songkan Tang <songkant@amazon.com>

---------

Signed-off-by: Songkan Tang <songkant@amazon.com>
penghuo pushed a commit that referenced this pull request Feb 19, 2025
Signed-off-by: Songkan Tang <songkant@amazon.com>
penghuo pushed a commit that referenced this pull request Jun 16, 2025
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
@songkant-aws songkant-aws deleted the log-pattern-command branch October 24, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Improve patterns command with more advanced log pattern algorithms

5 participants