Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogQL: Introduce distinct #8662

Merged
merged 22 commits into from Apr 28, 2023
Merged

LogQL: Introduce distinct #8662

merged 22 commits into from Apr 28, 2023

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Mar 1, 2023

What this PR does / why we need it:
Introduces distinct to LogQL. Usage:
{job="varlogs"} | distinct filename
It is similar to distinct in SQL and especially useful for sampling. Similar syntax exists in other log query languages.

Example
For the following log lines:

{"event": "access", "id": "1", "time": "2023-02-28 15:12:11"}
{"event": "access", "id": "1", "time": "2023-02-28 15:13:11"}
{"event": "access", "id": "2", "time": "2023-02-28 15:14:11"}
{"event": "access", "id": "2", "time": "2023-02-28 15:15:11"}

The query below:

{app="order"} | json | distinct id

Will return:

{"event": "access", `"id": "1",` "time": "2023-02-28 15:13:11"}
{"event": "access", `"id": "2", `"time": "2023-02-28 15:15:11"}

Example with multiple labels

{app="order"} | json | distinct id,time

Which issue(s) this PR fixes:
Fixes #8649

Special notes for your reviewer:
{job="varlogs"} |distinct filename
The test passed in a simple local development environment
snapshot.
image

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner March 1, 2023 08:52
@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 1, 2023
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Mar 1, 2023
@rkonfj
Copy link

rkonfj commented Mar 1, 2023

@liguozhong Thank you very much for solving the problem that is bothering me, I just want to say cool!

Two questions:

  1. Can multiple labels be specified?(Doesn't matter if this feature exists or not) | distinct label1, label2
  2. Are the deduplicated log lines the latest? I don't know if LogQL needs to implement a deduplication strategy, at least the strategy I need is to keep only the latest log line.

Copy link
Contributor

@JStickler JStickler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Docs squad] One small comment, otherwise the docs LGTM.

docs/sources/logql/log_queries/_index.md Outdated Show resolved Hide resolved
@liguozhong
Copy link
Contributor Author

  • Can multiple labels be specified?(Doesn't matter if this feature exists or not) | distinct label1, label2

nice, I submitted a commit to complete this feature.

{app="order"} | json | distinct id,time

@liguozhong
Copy link
Contributor Author

2. Are the deduplicated log lines the latest? I don't know if LogQL needs to implement a deduplication strategy, at least the strategy I need is to keep only the latest log line.

No, currently the earliest log is kept. because the log query is stateless now, so aggregation cannot be achieved.
If we want to realize this feature, we need to make a lot of changes, and we need to introduce a state for the log query. According to my analysis, this requirement is difficult to achieve.

@liguozhong
Copy link
Contributor Author

The mysql's distinct was no keep the last one too, maybe we can keep the same logic with mysql's SQL now.

@rkonfj
Copy link

rkonfj commented Mar 2, 2023

@liguozhong

nice, I submitted a commit to complete this feature.

Thanks.

No, currently the earliest log is kept. because the log query is stateless now, so aggregation cannot be achieved.
If we want to realize this feature, we need to make a lot of changes, and we need to introduce a state for the log query. According to my analysis, this requirement is difficult to achieve.

You are right, pipeline filtering is not possible, which conflicts with streaming log lines, since the latest log line is in the future.

However, when the date range and query conditions are determined, the log lines is no longer a stream, but a determined lines. Is it possible to reevaluate the log lines to be returned(if there is a suitable place to do that)?

@rkonfj
Copy link

rkonfj commented Mar 2, 2023

The mysql's distinct was no keep the last one too, maybe we can keep the same logic with mysql's SQL now.

The result of the MySQL query is the column values, it doesn't matter whether it is earliest or latest. But for logs, the log content itself is important, the deduplication strategy become more important, right?

@liguozhong
Copy link
Contributor Author

Metric query can do agg and do a commit after all the data is processed. But there is no such callback func in log query now.

And there is an important question we have to consider.
The log query always has a limit, which means that we don't need to pull all the logs to end the query, but if we promise that distinct can return the last line of logs, we will have to pull all the s3 files to end this query.

I'm afraid it's hard for loki to pull the last log.But I also think that the last line of log is the most important log. Maybe you can modify the start time continuously.

@liguozhong
Copy link
Contributor Author

Anyway, your issue is very valuable, I asked our developers, they also want a feature in their daily oncall work.

if !ok {
return line, false
}
_, ok = r.datas[label][val]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when r.datas[label] returns nil? E.g. | logfmt would add new labels that have not been seens when r.data was initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the label does not exist, it will be discarded, and the new label is based on the order of the pipeline. Such as the following example

log
{"a":"1", "b":2}

logql 1 :
{app="buy2"} | json | distinct a
logql 2 :
{app="buy2"} | distinct a | json

logql 1 can get log. logql 2 will not get log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but you'd still have a nil pointer right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.This code does look dangerous.
No, each label has been initialized in the NewDistinctFilter () function.

func NewDistinctFilter(labels []string) (Stage, error) {
	datas := make(map[string]map[string]int, 0)
	for _, label := range labels {
		datas[label] = make(map[string]int, 0)
	}
	...
}

If distinctFilter struct initialized through NewDistinctFilter(), the distinctFilter struct will never get nil point. But if we directly build a distinctFilter struct with &distinctFilter{}, it is possible that the nil point you mentioned will appear.

another way.
image
this for performance, it will not need to make a judgment like this for each log.

 _, ok = r.datas[label]
if !ok {
r.datas[label] = make(map[string]int, 0)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NewDestinctFilter called for each line? How is preserving the state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewDestinctFilter is generated for each logql, not for each line. So the state in NewDestinctFilter can be maintained continuously in one logql. The screenshot in the PR can also show that this state is valid

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. I still don't understand how data[label] can be not nil if the label not known when NewDestinctFilter is called. I'll try out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no I get it. The labels are not the labels seen in the line but the ones passed to distinct 🙈

liguozhong and others added 8 commits April 21, 2023 09:41
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
@liguozhong
Copy link
Contributor Author

thanks, ready for review.

Comment on lines +363 to +369
{
`{app="foo"} | logfmt | distinct id`,
[]*labels.Matcher{
mustNewMatcher(labels.MatchEqual, "app", "foo"),
},
[]linecheck{{"id=foo", true}, {"id=foo", false}, {"id=bar", true}},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liguozhong I hope you don't mind me adding this test.

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DylanGuedes I'm fine with this change. What do you say?

@liguozhong
Copy link
Contributor Author

thanks

@DylanGuedes
Copy link
Contributor

@DylanGuedes I'm fine with this change. What do you say?

same 😄

@liguozhong thank you again for another awesome contribution ❤️

@DylanGuedes DylanGuedes changed the title [new feature] logql : introduce distinct syntax LogQL: Introduce distinct Apr 28, 2023
@DylanGuedes DylanGuedes merged commit 38b298c into grafana:main Apr 28, 2023
4 checks passed
@xkkker
Copy link

xkkker commented Aug 17, 2023

Hi,what's the milestone of this feature? The lastest version 2.8.4 don't support distinct for now.

kavirajk added a commit that referenced this pull request Aug 25, 2023
kavirajk added a commit that referenced this pull request Aug 25, 2023
kavirajk added a commit that referenced this pull request Aug 25, 2023
**What this PR does / why we need it**:
This reverts commit 38b298c.

We are removing this PR #8662 because [known issue tracked
here](#9594) and some issues we
facing internally with query splitting and sharding work(hard to
parallelize, merging two `distinct` sets can have duplicates, hard to
make it work with metric queries, etc).

We decided best to remove it before make it to public release.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
grafanabot pushed a commit that referenced this pull request Aug 25, 2023
**What this PR does / why we need it**:
This reverts commit 38b298c.

We are removing this PR #8662 because [known issue tracked
here](#9594) and some issues we
facing internally with query splitting and sharding work(hard to
parallelize, merging two `distinct` sets can have duplicates, hard to
make it work with metric queries, etc).

We decided best to remove it before make it to public release.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
(cherry picked from commit c030217)
kavirajk added a commit that referenced this pull request Aug 25, 2023
Backport c030217 from #10356

---

**What this PR does / why we need it**:
This reverts commit 38b298c.

We are removing this PR #8662 because [known issue tracked
here](#9594) and some issues we
facing internally with query splitting and sharding work(hard to
parallelize, merging two `distinct` sets can have duplicates, hard to
make it work with metric queries, etc).

We decided best to remove it before make it to public release.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

Co-authored-by: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: [LogQL ] should have a way to filter out duplicate label values
6 participants