Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] add Remote read API #7324

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Oct 3, 2022

What this PR does / why we need it:

image

[new feature] add Remote read API.

remote read use /Users/fuling/go/src/github.com/grafana/loki/pkg/logcli/client/client.go, loki server do not add new http handler.

type Client interface {
	Query(queryStr string, limit int, time time.Time, direction logproto.Direction, quiet bool) (*loghttp.QueryResponse, error)
	QueryRange(queryStr string, limit int, start, end time.Time, direction logproto.Direction, step, interval time.Duration, quiet bool) (*loghttp.QueryResponse, error)
	ListLabelNames(quiet bool, start, end time.Time) (*loghttp.LabelResponse, error)
	ListLabelValues(name string, quiet bool, start, end time.Time) (*loghttp.LabelResponse, error)
	Series(matchers []string, start, end time.Time, quiet bool) (*loghttp.SeriesResponse, error)
	LiveTailQueryConn(queryStr string, delayFor time.Duration, limit int, start time.Time, quiet bool) (*websocket.Conn, error)
	GetOrgID() string
}

loki.yaml config demo

remote_read:
  - url: http://loki_us.svc:3100
    name: loki_us
    orgID: buy
  - url: http://loki_eu.svc:3100
    name: loki_eu
    orgID: carts

prometheus remote read doc:https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read
Which issue(s) this PR fixes:
Fixes #7306 And ##1866

Special notes for your reviewer:
This PR is already quite large, this PR is only implementation. The basic framework of remote read and the configuration of remote read and an interface read.go SelectLogs test.

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner October 3, 2022 17:58
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.4%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.4%

# Conflicts:
#	pkg/loki/loki.go
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.4%

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were planning a remote read API as well. I'll reject it just so we have time reviewing it not because I think our solution would be better.

I'm super there's interest in it.

@jeschkies
Copy link
Contributor

Is there an issue or something similar where this is discussed in more detail?

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I meant to introduce a new remote read API that would stream the results so we would only transfer the data we need.

if !ok {
return nil, errors.New("remote read Querier selectLogs fail,value cast (loghttp.Streams) fail")
}
return iter.NewStreamsIterator(streams.ToProto(), params.Direction), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should inject the remote id like we did with the tenant ID for multi tenant queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree , you are right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I cannot find it 🙈

@liguozhong
Copy link
Contributor Author

Is there an issue or something similar where this is discussed in more detail?

hi @jeschkies ,As far as I know, there is no detailed discussion about remote read, but I have some experience with the remote read module in the prometheus project before.

This PR is only to provide a template for the community to discuss, we can discuss in this PR.

I'm very sorry to submit this PR directly without discussing it in detail, trust me, I have no ill intentions. I just want to make loki a more mature and complete logging solution faster.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.4%

@jeschkies
Copy link
Contributor

I'm very sorry to submit this PR directly without discussing it in detail, trust me, I have no ill intentions. I just want to make loki a more mature and complete logging solution faster.

@liguozhong no worries. I didn't mean to criticize. I was out for some time and thought I missed some info.

I'm super happy you took this initiative. I have a few ideas that I'll add later here.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about performance of this at scale, which often coincides with users that want this feature. Right now, no aggregation is done remotely, meaning we're going to ship potentially a lot of data back to the host cluster on each query. This can quickly become prohibitively expensive (data egress costs).

Some other thoughts

  • (mentioned above) How do we minimize bandwidth between clusters? Ideally we'd need to do more sophisticated query planning (aggregation push-down).
  • What happens when the ideal shard factor should differ on different clusters during query parallelization?

I'm inclined to say we don't want this feature because at scale, this will be both costly and slow :( .

@liguozhong
Copy link
Contributor Author

I'm worried about performance of this at scale, which often coincides with users that want this feature. Right now, no aggregation is done remotely, meaning we're going to ship potentially a lot of data back to the host cluster on each query. This can quickly become prohibitively expensive (data egress costs).

Some other thoughts

  • (mentioned above) How do we minimize bandwidth between clusters? Ideally we'd need to do more sophisticated query planning (aggregation push-down).
  • What happens when the ideal shard factor should differ on different clusters during query parallelization?

I'm inclined to say we don't want this feature because at scale, this will be both costly and slow :( .

Agreed, introducing this feature will cause a lot of complaints

@SuperQ
Copy link

SuperQ commented Mar 28, 2023

@owen-d / @liguozhong This kind of remote-read federation is very much needed.

For example, if have multiple compute/cloud service providers in multiple geo regions. I would rather ship data per query over, than ship all logs over our regional links. Most log output goes un-queried, so keeping the logs inside a regional or provider boundary is desired.

We would also rather have the ability to circuit-breaker pattern such that if a single geo region/provider is unavailable we can continue to have access to the regions/providers.

Without remote read / distributed query, Loki would a SPoF in our observability architecture.

Having a remote read / federated model is why we went with Thanos over Mimir.

@liguozhong
Copy link
Contributor Author

liguozhong commented Mar 28, 2023

@owen-d / @liguozhong This kind of remote-read federation is very much needed.

For example, if have multiple compute/cloud service providers in multiple geo regions. I would rather ship data per query over, than ship all logs over our regional links. Most log output goes un-queried, so keeping the logs inside a regional or provider boundary is desired.

We would also rather have the ability to circuit-breaker pattern such that if a single geo region/provider is unavailable we can continue to have access to the regions/providers.

Without remote read / distributed query, Loki would a SPoF in our observability architecture.

Having a remote read / federated model is why we went with Thanos over Mimir.

Hi, I understand your situation very well. At present, I also operate 22 loki clusters.
When I need to query a traceID log, I need to click 22 times each time. This is very uncomfortable.
I am willing to complete this PR to free up this unproductive duplication of work ,

needs to be discussed by more people.

# Conflicts:
#	docs/sources/configuration/_index.md
#	pkg/querier/querier.go
@SuperQ
Copy link

SuperQ commented Mar 28, 2023

Having this as an option, and then improving on things like query pushdown is probably the best incremental approach. Documenting the down-sides like fan-out and lack of push-down is fine IMO.

For example in Thanos, we had lots of issues with fanout (we have 2000+ Prometheus instances), so we introduced a label enforcement proxy to make sure users selected which promehteus instances they needed per query. This is slightly less intuitive, but a lot less cumbersome to maintaining 2000+ datasources. :)

@liguozhong
Copy link
Contributor Author

I am very looking forward to launching an alpha version so that more developers can participate in optimizing the code.
If there are 2000 loki, each loki is configured with a label, and then filtering different datasource based on the label can provide a very cool federation query. great idea.

wait more eyes

@jeschkies
Copy link
Contributor

What do you think about blocking queries that do not perform well without remote aggregation?

@liguozhong
Copy link
Contributor Author

liguozhong commented Apr 27, 2023

What do you think about blocking queries that do not perform well without remote aggregation?

sorry, I missed your message.
Today our dev team asked me for this feature again. I found your message in the PR today.

sum(count_over_time{app="buy2"}[1m])

As far as I know, when the querier queries the ingester, it will execute this statement(count_over_time{app="buy2"}[1m]) in the ingester instead of pulling all the original data back to the querier for execution. What the ingester returns to the queier is a time series result . And this amount of data is very small, there will be no performance problems.

The Remote read API is the same, the performance problem is mainly concentrated in s3 file download, | json calculation. instead of happening in sum

Maybe my understanding is wrong.

@jeschkies
Copy link
Contributor

What the ingester returns to the querier is a time series result . And this amount of data is very small, there will be no performance problems.

@liguozhong, that's because the ingester covers a small time range. If you query over days it's an issue. If I'm not mistaken we want to use something like the shard_summer here that fans out to each remote. Say you have one thousand pods and use their ids a labels. You'll have over a thousand streams returning. However, if you sum by say the namespace, the streams are reduced.

@liguozhong
Copy link
Contributor Author

What the ingester returns to the querier is a time series result . And this amount of data is very small, there will be no performance problems.

@liguozhong, that's because the ingester covers a small time range. If you query over days it's an issue. If I'm not mistaken we want to use something like the shard_summer here that fans out to each remote. Say you have one thousand pods and use their ids a labels. You'll have over a thousand streams returning. However, if you sum by say the namespace, the streams are reduced.

sorry ,I miss this review tips . I will try to make it right. thanks ,"over a thousand streams returning" this is really danger .

# Conflicts:
#	pkg/logcli/client/client.go
#	pkg/logcli/client/file.go
#	pkg/logcli/query/query_test.go
#	pkg/querier/querier_test.go
# Conflicts:
#	pkg/logcli/client/client.go
@JStickler
Copy link
Contributor

@liguozhong This PR is over a year old. Can we close it? Or are you still hoping to get this merged?

@paulojmdias
Copy link

Any updates on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Remote read API
8 participants