Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] Index: loki support apache calcite avatica index storage. #5692

Closed
wants to merge 38 commits into from

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Mar 22, 2022

What this PR does / why we need it:
There is no google bigtable, aws dynamoDB and no hosted Cassandra service in China.
To operate a loki also needs to operate a NOSQL distributed databases, which puts a lot of pressure on our loki operater team.
We expect to find a hosted managed NOSQL service in China.

There is a cloud product on Alibaba Cloud lindorm(https://lindorm.console.aliyun.com/), that can solve this problem.

The go client of lindorm is apache calcite avatica(https://calcite.apache.org/avatica/). This PR is also applicable to the servers of the avatica protocol such as kylin, Phoenix and other service.

This PR attempts to provide an option for index storage that can run loki on a large scale in China.

out clutser config.

schema_config:
  configs:
    - from: "2020-05-15"
      index:
        period: 168h
        prefix: loki_avatica_table2_index_
      object_store: s3
      schema: v11
      store: avatica
storage_config:
  max_chunk_batch_size: 500
  avatica:
    username: root
    password: xxxx
    addresses: http://ld-xxx-proxy-lindorm.lindorm.rds.aliyuncs.com:30060
    database: indexlokiavatica

Which issue(s) this PR fixes:
Fixes ##5667

Special notes for your reviewer:
The tests have all passed in my dev env.
image

prometheus monitor snapshot, write status code == 200
image

https://github.com/dlmiddlecote/sqlstats
https://github.com/prometheus/client_golang/blob/main/prometheus/collectors/dbstats_collector.go

sql stats metrics

Name Description Labels Go Version
go_sql_stats_connections_max_open Maximum number of open connections to the database. db_name 1.11+
go_sql_stats_connections_open The number of established connections both in use and idle. db_name 1.11+
go_sql_stats_connections_in_use The number of connections currently in use. db_name 1.11+
go_sql_stats_connections_idle The number of idle connections. db_name 1.11+
go_sql_stats_connections_waited_for The total number of connections waited for. db_name 1.11+
go_sql_stats_connections_blocked_seconds The total time blocked waiting for a new connection. db_name 1.11+
go_sql_stats_connections_closed_max_idle The total number of connections closed due to SetMaxIdleConns. db_name 1.11+
go_sql_stats_connections_closed_max_lifetime The total number of connections closed due to SetConnMaxLifetime. db_name 1.11+
go_sql_stats_connections_closed_max_idle_time The total number of connections closed due to SetConnMaxIdleTime. db_name 1.15+

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@liguozhong liguozhong requested a review from a team as a code owner March 22, 2022 07:52
@sandeepsukhani
Copy link
Contributor

Thanks for the PR @liguozhong! Have you considered using boltdb-shipper for index storage?

@liguozhong
Copy link
Contributor Author

liguozhong commented Mar 22, 2022

as I know, boltdb-shipper can only be deployed in the form of a single loki,
we have deployed a very large distributed loki.
image

@sandeepsukhani
Copy link
Contributor

Sorry, not sure what gave you an impression that boltdb-shipper can only be used with single Loki. Actually boltdb-shipper is meant to run Loki in distributed mode at a large scale without having to use expensive and complicated NoSql stores. We have been running all our Loki deployments with boltdb-shipper for more than a year now.

@liguozhong
Copy link
Contributor Author

boltdb-shipper

hi, @sandeepsukhani , thanks a lot for your suggestion. This advice has taught me another great knowledge.

I read this document carefully (https://grafana.com/docs/loki/latest/operations/storage/boltdb-shipper/), the implementation of boltdb-shipper is really amazing, this idea is very good , a perfect replacement for Cassandra. I really want to use this new mode of boltdb-shipper to operate loki, it will be a very interesting challenge.

But I am a little worried. At present, our loki is running in k8s. Neither ingester nor qierier use persistent volume, but only use 2 resources of memory and cpu. The reason we do this is that we think that disk is a database, and the database is difficult, such as how to solve the problem of disk failure, how to expand the capacity of insufficient disk space, how to migrate the data on the disk, mmap, etc. The hardest problems in the CS are all in the disk.

Because our team is a monitoring team, not a database team, our development team is mainly monitoring skills. Because of the above concerns, we chose hosted Cassandra and hosted s3 to hand over the disk and database issues to a professional team, which has obtained loki's higher availability.

We still hope to put persistent storage in a database similar to lindorm /cassandra /bigtable /dynamoDB,

and we will try to use boltdb-shipper in some small loki clusters. After the ability, we will try to replace all online loki with boltdb-shipper. But I think this is difficult, we need to have the skills of the database team, but we will try.

@sandeepsukhani
Copy link
Contributor

I think you should give it a try since adding disks now a days is very easy with most platforms providing k8s as a service.
If you are using jsonnet for deployment then you can even configure the size of PV.

@liguozhong
Copy link
Contributor Author

Ok, I will try to use boltdb-shipper to store index data in a dev cluster soon

@liguozhong liguozhong closed this Mar 30, 2022
@liguozhong
Copy link
Contributor Author

liguozhong commented Mar 30, 2022

I think you should give it a try since adding disks now a days is very easy with most platforms providing k8s as a service. If you are using jsonnet for deployment then you can even configure the size of PV.

@sandeepsukhani hi ,I need your help.
I see that the index data of boltdb-shipper and chunk exist in the same bucket.
Can you separate /index/ and chunk into two different buckets? The ruler component is also an independent s3 bucket.
The performance stability of our Chinese s3 storage in the same bucket is not good.
thanks for your help.

ruler yaml

ruler:
    storage:
      type: s3
      s3:
        s3: s3://s3.aliyuncs.com:9053/loki-rule
        s3forcepathstyle: true

shipper yaml

storage_config:
  aws:
    bucketnames: loki-logs 
    endpoint:s3.com  
  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: s3
    cache_location: /loki/boltdb-cache
compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: aws

@liguozhong liguozhong reopened this Mar 30, 2022
panic: runtime error: index out of range [0] with length 0

goroutine 305779 [running]:
github.com/apache/calcite-avatica-go/v5.(*rows).Columns(0xc08910fa5eb1c1b8)
	/Users/fuling/go/pkg/mod/github.com/apache/calcite-avatica-go/v5@v5.0.0/rows.go:53 +0xfe
database/sql.(*Rows).nextLocked(0xc06b0a4080)
	/usr/local/go/src/database/sql/sql.go:2964 +0xb0
@owen-d
Copy link
Member

owen-d commented Apr 4, 2022

I think it's unlikely we'll accept another index client (sorry!) because we're trying to deprecate usage of alternative index stores and standardize on using object-storage for the index via boltdb-shipper (or TSDB in the future).

As for your question about independent buckets for the index, it's not currently a supported option like it is for the ruler.

@liguozhong
Copy link
Contributor Author

I think it's unlikely we'll accept another index client (sorry!) because we're trying to deprecate usage of alternative index stores and standardize on using object-storage for the index via boltdb-shipper (or TSDB in the future).

As for your question about independent buckets for the index, it's not currently a supported option like it is for the ruler.

Ok, I had to do this work because hosted Cassandra in China went offline.
I'm already trying to use boltdb-shipper to replace the original Cassandra.
Will the loki community consider splitting chunk and index into two separate buckets in the future?

@liguozhong
Copy link
Contributor Author

liguozhong commented Apr 8, 2022

I think it's unlikely we'll accept another index client (sorry!) because we're trying to deprecate usage of alternative index stores and standardize on using object-storage for the index via boltdb-shipper (or TSDB in the future).

As for your question about independent buckets for the index, it's not currently a supported option like it is for the ruler.

@sandeepsukhani The write performance of boltdb-shipper is too bad. When Cassandra needs 3ms to write, boltdb needs 300ms.
At present, boltdb will make writing too slow and cause the s3 write queue to accumulate, which will eventually cause ingester OOM.
At present, according to my judgment, boltdb Not applicable in our online environment.
@owen-d ,hi ,I see if the tsdb engine(#5428) you are doing is to solve the problem of low boltdb write performance?

Currently our Cassandra is going offline, should I wait for the tsdb engine to finish , or continue with the pr of this Chinese version of the nosql index client?

boltdb write duration
image
cassandra write duration
image

@sandeepsukhani
Copy link
Contributor

@liguozhong thanks for giving it a try and getting back with your findings!
What is your QPS on index writes? Are you using a node disk or PV? If PV then are they SSD backed?

@liguozhong
Copy link
Contributor Author

@liguozhong thanks for giving it a try and getting back with your findings! What is your QPS on index writes? Are you using a node disk or PV? If PV then are they SSD backed?

I just started reading boltdb code, trying to find lock related code, check if there are optimized code blocks.
This pr is because I saw it during the viewing process.
I'm still trying to find performance optimization points for writes.
We're working on switching from Cassandra to boltdb-shipper, which is a huge change for us, and we hope our team will be more familiar with the boltdb-shipper part of the code before switching to boltdb-shipper from cassandra.

@liguozhong
Copy link
Contributor Author

@liguozhong thanks for giving it a try and getting back with your findings! What is your QPS on index writes? Are you using a node disk or PV? If PV then are they SSD backed?

Our Cassandra has qps=7000, duration=3ms.
The case of boltdb-shipper is qps=1000, duration=300ms

We are using a 50Gb SSD

@sandeepsukhani
Copy link
Contributor

sandeepsukhani commented Apr 8, 2022

Our Cassandra has qps=7000, duration=3ms. The case of boltdb-shipper is qps=1000, duration=300ms

The boltdb-shipper actually measures batch writes while cassandra measures individual index entry writes.
boltdb-shipper would be doing much more index writes than cassandra because we disable index deduplication to avoid losing index if we lose an ingester since they write the index locally before uploading the index files to object storage.

We are using a 50Gb SSD

Not sure if you would need that much, you can check the disk usage.

I would like to point out that there are tradeoffs in both systems. You either pay for running Cassandra or you throw a little more resources on your Loki cluster when using boltdb-shipper since it now has to do more work to manage the index as well.
We run all our clusters with boltdb-shipper which soon would be replaced by tsdb. We will continue to invest our time improving object storage only backed Loki since it is cheaper and easier to run.

@liguozhong
Copy link
Contributor Author

We are using a 50Gb SSD

Not sure if you would need that much, you can check the disk usage.

In Alibabacloud, a cloud computing platform, the pv of the ssd type needs to be at least 50Gb, otherwise the error "InvalidDiskSize.NotSupported" will be reported.
But the disk is relatively cheap compared to buying a set of hosted Cassandra.

ErrorCode: InvalidDiskSize.NotSupported
Recommend: https://troubleshoot.api.aliyun.com?q=InvalidDiskSize.NotSupported&product=Ecs
RequestId: 9A5F223A-BB9C-5BD4-88BB-5C5AC0C51B57
Message: disk size is not supported.
recommand: Please adjust the size of the cloud disk;
reference:https://help.aliyun.com/document_detail/25412.html

@splitice
Copy link
Contributor

Splitting all s3 to multiple buckets might help you with performance. It's currently possible with comma separated buckets in the config.

Migration is difficult however as it's not at the schema level.

@stale
Copy link

stale bot commented Jul 10, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jul 10, 2022
@stale stale bot closed this Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants