create smaller unique files from boltdb shipper and other code refactorings #2261

sandeepsukhani · 2020-06-24T14:11:41Z

What this PR does / why we need it:
This PR does major overhaul of boltdb shipper code. Most of the functionality remains the same except it would shard index files by creating a new one every 15 mins.
The overhaul includes splitting upload and downloads code to a separate sub-package.
This PR still needs some final touches and tests.

Checklist

Tests updated

codecov-commenter · 2020-06-27T10:07:47Z

Codecov Report

Merging #2261 into master will increase coverage by 1.15%.
The diff coverage is 60.25%.

@@            Coverage Diff             @@
##           master    #2261      +/-   ##
==========================================
+ Coverage   61.55%   62.70%   +1.15%     
==========================================
  Files         160      162       +2     
  Lines       13612    13847     +235     
==========================================
+ Hits         8379     8683     +304     
+ Misses       4607     4488     -119     
- Partials      626      676      +50

Impacted Files	Coverage Δ
pkg/loki/loki.go	`0.00% <0.00%> (ø)`
pkg/storage/stores/shipper/metrics.go	`0.00% <0.00%> (ø)`
pkg/storage/stores/shipper/shipper_index_client.go	`0.00% <0.00%> (ø)`
pkg/storage/stores/shipper/table_client.go	`41.93% <ø> (ø)`
pkg/loki/modules.go	`10.86% <44.44%> (ø)`
pkg/storage/store.go	`62.02% <60.00%> (-0.48%)`	⬇️
pkg/storage/stores/shipper/downloads/table.go	`65.00% <65.00%> (ø)`
...kg/storage/stores/shipper/uploads/table_manager.go	`67.52% <67.52%> (ø)`
pkg/storage/stores/shipper/uploads/table.go	`68.62% <68.62%> (ø)`
.../storage/stores/shipper/downloads/table_manager.go	`69.76% <69.76%> (ø)`
... and 11 more

slim-bean · 2020-06-29T21:29:02Z

pkg/loki/modules.go

@@ -34,7 +34,8 @@ import (
 	"github.com/grafana/loki/pkg/querier"
 	"github.com/grafana/loki/pkg/querier/queryrange"
 	loki_storage "github.com/grafana/loki/pkg/storage"
-	"github.com/grafana/loki/pkg/storage/stores/local"
+	"github.com/grafana/loki/pkg/storage/stores/shipper"
+	shipper_uploads "github.com/grafana/loki/pkg/storage/stores/shipper/uploads"


is there a collision that requires this? (I didn't look super close just aw it in the diff)

pkg/storage/stores/shipper/downloads/metrics.go

pkg/storage/stores/shipper/downloads/table.go

sandeepsukhani · 2020-07-21T12:26:40Z

@cyriltovena Thanks for the feedback! I have addressed them all.
Please feel free to point out any problems that you see, if any.

pkg/storage/stores/shipper/downloads/table_manager.go

cyriltovena · 2020-07-21T19:53:48Z

pkg/storage/stores/shipper/downloads/table.go

+		return t.err
+	}
+
+	t.dbsMtx.RLock()


I think the code for waiting initialization should be before this. I know it is the case already but it should included in this function and not in the table manager, because If I use table without tablemanager I can get deadlocked here.

Another option is to make Table not exported in this package.

May be easier to just rename Table to table.

I have made public methods to give a guarantee that usage of them is concurrency safe. The Table is not meant to be used without the table manager. Having public methods(for concurrency guarantees) on a private type would look weird. It would be better to move readiness check here from table manager given we want to support using Table without table manager.
What do you think?

Either you make Table => table or you move the readiness check here. Up to you !

I have pushed the code. Thanks!

pkg/storage/stores/shipper/downloads/table_manager.go

pkg/storage/stores/shipper/downloads/table.go

pkg/storage/stores/shipper/uploads/table.go

cyriltovena · 2020-07-21T20:54:18Z

pkg/storage/stores/shipper/uploads/table_manager.go

+	for {
+		select {
+		case <-syncTicker.C:
+			err := tm.uploadTables(context.Background())


I don't think you should be using a context.Background here.

What is the issue when you cancel this ?

I don't think we would ever want to cancel that context, not even while stopping to make sure we stop the service only after uploading the files. We could have a context with a minimum of 10 mins timeout to avoid waiting on it forever. What do you think? Is that what you are looking for?

My problem is mostly when using context.Background you're blocking the waitgroup and you'll likely be killed and you don't know when. I'm wondering if this could be a problem ? Basically being stopped before gracefully stopping.

First I think you should cancel the loop because the upload happen again anyways after stop. Am I wrong ?

Second 10min would get just shutdown by Kubernetes see https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace

At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.

Basically you're going to get SIGKILL after 30s or more depending on what we have configured, for sure not 10min.

Let's change the Close function to give a 1m timeout, and log if it has been cancelled or success, so we can see if we get killed before finishing.

If I am not wrong our ingesters don't get killed within 30s of receiving SIGTERM because we do chunk transfers and it could take a couple of minutes. I checked the config and it seems we configure termination grace period to 80 minutes, see

loki/production/ksonnet/loki/ingester.libsonnet

Line 31 in a0dc10a

deployment.mixin.spec.template.spec.withTerminationGracePeriodSeconds(4800),

I think this context should have a timeout and should be configured to a value less than the termination grace period and we should have explicit info level logging on when a table upload starts and completed.

I don't like the idea of hiding a table upload into the background without any visibility, many people run Loki with much more aggressive shutdown requirements and it shoudl be clear from the logs if table uploads succeeded or if the process was killed before they completed, have a timeout less than the grace period would ensure we log that upload failed before we gave up and shutdown.

After some more discussion with @sandeepsukhani , we already try to flush chunks forever on shutdown it makes sense to try to upload tables forever on shutdown.

I would like to see the explicit logging of both started and succeeded (as well as error) so that someone can determine from their logs if all tables were uploaded.

pkg/storage/stores/shipper/uploads/table_manager.go

pkg/storage/stores/shipper/uploads/table.go

pkg/storage/stores/shipper/uploads/table_manager.go

slim-bean

Looks Good To Get Merged To Me!

…opped

… from ingesters

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

pull-request-size bot added the size/XXL label Jun 24, 2020

sandeepsukhani force-pushed the boltdb-shipper-overhaul branch 5 times, most recently from d1a2af1 to 16ce8df Compare June 27, 2020 09:57

slim-bean reviewed Jun 29, 2020

View reviewed changes

sandeepsukhani force-pushed the boltdb-shipper-overhaul branch 5 times, most recently from 2a21e61 to 89ec425 Compare July 8, 2020 13:33

cyriltovena reviewed Jul 20, 2020

View reviewed changes

pkg/storage/stores/shipper/downloads/metrics.go Outdated Show resolved Hide resolved

cyriltovena reviewed Jul 20, 2020

View reviewed changes

pkg/storage/stores/shipper/downloads/metrics.go Outdated Show resolved Hide resolved

cyriltovena reviewed Jul 20, 2020

View reviewed changes

pkg/storage/stores/shipper/downloads/table.go Show resolved Hide resolved

cyriltovena reviewed Jul 20, 2020

View reviewed changes

pkg/storage/stores/shipper/downloads/table.go Outdated Show resolved Hide resolved

cyriltovena reviewed Jul 20, 2020

View reviewed changes

pkg/storage/stores/shipper/downloads/table.go Show resolved Hide resolved