WIP: Size based retention #7927

Abuelodelanada · 2022-12-13T13:44:58Z

What this PR does / why we need it:

This WorkInProgress RP is the first approach to try to solve #6876.
I'm opening the PR at this stage to share the progress we made and fundamentally to read suggestions about it!

Which issue(s) this PR fixes:

Fixes #6876

Special notes for your reviewer:

To activate the size_based_retention policy you have to add the size_based_retention_percentage in the storage.filesystem section:

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
      size_based_retention_percentage: 2

Besides in the compactor section, the value for retention_enabled must be true, for instance:

compactor:
  working_directory: /tmp/loki/retention
  shared_store: filesystem
  compaction_interval: 30s
  retention_enabled: true
  retention_delete_delay: 1m
  retention_delete_worker_count: 150

How to test:

Build loki: make loki
Let's say loki will store chunks in a filesystem with a disk usage near 2%, use a loki config like this:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
      size_based_retention_percentage: 2
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory


storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunks


schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ingester:
  wal:
    enabled: true
    dir: /tmp/loki/chunks/wal
    flush_on_shutdown: true

compactor:
  working_directory: /tmp/loki/retention
  shared_store: filesystem
  compaction_interval: 30s
  retention_enabled: true
  retention_delete_delay: 1m
  retention_delete_worker_count: 150

limits_config:
  ingestion_rate_mb: 10

Run loki: ./cmd/loki/loki -config.file=cmd/loki/loki-local-config.yaml
Send logs to loki using for instance promtail.
Check that Loki actually receives the logs, for instance using Grafana:

When disk usage reaches (or exceeds) 2% Loki will start deleting chunk files and you will see in the logs:

level=info ts=2023-04-14T17:17:59.614606507Z caller=retention.go:135 msg="Detected disk usage percentage" diskUsage=52.61%
level=info ts=2023-04-14T17:17:59.614636609Z caller=retention.go:156 msg="Block size retention exceeded, removing chunks from file" filepath=miguelito-1681483439590834149-1681491600.gz
level=info ts=2023-04-14T17:18:59.611874456Z caller=retention.go:135 msg="Detected disk usage percentage" diskUsage=52.60%
level=info ts=2023-04-14T17:18:59.614467894Z caller=retention.go:135 msg="Detected disk usage percentage" diskUsage=52.60%
level=info ts=2023-04-14T17:18:59.614573707Z caller=retention.go:156 msg="Block size retention exceeded, removing chunks from file" filepath=compactor-1681492684.gz

Loki will check for disk usage every minute.

Doubts and considerations

Settings

Despite of the fact it is working seems the solution may be improved a lot, for instance we need 2 settings in the config file:

size_based_retention_percentage in storage.filesystem section
retention_enabled: true in compactor section.

I would love to avoid the need for retention_enabled: true....

About the author

Since I'm mostly a Python/PHP Software Engineer writing my first lines in Go and it's my first contact with the Loki codebase, there are surely many things that can be improved. Please feel free to comment!! 😄

Checklist

Reviewed the CONTRIBUTING.md guide
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

…d test

CLAassistant · 2022-12-13T13:45:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ MasslessParticle
✅ Abuelodelanada
❌ rbarry82
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Abuelodelanada · 2023-05-05T08:48:38Z

Hi @MasslessParticle

After some weeks of radio silence (we were working hard), we return to this PR.

We are at a point this is kind of working as expected, using a config file like this one:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
      size_based_retention_percentage: 2
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory


storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunks


schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ingester:
  wal:
    enabled: true
    dir: /tmp/loki/chunks/wal
    flush_on_shutdown: true

compactor:
  working_directory: /tmp/loki/retention
  shared_store: filesystem
  compaction_interval: 30s
  retention_enabled: true
  retention_delete_delay: 1m
  retention_delete_worker_count: 150

limits_config:
  ingestion_rate_mb: 10

Note the size_based_retention_percentage under common section and retention_enabled: true under compactor.

May you give it a try and let us known your thoughts?

MasslessParticle · 2023-05-11T14:44:49Z

@Abuelodelanada 🎉

I'm a buried at the moment but I will test when I get the chance!

simskij · 2023-05-31T12:51:05Z

@Abuelodelanada tada

I'm a buried at the moment but I will test when I get the chance!

@MasslessParticle

Hey mate :)

Sorry for being a pain in the ass. We're really keen on getting this functionality into shape. Would you mind giving it another review?

Thanks,
Simme

MasslessParticle

Sorry again for the delay to look at this. I can't run it because things have drifted but I have done a review.

This is looking good and makes sense. I've added comments about a couple of concerns. I'm also concerned there there's a lot of logic here and very few tests. I notice that you merged my original suggestions. That's fair to avoid rework but also those were illustrative and should probably be tested

pkg/loki/modules.go

MasslessParticle · 2023-06-08T16:10:48Z

pkg/storage/stores/indexshipper/compactor/compactor.go

 	return nil
 }

+func (c *Compactor) sizeBasedCompactionInterval(ctx context.Context) error {
+	if exceeded, err := c.sizeBasedRetention.ThresholdExceeded(); !exceeded {


This doesn't handle the error when exceeded == true

Hi @MasslessParticle

We are not handling this situation, because the function ThresholdExceeded() does not return this combination: true, err. The possible returns are:

false, err

false, nil

true, nil

Do you think we should modify this anyway?

For uniform error handling, I think it would be best to modify this to handle true, err. This guards against future changes to ThresholdExceeded

Reviewing the code again seems that there is no point in checking all the combinations.

We need just 2 guards:

Log a message and return nil if there is an error, or

return nil if doesn't exceed threshold.

In other situations we should continue with the execution.

Pushing these changes with less nesting, let me know your thoughts.

This makes sense. Do we return nil on error to not exit compactions? That makes sense, but we should probably add a metric here so an alert can be raised on successive failures.

pkg/storage/stores/indexshipper/compactor/compactor.go

MasslessParticle · 2023-06-08T16:11:33Z

pkg/storage/stores/indexshipper/compactor/retention/retention.go

@@ -90,7 +227,17 @@ type Marker struct {
 	markTimeout      time.Duration
 }

+var mu sync.Mutex


I don't think we need the mutex here. The compactor should probably own the marker metrics and pass it to markers.

I did not implement this.
This section of the code was written by @rbarry82 who sadly passed away a month ago.

Do you mean something like this?

func NewMarker(workingDirectory string, expiration ExpirationChecker, markTimeout time.Duration, chunkClient client.Client, metrics *markerMetrics) (*Marker, error) { return &Marker{ workingDirectory: workingDirectory, expiration: expiration, markerMetrics: metrics, chunkClient: chunkClient, markTimeout: markTimeout, }, nil }

I'm sorry to hear about the loss of your colleague 🙁

Metrics can only be registered once. If we try to register them a second time, a runtime panic occurs. The code, as is prevents registering a second time but it's a little akward.

I think we could change

func NewMarker(workingDirectory string, expiration ExpirationChecker, markTimeout time.Duration, chunkClient client.Client, r prometheus.Registerer) (*Marker, error)

to take metrics like this

func NewMarker(workingDirectory string, expiration ExpirationChecker, markTimeout time.Duration, chunkClient client.Client, metrics *markerMetrics) (*Marker, error)

But then we need a place to instantiate the metrics. We could probably do it in the compactor's init function and make it a field on the compactor.

If we do this, the mutex can be deleted altogether.

About mutex, now I remember that Ryan added the following to Compactor struct:

// Size based compaction means that two compactions might try to happen at the same time. // Use this to ensure size-based and normal compaction can't step on eachother. compactionMtx sync.Mutex

Does make sense or he was wrong?

About your suggestion, I have this draft right now (didn't push it yet)... Is it close to what you suggest?

modified pkg/storage/stores/indexshipper/compactor/compactor.go @@ -182,6 +182,7 @@ type Compactor struct { // Size based compaction means that two compactions might try to happen at the same time. // Use this to ensure size-based and normal compaction can't step on eachother. compactionMtx sync.Mutex + markerMetrics *retention.markerMetrics // one for each object store storeContainers map[string]storeContainer @@ -310,6 +311,8 @@ func (c *Compactor) init(objectStoreClients map[string]client.ObjectClient, sche } } + c.markerMetrics = retention.newMarkerMetrics(r) + c.storeContainers = make(map[string]storeContainer, len(objectStoreClients)) for objectStoreType, objectClient := range objectStoreClients { var sc storeContainer @@ -338,7 +341,7 @@ func (c *Compactor) init(objectStoreClients map[string]client.ObjectClient, sche return fmt.Errorf("failed to init sweeper: %w", err) } - sc.tableMarker, err = retention.NewMarker(retentionWorkDir, c.expirationChecker, c.cfg.RetentionTableTimeout, chunkClient, r) + sc.tableMarker, err = retention.NewMarker(retentionWorkDir, c.expirationChecker, c.cfg.RetentionTableTimeout, chunkClient, c.markerMetrics) if err != nil { return fmt.Errorf("failed to init table marker: %w", err) } modified pkg/storage/stores/indexshipper/compactor/retention/retention.go @@ -5,9 +5,9 @@ import ( "context" "errors" "fmt" - "sync" "os" "path/filepath" + "sync" "time" @@ -227,21 +227,11 @@ type Marker struct { markTimeout time.Duration } -var mu sync.Mutex -var metrics *markerMetrics - -func NewMarker(workingDirectory string, expiration ExpirationChecker, markTimeout time.Duration, chunkClient client.Client, r prometheus.Registerer) (*Marker, error) { - mu.Lock() - defer mu.Unlock() - - if metrics == nil { - metrics = newMarkerMetrics(r) - } - +func NewMarker(workingDirectory string, expiration ExpirationChecker, markTimeout time.Duration, chunkClient client.Client, metrics *markerMetrics) (*Marker, error) { return &Marker{ workingDirectory: workingDirectory, expiration: expiration, - markerMetrics: newMarkerMetrics(r), + markerMetrics: metrics, chunkClient: chunkClient, markTimeout: markTimeout, }, nil

MasslessParticle

Sorry again for the delay to look at this. I can't run it because things have drifted but I have done a review.

This is looking good and makes sense. I've added comments about a couple of concerns. I'm also concerned there there's a lot of logic here and very few tests. I notice that you merged my original suggestions. That's fair to avoid rework but also those were illustrative and should probably be tested

Abuelodelanada · 2023-06-26T18:31:46Z

Sorry again for the delay to look at this. I can't run it because things have drifted but I have done a review.

This is looking good and makes sense. I've added comments about a couple of concerns. I'm also concerned there there's a lot of logic here and very few tests. I notice that you merged my original suggestions. That's fair to avoid rework but also those were illustrative and should probably be tested

I'm working on the test right now.

simskij · 2023-07-10T18:43:26Z

@MasslessParticle

This is taking a lot longer than we expected as we've bumped into quite a lot of issues with the delta between when we originally forked and the current state. In combination with @rbarry82's unfortunate passing, things are going slower than usual. Work is however still ongoing and we're still looking to get this across the finish line.

VenkateswaranJ · 2023-12-11T23:55:48Z

Any update on this?

simskij · 2023-12-12T13:32:46Z

Any update on this?

Some, but nothing substantial. Loki moves quite rapidly, and as this feature touches multiple parts of the code base, it's been somewhat of a moving target. I'll get back to you all at the beginning of next year with more details.

mkjpryor · 2024-02-13T23:21:20Z

We would absolutely love this feature. Any progress so far this year?

Abuelodelanada added 23 commits December 12, 2022 11:06

Droping the first lines of code... Just seeing how it could be

30fe91a

Constant added and if... changed

664977f

DiskStatus struct and DiskUsage function added

c1a6b10

First lines of DeleteChunksBasedOnBlockSize

3542372

Missing "(" and ")"

2647d1f

First unit test written in golang

6842ff5

Implement the deletion process

c5c74f2

Refactor bytesToDelete function

ae6dd9f

Move DiskStatus struct to utils

0cc8d44

Move DiskStatus to a better place (?)

b44fe34

Add FIXME

85a72ec

bucketBlockSizeRetentionPercentaje -> bucketBlockSizeRetentionInterval

abf6f37

Improve FIXME comment

4b8e50d

Some refactor and encasuplation

7b85a62

Adding SizeBasedRetentionPercentage to common.go

1dad70c

There are still a lot of thinks to fix, but it is "working"

3001053

Fix TODO comment

b73bdf0

Check if we are using boltdb as storage index

ee75bc0

Fix comment to help us solve the issue

8dc931d

NewCompactor modified. Now receives fsConfig

b0d5a42

Refactor DeleteChunksBasedOnBlockSize, remove from FSObjectClient, ad…

46688f9

…d test

missing code in merge

be4cfb3

fixing import in test

2f75e1b

Abuelodelanada requested a review from a team as a code owner December 13, 2022 13:44

pull-request-size bot added the size/L label Dec 13, 2022

Abuelodelanada mentioned this pull request Dec 13, 2022

Loki Block Disk Space Cleanup Proposal #6876

Open

Merge branch 'main' into size_based_retention

810d2b0

lucabello mentioned this pull request Jan 4, 2023

Clean up logs based on disk usage canonical/loki-k8s-operator#131

Closed

remove duplicated CompactorRing

538ba3a

Abuelodelanada and others added 2 commits April 14, 2023 11:27

remove functions that were re-refactored

d538b7d

Merge branch 'main' into size_based_retention

6c2e2e9

simskij mentioned this pull request May 30, 2023

Expose retention tuning knobs canonical/loki-k8s-operator#278

Closed

MasslessParticle reviewed Jun 8, 2023

View reviewed changes

Abuelodelanada and others added 5 commits June 13, 2023 11:26

Merge branch 'main' into size_based_retention

2d5c863

Merge branch 'main' into size_based_retention

2073945

remove commented code

ba5c932

remove turning off index compaction

dd77bd7

remove nesting, add guards

a05a5d8

Abuelodelanada and others added 10 commits June 27, 2023 18:03

Merge branch 'grafana:main' into size_based_retention

69f78be

fix NewCompactor func signature

1fbc41b

remove not used function

4ac5034

fix NewCompactor() call

53156e0

make MarkerMetrics public

af0b78c

Refactor NewMarker. SizeBasedRetentionCleanerService added

bedd990

compactor refactored

dcf45cc

Fix Test_Retention

797fea8

make NewMarkerMetrics public

83cc6ec

filling c.indexStorageClient

b5abb85

afarresg mentioned this pull request Jan 9, 2024

Add request_id field to the log deletion API #11613

Closed

mkjpryor mentioned this pull request Feb 15, 2024

Allow changing of volume sizes for metrics + logs stackhpc/azimuth#173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Size based retention #7927

WIP: Size based retention #7927

Abuelodelanada commented Dec 13, 2022 •

edited

CLAassistant commented Dec 13, 2022 •

edited

Abuelodelanada commented May 5, 2023

MasslessParticle commented May 11, 2023

simskij commented May 31, 2023 •

edited

MasslessParticle left a comment

MasslessParticle Jun 8, 2023

Abuelodelanada Jun 26, 2023

MasslessParticle Jun 26, 2023

Abuelodelanada Jun 26, 2023

MasslessParticle Jun 26, 2023

MasslessParticle Jun 8, 2023

Abuelodelanada Jun 27, 2023

MasslessParticle Jun 28, 2023 •

edited

Abuelodelanada Jun 28, 2023 •

edited

MasslessParticle left a comment

Abuelodelanada commented Jun 26, 2023

simskij commented Jul 10, 2023 •

edited

VenkateswaranJ commented Dec 11, 2023

simskij commented Dec 12, 2023 •

edited

mkjpryor commented Feb 13, 2024

WIP: Size based retention #7927

Are you sure you want to change the base?

WIP: Size based retention #7927

Conversation

Abuelodelanada commented Dec 13, 2022 • edited

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

How to test:

Doubts and considerations

Settings

About the author

CLAassistant commented Dec 13, 2022 • edited

Abuelodelanada commented May 5, 2023

MasslessParticle commented May 11, 2023

simskij commented May 31, 2023 • edited

MasslessParticle left a comment

Choose a reason for hiding this comment

MasslessParticle Jun 8, 2023

Choose a reason for hiding this comment

Abuelodelanada Jun 26, 2023

Choose a reason for hiding this comment

MasslessParticle Jun 26, 2023

Choose a reason for hiding this comment

Abuelodelanada Jun 26, 2023

Choose a reason for hiding this comment

MasslessParticle Jun 26, 2023

Choose a reason for hiding this comment

MasslessParticle Jun 8, 2023

Choose a reason for hiding this comment

Abuelodelanada Jun 27, 2023

Choose a reason for hiding this comment

MasslessParticle Jun 28, 2023 • edited

Choose a reason for hiding this comment

Abuelodelanada Jun 28, 2023 • edited

Choose a reason for hiding this comment

MasslessParticle left a comment

Choose a reason for hiding this comment

Abuelodelanada commented Jun 26, 2023

simskij commented Jul 10, 2023 • edited

VenkateswaranJ commented Dec 11, 2023

simskij commented Dec 12, 2023 • edited

mkjpryor commented Feb 13, 2024

Abuelodelanada commented Dec 13, 2022 •

edited

CLAassistant commented Dec 13, 2022 •

edited

simskij commented May 31, 2023 •

edited

MasslessParticle Jun 28, 2023 •

edited

Abuelodelanada Jun 28, 2023 •

edited

simskij commented Jul 10, 2023 •

edited

simskij commented Dec 12, 2023 •

edited