Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage.tsdb.retention.size not being respected >15d #5213

Closed
TimSimmons opened this Issue Feb 13, 2019 · 5 comments

Comments

Projects
None yet
3 participants
@TimSimmons
Copy link

TimSimmons commented Feb 13, 2019

What did you do?

Started an existing Prometheus instance (previously with 15d retention on v2.3.1 with --storage.tsdb.retention.time=15d) on v2.7.1 with --storage.tsdb.retention.size 473436089549B

What did you expect to see?

Retention slowly creep upwards until it used that many bytes of disk.

What did you see instead? Under which circumstances?

Retention inched up past 15d and then went back down. Never exceeding 16 days.

Environment

  • System information:

    Linux 4.4.0-112-generic x86_64

Prometheus version:

prometheus, version 2.7.1 (branch: HEAD, revision: 62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)
  build user:       root@f9f82868fc43
  build date:       20190131-11:16:59
  go version:       go1.11.5

Prometheus configuration file:

global:
  scrape_interval: 5m
  scrape_timeout: 30s
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - <not important>
    scheme: http
    timeout: 10s
rule_files:
- /opt/prometheus/rules/*
- /opt/prometheus/staticrules/*
scrape_configs:
- job_name: <not important>
  scrape_interval: 5m
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /opt/prometheus/services/<not important>.json
    refresh_interval: 5m
- job_name: <not important>
  scrape_interval: 5m
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /opt/prometheus/services/<not important>.json
    refresh_interval: 5m

Logs:

Nothing out of the ordinary

February 13th 2019, 15:05:26.749	compact blocks
February 13th 2019, 15:03:26.172	WAL checkpoint complete
February 13th 2019, 15:02:51.239	head GC completed
February 13th 2019, 15:02:28.742	write block
February 13th 2019, 13:03:39.905	WAL checkpoint complete
February 13th 2019, 13:03:04.751	head GC completed
February 13th 2019, 13:02:40.954	write block
February 13th 2019, 11:03:21.124	WAL checkpoint complete
February 13th 2019, 11:02:48.191	head GC completed
February 13th 2019, 11:02:27.659	write block
February 13th 2019, 09:05:09.337	compact blocks
February 13th 2019, 09:03:15.393	WAL checkpoint complete
February 13th 2019, 09:02:41.952	head GC completed
February 13th 2019, 09:02:21.303	write block
February 13th 2019, 07:03:19.826	WAL checkpoint complete
February 13th 2019, 07:02:46.056	head GC completed
February 13th 2019, 07:02:25.234	write block
February 13th 2019, 05:03:23.665	WAL checkpoint complete
February 13th 2019, 05:02:49.790	head GC completed
February 13th 2019, 05:02:28.875	write block

Other Information:

$ sudo ./tsdb ls /data/prometheus/
BLOCK ULID                  MIN TIME       MAX TIME       NUM SAMPLES  NUM CHUNKS  NUM SERIES
01D2DQ3CJSPJ9WAHC5ZMFK645A  1548720000000  1548784800000  2033912841   85100495    9995467
01D2FMYPYDZE3X9J6KX98BXJEZ  1548784800000  1548849600000  2035487860   85107889    9938513
01D2HJPKEV3QQ9FXYNJF7X8FB1  1548849600000  1548914400000  2036307983   85333785    10151497
01D2KGFMC804DMYXCQ5ZSHBW4G  1548914400000  1548979200000  2035639475   85238746    10066309
01D2NE8Y5FFE9DQQWG4Z6RNMZ3  1548979200000  1549044000000  2034538485   85162125    10029674
01D2QC0H4JJRWP91J0R7WKBB9Q  1549044000000  1549108800000  2031040667   85160175    10181466
01D2S9SRHWRMBAM849C0KFZZTY  1549108800000  1549173600000  2037159680   85474252    10259224
01D2V7KDP0CH18RCP53FWB9500  1549173600000  1549238400000  2029881220   85086840    10147053
01D2X5C9MY5NYS8SX7DHJ5V3HK  1549238400000  1549303200000  2023600687   84871180    10163861
01D2Z35YV1S9GREGKSGEN3EHMB  1549303200000  1549368000000  2025406021   85005889    10223034
01D310Z7FTWZ5EXR88FAKA3DDZ  1549368000000  1549432800000  2024534227   84926831    10179083
01D32YS40V5ZP6PXPE7BAGGAVW  1549432800000  1549497600000  2026813092   85034921    10230400
01D34WKEBMA62773ATKDAZF9SK  1549497600000  1549562400000  2035551468   85420781    10265665
01D36TCDKJ9J63BED52QBC1EY9  1549562400000  1549627200000  2034484437   85200475    10091533
01D38R5D3Q89THQRNA7KG6X6PZ  1549627200000  1549692000000  2035339787   85298286    10156386
01D3ANZ8X3QVBGDMXYZH9GEZQ3  1549692000000  1549756800000  2033399177   85122753    10012874
01D3CKSHRGWAQ77KWGAPV4VR6W  1549756800000  1549821600000  2034492551   85143116    10042069
01D3EHJS9E23ARD8Q2DY8D9Q31  1549821600000  1549886400000  2035898560   85122306    9953228
01D3GF93CAS4RHCZ18RQ91J0DT  1549886400000  1549951200000  2025739689   84998617    9928608
01D3JD21NDRMJY9P5WD2DBYAC3  1549951200000  1550016000000  2033888680   85085903    9985246
01D3K1H9H34Q4WVW6HNHY7VB66  1550016000000  1550037600000  677435398    28323940    9569766
01D3KP4TRBMDS0KH64H187AHW0  1550037600000  1550059200000  677758634    28342090    9574730
01D3KNYAC8R72FA7MR455ACRJK  1550059200000  1550066400000  226214865    9475804     9475804

retentionimg

$ ps aux | grep prometheus
prometh+  1537  230 35.9 163097032 23710144 ?  Ssl  16:26  27:57 
/opt/prometheus/bin/prometheus 
--config.file /opt/prometheus/prometheus.yml 
--web.enable-lifecycle 
--web.enable-admin-api 
--log.level info 
--log.format json 
--storage.tsdb.path /data/prometheus 
--storage.tsdb.retention.size 473436089549B 
--query.max-samples 5000000 
--query.max-concurrency 16

flags

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G   12K   32G   1% /dev
tmpfs           6.3G  336K  6.3G   1% /run
/dev/vda1       1.3T  145G  1.1T  12% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none             32G     0   32G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/vda15      105M  3.2M  102M   4% /boot/efi

Only thing that seemed odd to me is that the other retention parameters in the flags page still showed 15d in the page, so maybe that was still happening. Or maybe upgrading from 2.3.1 made it bad somehow?

Anything else?

Thank you for taking the time to read my bug report, any time you spend on this is greatly appreciated. Open source maintainership is hard and you all are doing a great job 👍

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 13, 2019

Ah, you need to set time retention also. By not setting the flag, it is set to 15d and we respect that value too. I thought I made it clear in the docs. Will make it clear.

And, thanks for this super clear bug report!

@TimSimmons

This comment has been minimized.

Copy link
Author

TimSimmons commented Feb 13, 2019

oooooooooooooh I see. So to be clear, I should set the retention time to some amount, but if I were going to use more storage than the retention size, it would respect the retention size rather than retention time. But I'd imagine retention time still informs the size of the blocks, so I should endeavor to set that somewhat correctly?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 13, 2019

Yes, we check both size and time and if we cross either, we start deleting data. We have limits on the block sizes now. It should be safe to set retention to 100years :)

@TimSimmons

This comment has been minimized.

Copy link
Author

TimSimmons commented Feb 13, 2019

awesome, I probably will do that, some of these servers when I did the query above were looking at ~15 days others like 10 years, not sure how that will actually work in practice but I'm excited to try :)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 14, 2019

@TimSimmons just submitted a PR that removes the confusion and the default time retention will be used only when both (time/size) flags are not set.

krasi-georgiev added a commit that referenced this issue Feb 19, 2019

use the default time retention value only when no size retention is s…
…et (#5216)

fixes #5213

Now that we have time and size base retention time bases should not have a default value. A default is set only when both - time and size flags are not set.

This change will not affect current installations that rely on the default time based value, and will avoid confusions when only the size retention is set and it is expected that the default time based setting would be no longer in place.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.