Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_tombstone not removing old job labels #3728

Closed
keithf4 opened this Issue Jan 23, 2018 · 12 comments

Comments

Projects
None yet
5 participants
@keithf4
Copy link

keithf4 commented Jan 23, 2018

Trying out the new clean_tombstones API in 2.1 to try and get rid of old data, it only seems to partially be working. Referencing another ticket (#3584), I thought this new clean method would get rid of old job labels if they contained no data. I've done the below several times testing it out, and while it does appear to clean up the data, the old job label is not removed even if no data exists for it. Not sure if I have to wait longer or something, but I was hoping to get back the same functionality that existed pre-2.0 for cleaning up old jobs where a single run of a script could clean up everything quickly.

Prometheus data directory size before tombstone cleanup:
315MB/322048KB

I've got two targets (Prod & Replica).

curl http://localhost:9090/api/v1/label/job/values

{"status":"success","data":["Prod","Replica"]}
curl http://localhost:9090/api/v1/targets

{
   "status":"success",
   "data":{
      "activeTargets":[
         {
            "discoveredLabels":{
               "__address__":"127.0.0.1:9100",
               "__meta_filepath":"/etc/prometheus/auto.d/ProductionDB.yml",
               "__metrics_path__":"/metrics",
               "__scheme__":"http",
               "job":"Prod"
            },
            "labels":{
               "instance":"127.0.0.1:9100",
               "job":"Prod"
            },
            "scrapeUrl":"http://127.0.0.1:9100/metrics",
            "lastError":"",
            "lastScrape":"2018-01-23T11:36:54.423586172-05:00",
            "health":"up"
         },
         {
            "discoveredLabels":{
               "__address__":"127.0.0.1:9187",
               "__meta_filepath":"/etc/prometheus/auto.d/ProductionDB.yml",
               "__metrics_path__":"/metrics",
               "__scheme__":"http",
               "job":"Prod"
            },
            "labels":{
               "instance":"127.0.0.1:9187",
               "job":"Prod"
            },
            "scrapeUrl":"http://127.0.0.1:9187/metrics",
            "lastError":"",
            "lastScrape":"2018-01-23T11:37:01.988456947-05:00",
            "health":"up"
         },
         {
            "discoveredLabels":{
               "__address__":"127.0.0.1:9188",
               "__meta_filepath":"/etc/prometheus/auto.d/Replica.yml",
               "__metrics_path__":"/metrics",
               "__scheme__":"http",
               "job":"Replica"
            },
            "labels":{
               "instance":"127.0.0.1:9188",
               "job":"Replica"
            },
            "scrapeUrl":"http://127.0.0.1:9188/metrics",
            "lastError":"",
            "lastScrape":"2018-01-23T11:36:47.319521388-05:00",
            "health":"up"
         }
      ]
   }
}

Query to show that Replica data does exist

curl -g 'http://localhost:9090/api/v1/query_range?query=ccp_connection_stats_active{job="Replica"}&start=2018-01-22T16:30:30.781Z&end=2018-01-22T20:30:00.781Z&step=15s'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"__name__":"ccp_connection_stats_active","instance":"127.0.0.1:9188","job":"Replica"},"values":[[1516652175.781,"1"],[1516652190.781,"1"],[1516652205.781,"1"],[1516652220.781,"1"],[1516652235.781,"1"],[1516652250.781,"1"],[1516652265.781,"1"],[1516652280.781,"1"],[1516652295.781,"1"],[1516652310.781,"1"],[1516652325.781,"1"],[1516652340.781,"1"],[1516652355.781,"1"],[1516652370.781,"1"],[1516652385.781,"1"],[1516652400.781,"1"],[1516652415.781,"1"],[1516652430.781,"1"],[1516652445.781,"1"],[1516652460.781,"1"],[1516652475.781,"1"],[1516652490.781,"1"],[1516652505.781,"1"],[1516652520.781,"1"],[1516652535.781,"1"],[1516652550.781,"1"],[1516652565.781,"1"],[1516652580.781,"1"],[1516652595.781,"1"],[1516652610.781,"1"],[1516652625.781,"1"],[1516652640.781,"1"],[1516652655.781,"1"],[1516652670.781,"1"],[1516652685.781,"1"],[1516652700.781,"1"],[1516652715.781,"1"],[1516652730.781,"1"],[1516652745.781,"1"],[1516652760.781,"1"],[1516652775.781,"1"],[1516652790.781,"1"],[1516652805.781,"1"],[1516652820.781,"1"],[1516652835.781,"1"],[1516652850.781,"1"],[1516652865.781,"1"],[1516652880.781,"1"],[1516652895.781,"1"],[1516652910.781,"1"],[1516652925.781,"1"],[1516652940.781,"1"],[1516652955.781,"1"],[1516652970.781,"1"],[1516652985.781,"1"],[1516653000.781,"1"]]}]}}

This shows that I've removed the Replica target from prometheus so no further data will be collected for it.

curl http://localhost:9090/api/v1/targets

{
   "status":"success",
   "data":{
      "activeTargets":[
         {
            "discoveredLabels":{
               "__address__":"127.0.0.1:9100",
               "__meta_filepath":"/etc/prometheus/auto.d/ProductionDB.yml",
               "__metrics_path__":"/metrics",
               "__scheme__":"http",
               "job":"Prod"
            },
            "labels":{
               "instance":"127.0.0.1:9100",
               "job":"Prod"
            },
            "scrapeUrl":"http://127.0.0.1:9100/metrics",
            "lastError":"",
            "lastScrape":"2018-01-23T11:44:24.423177292-05:00",
            "health":"up"
         },
         {
            "discoveredLabels":{
               "__address__":"127.0.0.1:9187",
               "__meta_filepath":"/etc/prometheus/auto.d/ProductionDB.yml",
               "__metrics_path__":"/metrics",
               "__scheme__":"http",
               "job":"Prod"
            },
            "labels":{
               "instance":"127.0.0.1:9187",
               "job":"Prod"
            },
            "scrapeUrl":"http://127.0.0.1:9187/metrics",
            "lastError":"",
            "lastScrape":"2018-01-23T11:44:31.988114694-05:00",
            "health":"up"
         }
      ]
   }
}

Delete all Replica data

curl -X POST -i -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="Replica"}'

HTTP/1.1 204 No Content
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Date: Tue, 23 Jan 2018 16:45:46 GMT

Same query above now returns no data

curl -g 'http://localhost:9090/api/v1/query_range?query=ccp_connection_stats_active{job="Replica"}&start=2018-01-22T16:30:30.781Z&end=2018-01-22T20:30:00.781Z&step=15s'

{"status":"success","data":{"resultType":"matrix","result":[]}}

Query looking for any Replica data returns nothing

curl -i -g 'http://localhost:9090/api/v1/series?match[]={job="Replica"}'

HTTP/1.1 200 OK
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Content-Type: application/json
Date: Tue, 23 Jan 2018 16:46:44 GMT
Content-Length: 30

clean_tombstones returns successful

curl -X POST -i http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
HTTP/1.1 204 No Content
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Date: Tue, 23 Jan 2018 16:49:06 GMT

Prometheus data directory size after tombstone cleanup is definitely smaller, so assuming data was cleared from disk:
307MB/313560KB

But Label is still there:

curl -i http://localhost:9090/api/v1/label/job/values
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Content-Type: application/json
Date: Tue, 23 Jan 2018 16:52:39 GMT
Content-Length: 46

Waited a while to ensure no new data, did the clean again and label is still there

curl -i -g 'http://localhost:9090/api/v1/series?match[]={job="Replica"}'
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Content-Type: application/json
Date: Tue, 23 Jan 2018 18:14:18 GMT
Content-Length: 30

curl -X POST -i http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
HTTP/1.1 204 No Content
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Date: Tue, 23 Jan 2018 18:14:57 GMT

curl -i http://localhost:9090/api/v1/label/job/values
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Content-Type: application/json
Date: Tue, 23 Jan 2018 18:15:22 GMT
Content-Length: 46

{"status":"success","data":["Prod","Replica"]}
@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 25, 2018

Thanks for the detailed report, I can replicate the issue and I think the reason is that we are not currently cleaning up data from the in-mem block and hence are returning the data. Without diving too much into code, I think removing the data and calling headblock.gc() should fix this.

@codesome You can look into this for the next issue.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Jan 25, 2018

Sure @gouthamve, I am on it now.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Jan 27, 2018

@fabxc @gouthamve
I could confirm with some tests that it is indeed because of non removal of data from in-mem block.

I came up with 2 solutions. When clearing tombstones
if number of stones in memory > 0:

  1. Persist in-mem block, which will remove tombstones in the process (tried and tested, working).
  2. Remove data from memory itself and don't.

Please let me know if solution (1) is fine. As I don't expect clearing tombstones to be very frequent, persisting might be fine, as it will be compacted later on anyway.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2018

Clean tombtomes should have no semantic impact, do we need to be considering the tombstones in more places?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 28, 2018

@codesome I think it should be 2.

@brian-brazil I am not sure what you mean by semantic impact... It should be removing data from headblock and part of that is the series info, and that is not happening.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Jan 28, 2018

@gouthamve I guess (1) has semantic impact. It is doing more than what it needs to do: flushing in-mem block, which is not desired from /clean_tombstones.

PR for (2) soon

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 28, 2018

It should be removing data from headblock and part of that is the series info, and that is not happening.

If /clean_tombstones is changing the results of any of the HTTP query API, that's a semantic effect.

@keithf4

This comment has been minimized.

Copy link
Author

keithf4 commented Feb 26, 2018

Not sure if this was supposed to be fixed in 2.2.0, but I tried the release candidate anyway, and it's still leaving these tombstones behind.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Feb 27, 2018

@keithf4 it has been kept on hold till 2.2.0 is released, hence wont be fixed in 2.2.0.

Discussion on this

@keithf4

This comment has been minimized.

Copy link
Author

keithf4 commented Nov 6, 2018

I understand this is "not-as-easy-as-it-looks" but this now 3 minor version releases with no updates or fixes to this. Any word on when we can finally be able to clear out old job names for things like Grafana or anything that scrapes that info to make dynamic interfaces?

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Nov 6, 2018

@keithf4
prometheus/tsdb#270 has been active again recently, it will be fixed once this is merged.
On a side note, for now, the old label names should be cleared once the head is persisted, if not immediately.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 8, 2019

fixed in prometheus/tsdb#270 will be added to the new Prom release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.