Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRelease 2.8+ remote storage doesn't work on ext4 bare metal, running RH7 #5424
Comments
tomwilkie
added
the
component/remote storage
label
Apr 1, 2019
This comment has been minimized.
This comment has been minimized.
|
Hi @aned, sorry you're having issues with this. Can you post the logs from the startup of your Prometheus please? Also, can you post us screen shots of the following queries?
I suspect this might be the same issue as #5389. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
Can you share a |
This comment has been minimized.
This comment has been minimized.
|
Yep,
|
This comment has been minimized.
This comment has been minimized.
|
That adds up anyway. |
This comment has been minimized.
This comment has been minimized.
This looks the same as #5389. What are you thinking @brian-brazil ? |
This comment has been minimized.
This comment has been minimized.
|
@aned can you get a goroutine dump please? Go to /debug/pprof/goroutine?debug=2 in the browser. |
This comment has been minimized.
This comment has been minimized.
|
From the graphs it looks like you downgraded back to 2.7 shortly after trying 2.8 - is that correct? |
This comment has been minimized.
This comment has been minimized.
I was checking there wasn't anything crazy in the WAL file wise. |
This comment has been minimized.
This comment has been minimized.
|
I could do with a rubber ducking session with you and @cstyan to chat about this TBH, want to try and get some mitigation / fix in place for 2.9. Got some time this week? |
This comment has been minimized.
This comment has been minimized.
|
This week is fairly busy for me, I was planning on digging around today. |
This comment has been minimized.
This comment has been minimized.
|
Can you indicate if you're using the Delete api? |
brian-brazil
added a commit
that referenced
this issue
Apr 2, 2019
This comment has been minimized.
This comment has been minimized.
|
I don't think m3coordinator deletes any series via api. |
vsliouniaev
referenced this issue
Apr 11, 2019
Merged
[stable/prometheus-operator] Updated Prometheus Operator versions #12971
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Delete your WAL, and try 2.9.0 again. |
This comment has been minimized.
This comment has been minimized.
|
2.9.1, deleted WAL, the same issue. |
This comment has been minimized.
This comment has been minimized.
|
@aned can I see the logs from your 2.9 instance? Can you also post these graphs again from the instance that you're running with a new WAL?
I can look into this tomorrow. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Hmm, it looks like everything is being dropped. |
This comment has been minimized.
This comment has been minimized.
|
Given the lack of logs about dropped series, this is possibly due to write_relabel_configs dropping everything. Can you share the remote_write part of your config? |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
That's going to drop a lot of time series, so this result is as expected. Are there signs that data isn't making it to your remote write endpoint? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Based on that data my first thought is that relabelling was broken back in 2.7.2 for remote write. What does If rule evaluations are failing, that'd also decrease the amount of data that'd be sent to remote write. |
This comment has been minimized.
This comment has been minimized.
|
The count is 1162458.
|
This comment has been minimized.
This comment has been minimized.
|
Okay, so we'd expect to see about 38k successes per second when everything is working correctly. |
This comment has been minimized.
This comment has been minimized.
|
Can you share with me what the Rules Status page is showing, preferably when there's some failures? |
This comment has been minimized.
This comment has been minimized.
|
I'd need to reproduce it tomorrow if really needed. |
This comment has been minimized.
This comment has been minimized.
|
At this point I suspect that remote_write is working fine, however there's something up with your rule evaluations. |
This comment has been minimized.
This comment has been minimized.
|
That'd be great. |
This comment has been minimized.
This comment has been minimized.
|
Nm, I'll dm you the page in a few. |
This comment has been minimized.
This comment has been minimized.
|
A snapshot of the same page from 2.7.2 would also be handy for comparison. |
This comment has been minimized.
This comment has been minimized.
|
Ok, reverting back, will dm you as well. Anything else might be useful from 2.9.1 before I revert? |
This comment has been minimized.
This comment has been minimized.
|
That's it offhand. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Given how Go GC works, 75% is the fullest you can safely be on average. You need to either cut CPU usage, or get a bigger machine. |
This comment has been minimized.
This comment has been minimized.
|
This doesn't look like the same thing based on hte graphs, but you might want to try the settings at #5166 (comment) |
This comment has been minimized.
This comment has been minimized.
|
Hmm, what's your GOMAXPROCS? It'll be on the Runtime Information page (should be the same for both versions). |
This comment has been minimized.
This comment has been minimized.
|
GOMAXPROCS is 24 in both cases. Tried 2.9.1 with
the same results. |
This comment has been minimized.
This comment has been minimized.
|
Removed remote write completely, tested 2.9.1, the same performance degradation (read unusable), so the remote write is not related here I guess. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
It's odd that you're hitting a 12 CPU limit on a 24 core box. What's your
GOMAXPROCS showing as and are those real cores or hyperthreads?
…On Thu 18 Apr 2019, 23:41 Artem Nedoshepa, ***@***.***> wrote:
Some snapshots without remote write, 2.7.2 vs 2.9.1
[image: image]
<https://user-images.githubusercontent.com/2947691/56392937-b267d680-61e7-11e9-94d1-c41a0abc521f.png>
2.9.1 has 10k less samples, probably dropping them due to high CPU
[image: image]
<https://user-images.githubusercontent.com/2947691/56393000-e2af7500-61e7-11e9-9c10-b7ef1c7de3f5.png>
[image: image]
<https://user-images.githubusercontent.com/2947691/56393051-08d51500-61e8-11e9-8594-1ae025735a07.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5424 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABWJG5SQSJUUCJARQDDZSYDPRDTKLANCNFSM4HCV637A>
.
|
This comment has been minimized.
This comment has been minimized.
|
12 real cores (24 hyperthreads). |
This comment has been minimized.
This comment has been minimized.
|
Okay, so really a 12 core box. You could try setting the GOMAXPROCS
environment variable to 12 to reduce contention but likely you need a
bigger box.
…On Fri 19 Apr 2019, 00:14 Artem Nedoshepa, ***@***.***> wrote:
12 real cores (24 hyperthreads).
GOMAXPROCS shows 24 in both cases.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5424 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABWJG5TN73JOGHSBEMVP4HLPRDXDDANCNFSM4HCV637A>
.
|
This comment has been minimized.
This comment has been minimized.
|
Why everything works just fine on 2.7.2 and when switching to 2.9 I need a "bigger box"? |
This comment has been minimized.
This comment has been minimized.
|
Changes in resource usage not unexpected between versions. If you can get a
CPU profile we can see which change might have caused it.
…On Fri 19 Apr 2019, 00:53 Artem Nedoshepa, ***@***.***> wrote:
Why everything works just fine on 2.7.2 and when switching to 2.9 I need a
"bigger box"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5424 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABWJG5USEMM5JSQH2WAIXDLPRD3XRANCNFSM4HCV637A>
.
|
This comment has been minimized.
This comment has been minimized.
|
CPU profile, on 2.7.2 and 2.9.1 |
This comment has been minimized.
This comment has been minimized.
|
Looks like it's garbage collection, everything else is faster on 2.9.1. Can you share a heap profile with the |
This comment has been minimized.
This comment has been minimized.
|
go tool pprof -alloc_space --svg http://localhost:7320/debug/pprof/heap > prof-heap.svg |
This comment has been minimized.
This comment has been minimized.
|
Looks like the allocation in mergedPostings.Seek is the issue. |





















aned commentedApr 1, 2019
•
edited
Remote write doesn't work on fs type ext4 on bare metal running RH7 after “ [ENHANCEMENT] Use the WAL for remote_write API. #4588”.
Getting this in logs:
Remote read works fine on 2.8+
Downgraded to 2.7.2 - no issues with remote write.
Relevant config:
100% drop rate: