Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.3.0 - WAL log samples: log series: /wal/007913: file already closed #4303

Closed
VR6Pete opened this Issue Jun 22, 2018 · 20 comments

Comments

Projects
None yet
5 participants
@VR6Pete
Copy link

VR6Pete commented Jun 22, 2018

Upgraded to 2.3.0 yesterday and now getting these errors in the "targets" screen and journalctl.

If it makes any difference I configured blackbox at the same time to scrape around 200 URL's.

WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already closed

om:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already clo
.com:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already clo
oud.com:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already
oud.com:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already
.com:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already clo
.com:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007913: file already clo
be?module=http_2xx&target=https%3A%2F%2FXXXX.XXXX-XXX.com%2XXXXCareMobile%XXXXX" msg="append failed" err="WAL log samp

Performing a restart works but then quickly falls over again with the same error.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 22, 2018

Can you show full logs as these are cut.

I have seen this error once caused by low disk space so worth checking.

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 22, 2018

WAL log samples: log series: write /data/prometheus/metrics2/wal/007938: file already closed

Just fell over again.

dev/xvdf 493G 293G 179G 63% /data

Plenty of storage available.

Jun 22 15:55:52 ukasprom01 prometheus[1845]: level=warn ts=2018-06-22T15:55:52.98659825Z caller=scrape.go:717 component="scrape manager" scrape_pool="RC - App" target=http://XXXXXX:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007938: file already closed"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=warn ts=2018-06-22T15:55:53.002817307Z caller=scrape.go:713 component="scrape manager" scrape_pool="AWS (Windows) - eu-west-2" target=http://172.18.146.172:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007938: file already closed"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=error ts=2018-06-22T15:55:53.009252657Z caller=db.go:272 component=tsdb msg="retention cutoff failed" err="open directory: open /data/prometheus/metrics2: too many open files"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=error ts=2018-06-22T15:55:53.009425621Z caller=compact.go:421 component=tsdb msg="removed tmp folder after failed compaction" err="open /data/prometheus/metrics2/01CGM3B0SHXQGJHK25YZF5DQGT.tmp: too many open files"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=error ts=2018-06-22T15:55:53.009470747Z caller=db.go:277 component=tsdb msg="compaction failed" err="persist head block: open chunk writer: open /data/prometheus/metrics2/01CGM3B0SHXQGJHK25YZF5DQGT.tmp/chunks: too many open files"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=warn ts=2018-06-22T15:55:53.210921751Z caller=scrape.go:713 component="scrape manager" scrape_pool="AWS (Windows) - eu-west-2" target=http://XXXXX:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007938: file already closed"
Jun 22 15:55:53 ukasprom01 prometheus[1845]: level=error ts=2018-06-22T15:55:53.66182587Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /data/prometheus/metrics2/wal/007938: file already closed"
Jun 22 15:55:54 ukasprom01 prometheus[1845]: level=warn ts=2018-06-22T15:55:54.107974047Z caller=scrape.go:713 component="scrape manager" scrape_pool="RC - App" target=http://XXXXXXXX:9182/metrics msg="append failed" err="log series: write /data/prometheus/metrics2/wal/007938: file already closed"
Jun 22 15:55:54 ukasprom01 prometheus[1845]: level=warn ts=2018-06-22T15:55:54.108384735Z caller=scrape.go:717 component="scrape manager" scrape_pool="RC - App" target=http://XXXXXXX:9182/metrics msg="append failed" err="WAL log samples: log series: write /data/prometheus/metrics2/wal/007938: file already closed"

Jun 22 15:56:03 ukasprom01 prometheus[1845]: level=error ts=2018-06-22T15:56:03.661810827Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /data/prometheus/metrics2/wal/007938: file already closed"

ARGS="--storage.tsdb.retention=365d --storage.tsdb.max-block-duration=1d --storage.tsdb.path="/data/prometheus/

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 22, 2018

You're running out of file descriptors, make sure you have a high ulimit.

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 22, 2018

How do I set this?

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 22, 2018

cheers, tweaked this - will monitor over weekend.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 22, 2018

sure thing.

Closing now , but feel free to reopen if it happens again.

Thanks!

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

Unfortunately same thing happened again this morning, same symptoms....
~

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

epair.go:39 component=tsdb msg="found healthy block" mint=1529892000000 maxt=1529899200000 ulid=01CGTN085E67CH51V4YYNMS6MV
6450696Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=1767059
9413381Z caller=main.go:524 msg="TSDB started"
0375554Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
1799841Z caller=main.go:500 msg="Server is ready to receive web requests."

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

onaws.com on 172.18.91.149:53: dial udp 172.18.91.149:53: socket: too many open files"
zonaws.com on 172.18.91.149:53: dial udp 172.18.91.149:53: socket: too many open files"

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 25, 2018

@VR6Pete please paste the first lines of the log file to check the effective fd limits as it's sometimes tricky to set them properly. They should be reported like this:

level=info ts=2018-06-25T08:49:51.099871354Z caller=main.go:225 fd_limits="(soft=1024, hard=4096)"
@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

ubuntu@ukasprom01:/etc/prometheus$ journalctl -u prometheus.service | grep soft
Jun 25 08:36:09 ukasprom01 prometheus[1272]: level=info ts=2018-06-25T08:36:09.886889925Z caller=main.go:225 fd_limits="(soft=1024, hard=4096)"

yes, doesn't look like its taken effect.

/etc/security/limits.conf

  • soft nofile 64000
  • hard nofile 64000
    root soft nofile 64000
    root hard nofile 64000

I thought that Prometheus would pick this change up.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 25, 2018

It depends how your Prometheus service is managed. Do you use systemd?

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

yes - ubuntu system.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 25, 2018

@VR6Pete

This comment has been minimized.

Copy link
Author

VR6Pete commented Jun 25, 2018

Ok, that has now read in.

prometheus[1269]: level=info ts=2018-06-25T10:55:42.910572336Z caller=main.go:225 fd_limits="(soft=65536, hard=65536)"

I have noticed that I have the following message on starting Prometheus - is there any action to be taken?

ler=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=55361

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 25, 2018

Regarding the "unknown series references in WAL samples" message I would assume that it would go away after the first compaction and a clean stop/start of the service. @krasi-georgiev?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 25, 2018

Have'n't looked at that part of the code yet, but I assume it should clear those when passed the time range.

@linkbug

This comment has been minimized.

Copy link

linkbug commented Feb 21, 2019

Old wal have been closed and new wal is created. But new wal file
size is 0. I think write "WALMagic" failed in func createSegmentFile:
image

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 21, 2019

@linkbug lots of changes have been done since 2.3, would you mind opening a new issues with steps to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.