Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binlog files preventing searchd from starting #275

Closed
jiru opened this issue Nov 11, 2019 · 7 comments
Closed

Binlog files preventing searchd from starting #275

jiru opened this issue Nov 11, 2019 · 7 comments
Labels
bug

Comments

@jiru
Copy link

@jiru jiru commented Nov 11, 2019

Environment

Manticore Search version: Manticore 3.1.0

OS version: Debian 9.11

Build version: 445e806e@190716

Problem

Sometimes, after Manticore crashes for some reason, it is impossible to restart it because of an error related to binlog files. After removing the binlog files (binlog.meta and binlog.lock), Manticore starts normally. The issue here is that a manual intervention (removing the files) is needed before restarting searchd. This is a problem on a production environment where we want the daemon to be automatically restarted after a crash. A Google search suggests that the issue has been around for years.

Steps to reproduce: I don’t know how to trigger this bug.

Messages from log files: When the daemon refuses to start, the error message is: FATAL: binlog: log open error: failed to open /var/lib/manticore/data/binlog.001: No such file or directory. See also the relevant part of searchd.log before and after the crash.

@tomatolog

This comment has been minimized.

Copy link
Contributor

@tomatolog tomatolog commented Nov 11, 2019

could you also post log with more binlog messages?

How binlog messages looks during normal replay?

@jiru

This comment has been minimized.

Copy link
Author

@jiru jiru commented Nov 11, 2019

@githubmanticore githubmanticore added the bug label Nov 12, 2019
@tomatolog

This comment has been minimized.

Copy link
Contributor

@tomatolog tomatolog commented Nov 12, 2019

could you provide binlog files (.meta and .xyz) to reproduce the issue locally here?

In case you delete these could you post binlog files next time the will issue pop up?

@jiru

This comment has been minimized.

Copy link
Author

@jiru jiru commented Nov 12, 2019

@jiru

This comment has been minimized.

Copy link
Author

@jiru jiru commented Nov 14, 2019

I did some more investigations.

About the searchd crashes. It seems that our recent crashes were related to a lack of disk space. They were always occurring at the time a diskspace-consuming task was triggered by a cronjob. What’s more, the stack traces are all pointing to UpdateAttributes and the SphinxAPI request dumps (in base64) are always showing a request about updating an MVA attribute (lists_id).

While I havn’t been able to reproduce the crash, my analysis is that since UpdateAttributes increases the size of binlog.001, searchd somehow crashes because of the lack of free disk space and/or some other conditions. While searchd normally recovers from such crash and replays the binlog on the next restart, in some situations, the binlog.001 file is not present (presumably because of the lack of disk space) and searchd refuses to start because of that.

So I was able to reproduce the bug by just removing the binlog.001 file before a normal restart. You can try to start searchd with this binlog to reproduce the issue locally.

@tomatolog

This comment has been minimized.

Copy link
Contributor

@tomatolog tomatolog commented Nov 19, 2019

I've just fixed issue with binlog invalid state after error about no space left on disk at 795520a

@tomatolog tomatolog closed this Nov 19, 2019
@jiru

This comment has been minimized.

Copy link
Author

@jiru jiru commented Nov 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.