Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OUT_OF_SYNC #1045

Open
sirkubax opened this issue Jan 23, 2017 · 6 comments
Open

OUT_OF_SYNC #1045

sirkubax opened this issue Jan 23, 2017 · 6 comments

Comments

@sirkubax
Copy link
Contributor

sirkubax commented Jan 23, 2017

Hello

v 1.9.4

We did already experienced 3 times the OUT_OF_SYNC error due to high write to one of the ssdb master-master node.
The first-to see problem:

  • replication sync can take up to 2 minutes - not real-time any-more.
    The biggest problem:
  • OUT_OF_SYNC that does not fix itself - you need to perform the whole procedure with deleting the meta folders :/ SSDB new node in OUT_OF_SYNC / INIT state #975

It is getting unacceptable to stop your database for a few hours to recreate the synchronization. Could you consider some re-sync mechanism that would recover the faulty sync? It is almost sure that the sync-lag would happen again, and unless it can fix itself, the whole concept is getting production-useless :(

It would be great if we don't have to stop all ssdb instances in case of OUT_OF_SYNC,
but assume we still have to replicate whole data to instance that is out of sync - still if this can be done semi-automatically, with some 'ssd-rebuild' command, this would be acceptable.
Would you consider introducing logic that would re-sync the faulty data-set's?

Technical problems that We see:

  • LevelDB is optimized for latency but not throughput, which seems to be a bit of a problem regarding the reasonable fast bulk upload requirement. Not sure how much SSDB contributes to this problem, but since SSDB locks the LevelDB we can't insert directly (to LevelDB) anyway without shutting down SSDB (and this would also break the replication).
  • The replication model of SSDB basically breaks the whole bulk upload idea (multi_hsets are replicated as single hsets). Not sure how much work it would be to change this (increasing the binlog.capacity wouldn't really solve the problem).
  • increaseign binlog.capacity would give us more time before going in OUT_OF_SYNC state, which is hard to recover, so we would consider increasing it anyway

The bugs list (mostly Chinese - I do not get the context):
https://github.com/ideawu/ssdb/issues?utf8=%E2%9C%93&q=is%3Aissue%20OUT_OF_SYNC

@ideawu
Copy link
Owner

ideawu commented Feb 3, 2017

As you mentioned, the simplest and recommended way to avoid OUT_OF_SYNC is to increase binlog capacity, which can be configured in ssdb.conf(it is a hidden configuration item, not included in ssdb.conf template):

replication:
	binlog: yes
		capacity: 100000000

@saveriocastellano
Copy link

hello,
I'm having the same problems... i cannot get two masters with sync=mirror to fully sync.
Every time the sync up to 70%-80% and then they go OUT_OF_SYNC. And this happens
all the time.
The size of the database is 80GB, and the database is doing heavy writes also during sync..
is this the problem?

I'm already using this:

replication:
binlog: yes
capacity: 100000000

But i still get OUT_OF_SYNC.

Would it help is I further increase the capacity, let's say to 200000000 ?

@ideawu
Copy link
Owner

ideawu commented Jan 17, 2020

@saveriocastellano Heavy writes may cause the OUT_OF_SYNC problem, especially when the network band width is not large enough. To solve this problem, one way is to increase binlog.capacity, another way is to make the network condition more better(increase bandwidth for your server).

@saveriocastellano
Copy link

ok, I'm pretty sure it is not because of lack of bandwidth.
Do you think I can try raising the binlog capacity from 100000000 to 200000000?
How does that effect memory?

@ideawu
Copy link
Owner

ideawu commented Jan 17, 2020

binlog.capacity does not impact memory, it just impact disk usage, so you can increase it without worrying about memory.

but 100000000 is normally a large number.

you should try and tell us the result later.

@saveriocastellano
Copy link

By the way in my case doubling the binlog capacity solved the issue and now the two ssdb instances manage to sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants