Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut-over should wait for heartbeat lag to be low enough to succeed #14

Merged
merged 9 commits into from
Feb 7, 2021

Conversation

ccoffey
Copy link

@ccoffey ccoffey commented Feb 3, 2021

Description

Related issue: github#799

In the above issue, we see migrations which fail at the cut-over phase with ERROR Timeout while waiting for events up to lock. These migrations fail cut-over many times and eventually exhaust all retries.

Root cause

Lag experienced by an external replica is not the same as lag experienced by gh-ost while processing the binlog.

  • Lag (external replica lag): Is normally computed by executing show slave status against an external replica and extracting the value of Seconds_Behind_Master.
  • Gh-ost Lag: Is how long it takes Gh-ost to intercept binlog events for updates (for example heartbeats) that it performed.

For example: Imagine that both of these lags were ~0 seconds. Then imagine that you throttle gh-ost for N minutes. At this point the external replica's lag will still be ~0 seconds, but gh-ost's lag will be N minutes.

This is important because its gh-ost's lag (not the external replica's lag) that determines if cut-over succeeds or times out.

More Detail

During cut-over:

  • A token AllEventsUpToLockProcessed:time.Now() is inserted into the changelog table
  • The table being migrated is locked
  • Gh-ost waits for up to --cut-over-lock-timeout-seconds (default: 3 seconds) for this token to appear on the binlog
  • If the token was not read within the timeout then gh-ost aborts the cutover

Problem: It's possible to enter this cut-over phase when gh-ost is so far behind on processing the binlog that it could not possibly catch-up during the timeout.

What this PR proposes

  • Maintain a new value CurrentHeartbeatLag in the MigrationContext
  • Update CurrentHeartbeatLagevery time we intercept a binlog event for the changelog table of type heartbeat
  • Before cut-over, wait until CurrentHeartbeatLag is less than --max-lag-millis

Note: This PR is best-reviewed commit by commit

An example

It's best to demonstrate the value of this change by example.

I am able to reliably reproduce the cut-over problem (40+ failed cut-over attempts) when running gh-ost against an Amazon RDS Aurora DB.

Test setup:

  • Create an Aurora cluster with a single writer and no readers
  • Create two "large" tables (200 million rows & 100 million rows respectively)

Test process:

  • Run two table migrations in parallel using gh-ost

Both migrations are run using the following params:

gh-ost \
  --host='<REDACTED>' \
  --user='<REDACTED>' \
  --password='<REDACTED>' \
  --database='<REDACTED>' \
  --table='<TABLE>' \
  --alter='<NOT_IMPORTANT>' \
  --allow-on-master \
  --max-load=Threads_running=30 \
  --assume-rbr \
  --skip-foreign-key-checks \
  --chunk-size=500 \
  --cut-over=default \
  --serve-socket-file=/tmp/gh-ost.<TABLE>.sock \
  --replica-server-id=<UNIQUE_ID> \
  --verbose \
  --execute

Note: <TABLE> and <UNIQUE_ID> must be different per migration.

The following logs came from one of the many experiments I ran.

This log was output by the smaller of the two migrations when it got to 13.9% for row copy:

I, [2021-02-01T15:20:19.572790 #1056173]  INFO -- : GhostWorker: Copy: 321000/2308807 13.9%; Applied: 0; Backlog: 0/1000; Time: 43s(total), 36s(copy); streamer: mysql-bin-changelog.046679:64796762; Lag: 0.01s, HeartbeatLag: 17.92s, State: migrating; ETA: 3m42s

Important: Notice that Lag is 0.01s but HeartbeatLag is 17.92s. The value of Lag is actually meaningless here because we are running with --allow-on-master so we are computing Lag by reading a heartbeat row directly from the table which we wrote it to. This explains the extremely low value of 0.01s.

A few minutes later, when row copy completed, Lag was 0.01s and HeartbeatLag was 100.79s:

I, [2021-02-01T15:24:18.335164 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO Row copy complete
I, [2021-02-01T15:24:18.359951 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 4m42s(total), 4m34s(copy); streamer: mysql-bin-changelog.046681:84235524; Lag: 0.01s, HeartbeatLag: 100.79s, State: migrating; ETA: due

This PR causes gh-ost to wait until the heartbeat lag is less than --max-lag-millis before continuing with the cut-over.

I, [2021-02-01T15:24:18.360111 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO Waiting for heartbeat lag to be low enough to proceed
I, [2021-02-01T15:24:18.360140 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO current HeartbeatLag (100.79s) is too high, it needs to be less than --max-lag-millis (1.50s)
I, [2021-02-01T15:24:19.360063 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:19 INFO current HeartbeatLag (100.93s) is too high, it needs to be less than --max-lag-millis (1.50s)
I, [2021-02-01T15:24:20.360161 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:20 INFO current HeartbeatLag (101.07s) is too high, it needs to be less than --max-lag-millis (1.50s)

Note: If we had tried to cut-over during this period where HeartbeatLag was greater than 100 seconds then we would have failed many times.

The heartbeat lag only started to reduce (a few minutes later) when the larger migration's row copy completed.

I, [2021-02-01T15:27:41.472574 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:41 INFO current HeartbeatLag (90.48s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:42.473628 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:42 INFO current HeartbeatLag (67.26s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:43.472813 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:43 INFO current HeartbeatLag (46.01s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:44.472932 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:44 INFO current HeartbeatLag (22.47s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue

At this point the following message was outputted:

I, [2021-02-01T15:27:45.473009 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Heartbeat lag is low enough, proceeding

And then the table cut-over succeeded in a single attempt:

I, [2021-02-01T15:27:45.473642 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Grabbing voluntary lock: gh-ost.70930.lock
I, [2021-02-01T15:27:45.473918 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Setting LOCK timeout as 6 seconds
I, [2021-02-01T15:27:45.474239 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Looking for magic cut-over table
I, [2021-02-01T15:27:45.474734 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Creating magic cut-over table `***`.`_***_del`
I, [2021-02-01T15:27:45.517945 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Magic cut-over table created
I, [2021-02-01T15:27:45.518084 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Locking `***`.`***`, `***`.`_***_del`
I, [2021-02-01T15:27:45.518736 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Tables locked
I, [2021-02-01T15:27:45.518807 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Session locking original & magic tables is 70930
I, [2021-02-01T15:27:45.519266 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Writing changelog state: AllEventsUpToLockProcessed:1612193265518649636
I, [2021-02-01T15:27:45.527817 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Intercepted changelog state AllEventsUpToLockProcessed
I, [2021-02-01T15:27:45.527900 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Handled changelog state AllEventsUpToLockProcessed
I, [2021-02-01T15:27:45.531266 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Waiting for events up to lock
I, [2021-02-01T15:27:46.419921 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1612193265518649636
I, [2021-02-01T15:27:46.420076 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Done waiting for events up to lock; duration=901.073705ms
I, [2021-02-01T15:27:46.420108 #1056173]  INFO -- : GhostWorker: # Migrating `***`.`***`; Ghost table is `***`.`_***_gho`
I, [2021-02-01T15:27:46.420131 #1056173]  INFO -- : GhostWorker: # Migrating ***; inspecting ***; executing on ***
I, [2021-02-01T15:27:46.420151 #1056173]  INFO -- : GhostWorker: # Migration started at Mon Feb 01 15:19:36 +0000 2021
I, [2021-02-01T15:27:46.420172 #1056173]  INFO -- : GhostWorker: # chunk-size: 500; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=30; critical-load: ; nice-ratio: 0.000000
I, [2021-02-01T15:27:46.420192 #1056173]  INFO -- : GhostWorker: # throttle-additional-flag-file: /tmp/gh-ost.throttle
I, [2021-02-01T15:27:46.420213 #1056173]  INFO -- : GhostWorker: # Serving on unix socket: /tmp/gh-ost.***.***.sock
I, [2021-02-01T15:27:46.423808 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 8m10s(total), 4m34s(copy); streamer: mysql-bin-changelog.046683:133419216; Lag: 0.01s, HeartbeatLag: 0.00s, State: migrating; ETA: due
I, [2021-02-01T15:27:46.424814 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Setting RENAME timeout as 3 seconds
I, [2021-02-01T15:27:46.424897 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Session renaming tables is 70935
I, [2021-02-01T15:27:46.425570 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Issuing and expecting this to block: rename /* gh-ost */ table `***`.`***` to `***`.`_***_del`, `***`.`_***_gho` to `***`.`***`
I, [2021-02-01T15:27:46.560987 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 8m10s(total), 4m34s(copy); streamer: mysql-bin-changelog.046683:133420865; Lag: 0.01s, HeartbeatLag: 0.00s, State: migrating; ETA: due 
I, [2021-02-01T15:27:47.426402 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
I, [2021-02-01T15:27:47.426543 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Checking session lock: gh-ost.70930.lock
I, [2021-02-01T15:27:47.427267 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Connection holding lock on original table still exists
I, [2021-02-01T15:27:47.427353 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Will now proceed to drop magic table and unlock tables
I, [2021-02-01T15:27:47.427377 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Dropping magic cut-over table
I, [2021-02-01T15:27:47.440892 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Releasing lock from `***`.`***`, `***`.`_***_del`
I, [2021-02-01T15:27:47.441353 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tables unlocked
I, [2021-02-01T15:27:47.476893 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tables renamed
I, [2021-02-01T15:27:47.477039 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Lock & rename duration: 1.958935784s. During this time, queries on `***` were blocked
I, [2021-02-01T15:27:47.477066 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [info] binlogsyncer.go:164 syncer is closing...
I, [2021-02-01T15:27:47.547745 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Closed streamer connection. err=<nil>
I, [2021-02-01T15:27:47.547891 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Dropping table `***`.`_***_ghc`
I, [2021-02-01T15:27:47.547917 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [error] binlogstreamer.go:77 close sync with err: sync is been closing...
I, [2021-02-01T15:27:47.547938 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [info] binlogsyncer.go:179 syncer is closed
I, [2021-02-01T15:27:47.562784 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Table dropped
I, [2021-02-01T15:27:47.562898 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue: ***
I, [2021-02-01T15:27:47.562939 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO -- drop table `***`.`_***_del`
I, [2021-02-01T15:27:47.563005 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Done migrating `***`.`***`
I, [2021-02-01T15:27:47.563025 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Removing socket file: /tmp/gh-ost.***.***.sock
I, [2021-02-01T15:27:47.563044 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down inspector
I, [2021-02-01T15:27:47.563066 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down applier
I, [2021-02-01T15:27:47.563085 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down streamer
I, [2021-02-01T15:27:47.563102 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down throttler
I, [2021-02-01T15:27:47.563120 #1056173]  INFO -- : GhostWorker: # Done

Final Thoughts

  1. This problem is likely exacerbated by Aurora because readers in Aurora do not use binlog replication. This means there is no external replica lag that gh-ost can use to throttle itself so gh-ost ends up backing up the binlog. If gh-ost is the only consumer of the binlog (typical for applications that use Aurora) then only gh-ost's cut-over would suffer. Any observer looking at key health metrics on Aurora's dashboard would conclude that the DB was completely healthy.

  2. The specific problem of backing up the binlog could be solved by getting gh-ost to copy rows much slower. However, gh-ost would still be susceptible to this problem in other ways, for example, if gh-ost was throttled heavily just before cut-over. Also, why should we artificially slow down if the DB is perfectly capable of handling the extra load without hurting our SLAs.

@ccoffey
Copy link
Author

ccoffey commented Feb 3, 2021

Another day passed by and we learned something new about this problem. I added details to the original issue github#799 (comment) about how the Aurora setting aurora_binlog_replication_max_yield_seconds is responsible for the abnormally high HeartbeatLag.

My team are not convinced that it's safe for us to change the Aurora default value of aurora_binlog_replication_max_yield_seconds in our production environment so we still believe that this PR is a good change to gh-ost which will make it safer by default.

Copy link

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed upstream: github#921 (review)

Will merge here. I have no ownership to merge upstream.

@myusuf3
Copy link

myusuf3 commented Jan 23, 2022

Another day passed by and we learned something new about this problem. I added details to the original issue github#799 (comment) about how the Aurora setting aurora_binlog_replication_max_yield_seconds is responsible for the abnormally high HeartbeatLag.

aurora_binlog_replication_max_yield_seconds looks like this setting has since been fixed by aws with the default of zero.

I am currently experiencing something similar to what you described in this issue. github#1081

I read your comment here

The specific problem of backing up the binlog could be solved by getting gh-ost to copy rows much slower. However, gh-ost would still be susceptible to this problem in other ways, for example, if gh-ost was throttled heavily just before cut-over.

Does this mean that HeartbeatLag is being caused by the copy itself? Meaning if I process this faster (doing the copy itself) that will result in higher HeartbeatLag and longer wait times after the copy is complete? I had to kill a large migration in a test environment because lag has reach over 70,000 seconds.

I would appreciate any insight you have here @ccoffey

@myusuf3
Copy link

myusuf3 commented Jan 23, 2022

I have even forcibly throttled myself and HeartbeatLag continues. I am using 1.1.2 version of gh-ost.

myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m20s(total), 3h45m20s(copy); streamer: mysql-bin-changelog.012031:41829585; Lag: 318.42s, HeartbeatLag: 2229.29s, State: throttled, commanded by user; ETA: 23h6m56s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m22s(total), 3h45m21s(copy); streamer: mysql-bin-changelog.012031:47634096; Lag: 319.92s, HeartbeatLag: 2230.75s, State: throttled, commanded by user; ETA: 23h7m5s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m23s(total), 3h45m22s(copy); streamer: mysql-bin-changelog.012031:50747420; Lag: 320.72s, HeartbeatLag: 2231.53s, State: throttled, commanded by user; ETA: 23h7m10s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m23s(total), 3h45m23s(copy); streamer: mysql-bin-changelog.012031:53188812; Lag: 321.32s, HeartbeatLag: 2232.15s, State: throttled, commanded by user; ETA: 23h7m14s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m24s(total), 3h45m23s(copy); streamer: mysql-bin-changelog.012031:55407229; Lag: 321.82s, HeartbeatLag: 2232.71s, State: throttled, commanded by user; ETA: 23h7m17s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m24s(total), 3h45m24s(copy); streamer: mysql-bin-changelog.012031:57378604; Lag: 322.32s, HeartbeatLag: 2232.41s, State: throttled, commanded by user; ETA: 23h7m20s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m25s(total), 3h45m24s(copy); streamer: mysql-bin-changelog.012031:59364269; Lag: 322.82s, HeartbeatLag: 2232.91s, State: throttled, commanded by user; ETA: 23h7m23s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m25s(total), 3h45m25s(copy); streamer: mysql-bin-changelog.012031:61323482; Lag: 323.32s, HeartbeatLag: 2233.40s, State: throttled, commanded by user; ETA: 23h7m26s```

@myusuf3
Copy link

myusuf3 commented Jan 26, 2022

Bumping again.

@ccoffey
Copy link
Author

ccoffey commented Jan 26, 2022

Does this mean that HeartbeatLag is being caused by the copy itself?

It's possible. In my company we figured this out scientifically. We made a copy of our production DB and experimented with it.

  • When we ran a single gh-ost migration, there was no HeartbeatLag.
  • When we ran two gh-ost migrations at the same time, there was HeartbeatLag.
  • When we ran a single gh-ost migration and we simulated production traffic, there was HeartbeatLag.

Meaning if I process this faster (doing the copy itself) that will result in higher HeartbeatLag and longer wait times after the copy is complete?

This makes sense, the writes done by the copy operation appear on the binlog. At some copy speed, you are going to experience HeartbeatLag.

In my post here I mentioned the following:

On our clusters, the parameter aurora_binlog_replication_max_yield_seconds defaulted to 60 seconds. By changing this value to 0 (on a test cluster), we were able to completely remove the HeartbeatLag that we were seeing.

Warning: We have not changed this parameter in our production environment as it may not be safe for us to do so. We are actually leaning towards keeping this value at its default of 60 seconds and instead relying on the safety added to gh-ost by this PR.

This is no longer true. We rolled this setting change out to all of our production DBs and we no longer experience significant HeartbeatLag. This was basically a silver bullet for us.

We later followed up with another improvement that you might want to try.

We noticed that certain migrations (on our largest table) still caused increased Aurora replication latency, sometimes to multiple seconds which was unacceptable. My colleague solved this by running gh-ost with the following:

--throttle-query='select count(*) from mysql.ro_replica_status where Replica_lag_in_msec >= 100'

The reason this works is that gh-ost provides this option to throttle its activity based on the result of a query executed against the DB (see the documentation), and Aurora exposes the replication latency (and other sats) of the entire cluster in the mysql.ro_replica_status table.

Best of luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants