Clear replication errors when any replication progress has been made #421
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when there is any temporary error during initial replication, the error persists until replication completed. This ends up with the dashboard (or anything using the diagnostics API) displaying an error such as this:
This is confusing for users - it leads them to believe there is an issue with the replication, often leading to re-deploying, or even stopping and restarting the instance, which then further delays sync rule processing.
This changes the logic to clear the replication error on any flush that made progress, instead of only on the final commit. As background, we used to only clear on commit to make sure we don't clear errors when replication consistently runs into the same error when retrying. However, since the change to skip existing rows during snapshots (#150), any successful flush does indicate progress, making it safe to clear the error.
While this primarily affects initial replication of new sync rules, it could also affect replication of the active sync rules in some edge cases with large replication delays. If the delay is enough that it is an actual issue, we have different checks that would report that.
Additionally, this adds the timestamp of when an error occurred to the diagnostics API. For anything other than these replication errors, it would be the current time. For replication errors, it is the time the error last occurred. We only record the time with MongoDB storage currently. Postgres storage would require a migration to add this, so we can do that in the next minor release.