Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling transient errors when backing up to Azure URL #50

Closed
m60freeman opened this issue May 14, 2018 · 17 comments
Closed

Handling transient errors when backing up to Azure URL #50

m60freeman opened this issue May 14, 2018 · 17 comments

Comments

@m60freeman
Copy link

m60freeman commented May 14, 2018

I occasionally see failed backups in one of my instances in an Azure VM that uses the backup to Azure BLOB storage feature. The errors generally look like this in the ERRORLOG:

Error: 18210, Severity: 16, State: 1.
BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'https://.blob.core.windows.net//_LOG_20180509_003000.trn'. Operating system error Backup to URL received an exception from the remote endpoint. Exception Message: The client could not finish the operation within specified timeout..

Working with Microsoft Premier Support, they had me add some trace flags that cause me to get BackupToUrl log files. The one that matches with the error above (timezone differences) has the following entries:

5/9/2018 12:35:02 AM: An unexpected exception occurred during communication on VDI Channel.
5/9/2018 12:35:02 AM: Exception Info: The client could not finish the operation within specified timeout.
5/9/2018 12:35:02 AM: Stack: at Microsoft.SqlServer.VdiInterface.VDI.AsyncIOCompletion(BlobRequestOptions options, List`1 asyncResults, CloudPageBlob pageBlob, Boolean onFlush)
at Microsoft.SqlServer.VdiInterface.VDI.PerformPageDataTransfer(CloudPageBlob pageBlob, AccessCondition leaseCondition, Boolean forBackup)
5/9/2018 12:35:02 AM: The Active queue had 0 requests until we got a clearerror
5/9/2018 12:35:02 AM: A fatal error occurred during Engine Communication, exception information follows
5/9/2018 12:35:02 AM: Exception Info: The client could not finish the operation within specified timeout.
5/9/2018 12:35:02 AM: Stack: at Microsoft.SqlServer.VdiInterface.VDI.PerformPageDataTransfer(CloudPageBlob pageBlob, AccessCondition leaseCondition, Boolean forBackup)
at BackupToUrl.Program.MainInternal(String[] args)

After consulting with their internal groups, the Support Engineer replied to me today with the following (the bolding was highlighting in the email from Microsoft):

I was unexpectedly oof for most of the day Friday, and could not wrap up my conversations with PG on this. I have been working with them on yours as well as another case with similar symptoms – the common factor being the use of the backup script being found at https://ola.hallengren.com/ . Based on the analysis and conversations here is the current state:
• Cause/Solution: As noted in the link below when there are sudden spikes in request load for the storage account it could result in few request timeouts. The solution would be to retry the failed operation (in this case backups). This is documented below:

https://docs.microsoft.com/en-us/azure/storage/common/storage-performance-checklist?toc=%2fazure%2fstorage%2fblobs%2ftoc.json#subheading14

Throttling/ServerBusy
In some cases, the storage service may throttle your application or may simply be unable to serve the request due to some transient condition and return a "503 Server busy" message or "500 Timeout". This can happen if your application is approaching any of the scalability targets, or if the system is rebalancing your partitioned data to allow for higher throughput. The client application should typically retry the operation that causes such an error: attempting the same request later can succeed. However, if the storage service is throttling your application because it is exceeding scalability targets, or even if the service was unable to serve the request for some other reason, aggressive retries usually make the problem worse. For this reason, you should use an exponential back off (the client libraries default to this behavior). For example, your application may retry after 2 seconds, then 4 seconds, then 10 seconds, then 30 seconds, and then give up completely. This behavior results in your application significantly reducing its load on the service rather than exacerbating any problems.
Note that connectivity errors can be retried immediately, because they are not the result of throttling and are expected to be transient.
So, please check to see if the script has some sort of a flag to retry the backup operations that fail due to transient conditions with storage accounts.
• Long term: Product group is looking at implementing changes in code to retry the backup failures, but they are looking into pros/cons of the same and it will likely take a few more months before the fix (if approved) is in place.

We will also be working on documentation changes to reflect the learnings from these cases so that customers are betted informed about these issues.

The DatabaseBackup stored procedure does not seem to have any mechanism to retry on error. While I would think that Microsoft should have built this into the BACKUP command code when they added the backup to URL feature, I'm not optimistic about that happening any time soon.

I found this: https://support.microsoft.com/en-us/help/4023679/fix-timeout-when-you-back-up-a-large-database-to-url-in-sql-server-201, which claims the issue was fixed in SQL Server 2012 SP3 CU10, but we are running 2012 SP4 and the problem obviously still exists. We've had the problem when backing up a 16 MB ldf, so their assertion that this happens only on "large" databases doesn't seem right. Or they fixed a different problem from the one I'm seeing. Regardless, I'm hoping that DatabaseBackup can be enhanced to retry on the appropriate errors.

@olahallengren
Copy link
Owner

There is currently no retry mechanism in DatabaseBackup, but it is a very good idea.

I would like to check some things. Could you send me the output - file from a failed backup job.

@m60freeman
Copy link
Author

m60freeman commented May 14, 2018 via email

@olahallengren
Copy link
Owner

Unfortunately I don't see any good way to implement this.

If I was going to do this, it should only be for certain errors, where the retry makes sense.

The backup command returns two error. The first error is the interesting one, and the second error is a generic backup error. The problem is that you can only get the second error in T-SQL.

@olahallengren
Copy link
Owner

Yes, this is the output - file I mean. Please set it to append or use the SQL Server Agent tokens for date and time.

$(ESCAPE_SQUOTE(SQLLOGDIR))\DatabaseBackup_FULL_$(ESCAPE_SQUOTE(JOBID))_$(ESCAPE_SQUOTE(STEPID))_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt

@m60freeman
Copy link
Author

m60freeman commented May 14, 2018 via email

@m60freeman
Copy link
Author

m60freeman commented May 14, 2018 via email

@olahallengren
Copy link
Owner

I checked the CommandLog, and the backup shows an ErrorNumber of 3013, but
a NULL ErrorMessage.

Yes, this is the same issue. I am using the error variable. That only gives me the last error, for the logging, and no error message. However I will still get both errors in the output - file.

Here is an article about this issue:
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/04/03/server-side-error-handling-part-2-errors-and-error-messages/

@GeorgePalacios
Copy link

Just to chime in here, we are experiencing the same issue and are in the process of dealing with MS support.

If any error files are needed (or anything else) I will be more than willing to accommodate.

@allenwux
Copy link

We are looking at this issue now. Will update once we have a solution.

@GeorgePalacios
Copy link

I have confirmation from a Microsoft Support Engineer that the next CU for 2014 will be shipping with some retry logic for the BackupToURL functionality.

@cfendrick
Copy link

cfendrick commented Aug 15, 2018

We currently experience the same random errors you are seeing and we are backing up exclusively to blob storage from a Azure D15 VM running Microsoft SQL Server 2016 (SP2-CU2) 13.0.5153.0 with 800 plus databases.

Retry logic built into the BackupToURL functionality would be extremely beneficial and I would welcome the addition in all SQL versions. As a workaround you could create a job to look at the CommandLog and generate your own syntax to reattempt the backup again.

image

@philosophicles
Copy link

I'm also experiencing what seems to be the same problem (on 2012 SP4).
Having found this thread, I'm experimenting over the weekend with using the built-in retry functionality in SQL job steps - as I run DatabaseBackup from a job anyway.

@GeorgePalacios
Copy link

GeorgePalacios commented Sep 10, 2018 via email

@m60freeman
Copy link
Author

@GeorgePalacios: Nice! Last I heard, they were not going to provide this fix for 2012, so I'm glad to hear this.

@olahallengren
Copy link
Owner

olahallengren commented Sep 14, 2018

Microsoft has now released a fix for SQL Server 2014, SQL Server 2016, and SQL Server 2017.
https://support.microsoft.com/en-us/help/4463320/fix-intermittent-failures-when-you-run-backups-to-azure-storage-from-s

@durilai
Copy link

durilai commented Jan 14, 2020

@olahallengren I found this topic and am experiencing very similar behavior on SQL Server 2017 EE, CU17. We backup dozens of DB's to Azure blob storage ranging from GB's to 5TB using your solution.

It works great most of the time, but I get the occasional 3013. It typically works the next attempt, but occasionally multi-attempt failures do occur. I have set the transfer and block size, I backup to 64 files, and the failures are usually not on the largest of the DB's.

I do not see any details in the log, error messages are null in CommandLog table. Any help or guidance to help achieve more stability would be great. Thanks for your time.

@olahallengren
Copy link
Owner

@durilai, what would you like to have? A retry parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants