Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sync error handling #698

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

GRMrGecko
Copy link

@GRMrGecko GRMrGecko commented Feb 7, 2024

The company I work for has encountered a problem where a sync failure just ends up deleting all destination files. I took it upon myself to examine the issue, and found someone else reported the same behavior in issue #695.

In testing different failure cases, I have found the following through log level of trace and adding an extra log to the source object retrieval section.

Test environment is a Ceph Object Gateway which contains some test files and folders to allow me to quickly sync a test data set. Then, I have a PHP script on my web server that triggers an error I saw when Ceph Object Store had an error.

Test PHP file:

$ cat index.php                
<?php
if ($_SERVER['REQUEST_METHOD'] != "HEAD") {
    header("HTTP/1.1 504 Bad Request");
}
?>
<html>
<head>
<title>test</title>
</head>
<body>
This is a test of failures on object store.
</body>
</html>

First, I tested network related issues by rebooting my web server as s5cmd was in a retry loop, and I got the following error:

DEBUG retryable error: RequestError: send request failed
caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: connect: connection refused
Error at source object: RequestError: send request failed
caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: connect: connection refused
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": RequestError: send request failed caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: connect: connection refused
Run delete plan.
Plan done.
rm /home/grmrgecko/test/test-dir-1/test-file-11
rm /home/grmrgecko/test/test-file-6
rm /home/grmrgecko/test/test-dir-2/test-file-3
rm /home/grmrgecko/test/test-dir-2/test-file-4
rm /home/grmrgecko/test/test-dir-2/test-file-5

I added RequestError to the list of awsErr codes in shouldStopSync to make this error stop the sync.

My next test is with the above PHP script to reproduce the following error we got when Ceph was in a bad state:

ERROR "sync --delete=true --no-follow-symlinks=true s3://pkgrepo-webroot/* /var/www/repo": SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: 
failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>| 00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time| 00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20
 73 65 |-out</h1>.The se| 00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp| 00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b| 00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a       |ody></
html>.| caused by: expected element type <Error> but have <html>

Testing with my debug code, and I get the following:

Error at source object: SerializationError: failed to unmarshal error message
        status code: 504, request id: , host id: 
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0a 3c  68 65 61 64 3e 0a 3c 74  |<html>.<head>.<t|
00000010  69 74 6c 65 3e 74 65 73  74 3c 2f 74 69 74 6c 65  |itle>test</title|
00000020  3e 0a 3c 2f 68 65 61 64  3e 0a 3c 62 6f 64 79 3e  |>.</head>.<body>|
00000030  0a 54 68 69 73 20 69 73  20 61 20 74 65 73 74 20  |.This is a test |
00000040  6f 66 20 66 61 69 6c 75  72 65 73 20 6f 6e 20 6f  |of failures on o|
00000050  62 6a 65 63 74 20 73 74  6f 72 65 2e 0a 3c 2f 62  |bject store..</b|
00000060  6f 64 79 3e 0a 3c 2f 68  74 6d 6c 3e 0a           |ody>.</html>.|

caused by: expected element type <Error> but have <html>
DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/2.0 504 Gateway Timeout
Connection: close
Content-Type: text/html; charset=UTF-8
Date: Wed, 07 Feb 2024 17:56:42 GMT
Server: nginx/1.24.0
X-Powered-By: PHP/7.4.33


-----------------------------------------------------
DEBUG retryable error: SerializationError: failed to unmarshal error message
        status code: 504, request id: , host id: 
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0a 3c  68 65 61 64 3e 0a 3c 74  |<html>.<head>.<t|
00000010  69 74 6c 65 3e 74 65 73  74 3c 2f 74 69 74 6c 65  |itle>test</title|
00000020  3e 0a 3c 2f 68 65 61 64  3e 0a 3c 62 6f 64 79 3e  |>.</head>.<body>|
00000030  0a 54 68 69 73 20 69 73  20 61 20 74 65 73 74 20  |.This is a test |
00000040  6f 66 20 66 61 69 6c 75  72 65 73 20 6f 6e 20 6f  |of failures on o|
00000050  62 6a 65 63 74 20 73 74  6f 72 65 2e 0a 3c 2f 62  |bject store..</b|
00000060  6f 64 79 3e 0a 3c 2f 68  74 6d 6c 3e 0a           |ody>.</html>.|

caused by: expected element type <Error> but have <html>
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 0a 3c 74 |<html>.<head>.<t| 00000010 69 74 6c 65 3e 74 65 73 74 3c 2f 74 69 74 6c 65 |itle>test</title| 00000020 3e 0a 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 3e |>.</head>.<body>| 00000030 0a 54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 |.This is a test | 00000040 6f 66 20 66 61 69 6c 75 72 65 73 20 6f 6e 20 6f |of failures on o| 00000050 62 6a 65 63 74 20 73 74 6f 72 65 2e 0a 3c 2f 62 |bject store..</b| 00000060 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e 0a      |ody>.</html>.| caused by: expected element type <Error> but have <html>
Run delete plan.
Plan done.
rm /home/grmrgecko/test/test-dir-1/test-file-12
rm /home/grmrgecko/test/test-dir-1/test-file-10
rm /home/grmrgecko/test/test-dir-1/test-file-11
rm /home/grmrgecko/test/test-dir-2/test-file-13
rm /home/grmrgecko/test/test-dir-1/test-file-13
rm /home/grmrgecko/test/test-dir-1/test-file-14
rm /home/grmrgecko/test/test-dir-1/test-file-1
rm /home/grmrgecko/test/test-dir-1/test-file-4

With the above, I have added SerializationError to the list in shouldStopSync.

To ensure sync cancels also cancels the plan run, I added the context as a parameter and added a context error check to cancel the plan when the context cancels.

Now to test that the changes mentioned actually fixes the problem. Testing the first failure case of network related issue:

-----------------------------------------------------
DEBUG retryable error: RequestError: send request failed
caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: i/o timeout
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": RequestError: send request failed caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: i/o timeout
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": RequestError: send request failed caused by: Get "https://objects.gec.im/test-jcoleman?list-type=2&prefix=": dial tcp 136.53.94.174:443: i/o timeout

$ ls -lah ~/test/                                                                                                                      
total 0
drwxr-xr-x 1 grmrgecko grmrgecko  192 Feb  7 13:20 .
drwx--x--- 1 grmrgecko      1001 5.3K Feb  7 13:20 ..
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-1
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-2
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-3
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-1
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-2
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-3
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-4
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-5
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-6

Finally the second failure case of S3 backend failure:

-----------------------------------------------------
DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/2.0 504 Gateway Timeout
Connection: close
Content-Type: text/html; charset=UTF-8
Date: Wed, 07 Feb 2024 19:31:35 GMT
Server: nginx/1.24.0
X-Powered-By: PHP/7.4.33


-----------------------------------------------------
DEBUG retryable error: SerializationError: failed to unmarshal error message
        status code: 504, request id: , host id: 
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0a 3c  68 65 61 64 3e 0a 3c 74  |<html>.<head>.<t|
00000010  69 74 6c 65 3e 74 65 73  74 3c 2f 74 69 74 6c 65  |itle>test</title|
00000020  3e 0a 3c 2f 68 65 61 64  3e 0a 3c 62 6f 64 79 3e  |>.</head>.<body>|
00000030  0a 54 68 69 73 20 69 73  20 61 20 74 65 73 74 20  |.This is a test |
00000040  6f 66 20 66 61 69 6c 75  72 65 73 20 6f 6e 20 6f  |of failures on o|
00000050  62 6a 65 63 74 20 73 74  6f 72 65 2e 0a 3c 2f 62  |bject store..</b|
00000060  6f 64 79 3e 0a 3c 2f 68  74 6d 6c 3e 0a           |ody>.</html>.|

caused by: expected element type <Error> but have <html>
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 0a 3c 74 |<html>.<head>.<t| 00000010 69 74 6c 65 3e 74 65 73 74 3c 2f 74 69 74 6c 65 |itle>test</title| 00000020 3e 0a 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 3e |>.</head>.<body>| 00000030 0a 54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 |.This is a test | 00000040 6f 66 20 66 61 69 6c 75 72 65 73 20 6f 6e 20 6f |of failures on o| 00000050 62 6a 65 63 74 20 73 74 6f 72 65 2e 0a 3c 2f 62 |bject store..</b| 00000060 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e 0a      |ody>.</html>.| caused by: expected element type <Error> but have <html>
ERROR "sync --delete=true --no-follow-symlinks=true s3://test-jcoleman/* /home/grmrgecko/test/": SerializationError: failed to unmarshal error message status code: 504, request id: , host id: caused by: UnmarshalError: failed to unmarshal error message 00000000 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 0a 3c 74 |<html>.<head>.<t| 00000010 69 74 6c 65 3e 74 65 73 74 3c 2f 74 69 74 6c 65 |itle>test</title| 00000020 3e 0a 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 3e |>.</head>.<body>| 00000030 0a 54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 |.This is a test | 00000040 6f 66 20 66 61 69 6c 75 72 65 73 20 6f 6e 20 6f |of failures on o| 00000050 62 6a 65 63 74 20 73 74 6f 72 65 2e 0a 3c 2f 62 |bject store..</b| 00000060 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e 0a      |ody>.</html>.| caused by: expected element type <Error> but have <html>

$ ls -lah ~/test/                       
total 0
drwxr-xr-x 1 grmrgecko grmrgecko  192 Feb  7 13:20 .
drwx--x--- 1 grmrgecko      1001 5.3K Feb  7 13:20 ..
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-1
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-2
drwxr-xr-x 1 grmrgecko grmrgecko  510 Feb  7 13:20 test-dir-3
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-1
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-2
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-3
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-4
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-5
-rw-r--r-- 1 grmrgecko grmrgecko    0 Feb  7 13:20 test-file-6

@GRMrGecko GRMrGecko requested a review from a team as a code owner February 7, 2024 19:38
@GRMrGecko GRMrGecko requested review from igungor and seruman and removed request for a team February 7, 2024 19:38
@k0ste
Copy link

k0ste commented May 15, 2024

@adrienyhuel this resolve #695 for you?

@adrienyhuel
Copy link

@adrienyhuel this resolve #695 for you?

I don't know, I'm waiting for a release :)

@ilkinulas ilkinulas added the sync label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants