Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box: multipart upload problems #2054

Closed
jostasche opened this issue Feb 6, 2018 · 14 comments
Closed

box: multipart upload problems #2054

jostasche opened this issue Feb 6, 2018 · 14 comments

Comments

@jostasche
Copy link

rclone fails to upload some of the larger files to Box. My observation is that it typically happens for the exact same files (1-13GB in size), but seems to succeed eventually when restarted over and over. I am unsure if it is because a certain amount of time (days/weeks) passed.

I ran four synchronisations in debugging mode today and attached all log files.
The errors slightly differ, but all mention "refresh token" or "multipart upload" errors:

after many
commit multipart upload failed

the following errors occur
ERROR : box root 'Jos/0.85': Token refresh failed: couldn't list files: Get https://api.box.com/2.0/folders/450***84/items?fields=type%2Cid%2Csequence_id%2Cetag%2Csha1%2Cname%2Csize%2Ccreated_at%2Cmodified_at%2Ccontent_created_at%2Ccontent_modified_at%2Citem_status&limit=1000&offset=0: oauth2: cannot fetch token: 400 Bad Request

Response: {“error":“invalid_grant”,“error_description”:“Refresh token has expired”}

I used to encounter the “range_overlaps_existing_part” error, but I didn't manage reproduce it today in debugging mode.
ERROR : ******.trr: Failed to copy: multipart upload failed to upload part: Error “range_overlaps_existing_part” (416): Part overlaps with previously uploaded part: {id: “15AE83DA”, “offset”: 6106906624, “size”: 8388608}

rclone.log.1.txt
rclone.log.2.txt
rclone.log.3.txt
rclone.log.4.beta.txt

My details:

  • rclone v1.39 os/arch: linux/amd64 go version: go1.9.2 (used the beta version for the fourth attempt, see logs)
  • x86_64 GNU/Linux
  • Box.com
  • ./rclone-v1.39-linux-amd64/rclone sync -v --max-size 15G --stats 1m /data/ box:data --exclude ".*/**"
@ncw ncw added the bug label Feb 7, 2018
@ncw
Copy link
Member

ncw commented Feb 7, 2018

I think there is more than one thing going on here...

In rclone.log.1.txt we see

2018/02/06 08:39:04 DEBUG : traj.trr: Cancelling multipart upload: multipart upload failed to finalize: too many tries to commit multipart upload - increase --low-level-retries
2018/02/06 08:39:04 ERROR : traj.trr: Failed to copy: multipart upload failed to finalize: too many tries to commit multipart upload - increase --low-level-retries

Which means that rclone didn't wait long enough for the file to be finalised. It is likely that if clone had waited longer it would have worked.

On the retry we see

2018/02/06 08:44:48 DEBUG : traj.trr: Cancelling multipart upload: multipart upload failed to upload part: Put https://upload.box.com/api/2.0/files/upload_sessions/978436C11B373F25048523C8EA6E0C93: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_grant","error_description":"Refresh token has expired"}

Which I think must be the token expired but multiple threads refreshed it. The box refresh token is only valid once though so the remaining threads got this error. That is my guess as to what is going on.

This explanation doesn't quite add up as we don't see a Saved new token in config file DEBUG message so we didn't have a success.

In rclone.log.2.txt we see the same thing with rclone not waiting long enough.

In rclone.log.3.txt we see the token expiring and the refresh not working, again withouth a Saved new token in config file message

2018/02/06 12:06:45 ERROR : box root 'Jos/hamilton/MD-AB-surface/octacube/1.0/400K/0.85': Token refresh failed: couldn't list files: Get https://api.box.com/2.0/folders/45065466784/items?fields=type%2Cid%2Csequence_id%2Cetag%2Csha1%2Cname%2Csize%2Ccreated_at%2Cmodified_at%2Ccontent_created_at%2Ccontent_modified_at%2Citem_status&limit=1000&offset=0: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_grant","error_description":"Refresh token has expired"}
2018/02/06 12:06:48 DEBUG : combined.trr: Cancelling multipart upload: multipart upload failed to upload part: Put https://upload.box.com/api/2.0/files/upload_sessions/C1093086DA25A1101D0F81ABF4A1377C: oauth2: cannot fetch token: 400 Bad Request

rclone.log.4.beta.txt is the not waiting long enough problem...

Conclusion

So, I think for big files transferred to box increasing --low-level-retries is necessary. Can you try increasing it from 10 to 100 (say) and then send a log. Hopefully it will succeed. I can then use that to tune the time rclone waits for a file to be re-assembled.

I used to encounter the “range_overlaps_existing_part” error, but I didn't manage reproduce it today in debugging mode.

It might be that rclone needs to retry if that happens as it does if it gets a parts_mismatch error. Interestingly that error isn't in the docs!

If you can send a log with that happening that would be very helpful :-)

@ncw ncw added this to the v1.40 milestone Feb 7, 2018
@ncw ncw changed the title Error with multipart upload with BOX (connected to token issue?) box: multipart upload problems Feb 7, 2018
@ncw ncw modified the milestones: v1.40, v1.41 Mar 19, 2018
@ncw
Copy link
Member

ncw commented Apr 21, 2018

Can you have an experiment with different values of --low-level-retries? I can't progress on this without some experimental data!

@ncw ncw modified the milestones: v1.41, Known Problem Apr 21, 2018
@guruevi
Copy link

guruevi commented May 11, 2018

Just an FYI, I've seen the same issue. It happens more often when Box is either busy or perhaps they're scaling down for the night - generally some time in the evening, around 8pm I've had retries go up to 15.

I have ~6TB in Box right now, streaming at ~400Mbps, night time slows it down to ~160Mbps (we have 10Gbps to Box.com so there seems to be some single-client limit). I'm uploading 50TB in the next couple of days, I'll try to log what I can but upping retries does work thus far.

The number of threads doesn't seem to matter. For large numbers of large files, you can use transfers=16, for small files (4kB), transfers above 12 results in throttling issues (probably because you're hammering many requests).

2018/05/10 19:57:37 DEBUG : Backup/RCBI/RCBI/auto-20180508.0300-4w/1525992712.591292: commit multipart upload failed 15/100 - trying again in 3 seconds (not ready yet)
2018/05/10 19:57:38 INFO  : 
Transferred:    12 GBytes (19.481 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            0
Elapsed time:    10m30.7s
Transferring:
 *   .../auto-20180508.0300-4w/1525992712.591292: 100% /12G, 0/s, 0s

2018/05/10 19:57:41 INFO  : Backup/RCBI/RCBI/auto-20180508.0300-4w/1525992712.591292: Copied (new)
2018/05/10 19:57:42 INFO  : Backup/RCBI/RCBI/auto-20180508.0300-4w/1525992712.591292: Deleted
2018/05/10 19:57:42 INFO  : 
Transferred:    12 GBytes (19.376 MBytes/s)
Errors:                 0
Checks:                 1
Transferred:            1
Elapsed time:    10m34.1s

2018/05/10 19:57:42 DEBUG : 6 go routines active
2018/05/10 19:57:42 DEBUG : rclone: Version "v1.41" finishing with parameters ["rclone" "-vvvv" "--stats=30s" "--transfers=16" "--low-level-retries=100" "--exclude=tapedev" "--exclude=lastsnap" "--buffer-size=64M" "move" "/mnt/Backup/tapes" "URBox:/ZFS-SEND/"]
root@zombie:/mnt/Backup/ZFS2rclone # 

@jmfrank63
Copy link

Hi Nick, my name is Johannes, I work for Box. I found this entry via Google after not being able to find any errors on the Box side, although the downloads still failed for certain larger files. I followed your recommendation and increased the low level retries to 100 and that seems to have fixed the issue.
I also wrote you a message on Linkedin, maybe you have time to read it.

@ncw
Copy link
Member

ncw commented Jul 4, 2018

Nice to hear from you Johannes @jmfrank63 .

The default for --low-level-retries is 10, which isn't quite enough time to be sure box has re-assembled the parts.

I'd like to find the minimum value of --low-level-retries which produces a reliable result - could you help with that?

Then I can tune rclone's retries accordingly.

@jmfrank63
Copy link

jmfrank63 commented Jul 4, 2018 via email

@guruevi
Copy link

guruevi commented Jul 6, 2018

Here is a spreadsheet with the number of retries for each file and time intervals it says it will retry it at:
https://rochester.box.com/s/83xba8qt303uezjj846fkh46sgwf87vt

You can try making some histograms and predictions.

The maximum number I got was 13, no failures. This was a 5.85 TB transfer for 456 items.

@ncw
Copy link
Member

ncw commented Jul 6, 2018

@guruevi that is really useful thank you.

I at the moment the number of retries is --low-level-retries which is 10 by default. I was thinking I'd multiply that by 2 so the default number of retries is 20, but the user still has some control over it by adjusting --low-level-retries.

20 seems like a safe margin over 14.

What does everybody think about this approach? I don't really want to make yet another parameter for rclone if I don't have to!

@jmfrank63
Copy link

jmfrank63 commented Jul 6, 2018 via email

@ncw ncw added the Remote: Box label Jul 9, 2018
@ncw ncw modified the milestones: Known Problem, v1.43 Jul 9, 2018
ncw added a commit that referenced this issue Jul 9, 2018
Empirically in it was discovered that 20 tries is enough, so make this
the minimum value with larger values settable with --low-level-retries

Fixes #2054
@ncw
Copy link
Member

ncw commented Jul 9, 2018

OK, what I've decided to do is make the minimum value 20, but if you set --low-level-retries larger than that, then it will use that value instead.

I've done this in

https://beta.rclone.org/branch/v1.42-029-g74ab37f7-fix-2054-box-upload/ (uploaded in 15-30 mins)

Can you give it a quick test and if it looks good I'll merge it.

@jmfrank63
Copy link

Hi Nick,

just bumped in a file with little over 4GB that did not upload with 20 retries. To make sure it is 20 that is failing I added the parameter and set the value explicitly to 20, and it still failed.
I tried to upload with 30 retries and it still failed, but it succeeded with 100. I did not take the time in seconds however. Do you know what a retry is worth in time?
Additionally I noticed the help text still identifies 10 as the default as well as rclone browser does (might not be your problem though).

With 100 I would like to do a little more investigation why this is taking so long and what the possible problem might be. As I work in User Services I would like to open a ticket on your behalf. May I use your support email address given for rclone?
Thanks Johannes

@ncw
Copy link
Member

ncw commented Jul 10, 2018

@jmfrank63 Thanks for testing.

just bumped in a file with little over 4GB that did not upload with 20 retries. To make sure it is 20 that is failing I added the parameter and set the value explicitly to 20, and it still failed.

Hmm...

I tried to upload with 30 retries and it still failed, but it succeeded with 100. I did not take the time in seconds however. Do you know what a retry is worth in time?

The server sends a time the client should retry so it depends... If the server doesn't send a time then each retry takes 10s.

If you run with -vv you can see the retries happening.

Additionally I noticed the help text still identifies 10 as the default as well as rclone browser does (might not be your problem though).

I haven't changed the value for --low-level-retries is that what you mean? That controls a whole lot of other retry things. The retries value for the box upload finalization doesn't have its own parameter (though it could).

With 100 I would like to do a little more investigation why this is taking so long and what the possible problem might be. As I work in User Services I would like to open a ticket on your behalf. May I use your support email address given for rclone?

Yes by all means use nick@craig-wood.com to open a ticket.

I'm wondering whether we need something similar to the flag we put in for Amazon Drive:

  --acd-upload-wait-per-gb duration     Additional time per GB to wait after a failed complete upload to see if it appears. (default 3m0s)

I think this is the same problem - the assembly of all the parts of a big file takes some time, the bigger the file, the more time.

It might be that I should set the number of retries really large (say 100). The max retries protects against things getting in a loop, but if box is still responding sensibly to the requests then there is an argument to say rclone should carry on trying.

@ncw
Copy link
Member

ncw commented Aug 11, 2018

I think what I'll do is make a new variable that can be set independently of --low-level-retries say --box-commit-retries and default this to 100. This should fix the problem and allow it to be tweaked if necessary.

@ncw ncw closed this as completed in 751bfd4 Aug 11, 2018
@ncw
Copy link
Member

ncw commented Aug 11, 2018

I've made the new flag and defaulted it to 100 here

https://beta.rclone.org/v1.42-099-g751bfd45/ (uploaded in 15-30 mins)

This will be in the latest beta and in v1.43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants