New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vagrant-cloud post-processor fails to upload 5GB box #6615
Comments
How does it fail? |
It takes about 2 hours to build the box, so I don't have logs handy. It's this message: But I can't remember |
It would be great with a log, at least the post-processor part. |
@double16 I've run into variations of this bug many, many, times. If you compile your own packer version, try this sed command: sed -i -e "s/common.Retry(10, 10, 3/common.Retry(10, 10, 20/g" post-processor/vagrant-cloud/step_upload.go Which will modify packer to retry the upload 20 times before failing. With this change, you might run into this error:
Which appears to be synonymous with a successful upload, on the previous attempt, but packer didn't receive the correct status code within the allotted time. From what I can tell, there is a delay between when the upload finishes, and sends the HTTP |
@ladar if the upload is successful but validation fails, does it re-upload entirely the next try around, or does it just see that it's uploaded already and succeed? |
@SwampDragons I believe the box gets uploaded during the retries, if the previous attempt fails... until the 503 errors appear, at which point I don't believe it actually uploads the data. I just finished a full build and wiped all my logs, or I'd send you the raw data. I originally thought this might be a timeout problem, until I noticed the issue happens more frequently when I have all 5 robots building boxes. So I started looking into a rate/size limit with vagrant cloud. I opened a ticket with HashiCorp tech support. They said the quota wasn't the problem, but it took them awhile to get back, and by then I was pegged to my v1.2.4 fork (waiting on the split function merge). So I haven't looked into this issue too deeply as of late. With the recent merge, I'm once again trying to re-sync with the HEAD, so I'll probably spend more time investigating, and type up a full bug report when I have the time. It's worth noting, I recently wrote a crude bash script to upload boxes which fail using packer, and that got me thinking it's a timeout issue (again). At least a timeout problem when it comes to the 503 errors. I have no theories on why the uploads sometimes fail before I get the 503. To date, whenever I've gotten a 503, and then go download the box, and compare the checksums, they've matched. Whereas if the upload fails before I start getting the 503 errors, the box doesn't appear to be available for download. It may also be worth mentioning, this bug first appeared when I switched from the atlas upload post processor, to the vagrant cloud post processor. |
Interesting. Might be worth looking at whether the code for the client we used in atlas uploads had some kind of error handling that the vagrant cloud postprocessor doesn't |
Nothing really jumps out at me as different enough to account for this; I wonder if our vagrant cloud servers are doing something differently. |
It's been so long that I could be remembering it wrong, but as I recall the Atlas uploader used a different set of URLs, and API pragmatics, which made them difficult to compare, at least from the perspective of the log files. I don't think I ever tried comparing the code. |
I've reached out to the vagrant cloud folks to see if they have any insights. |
As it turns out, I still had one log file left from my robox run last week, and it contained a text book example of the error.Note, this log/error was using packer v1.2.4, and like I said, I haven't done very much testing against HEAD. In the debug log, I've redacted most URL/cookie strings inside the log, just in case they contain something I shouldn't be posting on GitHub. Anywhwere you see This packer binary was built with a retry count of 8, and an increasing pause up to 300 seconds... the max allowed by the retry code, by running First an abbreviated, reformatted log, with the most relevant lines, as the debug output is hard to follow. The first 6 attempts failed, while the last 2 attempts returned a 503 error.
If you dig into the debug output below, you'll see the upload failures show a response of "<nil>" from the server, which suggests either the connection fizzled out, or the server took too long to process the upload and the client gave up on waiting. The relevant debug lines:
Based on what I know of the vagrant cloud architecture, boxes appear to get uploaded to a HashiCorp server, before being transferred to Amazon S3 for permanent storage. If the server is waiting until the latter happens before responding, that would explain the timeout. The full debug log file follows, for your reading pleasure:
|
@double16 can you confirm that your seeing the same error/problem? As you can see from my logs, adding additional retries with that sed find/replace, aka code tweak, will increase the likelihood an upload attempt will eventually make it through. If your not setup to compile packer yourself, I can point you to a script I wrote which will do for you automatically, or I post binaries here. |
@ladar Sorry for the delay, had issues rebuilding the box. Recent packer (1.3.2-dev) with your https://gist.github.com/double16/28919b72b8840803fe69bc24bbe456e8#file-packer2-log |
@double16 after getting the 503... if you wait a reasonable period, say an hour, is the box available from the Vagrant Cloud? |
If not. Try the
If you have the time, record the script execution as an animated GIF and post it here. I've been meaning to do that, but haven't found the time. I believe the duration of post upload pause would be illuminating, as I believe it is that pause which is causing the problem. #!/bin/bash
ORG="$1"
NAME="$2"
PROVIDER="$3"
VERSION="$4"
FILE="$5"
CURL=curl
# Cross platform scripting directory plus munchie madness.
pushd `dirname $0` > /dev/null
BASE=`pwd -P`VAGRANT_CLOUD_TOKEN
popd > /dev/null
# The jq tool is needed to parse JSON responses.
if [ ! -f /usr/bin/jq ]; then
tput setaf 1; printf "\n\nThe 'jq' utility is not installed.\n\n\n"; tput sgr0
exit 1
fi
# Ensure the credentials file is available.
if [ -f $BASE/.credentialsrc ]; then
source $BASE/.credentialsrc
else
tput setaf 1; printf "\nError. The credentials file is missing.\n\n"; tput sgr0
exit 2
fi
if [ -z ${VAGRANT_CLOUD_TOKEN} ]; then
tput setaf 1; printf "\nError. The vagrant cloud token is missing. Add it to the credentials file.\n\n"; tput sgr0
ORG="$1"
NAME="$2"
PROVIDER="$3"
VERSION="$4"
FILE="$5"
CURL=/opt/vagrant/embedded/bin/curl
LD_PRELOAD="/opt/vagrant/embedded/lib/libcrypto.so:/opt/vagrant/embedded/lib/libssl.so"
# Cross platform scripting directory plus munchie madness.
pushd `dirname $0` > /dev/null
BASE=`pwd -P`
popd > /dev/null
# The jq tool is needed to parse JSON responses.
if [ ! -f /usr/bin/jq ]; then
tput setaf 1; printf "\n\nThe 'jq' utility is not installed.\n\n\n"; tput sgr0
exit 1
fi
# Ensure the credentials file is available.
if [ -f $BASE/../../.credentialsrc ]; then
source $BASE/../../.credentialsrc
else
tput setaf 1; printf "\nError. The credentials file is missing.\n\n"; tput sgr0
exit 2
fi
if [ -z ${VAGRANT_CLOUD_TOKEN} ]; then
tput setaf 1; printf "\nError. The vagrant cloud token is missing. Add it to the credentials file.\n\n"; tput sgr0
exit 2
fi
printf "\n\n"
# Assume the position, while you create the version.
${CURL} \
--tlsv1.2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $VAGRANT_CLOUD_TOKEN" \
"https://app.vagrantup.com/api/v1/box/$ORG/$NAME/versions" \
--data "
{
\"version\": {
\"version\": \"$VERSION\",
\"description\": \"A build environment for use in cross platform development.\"
}
}
"
printf "\n\n"
# Create the provider, while become one with your inner child.
${CURL} \
--tlsv1.2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $VAGRANT_CLOUD_TOKEN" \
https://app.vagrantup.com/api/v1/box/$ORG/$NAME/version/$VERSION/providers \
--data "{ \"provider\": { \"name\": \"$PROVIDER\" } }"
printf "\n\n"
# Prepare an upload path, and then extract that upload path from the JSON
# response using the jq command.
UPLOAD_PATH=`${CURL} \
--tlsv1.2 \
--silent \
--header "Authorization: Bearer $VAGRANT_CLOUD_TOKEN" \
https://app.vagrantup.com/api/v1/box/$ORG/$NAME/version/$VERSION/provider/$PROVIDER/upload | jq -r .upload_path`
# Perform the upload, and see the bits boil.
${CURL} --tlsv1.2 --include --max-time 7200 --expect100-timeout 7200 --request PUT --output "$FILE.upload.log.txt" --upload-file "$FILE" "$UPLOAD_PATH"
printf "\n-----------------------------------------------------\n"
tput setaf 5
cat "$FILE.upload.log.txt"
tput sgr0
printf -- "-----------------------------------------------------\n\n"
# Release the version, and watch the party rage.
${CURL} \
--tlsv1.2 \
--silent \
--header "Authorization: Bearer $VAGRANT_CLOUD_TOKEN" \
https://app.vagrantup.com/api/v1/box/$ORG/$NAME/version/$VERSION/release \
--request PUT | jq '.status,.version,.providers[]' | grep -vE "hosted|hosted_token|original_url|created_at|updated_at|\}|\{"
printf "\n\n" |
@double16 the GIF captures what I've been seeing. Should anybody investigate this further, note the "hang" at the end of the upload. In this case the pause was relatively short, but I've seen it last several minutes. I suspect when it stretches long enough (say more than 60 seconds), packer returns the "" error we've been seeing. I've also noticed that newer versions of packer seem to be deleting failed uploads (aka the 503 no longer indicates the box is already uploaded). I don't know if this is a packer, or a cloud change. |
I captured more data on the issue. I believe the reason a 503 no longer results in the box being available is because This screen shows the box available Vagrant Cloud website while packer is getting the 503 errors: Yet, when It's hard to see because of the zoom level, but at the end of the log, on the right side of the image, it says "Cleaning up provider" ... followed by "Deleting provider: libvirt" ... that logic must be new. @SwampDragons can you ask the Vagrant Cloud programmers what is triggering the 503 errors? |
This is a higher quality, accelerated GIF showing the upload process. Note the 1 min pause after the file is uploaded, before the server acknowledges. As I said previously, I think the timeouts are a result of that pause lasting more than 1 minute. While creating the |
I managed to conduct a test, where I downloaded the box while packer was showing a 503 error, and compare the hashes with the local box file, and as suspected they matched. The question is, can $ curl --location --user-agent "${AGENT}" "https://app.vagrantup.com/lineageos/boxes/lineage/versions/1.8.38/providers/libvirt.box" | sha256sum ; sha256sum lineageos-libvirt-1.8.38.box
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 300 0 300 0 0 438 0 --:--:-- --:--:-- --:--:-- 438
100 470 100 470 0 0 492 0 --:--:-- --:--:-- --:--:-- 2447
100 1551M 100 1551M 0 0 32.1M 0 0:00:48 0:00:48 --:--:-- 33.5M
49101f52e4708b4b18534b725186a07989094a4f8329f6c09d8fa980c8991f18 -
49101f52e4708b4b18534b725186a07989094a4f8329f6c09d8fa980c8991f18 lineageos-libvirt-1.8.38.box
$ curl --location --user-agent "${AGENT}" "https://app.vagrantup.com/lineageos/boxes/lineage/versions/1.8.38/providers/libvirt.box" | sha256sum ; sha256sum lineageos-libvirt-1.8.38.box
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 40 0 40 0 0 112 0 --:--:-- --:--:-- --:--:-- 112
a50aa56a19be12e2329e0b217e25d33a0eb81c206c31c6949f32a7214f75456c -
49101f52e4708b4b18534b725186a07989094a4f8329f6c09d8fa980c8991f18 lineageos-libvirt-1.8.38.box |
I finally gave up looking for a packer based fix, and removed the post processor upload step. I now handle the uploads using a If anyone on the Hashicorp sides wants to troubleshoot this issue let me know. I can trigger the failures at will. All I have to do is use my shell script to start uploading boxes in parallel. The more boxes in parallel, and the larger the box files, the easier it is to trigger the bug, suggesting a bottleneck on the backend. I've even found that if I upload large boxes sequentially, over a fast enough connection (gigabit speeds), I'll trigger the bug. |
Sorry for dropping the ball on this one; I do intend to get back to it but I'm glad you have a workaround in the meantime. |
@SwampDragons I was trying to say the upload errors appear to be an issue with the Vagrant Cloud backend. You might be able to find workaround for the timeout issue, but ultimately, if the boxes are too large, or there are too many simultaneous uploads, the cloud will have problems. If anybody at Hashicorp wants to troubleshoot, have them reach out to me. As I said, I can trigger the issue at will. That said, as @double16 mentioned, it appears that packer will delete any uploaded artifact from the cloud if a 503 error is returned. I think you should revert this change. As I've mentioned, it seems that when a 503 error is returned following an upload attempt, the uploaded artifact usually becomes becomes available for download after a few minutes. If you were to make any change, it should be retrying an upload until you succeed, or get 503. If a 503 is detected, abort the upload with a warning message saying the upload couldn't be verified, but may become available in a few minutes. I've never seen a 503 returned, and then seen packer successfully upload a box. I presume because it exists already. The real question, to me, is whether a 503 should also cause the release step to be skipped? |
I should have said recent packer version delete upon getting 503. The old logic, where the artifact was left intact, was better. If your concern is corrupted uploads, then perhaps a new post processor which downloads and then verifies via a hash, the integrity of the upload would be ideal. Even better if the Vagrant Cloud API offered a way of checking a box hash without downloading it (not aware of a way to do this currently). If downloading seems extreme, and a has can't be retrieved via the API, a less extreme option would be comparing the size of the uploaded object. |
I'll take a look at better handling for the 503 when I get back in January. |
Hello ! I will try to see what I can do to help here. So first, I've read that bumping the upload timeout seems to help; so I have set the vagrant client timeout to 120min, like you suggested @ladar. Here's a bunch of binaries containing the patch, can you please tell me if that helps ? The Cleanup part of the vagrant build hasn't changed in ~4 years, so this is probably a change Vagrant side. From your comments my wild guess is that there a single task s3 uploader and that in moments of rush it takes more time than usual for it to catch up. |
Hi @azr ... I've played around with the timeouts rather extensively. As I left things, I was using this patch: https://github.com/lavabit/robox/blob/master/res/providers/packer-upload-timeout.patch I also used variations of this
The version above retried uploads 10 times, increasing the delay between attempts, from 10 seconds up to 5 minutes. I think 5 minutes was the maximum allowed by the retry class, but it's been a very long time since I looked at that code. Either way, the print statement doesn't print the actual delay (minor bug, that doesn't matter when the delay is fixed at 10 seconds). I believe tweaking code to use a minimum delay of (10 seconds), and grow to a max of say 60 seconds might be a good thing to try. I believe you're correct, there is a bottleneck on the server side. I don't think casual users will experience it, but when I 3+ robots building and uploading boxes, the odds of 2+ robots trying to upload large box files at the same time was high enough to cause major delays/problems. I found that my robots would get stuck for 30+ minutes (re)-trying to upload box files. About 50% time, the box would eventually get through, (the percentages dropped dramatically the more robots I used). This 50% was using the higher timeouts, longer delays, and 10 or 20 retries. The failed boxes would keep failing with a 4xx error, and retry, or get 503, and simply fail. I never saw This started happening often enough that I eventually gave up on using When using Long story short, I can trigger the issue at will. If someone at HashiCorp would like to coordinate with me, I can setup a time to upload box files, so they can try debug things on the server. What I found was that when I had 3-6 boxes uploading at once, I plan to create a meta upload function, using the I've rambled on long enough. Let me know if you need more info. |
I've nudged upstream folks and I guess they've got a repro but need to find time to dig in, so we're gonna have to bump this again. Sorry, all. |
I just finished uploading v1.9.4. Uploading all fo the boxes single file took ~8 hours, and out of the 453 box uploads, 31 retries had to be performed. A retry means resending the file, so it's a lot of wasted bandwidth. More importantly, the inability to upload boxes in parallel hurts. When I run 8 upload jobs at a time, the overall speed goes up. Based on my experiments, I'd estimate the overall could be cut to under to ~2 or ~3 hours, if it worked. Unfortunately the errors/retries goes way way up, with some boxes failing to succeed even after 10 retries, so I try to keep the job number limited. After several failed attempts, I did this release single file, and it finally worked. If anybody is curious, here is a gist with the upload script output. It shows the HTTP error codes, frequency of errors, |
So I ran @ladar's script above (because vagrant cloud started failing again for me), and this is what I got from my curl response when it uploaded the vagrant box.
don't know if that helps at all or not. for a while vagrant cloud was uploading my big kali boxes (~4.2G) no problem, but within the past week it stopped uploading properly, you can check my ci logs when it started happening for me here: elreydetoda/packer-kali_linux#10 (comment) |
@elreydetoda you might want to try using a version of
I've been tweaking my upload script, and gotten it to work reliably. My current version automatically retries failed uploads till it works. But if I kickoff more than 1 upload at a time, it inevitably runs into errors and has to reupload the box. The more jobs I kickoff, the more retries, so I've resorted to single file uploads (a pain when you have 478 boxes to upload). The latest version of my upload script is here. Unlike the version above though, this script parses the box params from the file name, and looks for a |
@ladar, thank you so much! I really appreciate it! I will dig into it, and also you can pick pieces from my adaptation if you want. I added a help function in case people just try to execute it, and don't read the code, and a create new box section if the box hasn't been created before. Going to merge it to master for my repo soon, but still working on integrating it fully. |
@elreydetoda your welcome, but it looks like you based your uploader off my All of my recent improvements are in the If desired, there is separate |
Just to tell you I had the same problem on my loops. I tried everything to solve, using the The thing is... ignoring the fail on upload causes my test to fail, aparently the box isn't available for download. But I noticed that after some minutes of the fail on the upload, the box appear on Vagrant Cloud, so I just added a Check what I'm saying here: |
packer version: 1.2.5
host platform: centos 7.5 on Azure
The
vagrant-cloud
post-processor times out when uploading a box ~5GB in size. Usingcurl
to upload the same box is successful. This happens every time for me. I'm using an Azure VM which has ~30MB/s bandwidth.May be related to #5360
The text was updated successfully, but these errors were encountered: