Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some retry logic for pulling from remotes #878

Closed
dustymabe opened this issue May 19, 2017 · 38 comments
Closed

add some retry logic for pulling from remotes #878

dustymabe opened this issue May 19, 2017 · 38 comments
Labels

Comments

@dustymabe
Copy link
Contributor

dustymabe commented May 19, 2017

There has been a lot of intermittent network failures lately when pulling from our ostree remotes. While much of this is probably related to the backend fedora infra, i'd like to at least brainstorm how we could make our code more robust to poor networks and/or poor remote servers.

here are a few examples, but i've seen it myself quite a bit:

error: Socket I/O timed out

command output:
error: Socket I/O timed out

2017-05-19 16:55:42,082: command returned failure (1)
command returned failure (1)
2017-05-19 16:55:42,082: template command error in /mnt/koji/compose/branched/Fedora-26-20170519.n.0/work/x86_64/Atomic/lorax_templates/atomic-installer/lorax-embed-repo.tmpl:
template command error in /mnt/koji/compose/branched/Fedora-26-20170519.n.0/work/x86_64/Atomic/lorax_templates/atomic-installer/lorax-embed-repo.tmpl:
2017-05-19 16:55:42,083:   runcmd ostree --repo=/var/tmp/lorax.vgsxgpyx/installtree/ostree/repo pull --mirror fedora-atomic fedora/26/x86_64/atomic-host
  runcmd ostree --repo=/var/tmp/lorax.vgsxgpyx/installtree/ostree/repo pull --mirror fedora-atomic fedora/26/x86_64/atomic-host
2017-05-19 16:55:42,086:   subprocess.CalledProcessError: Command '['ostree', '--repo=/var/tmp/lorax.vgsxgpyx/installtree/ostree/repo', 'pull', '--mirror', 'fedora-atomic', 'fedora/26/x86_64/atomic-host']' returned non-zero exit status 1.
  subprocess.CalledProcessError: Command '['ostree', '--repo=/var/tmp/lorax.vgsxgpyx/installtree/ostree/repo', 'pull', '--mirror', 'fedora-atomic', 'fedora/26/x86_64/atomic-host']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/usr/sbin/lorax", line 273, in <module>
    main()
  File "/usr/sbin/lorax", line 133, in main
    remove_temp=True, verify=opts.verify)
  File "/usr/lib/python3.6/site-packages/pylorax/__init__.py", line 279, in run
    rb.install()
  File "/usr/lib/python3.6/site-packages/pylorax/treebuilder.py", line 124, in install
    self._runner.run(tmpl, **self.add_template_vars)
  File "/usr/lib/python3.6/site-packages/pylorax/ltmpl.py", line 220, in run
    self._run(commands)
  File "/usr/lib/python3.6/site-packages/pylorax/ltmpl.py", line 239, in _run
    f(*args)
  File "/usr/lib/python3.6/site-packages/pylorax/ltmpl.py", line 515, in runcmd
    stdout = runcmd_output(cmd)
  File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 347, in runcmd_output
    return execWithCapture(cmd[0], cmd[1:], **kwargs)
  File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 249, in execWithCapture
    reset_handlers=reset_handlers, reset_lang=reset_lang)[1]
  File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 201, in _run_program
    raise subprocess.CalledProcessError(proc.returncode, argv, output)
subprocess.CalledProcessError: Command '['ostree', '--repo=/var/tmp/lorax.vgsxgpyx/installtree/ostree/repo', 'pull', '--mirror', 'fedora-atomic', 'fedora/26/x86_64/atomic-host']' returned non-zero exit status 1.
@cgwalters
Copy link
Member

This is likely to be a bug of some sort caused by libsoup; that error message only occurs in GLib when a socket timeout is set (and they aren't set by default). libsoup however does enable socket timeouts - looks like 60 seconds by default. Which is a pretty long time, but likely shorter than the operating system timeouts.

The code path is going to be entirely different here for the libcurl backend.

@cgwalters cgwalters added the bug label May 23, 2017
@dustymabe
Copy link
Contributor Author

dustymabe commented Jun 1, 2017

happened again last night for rawhide: task

@cgwalters
Copy link
Member

@dustymabe The task link there is self-referential 😄

@dustymabe
Copy link
Contributor Author

sigh - I have to work on copy/paste skills - and sleep - https://koji.fedoraproject.org/koji/taskinfo?taskID=19781624

@cgwalters
Copy link
Member

That one is a 502 Bad Gateway. Not a socket timeout. My guess is that it's something like a squid proxy being restarted or a backend host being rebooted while we're pulling.

Do we want to retry in that scenario? And for how long?

@dustymabe
Copy link
Contributor Author

Do you think it would be reasonable to retry? I'd like to say yes, but don't want to seem unreasonable.
As far as how long, i'm going to take a wild guess and say more than a minute and less than five.

Other than 502 Bad Gateway, what other types of network failures can we reasonable retry?

@dustymabe
Copy link
Contributor Author

Another case where maybe something like this would help https://lists.projectatomic.io/projectatomic-archives/atomic-devel/2017-June/msg00030.html

I know in this case it is probably more about anaconda than ostree, but it's possible that whatever we would do for this PR could prevent the user's reported problem from happening.

@dustymabe
Copy link
Contributor Author

hit this again - pulling to a local machine running fedora 26 AH

@dustymabe
Copy link
Contributor Author

Hit this again last night for F25AH compose: https://koji.fedoraproject.org/koji/taskinfo?taskID=20110591

@cgwalters
Copy link
Member

I feel the pain here, but I still feel the correct fix here is in Fedora infra to make the webservers more reliable. I don't want to trivialize running infrastructure, but on the other hand we're just talking about serving static files.

Also, ostree should have similar behavior to librepo(libdnf), and for that matter docker/skopeo. Otherwise we aren't solving the problem fully. And AFAICS, there's no such retries in librepo.

@cgwalters
Copy link
Member

Now as far as the Cisco issue, Anaconda does do retries for rpm installs, see e.g. https://github.com/rhinstaller/anaconda/blob/cff864a08fc78ae30bcb299cc83762b12da78c8b/pyanaconda/payload/__init__.py#L631

Also in RHEL7 with the yum payload: https://github.com/rhinstaller/anaconda/blob/2377e2ed95f4c0caccec578fa906ec22a94839b3/pyanaconda/packaging/yumpayload.py#L691
However I can't find similar logic for the dnfpayload.

@dustymabe
Copy link
Contributor Author

I feel the pain here, but I still feel the correct fix here is in Fedora infra to make the webservers more reliable. I don't want to trivialize running infrastructure, but on the other hand we're just talking about serving static files.

yep, just saying that if there is anything we can do we should do it. Assuming ostree becomes more popular and is going to be used in other places than atomic host, there are going to be crappy server infra that gets used in places. Do we want those issues opened here or handled by the software?

If we can't do anything reasonable then let's close this issue. If there is something we can do then let's do it and save ourselves future headache.

@dustymabe
Copy link
Contributor Author

don't know if this is related or not, but since we are now on libcurl, I noticed this error once:


[root@vanilla-f26atomic ~]# rpm -q rpm-ostree ostree
rpm-ostree-2017.6-3.fc26.x86_64
ostree-2017.8-3.fc26.x86_64
[root@vanilla-f26atomic ~]# rpm-ostree upgrade
  Signature made Wed 26 Jul 2017 05:54:32 PM UTC using RSA key ID 812A6B4B64DAB85D
  Good signature from "Fedora 26 Primary <fedora-26-primary@fedoraproject.org>"


Receiving objects: 98% (171/173) 357.0 kB/s 55.3 MB                                                                                                                                                               
error: [18] Transferred a partial file

@dustymabe
Copy link
Contributor Author

also just saw this:

[root@vanilla-f26atomic ~]# rpm-ostree upgrade
  Signature made Wed 26 Jul 2017 05:54:32 PM UTC using RSA key ID 812A6B4B64DAB85D
  Good signature from "Fedora 26 Primary <fedora-26-primary@fedoraproject.org>"


Receiving objects: 50% (1/2) 212.1 kB/s 37.3 MB                                                                                                                                                                                                                 
error: [28] Timeout was reached

@alexlarsson
Copy link
Member

I've also seen

error: While pulling runtime/org.freedesktop.Sdk.Debug/x86_64/1.6 from remote flathub: [28] Timeout was reached

Several times with that (large) ref pull in particular.

@csoriano1618
Copy link

I can reproduce this with at least two people. Can we do something to help debugging it?

@csoriano1618
Copy link

Setting a break point in timer_cb I get this trace:

#0  0x00007ffff7582f80 in timer_cb () at /lib64/libostree-1.so.1
#1  0x00007ffff674461d in g_timeout_dispatch () at /lib64/libglib-2.0.so.0
#2  0x00007ffff6743bb7 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#3  0x00007ffff6743f60 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#4  0x00007ffff6743fec in g_main_context_iteration () at /lib64/libglib-2.0.so.0
#5  0x00007ffff7553207 in ostree_repo_pull_with_options () at /lib64/libostree-1.so.1
#6  0x0000555555598094 in repo_pull_one_dir ()
#7  0x00005555555acf27 in flatpak_dir_pull ()
#8  0x00005555555af4c4 in flatpak_dir_install ()
#9  0x0000555555571fd5 in flatpak_transaction_run ()
#10 0x0000555555576818 in flatpak_builtin_install ()
#11 0x000055555556f20b in main ()

Doesn't look too suspicious though

@cgwalters
Copy link
Member

What specific remotes and refs are people hitting? In #878 (comment) alex mentioned flathub; are other people hitting this with e.g.:
flatpak install flathub org.gnome.Sdk/x86_64/3.26
? And let's gather rpm -q ostree etc.

@cgwalters
Copy link
Member

Also libostree logs some information to the journal, try e.g.:
journalctl -b -t flatpak

@alexlarsson
Copy link
Member

I finally figured out how to reproduce this. First, download eos-select-bandwidth.py from https://gist.github.com/ramcq/fcfc6cc2d192d8f391fb9fb6e606509b, then run it like eos-select-bandwidth.py -n 3G-good eth0 (or whatever iface you have).

Then:

$ ostree --repo=testrepo init --mode=bare-user
$ ostree --repo=testrepo remote add --no-gpg-verify flathub https://flathub.org/repo
$ ostree --repo=testrepo pull -v flathub runtime/org.gnome.Sdk/x86_64/3.26
Output:
Receiving metadata objects: 0/(estimating) 48.5 kB/s 1.3 MB
OT: starting fetch of fdc1925f2a87bc5d166c00bd10cb67a347a0fe4bc0053f6ef7a61c02904940ad.file
OT: starting fetch of 38ebd6ff829fe426e40840022ef77f4cb23a2471ed00ff1b493868e3c81f8045.file
OT: starting fetch of 5e679cf5de1ebff0230dfa406c47837a5b414137dc0f01aa51e0745529fa02e6.file
OT: starting fetch of 1b581b9613c49911e35af56c58d469e1a2b846b61291e7853813993a88f80b8c.file
OT: starting fetch of 488ad75fd115326308fe7c8e314590026eb3d847a1ca290e6accf3d53100932d.file
OT: starting fetch of 8feb79be030e94c1493a8711a6b096133be3a2897eb10dc9cd99508d08d4dd73.file
OT: starting fetch of e573d5f5d14daf3210bb4657ba505f058bf303fb35cefa05f7b27264d44fe34e.file
OT: queuing fetch of 544df55308ef36e765bad62e5cae9de295373fa961a8c1026720365a84803cc7.file
OT: queuing fetch of 2c111766ba7e7aef31c00b46ce9dc45b70af7b4c9f4e77c5b4f0c447c5e1caa2.file
OT: queuing fetch of b8ef0032cc4e6c92f1dc525fb8fd5a7e34c73e2d6a31e4416f34a84b339921f4.file
OT: queuing fetch of 15ea348911013e54ef6f6f1e11de94020b7e1c8a6ad6b4174f57e9bfaf9d1eae.file
OT: queuing fetch of 6995c20c1a09b19acf03df5064311edc78461f55b35b8c1f0b190ffa2b0fd394.file
OT: queuing fetch of a11c088f40582cb534c19e0f84457dd416068725ee478b2d5e050066446daaee.file
OT: queuing fetch of be63d3f006572f7265ce922bdb69935bb71af5178bbd1defe1c919131bfbecdc.file
OT: queuing fetch of 8b5f606d27414ff66046ae77a76d72780420b064ec80b6313f5c5be3c9f34401.file
OT: queuing fetch of 17352717c431fad4e0f26cb9cc03d18c66abb4accbf42ada3c46897488709214.file
OT: queuing fetch of 522fb69cabcfc17a1e7e256ad131f147028fef6e4150fe395f618eddf5a7e0f9.file
OT: queuing fetch of 9f41d40894dad01c55b93f4e5a3dc7b8ccfff978ae52d96cc17264b0fecbea43.file
OT: queuing fetch of fdb657aa44f0b4114b2b40c5ccac3381bf72c7b98f94c20246e5563d06fec92e.file
OT: queuing fetch of 7c39524b348d00b231891bd3d99e625a780ad61a70b3781c15f87f67baeade13.file
OT: queuing fetch of b93d7a5b961f68049c1cee4994c0c2d3f726e64ab390b2a0f132093761e33731.file
OT: queuing fetch of 9ff303a4cf4eab7d62f645d836e957d7b95cc12a6088d7fcca29c68820b24c65.file
OT: queuing fetch of 9fece54fa98877d1bb0be993e286deea9a5ab07bfa6c3d2e54d9382c934198b1.file
OT: queuing fetch of c5ac9e19e97d1d56f45f1722c097f6a16c87813fe198b7f9e2784dbf05eeffe5.file
OT: queuing fetch of 75fc5c15b4a997d85d04c07addac3c5867c27349b652c5def0cab67d7a9fd26c.file
OT: queuing fetch of 8cb61ea301ac5ce6dfcdc38277b80caf6c2dfc5bffaffaacd88f64c90c890a3f.file
OT: queuing fetch of 62d119a1f51f848beafd349a8a963a9e6e43589a911a8d66d21acd776bfee446.file
OT: queuing fetch of d3685b8cde91969785cb166a1c45049440dcbf4206ca267257a11e5fbcbf9043.file
OT: queuing fetch of ab25ca71265485e8070741b0e646f5fea286e8312b1fad11c9ae704077858d76.file
OT: queuing fetch of 6879442b3ac60f7290ddb0e42bee6546ba728250ec1d22f376473b5e8e25ffd6.file
OT: queuing fetch of 329067e4431c77e2792f718047aebb0d48a41d7f237e54afb0f0dd5441ae04bc.file
OT: queuing fetch of 757e0cf594bdbb4b52c123ff6d701e85bfc5d111d60ac63354f6924619bb9028.file
OT: queuing fetch of bc7ea7f6db11c4326b776c996602589f574a401bd7585833b8206055a794a53a.file
OT: queuing fetch of 5094b036fc71c68a81b67d99f16438c44f4f6905a18909354f0c5803fdb565f4.file
OT: queuing fetch of 33be10b9b40d62c9a37eeda7dc86c6e52d9f6c51df3bb4799a202ae26517d2a7.file
OT: queuing fetch of 9696c6cc1a7cf2320af7d07c0cc865c15d319f593723177dd2930eb493092d48.file
OT: queuing fetch of ce24c87644fda6ab86c5d4aed246adc190034535d2daa4bc702b17312f69ad89.file
OT: queuing fetch of e7ea3fc26a0b168f06292ea6a31053a613dc4ba9fc2450cb0fefa7f4a621ea01.file
OT: queuing fetch of 7e4ae47ba9f0cfa80c387a9d5b5941ca32ae43b8d84c8247a259fa64fc1a1814.file
OT: queuing fetch of 4aa250e2778479e6069e26cb84a419caf06ee7149e6f722fa421c1ad029c93f3.file
OT: queuing fetch of 2d72f7e4dbfa8e02a626544cef9014ecfd0bb65cf50a84079772eddcefa90666.file
OT: queuing fetch of 386cfcf29d12828276bebc1033381869203d2aaff0446b2f22eb284a3732745b.file
OT: queuing fetch of b08b5ff4391929a3f838aca691c3c1e0638656ec756499a9b06f5d101144ec82.file
OT: queuing fetch of 908260fc17e3a91fdff94eb08126bd9a7269e8952a831cc09d0e7002fdaa8e55.file
OT: queuing fetch of 098836336cb4987f9cd79617b0be60740dd9a87e5399590471da9bea09c38956.file
OT: queuing fetch of 3e4b35caa3c3bd93ab9163bea709b1a8642226c8f160a5f4ba6b96c0f37f65c4.file
OT: queuing fetch of e91cd2fe498e564f396747658f6a8c5704492227dfbfa8fd375d304bceb149eb.file
OT: queuing fetch of 29db6e04731cda7233610894d16ac63520f7a49c599b4086fa80070c76c39546.file
OT: queuing fetch of c1c8589cfdc622670d33d5690b4fb912b4520b21d1f05d63d301df7a7faab3ac.file
OT: queuing fetch of d7ccfd19a0d907599069dfc0d477aa5f571eea40122a3847c32b19764122341a.file
OT: queuing fetch of dd0df6e4e585d7d41bf0e548bfe8ff221826b32d02719218a061191e75f7f984.file
OT: queuing fetch of e63e7c2ec4ace7a2bc60a42c1d1496dc3863ff397bf66f45091cb7d70e114ae8.file
OT: queuing fetch of 99f32e3cd54896bcee55906285fcdcab36917067498811d24326267168826659.file
OT: queuing fetch of 177a0782feffb89c875f19fd10efbc666dc5e828ff10a5a1394d1e8a424e0d44.file
OT: queuing fetch of 2090c7e0946e79b821325e0fc9defdd09c75d1f7ca5e0df331886374aebff6e4.file
OT: queuing fetch of dd868b0f6f4ba387f19f0fa0ac11f32d5970ff1560eae091f928704358b4e0c1.file
OT: queuing fetch of 7b8f4c700c68047e940a5df521e3262111276c8770e86907832a8e29c01d4cd2.file
OT: queuing fetch of 1a1c14e5b282127833900cc9222eec866aed8b8bd9b5774a98dca9adf7b4be2d.file
OT: queuing fetch of c6b3f0dad5da3bffe43f10b39bd5654b0f7108058b5f9f040fee1df7e8765118.file
OT: queuing fetch of ed799cb5cc2dcba5c58a7d694e05170989518c32aafe5b92de0824bb6c62472b.file
OT: queuing fetch of 81b147855c4196a5cec1fd3fcba183bc92b0bd95d0dab51db7341519e9e4cbb1.file
OT: queuing fetch of e321344f19b4f4f8a6d2b1f16847a285296b65b721703d3e411fc1553e07e275.file
OT: queuing fetch of dc2874fc456902d060784917569e966e398ebdffe56cedeeb8e656895b8b400a.file
OT: queuing fetch of 39778557d0e069c5a49c482bad82bb09dc3888dca4ce832eec3757103507ae3d.file
OT: queuing fetch of 272283737215fad8b966aafd6a91c122a74982d48f596d3bdcec342c33dfa8b6.file
OT: queuing fetch of f8a446ddda7eb2a7ade46ada8a51bf2faaebfc9b7a24e2c6be3af4ab14d4bc97.file
OT: queuing fetch of 143d9f86a01e072c95bba577ee79c517527e2d9264d35d216ab8abbd7e0ae845.file
OT: queuing fetch of 4f8c3160064a996734bca3f11da28067c57b84454bcb5b459950e87bae1f19e5.file
OT: queuing fetch of 0d5f468783141fcc67c89849e4bfc824736607bbcac02e5963cc8c84f74b3284.file
OT: queuing fetch of dea5aadf1f0b443d0a796eda540ea6c891e06e42f5091eb8b951ca9a0e9545b9.file
OT: queuing fetch of 330a12493bc41118160df9183e2970e912aaa04bd5683d7315aa6f65a3211a31.file
OT: queuing fetch of af0b0c92cae6791fcd2c66851cad9192b1b255775fc0da7a3fe5a3ceb4ce9e79.file
OT: queuing fetch of 8234259dc0458abff918f44500adb4be9c9cace95d80fdf8b27fbd955ccb0b5c.file
OT: queuing fetch of ba0e0e1045c1e5858712d2b96c4fb24acddb806b8135f98d80e6cb77b8ba4f65.file
OT: queuing fetch of 0fd3ee51d06cfd86977819d7577c542df313e48830f006627a4a373b759315e3.file
OT: queuing fetch of 5f2f8e4c7ca597beca1d0b842b5827acc70106e3941e8634315ccb5be28a19ab.file
OT: queuing fetch of af64031087bad79e6558d2b35a0ca1c0078ac11bcfef8db73103ee88e89c1ffa.file
OT: queuing fetch of f4f57b5703855238498b62a32ee905d276c9b98a57957d9976caf1fd57e55615.file
OT: queuing fetch of 11d91d3d8cad1a4ad7a261263c38b4ba5d7ad0f9d239e0bdc6c71000d4e1b165.file
OT: queuing fetch of 91cdbc7e6c403bc9f13f57109bb0e6727a41743cac8c316f8440520c80b5b1e8.file
OT: queuing fetch of 3bd881bdc1647f4973618adb31ccbf789646338ef6ff03dce5d2ce231d75c9c8.file
OT: queuing fetch of cb76b8336714dc2abae3a5630635763a23a9ab5b947d74d52e32c72f82b0e181.file
OT: queuing fetch of 659486ed0fde3aeca1fedc229f9cad3a828fa679c64e53680917c6680fe1a28e.file
OT: queuing fetch of fd43044fcb986a373c9200603499352f743a20921935eec28d6bbfe0c5621b48.file
OT: queuing fetch of 1e26991f02d8b778a4ca02fee497accbac75a99bf2eec74e843187e1f04b35f7.file
OT: queuing fetch of d214ea2a75338e664f8167f703dc3e7e254ae24912cdc30e77b3aad4711c4a28.file
OT: queuing fetch of 97901d1a9984ef2c4538ec953e5ca029a644d099ed98f9fe4701b8655cc416d2.file
OT: queuing fetch of ba319197bd12c9fbce90faa7defbdaf3cc7b726f0f83a863f455c1d3a764a36c.file
OT: queuing fetch of cb27355c6ef3c79f107cc49c8ea8488931dbd49e5ad0099acfbcdf312632e934.file
OT: queuing fetch of 3a2b433c1f538579b2200dc17720df84ef4e6def2485cd1763aa0996274f25e8.file
OT: queuing fetch of 496aade6a42792aa8a4524c1949eacf4a79a7895017e37a941587e539fa1eb82.file
OT: queuing fetch of 997f798998a5685827a3164d317e468f40320d647662f77c533731ffedcdc794.file
OT: queuing fetch of 7966fe3c69dfb259f3efd2c69467c0d5de9b111404c508b49013c3b07584ece9.file
OT: queuing fetch of 279509d41cdf6e267a5fde1f055972778c29049d123b8889c1ecf28d07cd44b8.file
OT: queuing fetch of e655e76c2644c393e138a8d2ca55dc25a55d277bb32a0774608a1f1f31aacad4.file
OT: queuing fetch of 1af3576442af8258c1b5a213954c09ffce3d94e184d88abb5206a89a7c77e5ba.file
OT: queuing fetch of bae7b575641cd2499ac9dce42c23b71b0e631e6a154bce12d1444902840f2d6b.file
OT: queuing fetch of 9520de4bf4168eae9c13483c860a8f650a01c84ed08e7d52de7c5edf1a149b18.file
OT: queuing fetch of 74571e37a623e79d8513abc52e2841b287e14a61638fc2e8dd5a753fa61e2de2.file
OT: queuing fetch of 6f7ee2f99867fefadba7ed7241211c2a9ea7d787e79fab7a373513ff1114d3a7.file
OT: queuing fetch of 50cc577d2663b91682c0fff7299002adef6c485afffe29389087c4b20d6dcb72.file
OT: queuing fetch of 777c2528ee815580a0556fc7e3abe769309e49cdbc76ee46edef99703e5a5476.file
OT: queuing fetch of fe8131b27a080a3342bf9592a67a5c769d36eb852563dac00054c592cd05dbee.file
OT: queuing fetch of f61accc1655a46419d21febf1d1025271e419f6dee50b6b05f28c88b9573690d.file
OT: queuing fetch of e83244ba2ed3c2bb4be29ed7b4e5b853bafc09247b71754fab1632ba0ad791bc.file
OT: queuing fetch of 81e4540cf8ca21c599d6fd480aa12138984486e68c65012436090811760c8c17.file
OT: queuing fetch of faa47e4c8881cc58090823a052735b146a815b730a7d2845ff89540510dab3a9.file
OT: queuing fetch of fd1356b70394b2141f3df6b6673aef08d35c71dbb486db48210c837735f122f6.file
OT: queuing fetch of b6c6811e9a48781dd96238e8017fcdf63a19ef0e4f59395d3e21442c121b38f5.file
OT: queuing fetch of 0547ebabc223606d1512f84ba0fb750682f0b954b9b98f37d5884e66f65b0da7.file
OT: queuing fetch of 72a453fd84250c22075426818e369477619bb8455a01dfac684d05bad7d95019.file
OT: queuing fetch of ee555ef3f9441944af5428147877c8c9feac9635c35afcc368d2d416544a8186.file
OT: queuing fetch of fdb5c31a9e614ef35450b32410961a315771326a86fa8d7f62eda12b9160d733.file
OT: queuing fetch of 306333e319e3264245a1442bb377203690d219cf397e87c4666fe9b28014a9da.file
OT: queuing fetch of e7772457219dd1c12a14e67b6c941c9e4e9627f6cff1570e5dc19955e454e557.file
OT: queuing fetch of a4dec80587c21c0251275d84cec8f9c78d91125c0877652dce41f69b1bca6186.file
OT: queuing fetch of 4c8e646870744de5e89845977b047777b0b5c56e1796848df19177615aeaa57b.file
OT: queuing fetch of 3ed1395c6f70ceda0ac11c9ed08f6a4231a34d22330417e25b2ec9613753a262.file
OT: queuing fetch of 8b2986bb89dc40b0b0ce6fb91d6821a9681cb4ffbec2c5b5dfc1dc76c538c24a.file
OT: queuing fetch of 1fcd3cf122833b4b05b2fb9dc28c0973dd2b8f6a1ca626497f067b6362c36f10.file
OT: queuing fetch of 381983fc345627bac5376cad0cc2fccefbcd20c19c2739627feb51a11f76e41a.file
OT: queuing fetch of 4615d9b115ad02c141a0bd1eca998e7e7606219e463186987c86b889c1f84edf.file
OT: queuing fetch of f8dd40e539d5eeacbd9a3ca47f0a1dc8777eb651e604778205684808edf35789.file
OT: queuing fetch of 837a992316d26f436d290fe0f0994c9550f5b813c7cd087d5685e0f0e1f8d90a.file
OT: queuing fetch of 66563c7d45af95d6e1bec943f0bfa9dec8401feeb0f3dc29148fffbcae1818dd.file
OT: queuing fetch of 30cb855c83e7a726ca4c727df6819421a7b1f1cd8c24f4126fa21cff78051adc.file
OT: queuing fetch of e60606a1d95662b9e605dde0c75a3e734b46cac9e458e1a68060122aaa22780d.file
OT: queuing fetch of 5d42d604319a54fe88e32420aa75b4dd44554972c07981d8a1f93324e543795d.file
OT: queuing fetch of 4d5f50ffe3c676e92d1ffa22dc2788635ade03c059b323eaa8236f54d64a2365.file
OT: queuing fetch of 9bbf934692385429a9af0a952eeea7c92b13b23c0ea1708385333021e3876074.file
OT: queuing fetch of ed6fd7491fe486a368ec41f7f9dffd81bb73f1e52cbca136869d578d41718d50.file
OT: queuing fetch of e8ea5b1f15c4f35806a67a4d0351c63887f69136d084007fb721af84ecdaef15.file
OT: queuing fetch of 8bbba35ff61c742778f40c030364efad02e8571c13e283fcb1b2f3b0cfe786af.file
OT: queuing fetch of e6a155de19baa673437d767186ff9a8996ee2e752fe924cfffa1f13c78044a1c.file
OT: queuing fetch of f963376e321f591e2cebfd2607f6ab0e1b9768b2bc593a0e912482e8f5952012.file
OT: queuing fetch of 1c6a7756b7b7174e153afa8ab153f27d8c5d09e6e7650f496cb27ff3dfd4900f.file
OT: queuing fetch of 9b2b2f199672b28a0159fae3d790fc16f236d5c8f442e490e7f934ad8a68bcbb.file
OT: queuing fetch of 144daae61d42e189fc29fdf7f8c727cfe4fa7f4b3a7feef4b713eb3361259c06.file
OT: queuing fetch of 3d7743bfc5f9fb26b4ecd4d58f684394d073c24f979fb865d3ad524232c7c39d.file
OT: queuing fetch of d305ee7170b80b4c57ed215f3328f56964b9e872cafdfabd3aeb51905583ec84.file
OT: queuing fetch of 73979bfd19d2c0feca036637ea596334814e9cce1690f3bd1a876dfe5e32c1fd.file
OT: queuing fetch of 6cd547d151b8245bcbca845235428dd6324c88bd7a7a8a650d64409db9381e4a.file
OT: queuing fetch of b95cf90546f51181074702e330e604eea6b6140e4d36fc665699be1fd5040e06.file
OT: queuing fetch of 72c113c2e20819a23939915446d6511836fb0b0538ab9d80f17fecd48e1b4076.file
OT: queuing fetch of 0217a9b0b9f0b8743294d0027ec72f509cd4fd24540a2ed551338247e68b5433.file
OT: queuing fetch of 5e032e08e01053e1af725e92ad8cfd9a7da1ef3acbda4e2ee5819065e2462f03.file
OT: queuing fetch of e21d4e0ed6bea9a5039ea3f2e88533d5514b1234a2a45467d873593686e5c96a.file
OT: queuing fetch of 8f0489482fb85d14cdd3b87f4f21e7a4e289d0c056619b81431024ae13a76186.file
OT: queuing fetch of 35eda84f40f74b64350802b279005b9ea7aae8831a42b6de58b28255d61c9321.file
OT: queuing fetch of 1ab0b40dccaf8f85f67dbbf3c3dea59f81d6a83a1a58b2e47be10e744799a9bc.file
OT: queuing fetch of 20b85b8c296cdf0121df7b01e29fb249956fe7e50b971d995a7b46430786f014.file
OT: queuing fetch of cd7c03e8fe50914f88a38b5259b8abb3bbe65453110b9ff999fa9a7980a93e68.file
OT: queuing fetch of fcc6b60bde87678be156cdc71c88d0cd444ce1ae1027f1a0fe9f27ebe6b6445d.file
OT: queuing fetch of e1d517b09b6a2cd3550e9521a01a1f6b40a8a0ebb1192bfac69013d33cdf40b0.file
OT: queuing fetch of 84615fcff342222af0127c098c370d62f99f431de9464e7b20340fe913b640d5.file
OT: write of 805d07e990e9149e9fb47e92068c86a3507c0717cf02a291ea0c375e44a9b1b2.commit complete
OT: queuing fetch of 805d07e990e9149e9fb47e92068c86a3507c0717cf02a291ea0c375e44a9b1b2.commit (detached)
Receiving delta parts: 0/173 0 bytes/355.1 MB 79.1 kB/s 1 hours 14 minutes 46 seconds remaining
OT: fetch of fdc1925f2a87bc5d166c00bd10cb67a347a0fe4bc0053f6ef7a61c02904940ad.file complete
OT: starting fetch of 805d07e990e9149e9fb47e92068c86a3507c0717cf02a291ea0c375e44a9b1b2.commit (detached)
Receiving delta parts: 0/173 0 bytes/355.1 MB 79.9 kB/s 1 hours 14 minutes 2 seconds remaining
OT: write of fdc1925f2a87bc5d166c00bd10cb67a347a0fe4bc0053f6ef7a61c02904940ad.file complete
Receiving delta parts: 1/173 0 bytes/355.1 MB 81.9 kB/s 1 hours 12 minutes 13 seconds remaining
OT: fetch of 8feb79be030e94c1493a8711a6b096133be3a2897eb10dc9cd99508d08d4dd73.file complete
OT: write of 8feb79be030e94c1493a8711a6b096133be3a2897eb10dc9cd99508d08d4dd73.file complete
Receiving delta parts: 2/173 0 bytes/355.1 MB 83.3 kB/s 1 hours 11 minutes 1 seconds remaining
OT: fetch of 38ebd6ff829fe426e40840022ef77f4cb23a2471ed00ff1b493868e3c81f8045.file complete
OT: write of 38ebd6ff829fe426e40840022ef77f4cb23a2471ed00ff1b493868e3c81f8045.file complete
Receiving delta parts: 3/173 0 bytes/355.1 MB 84.6 kB/s 1 hours 9 minutes 56 seconds remaining
OT: fetch of 5e679cf5de1ebff0230dfa406c47837a5b414137dc0f01aa51e0745529fa02e6.file complete
OT: write of 5e679cf5de1ebff0230dfa406c47837a5b414137dc0f01aa51e0745529fa02e6.file complete
Receiving delta parts: 4/173 0 bytes/355.1 MB 84.9 kB/s 1 hours 9 minutes 40 seconds remaining
OT: fetch of 1b581b9613c49911e35af56c58d469e1a2b846b61291e7853813993a88f80b8c.file complete
OT: write of 1b581b9613c49911e35af56c58d469e1a2b846b61291e7853813993a88f80b8c.file complete
Receiving delta parts: 5/173 0 bytes/355.1 MB 85.1 kB/s 1 hours 9 minutes 33 seconds remaining
OT: fetch of e573d5f5d14daf3210bb4657ba505f058bf303fb35cefa05f7b27264d44fe34e.file complete
OT: write of e573d5f5d14daf3210bb4657ba505f058bf303fb35cefa05f7b27264d44fe34e.file complete
OT: fetch of 805d07e990e9149e9fb47e92068c86a3507c0717cf02a291ea0c375e44a9b1b2.commit (detached) complete
Receiving delta parts: 6/173 0 bytes/355.1 MB 88.4 kB/s 1 hours 6 minutes 58 seconds remaining
OT: fetch of 488ad75fd115326308fe7c8e314590026eb3d847a1ca290e6accf3d53100932d.file complete
OT: write of 488ad75fd115326308fe7c8e314590026eb3d847a1ca290e6accf3d53100932d.file complete
Receiving delta parts: 7/173 0 bytes/355.1 MB 93.2 kB/s 1 hours 3 minutes 28 seconds remaining
OT: fetch static delta part 7a1721cfcd35d533af76e3e3c59ec98c94477ca74e503e531c6b7236374446f9 complete
Receiving delta parts: 8/173 0 bytes/355.1 MB 92.7 kB/s 1 hours 3 minutes 50 seconds remaining
OT: execute static delta part 7a1721cfcd35d533af76e3e3c59ec98c94477ca74e503e531c6b7236374446f9 complete
Receiving delta parts: 8/173 0 bytes/355.1 MB 94.9 kB/s 1 hours 2 minutes 22 seconds remaining
OT: fetch static delta part e9a43874076e81752864a353f9758347dc9e417e9ce9d383cbdf34446ee99da0 complete
Receiving delta parts: 9/173 0 bytes/355.1 MB 94.5 kB/s 1 hours 2 minutes 36 seconds remaining
OT: execute static delta part e9a43874076e81752864a353f9758347dc9e417e9ce9d383cbdf34446ee99da0 complete
Receiving delta parts: 9/173 0 bytes/355.1 MB 95.3 kB/s 1 hours 2 minutes 8 seconds remaining
OT: fetch static delta part 203280ea071e5cea332fbaa5f3b7aae8fde62f78bb9bf507d2378245dc0adc10 complete
Caught error, waiting for outstanding tasks   
OT: fetch static delta part 0eddd5837a9acca213e4446e415d9ebb474347c5c4fb29a336be3ee574767da5 complete
Caught error, waiting for outstanding tasks
OT: execute static delta part 0eddd5837a9acca213e4446e415d9ebb474347c5c4fb29a336be3ee574767da5 complete
OT: pull: idle, exiting mainloop

error: [28] Timeout was reached

@alexlarsson
Copy link
Member

I discovered OSTREE_DEBUG_HTTP, so here is a new log with that:
pull.log

@alexlarsson
Copy link
Member

So, here is a completely random guess. Maybe we're starting to many parallel requests on the same http2 connection, and the server has some limit on how many are actually executed in parallel while the others get on some queue. Now, if the initial operations take > 30 secs, then the not-yet-handled requests time out because they have got no response yet.

@alexlarsson
Copy link
Member

I did some more testing, and what seems to happen is that at some point a delta fetch is done and we start a new one, and then we don't get any data back on that in 30 sec (probably because we have a lot of other outstanding requests in the pipeline and the transfer rate is slow), which triggers these options that ostree sets:

  curl_easy_setopt (req->easy, CURLOPT_LOW_SPEED_LIMIT, 1L);
  curl_easy_setopt (req->easy, CURLOPT_LOW_SPEED_TIME, 30L);

If i remove these two then everything works fine.

@alexlarsson
Copy link
Member

I wonder if we should just remove these two lines for now, including in the fedora packages. We can research the details later, but lets make it work first.

@dustymabe
Copy link
Contributor Author

I have a tool that checks to make sure refs are signed in our repos. I happened to see this this morning in the tool.

Testing 26 fedora/26/x86_64/testing/atomic-host at https://kojipkgs.fedoraproject.org/atomic/26/

error: [16] Error in the HTTP2 framing layer

Not sure if anyone has seen the HTTP2 framing layer error before but figure I'd post it here anyway.

@csoriano1618
Copy link

Got from some students this error too while installing GNOME Builder from flathub, not sure if related:
"Error reading data from TLS socket: A TLS packet with unexpected length was received"

@dustymabe
Copy link
Contributor Author

Saw another HTTP2 framing layer error in last nights f26 atomic run for building the vagrant boxes:

08:57:09,750 NOTICE system-python:libostree HTTP error from remote fedora-atomic for <https://kojipkgs.fedoraproject.org/atomic/26/objects/be/6f95d977b33bb6aeee8fd88aebea4c23aab2fd5ece9a2e0f5ea12aae157448.filez>: Stream error in the HTTP/2 framing layer
08:57:09,776 ERR anaconda:Failed to pull from repository: g-io-error-quark: [92] Stream error in the HTTP/2 framing layer (0)

@cgwalters
Copy link
Member

Ah yes, the low speed option conflicting with HTTP2 feels like a libcurl bug, but I'm totally in favor of just removing that for now.

cgwalters added a commit to cgwalters/ostree that referenced this issue Nov 27, 2017
They don't play nicely currently with HTTP2 where we may
have lots of requests queued.

ostreedev#878 (comment)

In practice anyways I think issues here are better solved on a higher level -
e.g. apps today can use an overall timeout on pulls and if they exceed the limit
set the cancellable.
@cgwalters
Copy link
Member

PR in #1349

rh-atomic-bot pushed a commit that referenced this issue Nov 27, 2017
They don't play nicely currently with HTTP2 where we may
have lots of requests queued.

#878 (comment)

In practice anyways I think issues here are better solved on a higher level -
e.g. apps today can use an overall timeout on pulls and if they exceed the limit
set the cancellable.

Closes: #1349
Approved by: alexlarsson
rh-atomic-bot pushed a commit that referenced this issue Nov 27, 2017
They don't play nicely currently with HTTP2 where we may
have lots of requests queued.

#878 (comment)

In practice anyways I think issues here are better solved on a higher level -
e.g. apps today can use an overall timeout on pulls and if they exceed the limit
set the cancellable.

Closes: #1349
Approved by: alexlarsson
rh-atomic-bot pushed a commit that referenced this issue Nov 27, 2017
They don't play nicely currently with HTTP2 where we may
have lots of requests queued.

#878 (comment)

In practice anyways I think issues here are better solved on a higher level -
e.g. apps today can use an overall timeout on pulls and if they exceed the limit
set the cancellable.

Closes: #1349
Approved by: jlebon
@cgwalters
Copy link
Member

A quick web search for "curl HTTP 2 framing error" turns up
curl/curl#1618
which links to: http://mailman.nginx.org/pipermail/nginx-announce/2017/000204.html

@cgwalters
Copy link
Member

One thing we could probably do easily enough is add a remote config option to turn off http2 to allow people to work around this.

@puiterwijk
Copy link

Just for information, I have done further investigation and reported my findings on curl/curl#1618 (comment) (tl;dr: Curl keeps trying to read after the server asks it to cancel this connection and open a new one).
I have a curl patch I'm planning to submit upstream.

@parasense
Copy link

Was experiencing this issue, or something similar.
Had several koji image-build tasks running concurrently, all remote pulling the same ostree repo. My tasks were sometimes stalling out in libcurl, eventually resulting in timeout failures. Sometimes I could run 'koji free-task' on the task-id to get a 2nd chance on another koji image builder. Anyways, I successfully worked-around the situation by using NFS instead of HTTP. Simply NFS mount the ostree repo in the Anaconda %pre execution env, and use --url='file:///path/to/ostree/repo' in ostreesetup.

@pwithnall
Copy link
Member

pwithnall commented Jun 5, 2018

I added support for retries in #1594. Is there anything else specific to be done here? #894 is open about caching the results of DNS lookups, which would be a further improvement here.

Perhaps optionally disabling HTTP/2 is something which should be split out to a separate issue?

@cgwalters
Copy link
Member

We already support disabling HTTP/2 since #1373

Yep agree we can close this for now; we can always reopen more targeted issues for further enhancements. Thanks a ton for implementing this!

@Ruedii
Copy link

Ruedii commented Nov 16, 2018

NOT FIXED.

Current (new) Retry logic fails to consider congestion as cause of dropped packets. This especially happens with TLS handshake packets on certain routers and ISPs.

Recommended solutions .
ALL are recommended to bring into standard TCP and HTTP behavior. Most FTP servers presume HTTP-Like behavior regarding retries and parallel downloads. (All sane values should be tunable in advanced configuration, and should be able to be overridden in servers.)

Limit parallel connections to any single remote host to a sane value based on protocol.
Limit in-progress handshakes to any single remote host to a sane value.
Limit total parallel connections from application to a sane sane value.
Limit in-progress handshake number to sane value.

Furthermore, in the case of dropped connections, on the third retry to a host with parallel connections the local client should wait until a parallel connection finishes it's download before restarting, locking all new connections to the host at this point.

At this point it should either hand-off that connection to the new download (Preferable), or close the connection and open a new one after a fixed grace period.

As a final note, the best long term option is to redo the network code to be request based instead of connection based, opening a sane number of parallel connections, and feeding through requests from a cue, just like a modern web browser would when downloading the contents of a web page. This would be much simpler model, friendlier to multi-threading, and more adaptable in all aspects including network congestion and unreliable networks.

@pwithnall
Copy link
Member

pwithnall commented Nov 16, 2018

Probably best to open a new issue with your suggestions rather than try to resurrect an issue from half a year ago, thanks.

@Ruedii
Copy link

Ruedii commented Nov 18, 2018

Probably best to open a new issue with your suggestions rather than try to resurrect an issue from half a year ago, thanks.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants