Skip to content

Network and Connection Errors

kwrodarmer edited this page Jul 19, 2018 · 2 revisions

Network and Connection Errors

The SRA Toolkit will occasionally run into network errors. If you're reading this page, they might be more than occasional in your case. This page will give a brief walkthrough of what happens in typical connections between an SRA Toolkit tool and our servers.

First, it should be understood that the majority if not all of our connections will make use of https protocol, which of course is just http over TLS (which is a replacement for SSL!). The reason you care is that the lowest piece of software in our tools will be the TLS module, and it will be involved in nearly everything network related. This is why you start to see so many complaints from mbedtls_ssl in the error log. These errors do not necessarily mean that there was any cryptographic problem, but could indicate a problem establishing a connection or just getting any data across it.

Second, it is nearly impossible for our tools to control the behavior of the Internet, or to predict all of the potentially numerous devices that handle communications between the tools and our servers that can be located on the other side of the world from you. We do try to understand and compensate for problems to the best of our abilities, but there are limits to what we can do to eliminate them.

The general behavior of the SRA Toolkit (or any software making use of VDB) with regard to the network is that errors will be reported to the log, and then VDB will retry several times until it either succeeds or finally gives up. One the one hand, this is great because it makes our tools very stubborn and insistent and resilient to transient network failures. On the other hand, it is very confusing because while we're trying again and again, the error log is filling up with reports of failures. At the time the failure occurs and is reported, we don't know whether this will be the smoking gun needed to diagnose a problem, so it gets reported. But users find it difficult to know when such errors mean that their operation failed or whether it should just be taken as a warning.

Establish Connection

The process starts with establishing a TCP connection, generally on port 443. This can fail for many reasons, but the main reason is the existence of firewall rules that prevent outgoing connections to our servers. Such rules are common (and good practice) in many environments, and it may be necessary to let your IT group know how they would need to set up firewalls in order to permit SRA activity. See https://github.com/ncbi/sra-tools/wiki/Firewall-and-Routing-Information.

Using a Proxy

Proxies are often used as a means to gather HTTP communications and run them through a common pinch-point, especially for compute clusters but are also common in any enterprise or large organization. They function as a sort of HTTP gateway and have traditionally performed caching services. If required for your environment, the SRA Toolkit will detect them through standard environment variables or can be configured to use them. See https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration.

Perform TLS Handshake

Once a connection is made, the mbedtls_ssl subsystem will begin reading/writing from/to the socket and attempt to perform the TLS handshake. This process involves some exchange of protocol elements, where one of these is receipt and validation of the server's certificate.

You may be observing messages from the tools such as:

connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )

If you see a message like this, it means that there was a problem just reading data from the server. The exact reasons are difficult to know. We address them by attempting the read multiple times until we succeed or finally give up.

The handshake can also fail for cryptographic reasons. Namely, the certificate may not be verified by mbedtls_ssl and the connection might be dropped. This is not exactly a common occurrence, but we do see it happening. The most likely reason for this is that your organization is using a deep packet inspection firewall or proxy that examines all https traffic. Doing so requires decryption of your TLS connection, and this is done by supplying forged SSL certificates, usually generated on demand. SRA Toolkit tools are pretty good at detecting forged certificates, and so you may run into this problem within your enterprise. If your IT department wasn't responsible for forging the certificate, you may be experiencing a MITM (Man In The Middle) attack.

Download SRA Data

The remainder of a connection is dedicated to accessing SRA data. We do this in small chunks, typically 128K in size (see https://github.com/ncbi/sra-tools/wiki/Download-On-Demand), and we cache data already downloaded within the VDB cache area you can establish via configuration (see https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration).

During this phase of access, the most common problems we observe are timeout errors. These occur when we send a request for data but get no response from the server after a reasonable period of time. You will see error reports, but as always - we will retry until success or until we simply are exhausted and call it a day.