Skip to content

Commit

Permalink
docs updated
Browse files Browse the repository at this point in the history
  • Loading branch information
joelb123 committed Feb 2, 2024
1 parent f16e394 commit 2ad248c
Show file tree
Hide file tree
Showing 2 changed files with 88 additions and 101 deletions.
187 changes: 87 additions & 100 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,58 +27,52 @@

[logo license]: https://raw.githubusercontent.com/hydrationdynamics/flardl/main/LICENSE.logo.txt

## Enabling Sustainable Downloading

Climate change makes consideration of efficiency
and sustainability more important than ever in
designing computations, especially if those
computations may grow to be large. In computational
science, downloads may consume only a tiny fraction
of cycles but a noticeable fraction of the
wall-clock time.

Unless we are on an exceptionally fast network, the download
bit rate of our local LAN-WAN link is the limit that
matters most to downloading time. Computer networking is
packet-switched, with limits placed on the number of packets
per unit of time by both hardware limitations and network policies.
One can think of download times as given by that limiting bit
transfer rate that are moderated by periods of waiting tp
start transfers or acknowledge packets received.
**Synchronous downloads are highly energy-inefficient** because
a lot of energy-consuming hardware (CPU, memory, disk) is simply
waiting for those starts and acknowledgements. It's far more
sustainable to arrange the computational graph to do transfers
simultaneously and asynchronously using multiple simultaneous
connections to a server or connections to multiple servers or both,
because that reduces wall-clock time spent waiting for initiation
and acknowledgements.
## Towards Sustainable Downloading

Small amounts of Time spent waiting on downloads adds
up over thousands over uses, both in human terms and
in terms of energy usage. If we are to respond to the
challenge of climate change, it's important to consider
the efficiency and sustainability of computations we
launch. Downloads may consume only a tiny fraction
of cycles for computational science, but they often
consume a noticeable fraction of wall-clock time.

While the download bit rate of one's local WAN link is the limit
that matters most, downloading times are also governed by time
spent waiting on handshaking to start transfers or to acknowledge
data received. **Synchronous downloads are highly energy-inefficient**
because hardware still consumes energy during waits. A more sustainable
approach is to arrange the computational graph to do transfers
simultaneously and asynchronously using multiple simultaneous downloads.
The situation is made more complicated when downloads can be
launched from anywhere in the world to a federated set of servers,
possibly involving content delivery networks. Optimal download
performance in that situation depends on adapting to network
conditions and server loads, typically without no information
other than the last download times of files.

_Flardl_ downloads lists of files using an approach that
adapts to local conditions and is elastic with respect
to changes in network performance and server loads.
_Flardl_ achieves download rates **typically more than
300X higher** than synchronous utilities such as _curl_,
while use of multiple servers provides superior robustness
and protection against blacklisting. Download rates depend
while allowing use of multiple servers to provide superior
protection against blacklisting. Download rates depend
on network bandwidth, latencies, list length, file sizes,
and HTTP protocol used, but even a single server on another
continent can usually saturate a gigabit cable connection
after about 50 files.
and HTTP protocol used, but using _flardl_, even a single
server on another continent can usually saturate a gigabit
cable connection after about 50 files.

## Queueing on Long Tails

Typically, one doesn't know much about the list of files to
be downloaded, nor about the state of the servers one is going
to use to download them. Once the first file request has been
made to a given server, the algorithm has only two means of
control. The first is how long to wait before waking up. The
second is when the thread does wake up is whether to launch
another download or wait some more. For collections of files
that are highly predictable (for example, if all files are the
same size) and all servers are equally fast, one simply
divides the work equally. But real collections and real
networks are rarely so well-behaved.
made to a given server, download software has only one means of
control, whether to launch another download or wait. Making
that decision well depends on making good guesses about likely
return times.

Collections of files generated by natural or human activity such
as natural-language writing, protein structure determination,
Expand All @@ -104,43 +98,42 @@ problems because **mean values are neither stable nor
characteristic of the distribution**. For example, as can be
seen in the fits above, the mean and standard distribution
of samples drawn from a long-tail distribution tend to grow
with increasing sample size. In the example shown in the figure
above, a fit of a normal distribution to a sample of 5% of the
data (dashed line) gives a markedly-lower mean and standard
deviation than the fit to all points (dotted line) and both
fits are poor. The reason why the mean tend to grow larger with
more files is because the more files sampled, the higher the
likelihood that one of them will be huge enough to dominate the average values.
with increasing sample size. The fit of a normal distribution
to a sample of 5% of the data (dashed line) gives a markedly
lower mean and standard deviation than the fit to all points
(dotted line), and both fits are poor. The mean tends to grow
with sample size because larger samples are more likely to
include a huge file that dominates the average value.

Algorithms that employ average per-file rates or times as the
primary means of control will launch requests too slowly most
of the time while letting queues run too deep when big downloads
are encountered. While the _mean_ per-file download time isn't a
good statistic, **_modal_ per-file file statistics will be
consistent** (e.g., the modal per-file download time
$\tilde{t}_{\rm file}$, where the tilde indicates a modal value),
at least on timescales over which network and server performance
are consistent. You are not guaranteed to be the only user of either
your WAN connection nor of the server, and sharing those resources
impact download statistics in different ways, especially if multiple
servers are involved.
good statistic, **control based on _modal_ per-file file statistics
can be more consistent**. For example, the modal per-file download
time $\tilde{\tau}$, where the tilde indicates a modal value), is
fairly consistent across sample size, and transfer algorithms based
on that statistic will perform consistently, at least on timescales
over which network and server performance are stable.

### Queue Depth is Toxic

At first glance, running at high queue depths seems attractive.
One of the simplest queueing algorithms would to simply put every
job in a queue at startup and let the server(s) handle requests
in parallel up to their individual critical queue depths, then
serial as best they can. But such non-adaptive non-elastic algorithms
in parallel up to their individual critical queue depths, above
which depths they are responsible for serialization . But such
non-adaptive non-elastic algorithms
give poor real-world performance or multiple reasons. First, if
there is more than one server queue, differing file sizes and
transfer rates will result in the queueing equivalent of
[Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law),
by **creating an overhang** where one server still has multiple
files queued up to serve while others have completed all requests.
The server with the overhang is not guaranteed to be the fastest
server, either.
The server with the overhang is also not guaranteed to be the
fastest one.

Moreover, if a server decides you are abusing its queue policies,
If a server decides you are abusing its queue policies,
it may take action that hurts your current and future downloads.
Most public-facing servers have policies to recognize and defend
against Denial-Of-Service (DOS) attacks and a large number of
Expand All @@ -156,7 +149,7 @@ of the server to get removed from. Blacklisting might not even be your
personal fault, but a collective problem. I have seen a practical class
of 20 students brought to a complete halt by a 24-hour blacklisting
of the institution's public IP address by a government site. Until methods
are developed for servers to publish their "play-friendly" values and
are developed for servers to publish their "play-friendly" values and to
whitelist known-friendly clients, the highest priority for downloading
algorithms must be to **avoid blacklisting by a server by minimizing
queue depth**. At the other extreme, the absolute minimum queue depth is
Expand All @@ -171,65 +164,57 @@ fishing. At first, you have a single fishing rod and you go
fishing at a series of local lakes where your catch consists
of **small fishes called "crappies"**. Your records reval
that while the rate of catching fishes can vary from day to
day--fish might be hungry or not--the average size of your
catch is pretty stable. Bigger ponds tend to have bigger fish
in them, and it might take slightly longer to reel in a bigger
crappie than a small one, but big and small crappies average
out pretty quickly.
day--fish might be hungry or not--the average size of a crappie
from a given pond is pretty stable. Bigger ponds tend to have
bigger crappies in them, and it might take slightly longer to
reel in a bigger crappie than a small one, but the rate of
catching crappies averages out pretty quickly.

One day you decide you love fishing so much, you drive
You love fishing so much that one day you drive
to the coast and charter a fishing boat. On that boat,
you can set out as many lines as you want (up to some limit)
and fish in parallel. At first, you catch small bony fishes
and fish in parallel. At first, you catch mostly small fishes
that are the ocean-going equivalent of crappies. But
eventually you hook a small shark. Not does it take a lot of
your time and attention to reel in the shark, but landing
a single shark totally skews the average weight of your catch.
If you fish in the ocean for long enough you will probably catch
a big shark that weighs hundreds of times more than crappies.
Maybe you might even **hook a whale**. But you and your crew
can only effecively reel in so many hooked lines at once. Putting
out more lines than that effective limit of hooked plus waiting-to-be-hooked
lines only results in fishes waiting on the line, when they
may break the line or get partly eaten before you can reel
them in.

Our theory of fishing says to **put out lines
at the usual rate of catching crappies but limit the number of lines
to deal with whales**. The most probable rate of catching
modal-sized fies will be optimistic, but you can delay putting
out more lines if you reach the maximum number of lines the boat
allows. Once you catch enough fish to be able to estimate
how the fish are biting, you can back off the number
of lines to the number that you and your crew can handle
at a time that day.
eventually you hook a small whale. Not does it take a lot of
your time and attention to reel in the whale, but landing
it totally skews the average weight and catch rate. You and
your crew can only effecively reel in so many hooked lines at
once. Putting out more lines than that effective limit of hooked
plus waiting-to-be-hooked lines only results in more wait times
in the ocean.

Our theory of fishing says to **put out lines at the usual rate
of catching crappies but limit the number of lines to deal with
whales**. The most probable rate of catching modal-sized fies
will be optimistic, but you can delay putting out more lines if
you reach the maximum number of lines your boat can handle. Once
you catch enough to be able to estimate how fish are biting, you
can back off the number of lines to the number that you and your
crew can handle at a time that day.

### Adaptilastic Queueing

_Flardl_ implements a method I call "adaptilastic"
queueing to deliver robust performance in real situations.
Adaptilastic queueing uses timing on transfers from an initial
period--launched using optimistic assumptions--to
period—launched using optimistic assumptions—to
optimize later transfers by using the minimum total depth over
all quese that will plateau the download bit rate while avoiding
excess queue depth on any given server. _Flardl_ distinguishes
among four different operating regimes:

- **Naive**, where no transfers have ever been completed
on a given server,
- **Naive**, where no transfers have ever been completed,
- **Informed**, where information from a previous run
is available,
- **Arriving**, where information from at least one transfer
to at least one server has occurred but not enough files
have been transferred so that all statistics can be calculated,
to at least one server has occurred,
- **Updated**, where a sufficient number of transfers has
occurred that file transfers may be characterized, either
for the collection of servers or for an individual server.
occurred that file transfers may be characterized, for an
least one server.

The optimistic rate at which _flardl_ launches requests for
a given server $j$ is given by the expectation rates for
modal-sized files from the Equation of Time in the case of small
queue depths where the Head-Of-Line term is zero as
modal-sized files with small queue depths as

$`
\begin{equation}
Expand All @@ -247,8 +232,10 @@ $`

where

- $\tilde{S}$ is the modal file size for the collection,
- $B_{\rm max}$ is the maximum permitted download rate,
- $\tilde{S}$ is the modal file size for the collection
(an input parameter),
- $B_{\rm max}$ is the maximum permitted download rate
(an input parameter),
- $D_j$ is the server queue depth at launch,
- $\tilde{\tau}_{\rm prev}$ is the modal file arrival rate
for the previous session,
Expand All @@ -258,7 +245,7 @@ where
- $I_{\rm first}$ is the initiation time for the first
transfer to arrive,
- and $\tilde{\tau_j}$ is the modal file transfer rate
for the current session.
for the current session with the server.

After waiting an exponentially-distributed stochastic period
given by the applicable value for $k_j$, testing is done
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "flardl"
version = "0.0.8.1"
version = "0.0.8.2"
description = "Adaptive Elastic Multi-Site Downloading"
authors = [
{name = "Joel Berendzen", email = "joel@generisbio.com"},
Expand Down

0 comments on commit 2ad248c

Please sign in to comment.