docs updated

hydrationdynamics · Feb 2, 2024 · 2ad248c · 2ad248c
1 parent f16e394
commit 2ad248c
Show file tree

Hide file tree

Showing 2 changed files with 88 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -27,58 +27,52 @@
 
 [logo license]: https://raw.githubusercontent.com/hydrationdynamics/flardl/main/LICENSE.logo.txt
 
-## Enabling Sustainable Downloading
-
-Climate change makes consideration of efficiency
-and sustainability more important than ever in
-designing computations, especially if those
-computations may grow to be large. In computational
-science, downloads may consume only a tiny fraction
-of cycles but a noticeable fraction of the
-wall-clock time.
-
-Unless we are on an exceptionally fast network, the download
-bit rate of our local LAN-WAN link is the limit that
-matters most to downloading time. Computer networking is
-packet-switched, with limits placed on the number of packets
-per unit of time by both hardware limitations and network policies.
-One can think of download times as given by that limiting bit
-transfer rate that are moderated by periods of waiting tp
-start transfers or acknowledge packets received.
-**Synchronous downloads are highly energy-inefficient** because
-a lot of energy-consuming hardware (CPU, memory, disk) is simply
-waiting for those starts and acknowledgements. It's far more
-sustainable to arrange the computational graph to do transfers
-simultaneously and asynchronously using multiple simultaneous
-connections to a server or connections to multiple servers or both,
-because that reduces wall-clock time spent waiting for initiation
-and acknowledgements.
+## Towards Sustainable Downloading
+
+Small amounts of Time spent waiting on downloads adds
+up over thousands over uses, both in human terms and
+in terms of energy usage. If we are to respond to the
+challenge of climate change, it's important to consider
+the efficiency and sustainability of computations we
+launch. Downloads may consume only a tiny fraction
+of cycles for computational science, but they often
+consume a noticeable fraction of wall-clock time.
+
+While the download bit rate of one's local WAN link is the limit
+that matters most, downloading times are also governed by time
+spent waiting on handshaking to start transfers or to acknowledge
+data received. **Synchronous downloads are highly energy-inefficient**
+because hardware still consumes energy during waits. A more sustainable
+approach is to arrange the computational graph to do transfers
+simultaneously and asynchronously using multiple simultaneous downloads.
+The situation is made more complicated when downloads can be
+launched from anywhere in the world to a federated set of servers,
+possibly involving content delivery networks. Optimal download
+performance in that situation depends on adapting to network
+conditions and server loads, typically without no information
+other than the last download times of files.
 
 _Flardl_ downloads lists of files using an approach that
 adapts to local conditions and is elastic with respect
 to changes in network performance and server loads.
 _Flardl_ achieves download rates **typically more than
 300X higher** than synchronous utilities such as _curl_,
-while use of multiple servers provides superior robustness
-and protection against blacklisting. Download rates depend
+while allowing use of multiple servers to provide superior
+protection against blacklisting. Download rates depend
 on network bandwidth, latencies, list length, file sizes,
-and HTTP protocol used, but even a single server on another
-continent can usually saturate a gigabit cable connection
-after about 50 files.
+and HTTP protocol used, but using _flardl_, even a single
+server on another continent can usually saturate a gigabit
+cable connection after about 50 files.
 
 ## Queueing on Long Tails
 
 Typically, one doesn't know much about the list of files to
 be downloaded, nor about the state of the servers one is going
 to use to download them. Once the first file request has been
-made to a given server, the algorithm has only two means of
-control. The first is how long to wait before waking up. The
-second is when the thread does wake up is whether to launch
-another download or wait some more. For collections of files
-that are highly predictable (for example, if all files are the
-same size) and all servers are equally fast, one simply
-divides the work equally. But real collections and real
-networks are rarely so well-behaved.
+made to a given server, download software has only one means of
+control, whether to launch another download or wait. Making
+that decision well depends on making good guesses about likely
+return times.
 
 Collections of files generated by natural or human activity such
 as natural-language writing, protein structure determination,
@@ -104,43 +98,42 @@ problems because **mean values are neither stable nor
 characteristic of the distribution**. For example, as can be
 seen in the fits above, the mean and standard distribution
 of samples drawn from a long-tail distribution tend to grow
-with increasing sample size. In the example shown in the figure
-above, a fit of a normal distribution to a sample of 5% of the
-data (dashed line) gives a markedly-lower mean and standard
-deviation than the fit to all points (dotted line) and both
-fits are poor. The reason why the mean tend to grow larger with
-more files is because the more files sampled, the higher the
-likelihood that one of them will be huge enough to dominate the average values.
+with increasing sample size. The fit of a normal distribution
+to a sample of 5% of the data (dashed line) gives a markedly
+lower mean and standard deviation than the fit to all points
+(dotted line), and both fits are poor. The mean tends to grow
+with sample size because larger samples are more likely to
+include a huge file that dominates the average value.
+
 Algorithms that employ average per-file rates or times as the
 primary means of control will launch requests too slowly most
 of the time while letting queues run too deep when big downloads
 are encountered. While the _mean_ per-file download time isn't a
-good statistic, **_modal_ per-file file statistics will be
-consistent** (e.g., the modal per-file download time
-$\tilde{t}_{\rm file}$, where the tilde indicates a modal value),
-at least on timescales over which network and server performance
-are consistent. You are not guaranteed to be the only user of either
-your WAN connection nor of the server, and sharing those resources
-impact download statistics in different ways, especially if multiple
-servers are involved.
+good statistic, **control based on _modal_ per-file file statistics
+can be more consistent**. For example, the modal per-file download
+time $\tilde{\tau}$, where the tilde indicates a modal value), is
+fairly consistent across sample size, and transfer algorithms based
+on that statistic will perform consistently, at least on timescales
+over which network and server performance are stable.
 
 ### Queue Depth is Toxic
 
 At first glance, running at high queue depths seems attractive.
 One of the simplest queueing algorithms would to simply put every
 job in a queue at startup and let the server(s) handle requests
-in parallel up to their individual critical queue depths, then
-serial as best they can. But such non-adaptive non-elastic algorithms
+in parallel up to their individual critical queue depths, above
+which depths they are responsible for serialization . But such
+non-adaptive non-elastic algorithms
 give poor real-world performance or multiple reasons. First, if
 there is more than one server queue, differing file sizes and
 transfer rates will result in the queueing equivalent of
 [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law),
 by **creating an overhang** where one server still has multiple
 files queued up to serve while others have completed all requests.
-The server with the overhang is not guaranteed to be the fastest
-server, either.
+The server with the overhang is also not guaranteed to be the
+fastest one.
 
-Moreover, if a server decides you are abusing its queue policies,
+If a server decides you are abusing its queue policies,
 it may take action that hurts your current and future downloads.
 Most public-facing servers have policies to recognize and defend
 against Denial-Of-Service (DOS) attacks and a large number of
@@ -156,7 +149,7 @@ of the server to get removed from. Blacklisting might not even be your
 personal fault, but a collective problem. I have seen a practical class
 of 20 students brought to a complete halt by a 24-hour blacklisting
 of the institution's public IP address by a government site. Until methods
-are developed for servers to publish their "play-friendly" values and
+are developed for servers to publish their "play-friendly" values and to
 whitelist known-friendly clients, the highest priority for downloading
 algorithms must be to **avoid blacklisting by a server by minimizing
 queue depth**. At the other extreme, the absolute minimum queue depth is
@@ -171,65 +164,57 @@ fishing. At first, you have a single fishing rod and you go
 fishing at a series of local lakes where your catch consists
 of **small fishes called "crappies"**. Your records reval
 that while the rate of catching fishes can vary from day to
-day--fish might be hungry or not--the average size of your
-catch is pretty stable. Bigger ponds tend to have bigger fish
-in them, and it might take slightly longer to reel in a bigger
-crappie than a small one, but big and small crappies average
-out pretty quickly.
+day--fish might be hungry or not--the average size of a crappie
+from a given pond is pretty stable. Bigger ponds tend to have
+bigger crappies in them, and it might take slightly longer to
+reel in a bigger crappie than a small one, but the rate of
+catching crappies averages out pretty quickly.
 
-One day you decide you love fishing so much, you drive
+You love fishing so much that one day you drive
 to the coast and charter a fishing boat. On that boat,
 you can set out as many lines as you want (up to some limit)
-and fish in parallel. At first, you catch small bony fishes
+and fish in parallel. At first, you catch mostly small fishes
 that are the ocean-going equivalent of crappies. But
-eventually you hook a small shark. Not does it take a lot of
-your time and attention to reel in the shark, but landing
-a single shark totally skews the average weight of your catch.
-If you fish in the ocean for long enough you will probably catch
-a big shark that weighs hundreds of times more than crappies.
-Maybe you might even **hook a whale**. But you and your crew
-can only effecively reel in so many hooked lines at once. Putting
-out more lines than that effective limit of hooked plus waiting-to-be-hooked
-lines only results in fishes waiting on the line, when they
-may break the line or get partly eaten before you can reel
-them in.
-
-Our theory of fishing says to **put out lines
-at the usual rate of catching crappies but limit the number of lines
-to deal with whales**. The most probable rate of catching
-modal-sized fies will be optimistic, but you can delay putting
-out more lines if you reach the maximum number of lines the boat
-allows. Once you catch enough fish to be able to estimate
-how the fish are biting, you can back off the number
-of lines to the number that you and your crew can handle
-at a time that day.
+eventually you hook a small whale. Not does it take a lot of
+your time and attention to reel in the whale, but landing
+it totally skews the average weight and catch rate. You and
+your crew can only effecively reel in so many hooked lines at
+once. Putting out more lines than that effective limit of hooked
+plus waiting-to-be-hooked lines only results in more wait times
+in the ocean.
+
+Our theory of fishing says to **put out lines at the usual rate
+of catching crappies but limit the number of lines to deal with
+whales**. The most probable rate of catching modal-sized fies
+will be optimistic, but you can delay putting out more lines if
+you reach the maximum number of lines your boat can handle. Once
+you catch enough to be able to estimate how fish are biting, you
+can back off the number of lines to the number that you and your
+crew can handle at a time that day.
 
 ### Adaptilastic Queueing
 
 _Flardl_ implements a method I call "adaptilastic"
 queueing to deliver robust performance in real situations.
 Adaptilastic queueing uses timing on transfers from an initial
-period--launched using optimistic assumptions--to
+period&mdash;launched using optimistic assumptions&mdash;to
 optimize later transfers by using the minimum total depth over
 all quese that will plateau the download bit rate while avoiding
 excess queue depth on any given server. _Flardl_ distinguishes
 among four different operating regimes:
 
-- **Naive**, where no transfers have ever been completed
-  on a given server,
+- **Naive**, where no transfers have ever been completed,
 - **Informed**, where information from a previous run
   is available,
 - **Arriving**, where information from at least one transfer
-  to at least one server has occurred but not enough files
-  have been transferred so that all statistics can be calculated,
+  to at least one server has occurred,
 - **Updated**, where a sufficient number of transfers has
-  occurred that file transfers may be characterized, either
-  for the collection of servers or for an individual server.
+  occurred that file transfers may be characterized, for an
+  least one server.
 
 The optimistic rate at which _flardl_ launches requests for
 a given server $j$ is given by the expectation rates for
-modal-sized files from the Equation of Time in the case of small
-queue depths where the Head-Of-Line term is zero as
+modal-sized files with small queue depths as
 
 $`
    \begin{equation}
@@ -247,8 +232,10 @@ $`
 
 where
 
-- $\tilde{S}$ is the modal file size for the collection,
-- $B_{\rm max}$ is the maximum permitted download rate,
+- $\tilde{S}$ is the modal file size for the collection
+  (an input parameter),
+- $B_{\rm max}$ is the maximum permitted download rate
+  (an input parameter),
 - $D_j$ is the server queue depth at launch,
 - $\tilde{\tau}_{\rm prev}$ is the modal file arrival rate
   for the previous session,
@@ -258,7 +245,7 @@ where
 - $I_{\rm first}$ is the initiation time for the first
   transfer to arrive,
 - and $\tilde{\tau_j}$ is the modal file transfer rate
-  for the current session.
+  for the current session with the server.
 
 After waiting an exponentially-distributed stochastic period
 given by the applicable value for $k_j$, testing is done

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "flardl"
-version = "0.0.8.1"
+version = "0.0.8.2"
 description = "Adaptive Elastic Multi-Site Downloading"
 authors = [
     {name = "Joel Berendzen", email = "joel@generisbio.com"},