restructured documentation

pescadores · Aug 25, 2017 · bd9b60d · bd9b60d
1 parent fd57ba6
commit bd9b60d
Show file tree

Hide file tree

Showing 5 changed files with 30 additions and 23 deletions.
diff --git a/docs/example1.rst b/docs/example1.rst
@@ -1,18 +1,20 @@
 .. _example1:
 
-Basic example
-=============
+Streaming data
+==============
 
-This document will walk through the basics of using pescador to stream samples from a generator.
+This example will walk through the basics of using pescador to stream samples from a generator.
 
 Our running example will be learning from an infinite stream of stochastically perturbed samples from the Iris dataset.
 
 
 Sample generators
 -----------------
-Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
-a particular format.  Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
-key: `X`.  For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
+Streamers are intended to transparently pass data without modifying them.
+However, Pescador assumes that Streamers produce output in a particular format.
+Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`.
+For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one key: `X`.
+For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
 
 Here's a simple example generator that draws random samples of data from the Iris dataset, and adds gaussian noise to the features.
 
@@ -43,7 +45,6 @@ Here's a simple example generator that draws random samples of data from the Iri
             sample['Y'] is a scalar `np.ndarray` of shape `(,)`
         '''
 
-
         n, d = X.shape
 
         while True:
@@ -53,16 +54,20 @@ Here's a simple example generator that draws random samples of data from the Iri
 
             yield dict(X=X[i] + noise, Y=Y[i])
 
-
-In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop. Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.
+In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop.
+Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.
 
 
 Streamers
 ---------
-Generators in python have a couple of limitations for common stream learning pipelines.  First, once instantiated, a generator cannot be "restarted".  Second, an instantiated generator cannot be serialized
-directly, so they are difficult to use in distributed computation environments.
-
-Pescador provides the `Streamer` class to circumvent these issues.  `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`.  Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams.  Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.
+Generators in python have a couple of limitations for common stream learning pipelines.
+First, once instantiated, a generator cannot be "restarted".
+Second, an instantiated generator cannot be serialized directly, so they are difficult to use in distributed computation environments.
+
+Pescador provides the `Streamer` class to circumvent these issues.
+`Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`.
+Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams.
+Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.
 
 Here's a simple example, using the generator from the previous section.
 

diff --git a/docs/example2.rst b/docs/example2.rst
@@ -1,13 +1,14 @@
 .. _example2:
 
-This document will walk through some advanced usage of pescador.
+This example demonstrates how to re-use and multiplex streamers.
 
 We will assume a working understanding of the simple example in the previous section.
 
 Stream re-use and multiplexing
 ==============================
 
-The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams. `Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.
+The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams.
+`Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.
 
 As a concrete example, we can simulate a mixture of noisy streams with differing variances.
 
@@ -66,7 +67,8 @@ As a concrete example, we can simulate a mixture of noisy streams with differing
         print('Test accuracy: {:.3f}'.format(accuracy_score(Y[test], Ypred)))
 
 
-In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`. When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.
+In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`.
+When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.
 
 Setting `rate=None` disables the random stream bounding, and `mux()` simply runs each active stream until exhaustion.
 

diff --git a/docs/example3.rst b/docs/example3.rst
@@ -3,7 +3,11 @@
 Sampling from disk
 ==================
 
-A common use case for `pescador` is to sample data from a large collection of existing archives. As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings. When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters. Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive. The problem then becomes sampling data from a collection of `NPZ` archives.
+A common use case for `pescador` is to sample data from a large collection of existing archives.
+As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings.
+When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters.
+Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive.
+The problem then becomes sampling data from a collection of `NPZ` archives.
 
 Here, we will assume that the pre-processing has already been done so that each `NPZ` file contains a numpy array of features `X` and labels `Y`.
 We will define infinite samplers that pull `n` examples per iterate.
@@ -86,7 +90,6 @@ Alternatively, *memory-mapping* can be used to only load data as needed, but req
             yield dict(X=X[idx:idx + n],
                        Y=Y[idx:idx + n])
 
-
     # Using this streamer is similar to the first example, but now you need a separate
     # NPY file for each X and Y
     npy_x_files = #LIST OF PRE-COMPUTED NPY FILES (X)

diff --git a/docs/index.rst b/docs/index.rst
@@ -3,6 +3,8 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
+.. _pescador:
+
 ########
 Pescador
 ########

diff --git a/docs/intro.rst b/docs/intro.rst
@@ -1,10 +1,5 @@
 .. _intro:
 
-************
-Introduction
-************
-
-
 Definitions
 -----------