Documentation update for 1.1 [in progress]. #95

cjacoby · 2017-05-06T17:19:02Z

This PR (when complete) will complete documentation fixes for #83, #91.

TODO:

Fix/update semantics to match the discussion / decisions in Language conventions #75
- samples vs batches
- iterators, iterables, and generators
BufferedStreamer => buffer_stream
buffer_batch => buffer_stream
New buffer_stream behavior (vstack, adding dimension)
Minor mux semantics changes
pescador.maps
Changes from ZMQStreamer default time #91 for ZMQ

…ud Maps/Transformers instead of BufferedStreamer

…nd =>

cjacoby · 2017-06-29T15:04:17Z

@bmcfee - I'd appreciate a look-over. More or less done, except for the ZMQ thing from #91. I can't get the ZMQ portion of the API Docs to generate on my machine, but there is no obvious reason why.

ejhumphrey · 2017-06-29T15:06:44Z

docs/example1.rst

-key: `X`.  For supervised learning (e.g., SGDClassifier), valid batches must contain both `X` and `Y` keys,
-both of equal length.
+Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
+a particular format.  Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one


technically, Streamers place no requirements on the format of the items in its stream; only buffer_stream does this (maybe it's should be called buffer_datastream)

Yes, I had this thought as I was writing it, but at the time I was trying to more or less match up the old statement with how it actually works now.

Is it not worth saying anything about that here? Maybe just in buffer_stream?

a data -> data

But yeah, I think it's fine to only mention requirements when they're needed. Streamers don't generally care, but buffering does and ZMQ does (ndarrays are required to determine header payloads).

ejhumphrey · 2017-06-29T15:07:41Z

docs/bufferedstreaming.rst

+.. code-block:: python
+    :linenos:
+
+    batch_streamer = pescador.Stream(buffered_sample_gen)


I think you mean Streamer here?

ejhumphrey · 2017-06-29T15:11:12Z

docs/bufferedstreaming.rst

@@ -3,7 +3,7 @@
 Buffered Streaming
 ==================

-In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`BufferedStreamer`, which will "buffer" your batches into fixed batch sizes.
+In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`buffer_stream` map transformer, which will "buffer" your batches into fixed batch sizes.


how's

will "buffer" a data stream into fixed batch sizes.

ejhumphrey · 2017-06-29T15:12:59Z

docs/bufferedstreaming.rst


-    - A consequence of this is that you must make sure that your generators yield batches such that every key contains arrays shaped (N, ...), where N is the number of batches generated.
+    - :ref:`bufer_stream` will concatenate your arrays, adding a new sample dimension such that the first dimension contains the number of batches (`minibatch_size` in the above example.


unclosed paren; an e.g. might be helpful here? like

if your samples are shaped (4, 5), a batch size of 10 will produce arrays shaped (10, 4, 5)

Good call ✅

ejhumphrey · 2017-06-29T15:30:19Z

docs/example3.rst

@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate.
                yield dict(X=data['X'][idx:idx + n],


Do we want to make mention of memcopies here? i.e. np.array(data['X'][idx:idx + n])? This is a chance to educate users on hanging ids and memory leaks.

paging @bmcfee, I forget our consensus on this and unsure where to start digging.

I... would feel okay about someone else dealing with that. I don't think I have a complete enough understanding to write it up right yet.

Ugh. I also don't remember our consensus. But slicing shouldn't invoke a copy, and that's all this example is doing, right? The issue only comes up when there are dangling references to the buffer.

I think it's okay to skip the copy issue here.

I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.

bmcfee · 2017-06-30T18:12:52Z

docs/bufferedstreaming.rst

+    # Generate batches as a streamer:
+    for batch in batch_streamer:
+        # batch['X'].shape == (minibatch_size, ...)
+        # batch['Y'].shape == (minibatch_size,)


(minibatch_size, 1)?

bmcfee · 2017-06-30T18:13:53Z

docs/example1.rst

-both of equal length.
+Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
+a particular format.  Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
+key: `X`.  For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.


is SGDClassifier the right point of reference here?

What would you prefer? I agree that is not optimal.

bmcfee · 2017-06-30T18:14:52Z

docs/example1.rst


-            batch[Y'] is an `np.ndarray` of shape `(batch_size,)`
+            sample[Y'] is a scalar `np.ndarray` of shape `(,)`


sample['Y']

bmcfee · 2017-06-30T18:26:19Z

docs/example1.rst

-`generate()` multiple times on a streamer object is equivalent to restarting the generator, and can therefore
-be used to simply implement multiple pass streams.  Similarly, because `Streamer` can be serialized, it is
-simple to pass a streamer object to a separate process for parallel computation.
+Pescador provides the `Streamer` object to circumvent these issues.  `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`.  Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams.  Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.


Streamer object -> Streamer class

bmcfee · 2017-06-30T19:09:51Z

docs/example2.rst

-            estimator.partial_fit(batch['X'], batch['Y'], classes=classes)
+        # Fit the model to the stream, use at most 5000 samples
+        for sample in mux_stream(max_iter=5000):
+            estimator.partial_fit(sample['X'], sample['Y'], classes=classes)


Does this example actually work? I think you need a buffer in here for the indexing to be correct.

No... but that wasn't the reason.

(fixed now)

bmcfee · 2017-06-30T19:13:35Z

docs/index.rst

+
+Multiplexing Data Streams
+-------------------------
+1. Pescador defines an object called a `Mux` for the purposes of stochastically multiplexing streams of data.


not necessarily stochastic going forward

bmcfee · 2017-06-30T19:14:40Z

pescador/buffered.py

-    """Buffers a stream into batches of examples
+    """Deprecated in 1.1. Will be removed in 2.0.
+
+    Buffers a stream into batches of examples.


Should this be a .. warning:? Or do we want to use something fancy like a deprecation decorator?

bmcfee · 2017-06-30T19:14:57Z

pescador/maps.py

+
+Important note: map functions return a *generator*, not another streamer
+Streamer, so if you need it to behave like a Streamer, you have to wrap
+the function in a streamer again.


streamer Streamer

bmcfee

A few comments below, but looking good!

bmcfee · 2017-06-30T19:21:41Z

I can't get the ZMQ portion of the API Docs to generate on my machine, but there is no obvious reason why.

~~I fixed this locally -- any objection to me pushing up some changes to fix the build?~~

Nevermind, I took the liberty of pushing it up. And fixing most of the other build issues. The :ref:`buffer_stream` bits still don't work for some reason. Chalk it up to sphinxiness.

cjacoby · 2017-07-01T02:22:06Z

Okay - that should be all the requested changes. Are we done here?

bmcfee · 2017-07-05T14:54:56Z

Okay - that should be all the requested changes. Are we done here?

It looks like there are still several unresolved comments in my review -- can you take a look through them?

cjacoby

I think I got everything

cjacoby · 2017-07-05T17:26:58Z

docs/example1.rst

-both of equal length.
+Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
+a particular format.  Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
+key: `X`.  For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.


What would you prefer? I agree that is not optimal.

cjacoby · 2017-07-05T17:31:15Z

docs/example2.rst

-            estimator.partial_fit(batch['X'], batch['Y'], classes=classes)
+        # Fit the model to the stream, use at most 5000 samples
+        for sample in mux_stream(max_iter=5000):
+            estimator.partial_fit(sample['X'], sample['Y'], classes=classes)


(fixed now)

cjacoby · 2017-07-05T17:31:45Z

docs/example3.rst

@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate.
                yield dict(X=data['X'][idx:idx + n],


I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.

First pass at updating docs for #83

c84c802

cjacoby added this to the 1.1.0 milestone May 6, 2017

cjacoby self-assigned this May 6, 2017

cjacoby added 3 commits June 29, 2017 07:56

Fixing auto-generated APIs to include __init__ correctly, and to incl…

858f209

…ud Maps/Transformers instead of BufferedStreamer

minor formatting changes in some docstrings

35115a3

Making various changes from 1.1 in the docs, including changes to , a…

1b0b7dc

…nd =>

cjacoby mentioned this pull request Jun 29, 2017

Not all docstrings make it to the docs #99

Closed

cjacoby requested review from bmcfee and ejhumphrey June 29, 2017 15:03

ejhumphrey reviewed Jun 29, 2017

View reviewed changes

cjacoby added the Documentation label Jun 29, 2017

ejhumphrey reviewed Jun 29, 2017

View reviewed changes

pr requested changes

6038520

bmcfee reviewed Jun 30, 2017

View reviewed changes

bmcfee requested changes Jun 30, 2017

View reviewed changes

bmcfee and others added 2 commits June 30, 2017 15:42

fixed build issues except for buffer_stream refs

d7c7989

minor pr change requests

1c81f29

cjacoby commented Jul 5, 2017

View reviewed changes

bmcfee approved these changes Jul 5, 2017

View reviewed changes

cjacoby merged commit f56d23f into master Jul 5, 2017

cjacoby deleted the docs/1.1 branch July 5, 2017 19:23


		- A consequence of this is that you must make sure that your generators yield batches such that every key contains arrays shaped (N, ...), where N is the number of batches generated.
		- :ref:`bufer_stream` will concatenate your arrays, adding a new sample dimension such that the first dimension contains the number of batches (`minibatch_size` in the above example.

		@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate.
		yield dict(X=data['X'][idx:idx + n],


		batch[Y'] is an `np.ndarray` of shape `(batch_size,)`
		sample[Y'] is a scalar `np.ndarray` of shape `(,)`

Documentation update for 1.1 [in progress]. #95

Documentation update for 1.1 [in progress]. #95

Conversation

cjacoby commented May 6, 2017 • edited

cjacoby commented Jun 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmcfee left a comment

Choose a reason for hiding this comment

bmcfee commented Jun 30, 2017 • edited

cjacoby commented Jul 1, 2017

bmcfee commented Jul 5, 2017

cjacoby left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjacoby commented May 6, 2017 •

edited

bmcfee commented Jun 30, 2017 •

edited