Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update for 1.1 [in progress]. #95

Merged
merged 7 commits into from Jul 5, 2017
Merged

Documentation update for 1.1 [in progress]. #95

merged 7 commits into from Jul 5, 2017

Conversation

cjacoby
Copy link
Collaborator

@cjacoby cjacoby commented May 6, 2017

This PR (when complete) will complete documentation fixes for #83, #91.

TODO:

  • Fix/update semantics to match the discussion / decisions in Language conventions #75
    • samples vs batches
    • iterators, iterables, and generators
  • BufferedStreamer => buffer_stream
  • buffer_batch => buffer_stream
  • New buffer_stream behavior (vstack, adding dimension)
  • Minor mux semantics changes
  • pescador.maps
  • Changes from ZMQStreamer default time #91 for ZMQ

@cjacoby cjacoby added this to the 1.1.0 milestone May 6, 2017
@cjacoby cjacoby self-assigned this May 6, 2017
@cjacoby
Copy link
Collaborator Author

cjacoby commented Jun 29, 2017

@bmcfee - I'd appreciate a look-over. More or less done, except for the ZMQ thing from #91. I can't get the ZMQ portion of the API Docs to generate on my machine, but there is no obvious reason why.

key: `X`. For supervised learning (e.g., SGDClassifier), valid batches must contain both `X` and `Y` keys,
both of equal length.
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, Streamers place no requirements on the format of the items in its stream; only buffer_stream does this (maybe it's should be called buffer_datastream)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had this thought as I was writing it, but at the time I was trying to more or less match up the old statement with how it actually works now.

Is it not worth saying anything about that here? Maybe just in buffer_stream?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a data -> data

But yeah, I think it's fine to only mention requirements when they're needed. Streamers don't generally care, but buffering does and ZMQ does (ndarrays are required to determine header payloads).

.. code-block:: python
:linenos:

batch_streamer = pescador.Stream(buffered_sample_gen)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean Streamer here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drpity

@@ -3,7 +3,7 @@
Buffered Streaming
==================

In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`BufferedStreamer`, which will "buffer" your batches into fixed batch sizes.
In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`buffer_stream` map transformer, which will "buffer" your batches into fixed batch sizes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's

will "buffer" a data stream into fixed batch sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️


- A consequence of this is that you must make sure that your generators yield batches such that every key contains arrays shaped (N, ...), where N is the number of batches generated.
- :ref:`bufer_stream` will concatenate your arrays, adding a new sample dimension such that the first dimension contains the number of batches (`minibatch_size` in the above example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclosed paren; an e.g. might be helpful here? like

if your samples are shaped (4, 5), a batch size of 10 will produce arrays shaped (10, 4, 5)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call ✅

@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate.
yield dict(X=data['X'][idx:idx + n],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make mention of memcopies here? i.e. np.array(data['X'][idx:idx + n])? This is a chance to educate users on hanging ids and memory leaks.

paging @bmcfee, I forget our consensus on this and unsure where to start digging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... would feel okay about someone else dealing with that. I don't think I have a complete enough understanding to write it up right yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. I also don't remember our consensus. But slicing shouldn't invoke a copy, and that's all this example is doing, right? The issue only comes up when there are dangling references to the buffer.

I think it's okay to skip the copy issue here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.

# Generate batches as a streamer:
for batch in batch_streamer:
# batch['X'].shape == (minibatch_size, ...)
# batch['Y'].shape == (minibatch_size,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minibatch_size, 1)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️

both of equal length.
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is SGDClassifier the right point of reference here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you prefer? I agree that is not optimal.


batch[Y'] is an `np.ndarray` of shape `(batch_size,)`
sample[Y'] is a scalar `np.ndarray` of shape `(,)`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample['Y']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

`generate()` multiple times on a streamer object is equivalent to restarting the generator, and can therefore
be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is
simple to pass a streamer object to a separate process for parallel computation.
Pescador provides the `Streamer` object to circumvent these issues. `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streamer object -> Streamer class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️

estimator.partial_fit(batch['X'], batch['Y'], classes=classes)
# Fit the model to the stream, use at most 5000 samples
for sample in mux_stream(max_iter=5000):
estimator.partial_fit(sample['X'], sample['Y'], classes=classes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this example actually work? I think you need a buffer in here for the indexing to be correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No... but that wasn't the reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fixed now)

docs/index.rst Outdated

Multiplexing Data Streams
-------------------------
1. Pescador defines an object called a `Mux` for the purposes of stochastically multiplexing streams of data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily stochastic going forward

"""Buffers a stream into batches of examples
"""Deprecated in 1.1. Will be removed in 2.0.

Buffers a stream into batches of examples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a .. warning:? Or do we want to use something fancy like a deprecation decorator?

pescador/maps.py Outdated

Important note: map functions return a *generator*, not another streamer
Streamer, so if you need it to behave like a Streamer, you have to wrap
the function in a streamer again.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamer Streamer

Copy link
Collaborator

@bmcfee bmcfee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments below, but looking good!

@bmcfee
Copy link
Collaborator

bmcfee commented Jun 30, 2017

I can't get the ZMQ portion of the API Docs to generate on my machine, but there is no obvious reason why.

I fixed this locally -- any objection to me pushing up some changes to fix the build?

Nevermind, I took the liberty of pushing it up. And fixing most of the other build issues. The :ref:`buffer_stream` bits still don't work for some reason. Chalk it up to sphinxiness.

@cjacoby
Copy link
Collaborator Author

cjacoby commented Jul 1, 2017

Okay - that should be all the requested changes. Are we done here?

@bmcfee
Copy link
Collaborator

bmcfee commented Jul 5, 2017

Okay - that should be all the requested changes. Are we done here?

It looks like there are still several unresolved comments in my review -- can you take a look through them?

Copy link
Collaborator Author

@cjacoby cjacoby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got everything

both of equal length.
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you prefer? I agree that is not optimal.

estimator.partial_fit(batch['X'], batch['Y'], classes=classes)
# Fit the model to the stream, use at most 5000 samples
for sample in mux_stream(max_iter=5000):
estimator.partial_fit(sample['X'], sample['Y'], classes=classes)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fixed now)

@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate.
yield dict(X=data['X'][idx:idx + n],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.

@cjacoby cjacoby merged commit f56d23f into master Jul 5, 2017
@cjacoby cjacoby deleted the docs/1.1 branch July 5, 2017 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants