New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation update for 1.1 [in progress]. #95
Conversation
…ud Maps/Transformers instead of BufferedStreamer
docs/example1.rst
Outdated
key: `X`. For supervised learning (e.g., SGDClassifier), valid batches must contain both `X` and `Y` keys, | ||
both of equal length. | ||
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in | ||
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, Streamers place no requirements on the format of the items in its stream; only buffer_stream
does this (maybe it's should be called buffer_datastream
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had this thought as I was writing it, but at the time I was trying to more or less match up the old statement with how it actually works now.
Is it not worth saying anything about that here? Maybe just in buffer_stream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a data
-> data
But yeah, I think it's fine to only mention requirements when they're needed. Streamers don't generally care, but buffering does and ZMQ does (ndarrays are required to determine header payloads).
docs/bufferedstreaming.rst
Outdated
.. code-block:: python | ||
:linenos: | ||
|
||
batch_streamer = pescador.Stream(buffered_sample_gen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean Streamer
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drpity
docs/bufferedstreaming.rst
Outdated
@@ -3,7 +3,7 @@ | |||
Buffered Streaming | |||
================== | |||
|
|||
In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`BufferedStreamer`, which will "buffer" your batches into fixed batch sizes. | |||
In a machine learning setting, it is common to train a model with multiple input datapoints simultaneously, in what are commonly referred to as "minibatches". To achieve this, pescador provides the :ref:`buffer_stream` map transformer, which will "buffer" your batches into fixed batch sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's
will "buffer" a data stream into fixed batch sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
docs/bufferedstreaming.rst
Outdated
|
||
- A consequence of this is that you must make sure that your generators yield batches such that every key contains arrays shaped (N, ...), where N is the number of batches generated. | ||
- :ref:`bufer_stream` will concatenate your arrays, adding a new sample dimension such that the first dimension contains the number of batches (`minibatch_size` in the above example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclosed paren; an e.g. might be helpful here? like
if your samples are shaped
(4, 5)
, a batch size of 10 will produce arrays shaped(10, 4, 5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call ✅
@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate. | |||
yield dict(X=data['X'][idx:idx + n], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make mention of memcopies here? i.e. np.array(data['X'][idx:idx + n])
? This is a chance to educate users on hanging ids and memory leaks.
paging @bmcfee, I forget our consensus on this and unsure where to start digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I... would feel okay about someone else dealing with that. I don't think I have a complete enough understanding to write it up right yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. I also don't remember our consensus. But slicing shouldn't invoke a copy, and that's all this example is doing, right? The issue only comes up when there are dangling references to the buffer.
I think it's okay to skip the copy issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.
docs/bufferedstreaming.rst
Outdated
# Generate batches as a streamer: | ||
for batch in batch_streamer: | ||
# batch['X'].shape == (minibatch_size, ...) | ||
# batch['Y'].shape == (minibatch_size,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minibatch_size, 1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☑️
docs/example1.rst
Outdated
both of equal length. | ||
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in | ||
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one | ||
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is SGDClassifier
the right point of reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer? I agree that is not optimal.
docs/example1.rst
Outdated
|
||
batch[Y'] is an `np.ndarray` of shape `(batch_size,)` | ||
sample[Y'] is a scalar `np.ndarray` of shape `(,)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample['Y']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
docs/example1.rst
Outdated
`generate()` multiple times on a streamer object is equivalent to restarting the generator, and can therefore | ||
be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is | ||
simple to pass a streamer object to a separate process for parallel computation. | ||
Pescador provides the `Streamer` object to circumvent these issues. `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streamer object
-> Streamer class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☑️
estimator.partial_fit(batch['X'], batch['Y'], classes=classes) | ||
# Fit the model to the stream, use at most 5000 samples | ||
for sample in mux_stream(max_iter=5000): | ||
estimator.partial_fit(sample['X'], sample['Y'], classes=classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this example actually work? I think you need a buffer in here for the indexing to be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No... but that wasn't the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fixed now)
docs/index.rst
Outdated
|
||
Multiplexing Data Streams | ||
------------------------- | ||
1. Pescador defines an object called a `Mux` for the purposes of stochastically multiplexing streams of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily stochastic going forward
pescador/buffered.py
Outdated
"""Buffers a stream into batches of examples | ||
"""Deprecated in 1.1. Will be removed in 2.0. | ||
|
||
Buffers a stream into batches of examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a .. warning:
? Or do we want to use something fancy like a deprecation decorator?
pescador/maps.py
Outdated
|
||
Important note: map functions return a *generator*, not another streamer | ||
Streamer, so if you need it to behave like a Streamer, you have to wrap | ||
the function in a streamer again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streamer Streamer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments below, but looking good!
Nevermind, I took the liberty of pushing it up. And fixing most of the other build issues. The |
Okay - that should be all the requested changes. Are we done here? |
It looks like there are still several unresolved comments in my review -- can you take a look through them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got everything
docs/example1.rst
Outdated
both of equal length. | ||
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in | ||
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one | ||
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer? I agree that is not optimal.
estimator.partial_fit(batch['X'], batch['Y'], classes=classes) | ||
# Fit the model to the stream, use at most 5000 samples | ||
for sample in mux_stream(max_iter=5000): | ||
estimator.partial_fit(sample['X'], sample['Y'], classes=classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fixed now)
@@ -37,8 +30,7 @@ We will define infinite samplers that pull `n` examples per iterate. | |||
yield dict(X=data['X'][idx:idx + n], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpret this to mean I don't need to do anything here. Please correct me if I'm misinterpreting.
This PR (when complete) will complete documentation fixes for #83, #91.
TODO: