Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers should have batch- and example-specific methods #27

Closed
bartvm opened this issue Feb 27, 2015 · 11 comments
Closed

Transformers should have batch- and example-specific methods #27

bartvm opened this issue Feb 27, 2015 · 11 comments

Comments

@bartvm
Copy link
Member

bartvm commented Feb 27, 2015

I was just wondering whether we should make this a kind of policy: It's okay (and expected) for transformers to only act on examples, not on batches. There are basically two arguments that I can think of:

Pro

It makes code a lot simpler. This is n-grams for batches (and it's actually still not complete):

        features, 
        for _, sentence in enumerate(self.cache[0]):
            features.append(list(
                sliding_window(self.ngram_order,
                               sentence[:-1]))[:request - len(features)])
            targets.append(
                sentence[self.ngram_order:][:request - len(targets)])
            self.cache[0][0] = self.cache[0][0][request:]
            if not self.cache[0][0]:
                self.cache[0].pop(0)
                if not self.cache[0]:
                    self._cache()
            if len(features) == request:
                break
        return tuple(numpy.asarray(data) for data in (features, targets))

and this is it for examples:

        while not self.index < len(self.sentence) - self.ngram_order:
            self.sentence, = next(self.child_epoch_iterator)
            self.index = 0
        ngram = self.sentence[self.index:self.index + self.ngram_order]
        target = self.sentence[self.index + self.ngram_order]
        self.index += 1
        return (ngram, target)

If NGramStream had to deal both with batches and with examples, the code would be very long for such a simple operation. This goes for many, many cases. Hence, I'd prefer transformers to work on examples, and expect the user to add a BatchStream at the end.

Con

Speed. Performing operations on batches can often be faster.

My take on it is that we can aim at one of two things:

We can try to make Fuel as efficient as possible. That means quite a bit of code to make sure that we handle large batches efficiently, and it might limit our ability to easily add new transformers (because they need all this logic coded up).

Alternatively, we can just say that our primary goal is the easy creation of processing pipelines. We will care more about prototyping e.g. testing dozens of different combinations of transformers to see which one works best, and making it very easy to add new ones. This means that the pipelines might not be as fast as they could be, but I think (hope) not slow enough to be prohibitive. Once you have found your optimal pre-processing pipeline and really need the speed, it should be easy to code up a single, specialized transformer that does everything you want more efficiently on batches/in Cython/using GPU/etc.

@vdumoulin
Copy link
Contributor

I think we should benchmark what performance hit we're looking at if we choose to do examplewise preprocessing.

It probably won't do much difference for large models (especially if we have good multithreading support to do the preprocessing in parallel), but for small models it may have a big impact.

It's not clear to me yet why batchwise and examplewise preprocessing should be mutually exclusive. I haven't looked at the code long enough to get a good high-level feel of how things fit together, so the following suggestion may not be applicable, but would it be possible to require that both batchwise and examplewise preprocessing are supported and have a default batchwise implementation that simply concatenates a bunch of examplewise calls?

@bartvm
Copy link
Member Author

bartvm commented Feb 27, 2015

They're not mutually exclusive per se, although there is no good way of checking whether you received a single example or a batch right now, besides just checking the shape of things or something. This can get quite messy, you end up with code like "if it's a list but the first element is also a list then I'm going to assume it's a batch" but some transformers should in principle work for lists, tuples, NumPy arrays, etc.

Although I'm not too keen on the idea of hard coding a is_batch flag, although that would it easier to implement transformers that deal with both. Then transformers could have a get_example and get_batch method instead of the current get_data, and get_batch would default to get_example(example) for example in batch.

My current proposal is simply to make most transformers example-only by default. For cases where the speed up is significant and the demand is high, we could implement a second, batch-wise version e.g. a Whiten transformer as well as a BatchWhiten transformer.

@vdumoulin
Copy link
Contributor

I still need to read the code more carefully, but I think I understand where you're getting at.

Depending on the number of useful batch transformations, we may end up having lots of Transform/BatchTransform pairs, though.

@bartvm
Copy link
Member Author

bartvm commented Feb 27, 2015

Mm, rather than having separate transformers, or automatically trying to deduce whether something is a batch, maybe we can introduce a flag batch=True which transformers can optionally support? Transformers that don't support it just act on examples, and those that do support it implement two methods, and use one or the other based on the value of the batch flag.

@vdumoulin
Copy link
Contributor

That would seem reasonable to me.

@rizar
Copy link
Contributor

rizar commented Feb 27, 2015

I fully agree that processing example-wise should be the predominate way for writing transformers. That will save lots of time for people writing and using them.

The idea to turn the optionally supported "batch mode" on seems very reasonable.

@bartvm
Copy link
Member Author

bartvm commented Feb 27, 2015

So here's an idea in slightly more detail:

  • get_data becomes get_example and get_batch
  • Each transformer takes a keyword argument batch which defaults to False. A transformer which only supports batches, sets batch = True as a class attribute.
  • If a transformer doesn't have a get_batch method, but batch=True was passed, no child_epoch_iterator will be set by the get_epoch_iterator method. Instead, the DataIterator will call batch = next(self.data_stream.data_stream) to retrieve the next batch and set self.data_stream.child_epoch_iterator to iter_(batch), iterating over the examples. It will then return [self.data_stream.get_example() for _ in range(len(batch))].

This has the following limitations, but they seem sensible:

  • An example-transformer can't be applied to a batch if it needs an iteration scheme (because it's not clear whether each request applies to the entire batch, or if there should be one per example).
  • NumPy ndarrays will end up being converted to lists of arrays. I'm not sure whether to special case this (just calling numpy.asarray on batches that were ndarrays when coming in), or to just expect the user to add a kind of AsNumpyArray transformer at the end.

@bartvm
Copy link
Member Author

bartvm commented Feb 28, 2015

Thought about it a bit more, and wondering now whether we should try to intelligently handle batches at all. I can think of quite a few issues:

  • Imagine a transformer which filters examples (rejecting them based on some sort of criterion). If we feed it a batch, should the size of the batch be maintained? Or should it just filter the given batch? If so, what do we do if it filters each example in the batch?
  • Likewise for padding; it doesn't make sense to apply the Padding stream example-wise.

So perhaps the simplest solution is the best: Transformers can implement two methods (get_example and get_batch). Which one is used depends on the default of the Transformer, or in the case both are supported, on whether the batch=True flag is set.

@rizar
Copy link
Contributor

rizar commented Mar 2, 2015

In the first case I would not support batch input at all. I think it is okay for some transformers to be example-only or batch-only, like your second example.

Your final proposal sounds good.

@bartvm bartvm added the CCW label Mar 2, 2015
@bartvm bartvm changed the title Transformers (wrappers) working on batches or examples? Transformers should have batch- and example-specific methods Mar 2, 2015
@bartvm bartvm mentioned this issue Mar 2, 2015
@bartvm
Copy link
Member Author

bartvm commented Mar 8, 2015

Being addressed in #40

@bartvm
Copy link
Member Author

bartvm commented Mar 28, 2015

Closed via #45 (rebase of #40).

@bartvm bartvm closed this as completed Mar 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants