added sort stream #336

pbrakel · 2015-02-23T16:26:04Z

Any suggestions for additional tests are welcome.

bartvm · 2015-02-23T17:12:09Z

Any chance you could add support for the reverse keyword? The problem is that lambda functions can't be pickled, to something like lambda x: -x[0] won't work when using checkpointing. Instead, it would be nice if the user could use key=operator.itemgetter(0), reverse=True.

bartvm · 2015-02-23T17:15:18Z

blocks/datasets/streams.py

+    """
+    def __init__(self, data_stream, key=None):
+
+        def mapping(x):


Mm, nested functions aren't supported either by pickle (sorry, it's quite a nightmare sometimes).

What you could do is to create a callable Mapping class instead, which takes key and reverse as arguments to its __init__ method. If you define this class in the global namespace, everything should be fine.

Thanks for the suggestions! I added a reverse argument and a separate class SortMapper that takes a sorting key and the reverse flag as arguments. I still used a lambda expression for one of the test cases but implemented the others with the operator.itemgetter alternative. I guess operator.neg(operator.itemgetter(0)) would also do the trick but using a reverse argument seems fairly standard.

edit: Of course the negation method wouldn't work for sorting things that are not numbers so the reverse argument is definitely the way to go.

bartvm · 2015-02-23T19:04:52Z

blocks/datasets/streams.py

+        indices = [i for (v, i) in
+                   sorted(((v, i) for (i, v) in enumerate(x[0])),
+                          key=self.key)]
+        if self.reverse:


sorted takes reverse as an argument, so maybe you can just pass it there?

You're right. I changed it.

bartvm · 2015-02-23T19:10:15Z

blocks/datasets/streams.py

+
+    def __call__(self, x):
+        indices = [i for (v, i) in
+                   sorted(((v, i) for (i, v) in enumerate(x[0])),


Slightly confused here. If you do sorted(((v, i) for (i, v) in enumerate(x[0])), key=self.key), the key is applied to (v, i)? I would expect it to be applied to v directly. So if I want to sort a batch by length, I would expect to pass len as key, but in this case I would need to pass something like lambda x: len(x[0]), right?

Ah, this line is wrong. I have to rethink this and write a test that has an example with multiple sources because I'm sure this version wouldn't pass that.

pbrakel · 2015-02-23T21:46:13Z

The batch is now zipped first. Subsequently, the key is applied to that sequence/iterator. The values returned by the key are passed to sorted to generate indices. This way the key can also select the source to sort on.

rizar · 2015-02-24T10:44:25Z

blocks/datasets/streams.py

+
+    def __call__(self, x):
+        values = [self.key(i) for i in zip(*x)]
+        indices = [i for (v, i) in


Sorry of being picky, but sorting indices is not really Pythonic. In Python copying a reference to an object, i.e. a=b, is always a lightweight operation regardless whether the object is an int or a huge matrix. Therefore, I would prefer something like this:

def __call__(self, batch): with_keys = [(example, self.key(example)) for example in zip(*batch)] with_keys.sort(key=lambda _, key : key) return [example for example, _ in with_keys]

Another thing is precomputing keys. In general it is a nice thing to do, but I quickly checked and seems like sorted is smart enough to do it on its own. I could not find it in the docs explicitly, but I would trust that such a function must be optimized. That simplifies our job to the following:

def __call__(self, batch): return list(sorted(zip(*batch), key=self.key, reverse=self.reverse))

Did I miss anything?

pbrakel · 2015-02-24T15:16:05Z

I guess my code got a bit more complicated because I first tried to do it without using zip and later added it anyway, making the key usage and automatic sorting of all sources together trivial. I'll change it later today.

pbrakel · 2015-02-24T20:58:47Z

I changed the code as suggested by @rizar plus a line that undoes the zip operation. I'm now assuming the output format for batches should always be a tuple of lists. If this is not required, I can simplify it further to get a tuple of tuples by using just return zip(*output) instead.

bartvm · 2015-02-24T21:18:48Z

Mm... Ideally you would want to keep the kind of container you had I guess. The reason for this is that if batch contains NumPy arrays (which is very common of course), this operation destroys the original matrix and returns a list of vectors instead.

Maybe we should special case NumPy arrays and do numpy.asarray on all the inputs that isinstance(numpy.array) in the beginning?

pbrakel · 2015-02-25T00:10:31Z

This seems a very common problem that would affect many types of streams. I think that it would be great if the objects themselves could describe how to be split or merged by operations like zip but if 99% of the use cases are numpy arrays or lists it might be sufficient to apply the isinstance(numpy.array) method you suggested...

rizar · 2015-02-25T08:56:01Z

Considering the fact that BatchDataStream is currently producing numpy.array's, I would suggest to always pack as array here. I do not see much harm from that: the interface a 1-dimensional numpy array of objects provides is richer that the one of the list, except for fast appends, but this seems not important.

bartvm · 2015-02-25T13:00:32Z

It's not just fast appends though, it's also fast reading; NumPy arrays of objects are significantly slower than lists. I don't think the comparison is that simple, other streams could expect lists for good reasons e.g. removing the last word from all sentences (if I don't want to train with periods) is a simple [sentence.pop for sentence in batch] for lists, but becomes a very different story if sentence is NumPy array. Likewise, it determines whether I need to use extend or concatenate, whether insert exists, etc.

All in all, I don't think it's a good idea to blindly cast everything from lists to NumPy arrays, potentially casting lists of integers to lists of vectors, completely changing the methods each example supports.

Although I'm normally a big fan of NumPy arrays, I think that for our data streams, we might be better of using vanilla Python data structures (or at least, don't cast from the latter to the former). Considering we don't set any restrictions on the data passed around, there might be all sorts of data structures we're destroying. I think BatchDataStream should probably just return a list of examples, and we can always add a simple AsNumpyArray wrapper.

rizar · 2015-02-25T13:16:30Z

Okay, let's have this finished by isinstance(numpy.array). Your arguments make sense. I think you should create an issue in Fuel for switching to list in BatchDataStream.

pbrakel · 2015-02-25T16:48:58Z

I now made it such that inputs that isinstance(numpy.ndarray) get numpy.asarray applied while the others get turned into lists instead. I also added a test in which the data set is a list of tuples of arrays.

rizar · 2015-02-25T17:59:54Z

Okay, I think it looks nice and that we already made you wait too much. Merging it!

Guess when will later factor out the logic of "packing back" the batches and put in to util.

added sort stream

added sort stream

4c3536d

bartvm reviewed Feb 23, 2015
View reviewed changes

pbrakel added 2 commits February 23, 2015 12:59

added reverse and mapper class

5b9dbe1

Period in docstring

1e971d3

bartvm reviewed Feb 23, 2015
View reviewed changes

let sorted do the reversing itself

2672efc

bartvm reviewed Feb 23, 2015
View reviewed changes

pbrakel added 4 commits February 23, 2015 14:44

removed sort stream and added mapping documentation

919bede

the docstring blank line thing

8140900

sorting corrected for multiple sources

0c3e61f

identation

95fbcd8

rizar reviewed Feb 24, 2015
View reviewed changes

pbrakel added 2 commits February 24, 2015 15:51

cleaner version

dcf8b22

remove unused import

a96dc6c

bartvm mentioned this pull request Feb 25, 2015

Switch to lists for BatchedDataStream mila-iqia/fuel#18

Open

arrays stay arrays + test for that

9f2f5b6

rizar added a commit that referenced this pull request Feb 25, 2015

Merge pull request #336 from pbrakel/sort_stream

23ab187

added sort stream

rizar merged commit 23ab187 into mila-iqia:master Feb 25, 2015

bartvm mentioned this pull request Mar 2, 2015

Unpack mila-iqia/fuel#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added sort stream #336

added sort stream #336

pbrakel commented Feb 23, 2015

bartvm commented Feb 23, 2015

bartvm Feb 23, 2015

pbrakel Feb 23, 2015

bartvm Feb 23, 2015

pbrakel Feb 23, 2015

bartvm Feb 23, 2015

pbrakel Feb 23, 2015

pbrakel commented Feb 23, 2015

rizar Feb 24, 2015

pbrakel commented Feb 24, 2015

pbrakel commented Feb 24, 2015

bartvm commented Feb 24, 2015

pbrakel commented Feb 25, 2015

rizar commented Feb 25, 2015

bartvm commented Feb 25, 2015

rizar commented Feb 25, 2015

pbrakel commented Feb 25, 2015

rizar commented Feb 25, 2015

added sort stream #336

added sort stream #336

Conversation

pbrakel commented Feb 23, 2015

bartvm commented Feb 23, 2015

bartvm Feb 23, 2015

Choose a reason for hiding this comment

pbrakel Feb 23, 2015

Choose a reason for hiding this comment

bartvm Feb 23, 2015

Choose a reason for hiding this comment

pbrakel Feb 23, 2015

Choose a reason for hiding this comment

bartvm Feb 23, 2015

Choose a reason for hiding this comment

pbrakel Feb 23, 2015

Choose a reason for hiding this comment

pbrakel commented Feb 23, 2015

rizar Feb 24, 2015

Choose a reason for hiding this comment

pbrakel commented Feb 24, 2015

pbrakel commented Feb 24, 2015

bartvm commented Feb 24, 2015

pbrakel commented Feb 25, 2015

rizar commented Feb 25, 2015

bartvm commented Feb 25, 2015

rizar commented Feb 25, 2015

pbrakel commented Feb 25, 2015

rizar commented Feb 25, 2015