Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added sort stream #336

Merged
merged 11 commits into from
Feb 25, 2015
Merged

added sort stream #336

merged 11 commits into from
Feb 25, 2015

Conversation

pbrakel
Copy link
Contributor

@pbrakel pbrakel commented Feb 23, 2015

Any suggestions for additional tests are welcome.

@bartvm
Copy link
Member

bartvm commented Feb 23, 2015

Any chance you could add support for the reverse keyword? The problem is that lambda functions can't be pickled, to something like lambda x: -x[0] won't work when using checkpointing. Instead, it would be nice if the user could use key=operator.itemgetter(0), reverse=True.

"""
def __init__(self, data_stream, key=None):

def mapping(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, nested functions aren't supported either by pickle (sorry, it's quite a nightmare sometimes).

What you could do is to create a callable Mapping class instead, which takes key and reverse as arguments to its __init__ method. If you define this class in the global namespace, everything should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I added a reverse argument and a separate class SortMapper that takes a sorting key and the reverse flag as arguments. I still used a lambda expression for one of the test cases but implemented the others with the operator.itemgetter alternative. I guess operator.neg(operator.itemgetter(0)) would also do the trick but using a reverse argument seems fairly standard.

edit: Of course the negation method wouldn't work for sorting things that are not numbers so the reverse argument is definitely the way to go.

indices = [i for (v, i) in
sorted(((v, i) for (i, v) in enumerate(x[0])),
key=self.key)]
if self.reverse:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted takes reverse as an argument, so maybe you can just pass it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I changed it.


def __call__(self, x):
indices = [i for (v, i) in
sorted(((v, i) for (i, v) in enumerate(x[0])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused here. If you do sorted(((v, i) for (i, v) in enumerate(x[0])), key=self.key), the key is applied to (v, i)? I would expect it to be applied to v directly. So if I want to sort a batch by length, I would expect to pass len as key, but in this case I would need to pass something like lambda x: len(x[0]), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this line is wrong. I have to rethink this and write a test that has an example with multiple sources because I'm sure this version wouldn't pass that.

@pbrakel
Copy link
Contributor Author

pbrakel commented Feb 23, 2015

The batch is now zipped first. Subsequently, the key is applied to that sequence/iterator. The values returned by the key are passed to sorted to generate indices. This way the key can also select the source to sort on.


def __call__(self, x):
values = [self.key(i) for i in zip(*x)]
indices = [i for (v, i) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry of being picky, but sorting indices is not really Pythonic. In Python copying a reference to an object, i.e. a=b, is always a lightweight operation regardless whether the object is an int or a huge matrix. Therefore, I would prefer something like this:

def __call__(self, batch):
    with_keys = [(example, self.key(example)) for example in zip(*batch)]
    with_keys.sort(key=lambda _, key : key)
    return [example for example, _ in with_keys]

Another thing is precomputing keys. In general it is a nice thing to do, but I quickly checked and seems like sorted is smart enough to do it on its own. I could not find it in the docs explicitly, but I would trust that such a function must be optimized. That simplifies our job to the following:

def __call__(self, batch):
    return list(sorted(zip(*batch), key=self.key, reverse=self.reverse))

Did I miss anything?

@pbrakel
Copy link
Contributor Author

pbrakel commented Feb 24, 2015

I guess my code got a bit more complicated because I first tried to do it without using zip and later added it anyway, making the key usage and automatic sorting of all sources together trivial. I'll change it later today.

@pbrakel
Copy link
Contributor Author

pbrakel commented Feb 24, 2015

I changed the code as suggested by @rizar plus a line that undoes the zip operation. I'm now assuming the output format for batches should always be a tuple of lists. If this is not required, I can simplify it further to get a tuple of tuples by using just return zip(*output) instead.

@bartvm
Copy link
Member

bartvm commented Feb 24, 2015

Mm... Ideally you would want to keep the kind of container you had I guess. The reason for this is that if batch contains NumPy arrays (which is very common of course), this operation destroys the original matrix and returns a list of vectors instead.

Maybe we should special case NumPy arrays and do numpy.asarray on all the inputs that isinstance(numpy.array) in the beginning?

@pbrakel
Copy link
Contributor Author

pbrakel commented Feb 25, 2015

This seems a very common problem that would affect many types of streams. I think that it would be great if the objects themselves could describe how to be split or merged by operations like zip but if 99% of the use cases are numpy arrays or lists it might be sufficient to apply the isinstance(numpy.array) method you suggested...

@rizar
Copy link
Contributor

rizar commented Feb 25, 2015

Considering the fact that BatchDataStream is currently producing numpy.array's, I would suggest to always pack as array here. I do not see much harm from that: the interface a 1-dimensional numpy array of objects provides is richer that the one of the list, except for fast appends, but this seems not important.

@bartvm
Copy link
Member

bartvm commented Feb 25, 2015

It's not just fast appends though, it's also fast reading; NumPy arrays of objects are significantly slower than lists. I don't think the comparison is that simple, other streams could expect lists for good reasons e.g. removing the last word from all sentences (if I don't want to train with periods) is a simple [sentence.pop for sentence in batch] for lists, but becomes a very different story if sentence is NumPy array. Likewise, it determines whether I need to use extend or concatenate, whether insert exists, etc.

All in all, I don't think it's a good idea to blindly cast everything from lists to NumPy arrays, potentially casting lists of integers to lists of vectors, completely changing the methods each example supports.

Although I'm normally a big fan of NumPy arrays, I think that for our data streams, we might be better of using vanilla Python data structures (or at least, don't cast from the latter to the former). Considering we don't set any restrictions on the data passed around, there might be all sorts of data structures we're destroying. I think BatchDataStream should probably just return a list of examples, and we can always add a simple AsNumpyArray wrapper.

@rizar
Copy link
Contributor

rizar commented Feb 25, 2015

Okay, let's have this finished by isinstance(numpy.array). Your arguments make sense. I think you should create an issue in Fuel for switching to list in BatchDataStream.

@pbrakel
Copy link
Contributor Author

pbrakel commented Feb 25, 2015

I now made it such that inputs that isinstance(numpy.ndarray) get numpy.asarray applied while the others get turned into lists instead. I also added a test in which the data set is a list of tuples of arrays.

@rizar
Copy link
Contributor

rizar commented Feb 25, 2015

Okay, I think it looks nice and that we already made you wait too much. Merging it!

Guess when will later factor out the logic of "packing back" the batches and put in to util.

rizar added a commit that referenced this pull request Feb 25, 2015
@rizar rizar merged commit 23ab187 into mila-iqia:master Feb 25, 2015
@bartvm bartvm mentioned this pull request Mar 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants