-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added sort stream #336
added sort stream #336
Conversation
Any chance you could add support for the |
""" | ||
def __init__(self, data_stream, key=None): | ||
|
||
def mapping(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, nested functions aren't supported either by pickle (sorry, it's quite a nightmare sometimes).
What you could do is to create a callable Mapping
class instead, which takes key
and reverse
as arguments to its __init__
method. If you define this class in the global namespace, everything should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! I added a reverse
argument and a separate class SortMapper that takes a sorting key and the reverse flag as arguments. I still used a lambda expression for one of the test cases but implemented the others with the operator.itemgetter
alternative. I guess operator.neg(operator.itemgetter(0))
would also do the trick but using a reverse
argument seems fairly standard.
edit: Of course the negation method wouldn't work for sorting things that are not numbers so the reverse
argument is definitely the way to go.
indices = [i for (v, i) in | ||
sorted(((v, i) for (i, v) in enumerate(x[0])), | ||
key=self.key)] | ||
if self.reverse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted
takes reverse
as an argument, so maybe you can just pass it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I changed it.
|
||
def __call__(self, x): | ||
indices = [i for (v, i) in | ||
sorted(((v, i) for (i, v) in enumerate(x[0])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused here. If you do sorted(((v, i) for (i, v) in enumerate(x[0])), key=self.key)
, the key is applied to (v, i)
? I would expect it to be applied to v
directly. So if I want to sort a batch by length, I would expect to pass len
as key, but in this case I would need to pass something like lambda x: len(x[0])
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this line is wrong. I have to rethink this and write a test that has an example with multiple sources because I'm sure this version wouldn't pass that.
The batch is now zipped first. Subsequently, the key is applied to that sequence/iterator. The values returned by the key are passed to |
|
||
def __call__(self, x): | ||
values = [self.key(i) for i in zip(*x)] | ||
indices = [i for (v, i) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry of being picky, but sorting indices is not really Pythonic. In Python copying a reference to an object, i.e. a=b
, is always a lightweight operation regardless whether the object is an int
or a huge matrix. Therefore, I would prefer something like this:
def __call__(self, batch):
with_keys = [(example, self.key(example)) for example in zip(*batch)]
with_keys.sort(key=lambda _, key : key)
return [example for example, _ in with_keys]
Another thing is precomputing keys. In general it is a nice thing to do, but I quickly checked and seems like sorted
is smart enough to do it on its own. I could not find it in the docs explicitly, but I would trust that such a function must be optimized. That simplifies our job to the following:
def __call__(self, batch):
return list(sorted(zip(*batch), key=self.key, reverse=self.reverse))
Did I miss anything?
I guess my code got a bit more complicated because I first tried to do it without using zip and later added it anyway, making the key usage and automatic sorting of all sources together trivial. I'll change it later today. |
I changed the code as suggested by @rizar plus a line that undoes the zip operation. I'm now assuming the output format for batches should always be a tuple of lists. If this is not required, I can simplify it further to get a tuple of tuples by using just |
Mm... Ideally you would want to keep the kind of container you had I guess. The reason for this is that if Maybe we should special case NumPy arrays and do |
This seems a very common problem that would affect many types of streams. I think that it would be great if the objects themselves could describe how to be split or merged by operations like zip but if 99% of the use cases are numpy arrays or lists it might be sufficient to apply the |
Considering the fact that |
It's not just fast appends though, it's also fast reading; NumPy arrays of objects are significantly slower than lists. I don't think the comparison is that simple, other streams could expect lists for good reasons e.g. removing the last word from all sentences (if I don't want to train with periods) is a simple All in all, I don't think it's a good idea to blindly cast everything from lists to NumPy arrays, potentially casting lists of integers to lists of vectors, completely changing the methods each example supports. Although I'm normally a big fan of NumPy arrays, I think that for our data streams, we might be better of using vanilla Python data structures (or at least, don't cast from the latter to the former). Considering we don't set any restrictions on the data passed around, there might be all sorts of data structures we're destroying. I think |
Okay, let's have this finished by |
I now made it such that inputs that |
Okay, I think it looks nice and that we already made you wait too much. Merging it! Guess when will later factor out the logic of "packing back" the batches and put in to util. |
Any suggestions for additional tests are welcome.