_.zip'ing different length arrays. #1237

Closed
jashkenas opened this Issue Jul 30, 2013 · 13 comments

Projects

None yet

9 participants

@jashkenas
Owner

Zip functions take in arrays, and give you back tuples of values from each array, grouped by index.

What to do when the incoming arrays have different lengths is done differently in different languages and libraries. It's recently been brought to my attention by Jesse Pollak that:

  • Python and Haskell will only return tuples for indexes where all lists have values (the minimum length), and discard the unmatching values.
  • F# will throw an error if you try to zip differently-length'd arrays.
  • Ruby will use the length of the first array you zip, discarding values or padding the others with null to fit.
  • Underscore will try not to destroy any data, and will use the maximum length of the incoming arrays.

(I think we may have talked about this before.)

Which behavior is best?

@jessepollak
Contributor

Just to add some more examples to the discussion:

  • In Scheme, there is no zip, but you can define one easily and it follows the min pattern as seen in Haskell and Python.
(define (zip p q) (map list p q))
  • In Racket, there is no zip and you can define it the exact same way as scheme but it follows the F# pattern of throwing an exception if there are different length arrays.

These two examples lead me to believe that the core of the issue may actually be in how we think about map (if we assume that we're deriving zip from map).

@michaelficarra
Collaborator

I prefer the Haskell / python / scheme behaviour. If we don't know what value should be passed to the zipping function we shouldn't be passing anything. Otherwise, how do you distinguish zipping [1, 2, 3, 4, 5] with [some, values, null, null, null] from [some, values]?

@tashemi
tashemi commented Jul 30, 2013

I guess it is better to avoid data destruction (python / haskell way) and use maximum length of the incoming arrays. All missed values from shorter array could be replaced with null / undefined

@Rayraz
Rayraz commented Jul 30, 2013

I like the current approach, because it guarantees that no data is lost. If I zip a few arrays and unzip them again, I would expect to be right back where I started. If any values from any of the arrays are rejected in the process, this would no longer be the case.

However, I'm interested to hear if there are any strong arguments (strong use cases etc.) for the particular behaviours offered by the various languages mentioned.
In particular:

  • Why would it be preferable to take the minimum length?
  • If it is indeed simply not right to pass arrays of different lengths, why not throw an error the way F# does rather than taking minimum length?
  • If passing arrays of different lengths is acceptable, why should the first array be privileged the way Ruby does?

In general, I instinctively dislike the idea of destroying data without any clear / obvious reason and without warning.

@jessepollak
Contributor

Just to add more to the discussion, I shot an email to Raymond Hettinger (one of the Python core contributors) asking about Python's zip and this is what he said.

The history of Python's zip() is documented here: http://www.python.org/dev/peps/pep-0201/
Barry is an Emacs guy, so ELisp is part of his core vocabulary.
The roots of all zips trace back to Lisp where zipping stops
with the shortest input list: http://jtra.cz/stuff/lisp/sclr/mapcar.html
Stopping with the shortest is a useful behavior because it allows
infinite input streams to be combined with finite streams
(a reasonably common functional programming technique).

I think there's something to the infinite input stream logic. This also makes me kind of lean towards the Ruby implementation because then you could technically have the best of both worlds (although it adds complication).

Looking at the PEP, the Python zip comes from the map, which follows the same logic as underscore.

  >>> c = (4, 5, 6, 7)
  >>> map(None, a, c)
  [(1, 4), (2, 5), (3, 6), (None, 7)]

And they implemented zip to move away from that. Another interesting quote from the PEP:

Optional padding. An earlier version of this PEP proposed an
optional `pad' keyword argument, which would be used when the
argument sequences were not the same length. This is similar
behavior to the map(None, ...) semantics except that the user
would be able to specify pad object. This has been rejected by
the BDFL in favor of always truncating to the shortest sequence,
because of the KISS principle. If there's a true need, it is
easier to add later. If it is not needed, it would still be
impossible to delete it in the future.

No matter what we decide is best, I'd be concerned about switching the functionality of the zip in underscore because it could break things. That being said, it's exciting to explore.

@Rayraz
Rayraz commented Jul 31, 2013

@jessepollak, very interesting stuff! Thanks for spreading the knowledge. I didn't think of infinite input streams at all... It might be a bit special-case though? That is, I don't think I've ever seen an infinite array or something like that in JavaScript...

@martinblech

@jessepollak, there's no such thing as infinite arrays in JavaScript right now, but we might start running into stuff like that more and more if generators catch on.

Are there any plans on zip() supporting iterators and generators, or will it stick to Arrays only?

@iros
iros commented Jul 31, 2013

How about adding an optional second argument to control this behavior? I am tempted to say that it should pad with null values (to avoid data loss as others have suggested) but maybe add a truncate flag?

_.zip(..., { truncate: true }) -> will default to using the shortest array length.

@domenic
domenic commented Jul 31, 2013

Just popping in: I've only used _.zip a few times, but each time, if my arrays had been different lengths, that would have been a bug in my code. So, throwing would have been best in my case.

This is not a strong opinion---indeed, this whole question is not something that I have strong opinions about---so as such I won't watch the thread for follow-up. But I saw @jashkenas's request for comments on Twitter and didn't see this POV represented, so here I am.

@jamesfzhang

I like @iros's comment on adding a truncate flag.

@jessepollak
Contributor

@martinblech yeah, I think generators definitely have the potential to catch on; especially given how successful they've been in Python thus far.

Also, while we don't have infinite arrays in JS right now, there's definitely the potential for an infinite-ish situation (although I've never experienced it) where you have one Array that's much bigger than the others and you want to merge them. For instance, if you had an array containing all of the names of the people in a marathon and then an array containing the prize money values for the top 5 and wanted the tuples of names and prize money. Just a thought.

I'm not sure about the truncate flag...I'm kind of with Guido on the "Keep It Simple Stupid" principle.

@Rayraz
Rayraz commented Jul 31, 2013

So, IMO, I think either we should decide to support the possibility of infinite arrays using generators at some point in the future and go for the pythonesque approach.

Or, we should decide to "support arrays of different sizes / not to care about generators / decide infinite arrays are too edge-case / keep things as they are for the sake of backward compatibility" and leave things unchanged.

I don't particularly like the idea of adding padding options, truncate flags or even Ruby's privileged first argument. They add unnecessary complexity. If a particular use-case truly demands equal-length arrays, or a particular length it's trivial to enforce these requirements before passing arrays into _.zip.

@jashkenas
Owner

Thanks for the conversation, folks. Without any super strong arguments in any particular direction (infinite lists are unconvincing in JavaScript, because _.zip has to be eager), I think we should leave this as-is. Not destroying any of the incoming data is a nice feature, and you can always stop iterating when you see undefined values, or compact out undefined parts of your result.

@jashkenas jashkenas closed this Aug 1, 2013
@megawac megawac referenced this issue in megawac/lodash Jun 22, 2014
@megawac megawac Breaking loops style 74f58c1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment