-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should start/next/done iterate over rows or cols? #48
Comments
Having iteration for AbstractDataFrames work over rows does seem like the right approach, but there's definitely a use for easy-to-use tools to perform column iteration. So much behavior on DataFrames has a design pattern where you perform operations column-wise and then combine results. The mean of a DataFrame, for example, is a mapping where you aggregate the means of each column. I imagine all of that can achieved with |
Yes, I agree. It'd be good to be able to easily iterate over a subset of columns too, of On Wed, Aug 8, 2012 at 8:48 AM, John Myles White
|
This one is debatable. I don't think there are any functions that depend on this. It's easy enough to fix if I'm forgetting something. The reason I defaulted to iterating over columns is to keep a DataFrame in line with Julia's Associative methods and also sort-of to match the equivalent of R's Doing anything with a subset of columns is easy: just index it. That operation is cheap. |
Hm. I just don't think of a DataFrame as an Associative structure as I think something like the solution outlined above, where the default On Wed, Aug 8, 2012 at 9:28 AM, Tom Short notifications@github.com wrote:
|
I'm fine with that. |
One thing to consider: iteration and indexing should probably be consistent. That is doing |
Hm, perhaps. We do have reference dispatch set up so that The So, we keep iteration of dfs as columns by default for DataFrames (and by rows necessarily for DataStreams), and define |
I dunno. Just giving points of reference. This is your call :-) |
A reasonable compromise allowing for future changes would be to define both |
Ok, I find consistency very compelling in general. But I think there's a very strong argument here for iteration over rows rather than entries or columns: rows are the smallest unit of a DataFrame whose values are of a consistent type. Column A may have a different type from Column B, so you can't blindly loop over the columns and apply consistent processing. But each row consists of exactly the same number of columns and the types of entry at location I in each row is always homogeneous. So you can blindly treats rows as equivalent. Also, in statistical theory the row is almost always the fundamental object. That's the thing you usually assume is IID: the columns are almost never independent from each other and the entries are definitely not IID in any interesting model. (If they were, you'd have a vector, not a matrix or DataFrame.) That leaves me with the idea that we should do iteration by rows or that there should be no iteration at all using |
That's a pretty solid argument, John. |
We should make some decisions here. |
I remain OK with the final consensus suggestion, of DataVector and DataArray iteration should definitely follow standard Julia vector and array behavior... |
Ok. I'll write some tests for that. I believe those already work. |
Turns out that |
About to e-mail the main Julia list about iterating over rows and columns. For now, this closed by a137666 |
Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to me that they should instead iterate over rows, as do these functions for DataStreams. The next() return value should presumably be a 1-row SubDataFrame.
Are there any current functions that depend on the current behavior?
@tshort , github says I should blame you for this. :)
The text was updated successfully, but these errors were encountered: