Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should start/next/done iterate over rows or cols? #48

Closed
HarlanH opened this issue Aug 8, 2012 · 16 comments
Closed

should start/next/done iterate over rows or cols? #48

HarlanH opened this issue Aug 8, 2012 · 16 comments
Labels

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Aug 8, 2012

Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to me that they should instead iterate over rows, as do these functions for DataStreams. The next() return value should presumably be a 1-row SubDataFrame.

Are there any current functions that depend on the current behavior?

@tshort , github says I should blame you for this. :)

@johnmyleswhite
Copy link
Contributor

Having iteration for AbstractDataFrames work over rows does seem like the right approach, but there's definitely a use for easy-to-use tools to perform column iteration. So much behavior on DataFrames has a design pattern where you perform operations column-wise and then combine results. The mean of a DataFrame, for example, is a mapping where you aggregate the means of each column. I imagine all of that can achieved with map, but maybe there's something customized to be added here eventually.

@HarlanH
Copy link
Contributor Author

HarlanH commented Aug 8, 2012

Yes, I agree. colwise is a good start, of course. Maybe we need another
wrapper type that implicitly flips the axes, so you can do [f(col) for col in flip(df)] or something. Or maybe it should be called itcols(). Dunno.

It'd be good to be able to easily iterate over a subset of columns too, of
course. Especially as we're currently sans row names.

On Wed, Aug 8, 2012 at 8:48 AM, John Myles White
notifications@github.comwrote:

Having iteration for AbstractDataFrames work over rows does seem like the
right approach, but there's definitely a use for easy-to-use tools to
perform column iteration. So much behavior on DataFrames has a design
pattern where you perform operations column-wise and then combine results.
The mean of a DataFrame, for example, is a mapping where you aggregate the
means of each column. I imagine all of that can achieved with map, but
maybe there's something customized to be added here eventually.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7582594.

@tshort
Copy link
Contributor

tshort commented Aug 8, 2012

This one is debatable. I don't think there are any functions that depend on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in line with Julia's Associative methods and also sort-of to match the equivalent of R's lapply. Which way you normally iterate over really depends on the type of analysis you have. For me, I'd say iterating over columns is more common (especially for zoo-like operations). Array comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That operation is cheap.

@HarlanH
Copy link
Contributor Author

HarlanH commented Aug 8, 2012

Hm. I just don't think of a DataFrame as an Associative structure as
strongly as you do. The types of data I work with on a day-to-day basis
simply don't make sense to iterate over columnwise, or at least not all
of the columns. I tend to more frequently deal with relational data, not
sequential data. I also don't like the behavior of lappy on data.frames,
at all. Hadley's adply(df, 1, ...) is the more useful (to me) operation,
although I don't like having to give axes by numerical index.

I think something like the solution outlined above, where the default
iterator over a DF is row-wise (for my use cases), but there's some
relatively easy way to change it to col-wise (for your use cases), is the
way we should go.

On Wed, Aug 8, 2012 at 9:28 AM, Tom Short notifications@github.com wrote:

This one is debatable. I don't think there are any functions that depend
on this. It's easy enough to fix if I'm forgetting something.

The reason I defaulted to iterating over columns is to keep a DataFrame in
line with Julia's Associative methods and also sort-of to match the
equivalent of R's lapply. Which way you normally iterate over really
depends on the type of analysis you have. For me, I'd say iterating over
columns is more common (especially for zoo-like operations). Array
comprehension syntax is one I'd rather have work over columns.

Doing anything with a subset of columns is easy: just index it. That
operation is cheap.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/48#issuecomment-7583440.

@tshort
Copy link
Contributor

tshort commented Aug 8, 2012

I'm fine with that.

@StefanKarpinski
Copy link
Member

One thing to consider: iteration and indexing should probably be consistent. That is doing for x in df should probably give you the same thing as successively indexing into df. You can provide other iteration schemes with a wrapper. E.g. an EachRow type that you can use as for r in EachRow(df) or possibly for r in each_row(df) (see EachLine/each_line for comparison).

@HarlanH
Copy link
Contributor Author

HarlanH commented Aug 8, 2012

Hm, perhaps. We do have reference dispatch set up so that df[1:2] gets the first two columns, but df[1:2,1] gets the first two rows of the first column. So a single index into a df does return one or more columns.

The each_line/EachLine technique does seem applicable here.

So, we keep iteration of dfs as columns by default for DataFrames (and by rows necessarily for DataStreams), and define each_row/EachRow for DataFrames? It feels like a bit more typing for me, and not entirely consistent, but reasonable.

@StefanKarpinski
Copy link
Member

I dunno. Just giving points of reference. This is your call :-)

@HarlanH
Copy link
Contributor Author

HarlanH commented Aug 8, 2012

A reasonable compromise allowing for future changes would be to define both each_row and each_col, make one of them a no-op, and declare in the documentation that the default (unwrapped) iterator behavior might change in the future, and the safe thing to do would be to use the appropriate function.

@johnmyleswhite
Copy link
Contributor

Ok, I find consistency very compelling in general.

But I think there's a very strong argument here for iteration over rows rather than entries or columns: rows are the smallest unit of a DataFrame whose values are of a consistent type. Column A may have a different type from Column B, so you can't blindly loop over the columns and apply consistent processing. But each row consists of exactly the same number of columns and the types of entry at location I in each row is always homogeneous. So you can blindly treats rows as equivalent.

Also, in statistical theory the row is almost always the fundamental object. That's the thing you usually assume is IID: the columns are almost never independent from each other and the entries are definitely not IID in any interesting model. (If they were, you'd have a vector, not a matrix or DataFrame.)

That leaves me with the idea that we should do iteration by rows or that there should be no iteration at all using start, next or done. I'm totally happy with that later approach: create EachRow and EachCol methods or insist that people use a DataStream, not a DataFrame.

@StefanKarpinski
Copy link
Member

That's a pretty solid argument, John.

@johnmyleswhite
Copy link
Contributor

We should make some decisions here.

@HarlanH
Copy link
Contributor Author

HarlanH commented Jan 2, 2013

I remain OK with the final consensus suggestion, of for row in EachRow(df) or for col in EachCol(df), but for x in df throwing an error.

DataVector and DataArray iteration should definitely follow standard Julia vector and array behavior...

@johnmyleswhite
Copy link
Contributor

Ok. I'll write some tests for that. I believe those already work.

@johnmyleswhite
Copy link
Contributor

Turns out that EachRow and EachCol don't exist. I'm a little hesitant to create them, because they'd occupy a space that should arguably be left to Base to fill. I'll make drafts for now, but we should discuss this issue on the main Julia mailing list.

@johnmyleswhite
Copy link
Contributor

About to e-mail the main Julia list about iterating over rows and columns. For now, this closed by a137666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants