New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data vectors as columns vs rows #30
Comments
My $2-e2: stick with the sklearn model because it generalizes gracefully to multi-dimensional examples. It's annoying to be inconsistent with the literature, but it does make for simpler code. |
Ah I see, because when data points are n-dimensional, we would have to repeat |
Exactly. In memory, it makes sense to have the data index be the least-frequently varying index, so that data for one example is contiguous in memory. It also generalizes better to ragged data, so that A could actually be a list (not an ndarray), and each A[i] can be a $whatever. |
OK, that's convincing enough. |
Another reason is that it's pretty much the theano default (for the softmax function for example). Cuda-convnet unfortunately puts the batch size last, but I have some ideas about how to deal with that (also talked to @f0k about it). I'll try to get that done soon. By the way, you can technically do A[..., 0] to always get example 0 if the batch size is last :) It's not a very commonly used feature though. |
+1 for data points in rows! To elaborate on one of the arguments, as @bmcfee said, memory layout is an important point. Fortran and Matlab use column-major layout, so it makes sense to have data vectors as columns. C and numpy use row-major layout by default, so it makes sense to have data vectors as rows -- you will waste a lot of performance due to frequent cache misses when putting data vectors in columns in numpy, unless you take special care to keep your matrices in column-major layout throughout your code. (Always assuming that you mostly want to iterate over the data points, not over the features.) |
After using
nntools
a little bit I quickly noticed that data vectors are being represented as rows. That is, ifA
is a data matrix, thenA.shape = (n_data, n_features)
. I have always represented data vectors as columns (A.shape = (n_features, n_data)
). I considered not bringing this up because I don't expect to change your mind, but because design choices are still being made I figured I'd start a discussion. Here's how I see it -Arguments for data vectors as rows
A[0]
has the intuitive meaning of "the first data vector"sklearn
usesArguments for data vectors as columns
cc @bmcfee who has strong feelings about this.
The text was updated successfully, but these errors were encountered: