Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data vectors as columns vs rows #30

Closed
craffel opened this issue Sep 16, 2014 · 6 comments
Closed

Data vectors as columns vs rows #30

craffel opened this issue Sep 16, 2014 · 6 comments

Comments

@craffel
Copy link
Member

craffel commented Sep 16, 2014

After using nntools a little bit I quickly noticed that data vectors are being represented as rows. That is, if A is a data matrix, then A.shape = (n_data, n_features). I have always represented data vectors as columns (A.shape = (n_features, n_data)). I considered not bringing this up because I don't expect to change your mind, but because design choices are still being made I figured I'd start a discussion. Here's how I see it -

Arguments for data vectors as rows

  • In Python, A[0] has the intuitive meaning of "the first data vector"
  • It's what sklearn uses

Arguments for data vectors as columns

  • This is a pretty global convention for machine learning, particularly outside of code

cc @bmcfee who has strong feelings about this.

@bmcfee
Copy link
Contributor

bmcfee commented Sep 16, 2014

My $2-e2: stick with the sklearn model because it generalizes gracefully to multi-dimensional examples. It's annoying to be inconsistent with the literature, but it does make for simpler code.

@craffel
Copy link
Member Author

craffel commented Sep 16, 2014

Ah I see, because when data points are n-dimensional, we would have to repeat : for each dimension to retrieve the first data point (e.g. A[:, :, 0] for 2-d) in my representation, right? As opposed to A[0] regardless of the number of dimensions.

@bmcfee
Copy link
Contributor

bmcfee commented Sep 16, 2014

Exactly. In memory, it makes sense to have the data index be the least-frequently varying index, so that data for one example is contiguous in memory.

It also generalizes better to ragged data, so that A could actually be a list (not an ndarray), and each A[i] can be a $whatever.

@craffel
Copy link
Member Author

craffel commented Sep 16, 2014

OK, that's convincing enough.

@craffel craffel closed this as completed Sep 16, 2014
@benanne
Copy link
Member

benanne commented Sep 16, 2014

Another reason is that it's pretty much the theano default (for the softmax function for example).

Cuda-convnet unfortunately puts the batch size last, but I have some ideas about how to deal with that (also talked to @f0k about it). I'll try to get that done soon.

By the way, you can technically do A[..., 0] to always get example 0 if the batch size is last :) It's not a very commonly used feature though.

@f0k
Copy link
Member

f0k commented Sep 17, 2014

+1 for data points in rows!

To elaborate on one of the arguments, as @bmcfee said, memory layout is an important point. Fortran and Matlab use column-major layout, so it makes sense to have data vectors as columns. C and numpy use row-major layout by default, so it makes sense to have data vectors as rows -- you will waste a lot of performance due to frequent cache misses when putting data vectors in columns in numpy, unless you take special care to keep your matrices in column-major layout throughout your code. (Always assuming that you mostly want to iterate over the data points, not over the features.)
On the GPU, things look a little different again... leaning towards Fortran, cuBLAS assumes column-major layout, but Theano prefers to have matrices in row-major layout and just tells cuBLAS to transpose matrices as needed (every BLAS function involving matrices takes a flag for each matrix argument specifying whether it's to be transposed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants