Data vectors as columns vs rows #30

craffel · 2014-09-16T16:19:05Z

After using nntools a little bit I quickly noticed that data vectors are being represented as rows. That is, if A is a data matrix, then A.shape = (n_data, n_features). I have always represented data vectors as columns (A.shape = (n_features, n_data)). I considered not bringing this up because I don't expect to change your mind, but because design choices are still being made I figured I'd start a discussion. Here's how I see it -

Arguments for data vectors as rows

In Python, A[0] has the intuitive meaning of "the first data vector"
It's what sklearn uses

Arguments for data vectors as columns

This is a pretty global convention for machine learning, particularly outside of code

cc @bmcfee who has strong feelings about this.

The text was updated successfully, but these errors were encountered:

bmcfee · 2014-09-16T18:14:53Z

My $2-e2: stick with the sklearn model because it generalizes gracefully to multi-dimensional examples. It's annoying to be inconsistent with the literature, but it does make for simpler code.

craffel · 2014-09-16T18:26:36Z

Ah I see, because when data points are n-dimensional, we would have to repeat : for each dimension to retrieve the first data point (e.g. A[:, :, 0] for 2-d) in my representation, right? As opposed to A[0] regardless of the number of dimensions.

bmcfee · 2014-09-16T18:30:57Z

Exactly. In memory, it makes sense to have the data index be the least-frequently varying index, so that data for one example is contiguous in memory.

It also generalizes better to ragged data, so that A could actually be a list (not an ndarray), and each A[i] can be a $whatever.

craffel · 2014-09-16T18:55:03Z

OK, that's convincing enough.

benanne · 2014-09-16T19:36:15Z

Another reason is that it's pretty much the theano default (for the softmax function for example).

Cuda-convnet unfortunately puts the batch size last, but I have some ideas about how to deal with that (also talked to @f0k about it). I'll try to get that done soon.

By the way, you can technically do A[..., 0] to always get example 0 if the batch size is last :) It's not a very commonly used feature though.

f0k · 2014-09-17T09:52:48Z

+1 for data points in rows!

To elaborate on one of the arguments, as @bmcfee said, memory layout is an important point. Fortran and Matlab use column-major layout, so it makes sense to have data vectors as columns. C and numpy use row-major layout by default, so it makes sense to have data vectors as rows -- you will waste a lot of performance due to frequent cache misses when putting data vectors in columns in numpy, unless you take special care to keep your matrices in column-major layout throughout your code. (Always assuming that you mostly want to iterate over the data points, not over the features.)
On the GPU, things look a little different again... leaning towards Fortran, cuBLAS assumes column-major layout, but Theano prefers to have matrices in row-major layout and just tells cuBLAS to transpose matrices as needed (every BLAS function involving matrices takes a flag for each matrix argument specifying whether it's to be transposed).

craffel closed this as completed Sep 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data vectors as columns vs rows #30

Data vectors as columns vs rows #30

craffel commented Sep 16, 2014

bmcfee commented Sep 16, 2014

craffel commented Sep 16, 2014

bmcfee commented Sep 16, 2014

craffel commented Sep 16, 2014

benanne commented Sep 16, 2014

f0k commented Sep 17, 2014

Data vectors as columns vs rows #30

Data vectors as columns vs rows #30

Comments

craffel commented Sep 16, 2014

bmcfee commented Sep 16, 2014

craffel commented Sep 16, 2014

bmcfee commented Sep 16, 2014

craffel commented Sep 16, 2014

benanne commented Sep 16, 2014

f0k commented Sep 17, 2014