Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default column names for `read_csv` and for data frames created from other data #2034

Closed
bmu opened this issue Oct 7, 2012 · 8 comments

Comments

@bmu
Copy link

commented Oct 7, 2012

If you read data from a file with read_csv the default column names of the resulting data frame are set to X.1 to X.N (and to X1 to XN for versions >= 0.9), which are strings.

If you create a data frame from exiting arrays or lists or something the column names default to 0 to N and are integers.

Should these default names be the same? I think this would be more consistent and would avoid confusion (Not sure how this is handled in R if this a part of this design decision).

See also this question and the answers on stackoverflow.

@wesm

This comment has been minimized.

Copy link
Member

commented Nov 2, 2012

We made the decision for the default names to be X0, X1, ... X_{N-1} from read_csv. I'd thought about 0 through N - 1 also, but the X# was a convention from R that users seemed comfortable with

@wesm wesm closed this Nov 2, 2012

@bmu

This comment has been minimized.

Copy link
Author

commented Nov 30, 2012

@wesm Its about the difference between the default names with

In [39]: df = pd.DataFrame(np.arange(100).reshape(10, 10))

In [40]: df
Out[40]: 
    0   1   2   3   4   5   6   7   8   9
0   0   1   2   3   4   5   6   7   8   9
1  10  11  12  13  14  15  16  17  18  19
2  20  21  22  23  24  25  26  27  28  29
3  30  31  32  33  34  35  36  37  38  39
4  40  41  42  43  44  45  46  47  48  49
5  50  51  52  53  54  55  56  57  58  59
6  60  61  62  63  64  65  66  67  68  69
7  70  71  72  73  74  75  76  77  78  79
8  80  81  82  83  84  85  86  87  88  89
9  90  91  92  93  94  95  96  97  98  99

In [41]: df.to_csv('df.txt', header=False)

In [42]: df = pd.DataFrame(np.arange(100).reshape(10, 10))

In [43]: df.columns
Out[43]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [44]: df.to_csv('df.txt', header=False)

In [45]: df = pd.read_csv('df.txt', index_col=0, header=None)

In [46]: df.columns
Out[46]: Index([X1, X2, X3, X4, X5, X6, X7, X8, X9, X10], dtype=object)

It would be more consistent if it would be the same index (X1 ...XN) when creating a DataFrame from an array. This would less confusing when indexing a DataFrame by column names.

@wesm

This comment has been minimized.

Copy link
Member

commented Nov 30, 2012

Well, I often think the column names should just be integers 0 through N-1. Making default DataFrame column names X0 through X{N-1} is not going to happen, but perhaps the parser should be consistent with the range(N) behavior. I'd feel like a real jerk changing this again given that I just changed it. =/

@wesm

This comment has been minimized.

Copy link
Member

commented Nov 30, 2012

reopening for 0.10 and further consideration

@wesm wesm reopened this Nov 30, 2012

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 1, 2012

@jseabold I hate to bring this up again and break APIs again, but I'm thinking it might be simpler on everyone to just make the default column names range(N) vs. the X0, ... I don't really know how disruptive this would be.

@bmu

This comment has been minimized.

Copy link
Author

commented Dec 1, 2012

As far as I can see a DataFrame should also be dict like (and you need default keys as no keys are given), so there is nothing wrong to give default names for the keys and 'X0', ... would be an option. You can still access it in a list like way using the index. so from this point of view it wouldn't be "not pythonic". further more attribute access would be better with 'X0', ...

So whats wrong with using 'x0', ... as the default for every initialization of a DataFrame without column names given?

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 2, 2012

It would be very inconsistent with default indexing on other axes; plus it would break probably 90% of people's code, whereas changing X0, X1, ... to 0, 1, ... would break about < 1% of people's code in practice. Indeed one of the reasons we chose the X0, ... was to facilitate attribute access.

@wesm wesm closed this in 1e5c9d0 Dec 9, 2012

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 9, 2012

Resolution: range(n) by default and a new prefix option. So to get the hold behavior do read_csv(..., header=None, prefix='X'). I think this is a reasonable compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.