Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignments: numerics, strings and .loc #6171

Closed
aldanor opened this issue Jan 29, 2014 · 20 comments · Fixed by #6172

Comments

@aldanor
Copy link
Contributor

commented Jan 29, 2014

Why would assigning an entire column to an array of values work differently with numbers vs strings?

Assigning numeric values:

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df['x'] = [1, 2]
>>> df
   x    y
0  1  NaN
1  2  NaN

Assigning string values:

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df['x'] = ['1', '2']
ValueError: could not broadcast input array from shape (2) into shape (0)

Btw according to latest docs .loc can append, but can it append more than one value at once?

Setting multiple via .loc:

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df.loc[:, 'x'] = [1, 2]
>>> df
Empty DataFrame
Columns: [x, y]
Index: []
>>> df.loc[[0, 1], 'x'] = [1, 2]
>>> df
Empty DataFrame
Columns: [x, y]
Index: []
>>> df.loc[0:2, 'x'] = [1, 2]
>>> df
Empty DataFrame
Columns: [x, y]
Index: []

Setting single via .loc: (this works ofc)

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df.loc[0, 'x'] = 1
>>> df
   x    y
0  1  NaN
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

This is not valid python code (what you pasted)

>>> df = pd.DataFrame(columns='x', 'y')
SyntaxError: non-keyword arg after keyword arg

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

Presumably missing square brackets.

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Yea I retyped it manually, a typo.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

can you fix the examples pls

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Fixed typos, should work now; sry had to type it manually.

Huh, more weird stuff, check this out:

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df.x = [1, 2]  # no error?
>>> df.y  # expect [nan, nan]?
ValueError: Wrong number of items passed 2, indices imply 0

Btw: this is v0.13.0, not the head rev; not sure if it has been addressed or not already but I skimmed through recent changes/issues and haven't found anything related.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

assigning to columns via setattr is currently an open issue (should this actually error), hard to say, see here: #5904

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

@jreback Still crashes with __setitem__ as well.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

@aldanor what are you refurring to: Still crashes with __setitem__ as well.

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

@jreback Maybe I got it wrong; did you mean df.x = ... kind of access by "assigning to columns via setattr", as opposed to df['x']? The latter behaves the same in all the above cases.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

@aldanor pls roll the setting with multiple .loc into a new issue (ref this issue though); I am not sure what to do with them; I think they should raise actually, pushing off to 0.14 in any event

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

df.x = something works, but is just an attribute assignment, it blows up if x is a column (see the references issue) (which is guarding against assigning to a method, but same idea)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

@aldanor FYI, I fixed the assignments to an empty frame; if you are finding bugs, great!

but I would not use this in practice as you will almost always get object dtype

so not that useful

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Yea I was just using the code above for cheap and dirty unit tests and was surprised to see it blows up (in many different ways). What do you mean by "always get the object dtype" though? df.x = [1, 2] would result in np.int64 wouldn't it? Does for me at least.

Thanks for the quick PR; will move the .loc stuff to another issue.

@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Opened #6173 for .loc bug.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

df.ix = [1,2] is actually ok since you defined the column name already!

I am talking about

df = Dataframe(columns=['x','y'])
df['x'] = [1,2]
df.dtypes
@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Hmm...

>>> df = pd.DataFrame(columns=['x', 'y'])
>>> df.dtypes
x   NaN
y   NaN
dtype: float64
>>> (df.x.dtype, df.y.dtype)
(dtype('O'), dtype('O'))
>>> df['x'] = [1, 2]
>>> df.dtypes
x     int64
y    object

So x gets promoted to int64, y stays an object, looks ok? Why is the dtype of the empty dataframe is reported as float64 btw when both columns are object-type?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

that's right...was thinking of something else.

an empty (index wise) frame by definition will have object dtypes.

as I said, you should normally not do this, you can get into all sorts of werid things.

a data frame does not have a dtype at all (its a series property)

I am not sure what you are printing out

pd.DataFrame(columns=['x', 'y']).dtypes
x     object
y     object
dtype: object
@aldanor

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2014

Yea, I know data frame has no dtype of its own, it's more of a cosmetic thing in this case; weird though -- at least on v0.13.0 / numpy 1.8.0 -- I get:

pd.DataFrame(columns=['x', 'y']).dtypes
x   NaN
y   NaN
dtype: float64

This behaves different on master?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

yes, IIRC I did fix this after 0.13

can you test out things on master if you suspect bugs? I really do appreciated the reports!

sometimes don't remember fixing things (and issues get confused); though we are pretty organized on that.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2014

and you are seeing the dtype of the resulant Series that .dtypes returns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.