Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign to df with repeated column fails #6120

Closed
dbew opened this issue Jan 27, 2014 · 3 comments · Fixed by #6122

Comments

@dbew
Copy link
Contributor

commented Jan 27, 2014

If you have a DataFrame with a repeated or non-unique column, then some assignments fail.

df = pd.DataFrame(np.random.randn(10,2), columns=['that', 'that'])

df2
Out[10]: 
   that  that
0     1     1
1     1     1
2     1     1
3     1     1
4     1     1
5     1     1
6     1     1
7     1     1
8     1     1
9     1     1

[10 rows x 2 columns]

This is float data and the following works:

df['that'] = 1.0

However, this fails with an error and breaks the dataframe (e.g. a subsequent repr will also fail.)

df2['that'] = 1
Traceback (most recent call last):
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/ipython-1.1.0_1_ahl1-py2.7.egg/IPython/core/interactiveshell.py", line 2830, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-11-8701f5b0efe4>", line 1, in <module>
    df2['that'] = 1
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1879, in __setitem__
    self._set_item(key, value)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1960, in _set_item
    NDFrame._set_item(self, key, value)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 1057, in _set_item
    self._data.set(key, value)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2968, in set
    _set_item(item, arr[None, :])
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2927, in _set_item
    self._add_new_block(item, arr, loc=None)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 3108, in _add_new_block
    new_block = make_block(value, self.items[loc:loc + 1].copy(),
TypeError: unsupported operand type(s) for +: 'slice' and 'int'

I stepped through the code and it looked like most places handle repeated columns ok except the code that reallocates arrays when the dtype changes.

I've tested this against pandas 0.13.0 and the latest master. Here's the output of installed versions when running on the master:

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-308.el5
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.13.0-292-g4dcecb0
Cython: 0.16
numpy: 1.7.1
scipy: 0.9.0
statsmodels: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 2.3.1-1
numexpr: 2.0.1
matplotlib: 1.1.1
openpyxl: None
xlrd: 0.8.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.6
bs4: None
html5lib: None
bq: None
apiclient: None

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2014

looks like an untested case; i'll take look

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2014

was an untested / missing case...thanks

@dbew

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2014

Wow, that was fast! Thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.