New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present #14291

Closed
mbochk opened this Issue Sep 23, 2016 · 3 comments

Comments

Projects
None yet
4 participants
@mbochk

mbochk commented Sep 23, 2016

upon DataFrame.insert option allow_duplicates works, but only only once.
When i have 2 columns with same name, additon of third throws

ValueError: Wrong number of items passed 2, placement implies 1

Code Sample, a copy-pastable example if possible

a = pd.DataFrame()
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)

Expected Output

zxc qwe qwe
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4

output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: ru_RU

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
<\details>

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 23, 2016

Member

@mbochk That looks like a bug indeed. Thanks for reporting

Member

jorisvandenbossche commented Sep 23, 2016

@mbochk That looks like a bug indeed. Thanks for reporting

@jorisvandenbossche jorisvandenbossche changed the title from allow_duplicates doesn't work while several duplicates present to BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present Sep 23, 2016

@shawnheide

This comment has been minimized.

Show comment
Hide comment
@shawnheide

shawnheide Oct 4, 2016

Contributor

I looked into this and discovered that the problem is in frame.py, in the _sanitize_column method. Here's the relevant code:

# broadcast across multiple columns if necessary
if key in self.columns and value.ndim == 1:
    if (not self.columns.is_unique or
            isinstance(self.columns, MultiIndex)):
        existing_piece = self[key]
        if isinstance(existing_piece, DataFrame):
            value = np.tile(value, (len(existing_piece.columns), 1))

On the third time insert is called, the existing_piece is a 2d array consisting of the previous values. I'm not sure how to fix this though as I don't understand why the values are being broadcast in the first place. Any thoughts?

Contributor

shawnheide commented Oct 4, 2016

I looked into this and discovered that the problem is in frame.py, in the _sanitize_column method. Here's the relevant code:

# broadcast across multiple columns if necessary
if key in self.columns and value.ndim == 1:
    if (not self.columns.is_unique or
            isinstance(self.columns, MultiIndex)):
        existing_piece = self[key]
        if isinstance(existing_piece, DataFrame):
            value = np.tile(value, (len(existing_piece.columns), 1))

On the third time insert is called, the existing_piece is a 2d array consisting of the previous values. I'm not sure how to fix this though as I don't understand why the values are being broadcast in the first place. Any thoughts?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 4, 2016

Member

What happens here is needed when you are setting a certain column (eg df[key] = value). If key then is a duplicate column name, the value has to be broadcasted to fit in those multiple columns.
But of course this part of _sanitize_column is not needed for an insert operation.

Member

jorisvandenbossche commented Oct 4, 2016

What happens here is needed when you are setting a certain column (eg df[key] = value). If key then is a duplicate column name, the value has to be broadcasted to fit in those multiple columns.
But of course this part of _sanitize_column is not needed for an insert operation.

@jreback jreback added this to the Next Major Release milestone Oct 5, 2016

@jreback jreback modified the milestones: 0.19.1, Next Major Release Oct 19, 2016

@jreback jreback closed this in 2e77536 Oct 24, 2016

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

[Backport #14431] BUG: Fix issue with inserting duplicate columns in …
…a dataframe

closes #14291
closes #14431

(cherry picked from commit 2e77536)

amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment