BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present #14291

Closed
mbochk opened this Issue Sep 23, 2016 · 3 comments

Comments

Projects
None yet
4 participants

mbochk commented Sep 23, 2016 edited by jorisvandenbossche

upon DataFrame.insert option allow_duplicates works, but only only once.
When i have 2 columns with same name, additon of third throws

ValueError: Wrong number of items passed 2, placement implies 1

Code Sample, a copy-pastable example if possible

a = pd.DataFrame()
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)

Expected Output

zxc qwe qwe
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4

output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: ru_RU

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
<\details>

@mbochk That looks like a bug indeed. Thanks for reporting

jorisvandenbossche changed the title from allow_duplicates doesn't work while several duplicates present to BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present Sep 23, 2016

Contributor

shawnheide commented Oct 4, 2016

I looked into this and discovered that the problem is in frame.py, in the _sanitize_column method. Here's the relevant code:

# broadcast across multiple columns if necessary
if key in self.columns and value.ndim == 1:
    if (not self.columns.is_unique or
            isinstance(self.columns, MultiIndex)):
        existing_piece = self[key]
        if isinstance(existing_piece, DataFrame):
            value = np.tile(value, (len(existing_piece.columns), 1))

On the third time insert is called, the existing_piece is a 2d array consisting of the previous values. I'm not sure how to fix this though as I don't understand why the values are being broadcast in the first place. Any thoughts?

What happens here is needed when you are setting a certain column (eg df[key] = value). If key then is a duplicate column name, the value has to be broadcasted to fit in those multiple columns.
But of course this part of _sanitize_column is not needed for an insert operation.

jreback added this to the Next Major Release milestone Oct 5, 2016

@jreback jreback modified the milestone: 0.19.1, Next Major Release Oct 19, 2016

jreback closed this in 2e77536 Oct 24, 2016

@jorisvandenbossche jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

@paul-mannino @jorisvandenbossche paul-mannino + jorisvandenbossche [Backport #14431] BUG: Fix issue with inserting duplicate columns in …
…a dataframe


closes #14291
closes #14431

(cherry picked from commit 2e77536)
f77c108

@amolkahat amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

@paul-mannino @amolkahat paul-mannino + amolkahat BUG: Fix issue with inserting duplicate columns in a dataframe
closes #14291
closes #14431
a49baeb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment