Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present #14291

Closed
mbochk opened this issue Sep 23, 2016 · 3 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@mbochk
Copy link

mbochk commented Sep 23, 2016

upon DataFrame.insert option allow_duplicates works, but only only once.
When i have 2 columns with same name, additon of third throws

ValueError: Wrong number of items passed 2, placement implies 1

Code Sample, a copy-pastable example if possible

a = pd.DataFrame()
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)
a.insert(0, "qwe", [1,2,3,4], allow_duplicates=True)

Expected Output

zxc qwe qwe
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4

output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: ru_RU

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
<\details>

@jorisvandenbossche
Copy link
Member

@mbochk That looks like a bug indeed. Thanks for reporting

@jorisvandenbossche jorisvandenbossche changed the title allow_duplicates doesn't work while several duplicates present BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present Sep 23, 2016
@shawnheide
Copy link
Contributor

I looked into this and discovered that the problem is in frame.py, in the _sanitize_column method. Here's the relevant code:

# broadcast across multiple columns if necessary
if key in self.columns and value.ndim == 1:
    if (not self.columns.is_unique or
            isinstance(self.columns, MultiIndex)):
        existing_piece = self[key]
        if isinstance(existing_piece, DataFrame):
            value = np.tile(value, (len(existing_piece.columns), 1))

On the third time insert is called, the existing_piece is a 2d array consisting of the previous values. I'm not sure how to fix this though as I don't understand why the values are being broadcast in the first place. Any thoughts?

@jorisvandenbossche
Copy link
Member

What happens here is needed when you are setting a certain column (eg df[key] = value). If key then is a duplicate column name, the value has to be broadcasted to fit in those multiple columns.
But of course this part of _sanitize_column is not needed for an insert operation.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Oct 5, 2016
@jreback jreback added this to the Next Major Release milestone Oct 5, 2016
@jreback jreback modified the milestones: 0.19.1, Next Major Release Oct 19, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants