Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error converting DataFrame with duplicate columns to ndarray #2236

Closed
jpindi opened this issue Nov 12, 2012 · 5 comments
Closed

Error converting DataFrame with duplicate columns to ndarray #2236

jpindi opened this issue Nov 12, 2012 · 5 comments
Labels
Milestone

Comments

@jpindi
Copy link

jpindi commented Nov 12, 2012

Installed latest version of pandas 0.9.0 in case this was an error
Trying to read Excel file. That part seems ok.
Originally, I was trying iteritems() for each row of the pandas dataframe, as the id_company had to be verified against a mysql database (code not included). Same/similar error message to putting it into a tuple (code is below). Error message follows.

Note there is a .reindex() but it didn't work before, either. The reindex() was kind of a hail-mary.

As a work-around, I'm probably going to simply import from my target sql and do a join. I'm concerned because of the size of the datasets.

import pandas as pd
def runNow():
    #identify sheet
    source = 'C:\Users\jlalonde\Desktop\startup_geno\startupgenome_w_id_xl_20121109.xlsx'
    xls_file = pd.ExcelFile(source)
    sd = xls_file.parse('Sheet1')
    source_u = sd.drop_duplicates(cols = 'id_company', take_last=False)
    source_r = source_u[['id_company','id_good','description', 'website','keyword', 'company_name','founded_month', 'founded_year', 'description']]
    source_i = source_r.reindex() #hail mary
    tup_r = [tuple(x) for x in source_i.values]

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    sg_sql_2.runNow()
  File "sg_sql_2.py", line 31, in runNow
    tup_r = [tuple(x) for x in source_r.values]
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1443, in as_matrix
    return self._data.as_matrix(columns).T
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 723, in as_matrix
    mat = self._interleave(self.items)
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 743, in _interleave
    indexer = items.get_indexer(block.items)
  File "C:\Python27\lib\site-packages\pandas\core\index.py", line 748, in get_indexer
    raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
@wesm
Copy link
Member

wesm commented Nov 13, 2012

Any chance you could e-mail me the excel file? wesmckinn at gmail

@jpindi
Copy link
Author

jpindi commented Nov 13, 2012

Done! Just a note that I am using Python 2.7 and the Excel file was created using Excel 2010 on Windows. Thx.

@wesm
Copy link
Member

wesm commented Nov 13, 2012

The reindexing error is a red herring. I updated the title to reflect the underlying bug and I'll take a look now.

@wesm wesm closed this as completed in d56d0e6 Nov 13, 2012
@jpindi
Copy link
Author

jpindi commented Nov 14, 2012

You rock, Wes. Do I have to reinstall or do anything else? Just say the word...

@wesm
Copy link
Member

wesm commented Nov 14, 2012

Yeah you'll have to install the dev version or wait til 0.9.1 is cut this
week l. There are various workarounds you could do-- basically just have to
make the columns not have duplicate entries

Sent from my mobile device

On Nov 13, 2012, at 8:07 PM, Joseph Lalonde notifications@github.com
wrote:

You rock, Wes. Do I have to reinstall or do anything else? Just say the
word...


Reply to this email directly or view it on
GitHubhttps://github.com//issues/2236#issuecomment-10351223.

yarikoptic added a commit to neurodebian/pandas that referenced this issue Nov 15, 2012
* commit 'v0.9.1rc1-27-ge374f0f': (52 commits)
  BUG: axes.color_cycle from mpl rcParams should not be joined as single string
  BUG: icol duplicate columns with integer sequence failure. close pandas-dev#2228
  TST: unit test for pandas-dev#2214
  BUG: coerce ndarray dtype to object when comparing series
  ENH: make vbench_suite/run_suite executable
  ENH: Use __file__ to determine REPO_PATH in vb_suite/suite.py
  BUG: 1 ** NA issue in computing new fill value in SparseSeries. close pandas-dev#2220
  BUG: make inplace semantics of DataFrame.where consistent. pandas-dev#2230
  BUG: fix internal error in constructing DataFrame.values with duplicate column names. close pandas-dev#2236
  added back mask method that does condition inversion added condition testing to where that raised ValueError on an invalid condition (e.g. not an ndarray like object) added tests for same
  in core/frame.py
  TST: getting column from and applying op to a df should commute
  TST: add dual ( x op y <-> y op x ) tests for arith operators
  BUG: Incorrect error message due to zero based levels. close pandas-dev#2226
  fixed file modes for core/frame.py, test/test_frame.py
  relaxed __setitem__ restriction on boolean indexing a frame on an equal sized frame
  in core/frame.py
  ENH: warn user when invoking to_dict() on df with non-unique columns
  BUG: modify df.iteritems to support duplicate column labels pandas-dev#2219
  TST: df.iteritems() should yield Series even with non-unique column labels
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants