Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError from pandas DataFrame groupby for Windows based csv files #16690

Closed
suntong opened this issue Jun 13, 2017 · 8 comments
Closed

KeyError from pandas DataFrame groupby for Windows based csv files #16690

suntong opened this issue Jun 13, 2017 · 8 comments
Labels
IO CSV read_csv, to_csv

Comments

@suntong
Copy link

suntong commented Jun 13, 2017

Code Sample, a copy-pastable example if possible

# Your code here
df = pd.read_csv('test.csv')
df.tail(5)
df.info()
df.columns
df.groupby(['Id'])

Problem description

The df.groupby(['Id']) threw an exception:

KeyErrorTraceback (most recent call last)
<ipython-input-24-bba5c2dc5f75> in <module>()
----> 1 df.groupby(['Id'])

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   3776         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   3777                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 3778                        **kwargs)
...
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Id'

Expected Output

No exception. Returns pandas.core.groupby.DataFrameGroupBy object.

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.2
setuptools: None
Cython: 0.19.2
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.8.0
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None

Further explanation

There are time-formatted string in the csv file, but that's not the cause:

$ head test.csv
Id,Time,Val
T5115,00:00:00,20.819
T5115,00:00:03,16.926
T5115,00:00:08,11.343
T5115,00:00:17,6.975
T5115,00:00:29,13.553
T5115,00:00:35,7.082
T5115,00:00:49,8.147
T5115,00:01:03,12.317
T5115,00:01:07,6.171

I am certain that the real problem is the format of the file -- the "test.csv" is Windows based, and is output from SQL Server SSMS.

Using file test.csv under Linux shows:

test.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

Here are the top several bytes from the file:

0000000 ef bb bf 49 64 2c 54 69 - 6d 65 2c 56 61 6c 0d 0a  Id,Time,Val..
0000020 54 35 31 31 35 2c 30 30 - 3a 30 30 3a 30 30 2c 32  T5115,00:00:00,2
0000040 30 2e 38 31 39 0d 0a 54 - 35 31 31 35 2c 30 30 3a  0.819..T5115,00:
0000060 30 30 3a 30 33 2c 31 36 - 2e 39 32 36 0d 0a 54 35  00:03,16.926..T5
0000100 31 31 35 2c 30 30 3a 30 - 30 3a 30 38 2c 31 31 2e  115,00:00:08,11.
0000120 33 34 33 0d 0a 54 35 31 - 31 35 2c 30 30 3a 30 30  343..T5115,00:00
0000140 3a 31 37 2c 36 2e 39 37 - 35 0d 0a 54 35 31 31 35  :17,6.975..T5115
0000160 2c 30 30 3a 30 30 3a 32 - 39 2c 31 33 2e 35 35 33  ,00:00:29,13.553
0000200 0d 0a 54 35 31 31 35 2c - 30 30 3a 30 30 3a 33 35  ..T5115,00:00:35

This is very important and the root cause. Proofs:

  1. I opened, copied & saved the exactly content using Notepad++, and there won't be such problem with the newly saved file.
  2. If I convert it using dos2unix under Linux, then try the above same code, it would work. The groupby will not threw exception any more.
$ file test.csv 
test.csv: ASCII text

Here are the top several bytes from the working file:

0000000 49 64 2c 54 69 6d 65 2c - 56 61 6c 0a 54 35 31 31  Id,Time,Val.T511
0000020 35 2c 30 30 3a 30 30 3a - 30 30 2c 32 30 2e 38 31  5,00:00:00,20.81
0000040 39 0a 54 35 31 31 35 2c - 30 30 3a 30 30 3a 30 33  9.T5115,00:00:03
0000060 2c 31 36 2e 39 32 36 0a - 54 35 31 31 35 2c 30 30  ,16.926.T5115,00
0000100 3a 30 30 3a 30 38 2c 31 - 31 2e 33 34 33 0a 54 35  :00:08,11.343.T5

@TomAugspurger
Copy link
Contributor

Can you paste the output of df.columns? And try on a newer version of pandas, I suspect this was fixed in #13885

@TomAugspurger
Copy link
Contributor

You may need encoding='utf-8-sig'

@TomAugspurger TomAugspurger added the IO CSV read_csv, to_csv label Jun 13, 2017
@suntong
Copy link
Author

suntong commented Jun 13, 2017

@TomAugspurger, oh, sorry, the full question/info is from https://stackoverflow.com/questions/44508502/. Here are the part you wanted;

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 3 columns):
Id     165 non-null object
Time    165 non-null object
Val     165 non-null float64
dtypes: float64(1), object(2)
memory usage: 3.9+ KB

df.columns
Index([u'Id', u'Time', u'Val'], dtype='object')

Note that, using df.columns = df.columns.map(str.strip) as suggested doesn't make any different -- I'm still getting the exact same output from df.columns and error as above:

df.columns = df.columns.map(str.strip)
df.columns
Out[38]:
Index([u'Id', u'Time', u'Val'], dtype='object')

@suntong
Copy link
Author

suntong commented Jun 13, 2017

OK, I'll try with newer version of Python (v3), but for pandas, I'm not sure, because I'm using IBM's datascientistworkbench.com, which updates their tool chain quite often, but out of my control.

So how can I use the encoding='utf-8-sig' @TomAugspurger? I'm not a Python programmer, so please bear with me. :)

@suntong
Copy link
Author

suntong commented Jun 13, 2017

Is pandas: 0.18.1 not recently enough? That bug was nearly a year ago.

@TomAugspurger
Copy link
Contributor

The root problem is that you have a BOM (U+FEFF) at the start of the file. Older versions of pandas failed to strip this properly, but that's been fixed.

Note that, using df.columns = df.columns.map(str.strip) as suggested doesn't make any different

I don't think BOMs are considered whitespace by python, so they won't be stripped.
You could do df.columns = [u'Id', u'Time', u'Val'].

So how can I use the encoding='utf-8-sig'

pass it to pd.read_csv. I can't recall if it's necessary, though I think it is.

Is pandas: 0.18.1 not recently enough? That bug was nearly a year ago.

That fix was in 0.19 (check the milestone on the pull request).

@TomAugspurger
Copy link
Contributor

And the problem being a BOM at the start of the file is just a guess. I could be wrong.

@suntong
Copy link
Author

suntong commented Jun 13, 2017

Hi @TomAugspurger, you are absolutely right. Using encoding='utf-8-sig' solve the problem (I've double checked). Thanks!

@suntong suntong closed this as completed Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants