New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv(dtype='category') raises with many categories #18186

Closed
adbull opened this Issue Nov 9, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@adbull

adbull commented Nov 9, 2017

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
csv = io.StringIO('\n'.join(map(str, range(10**6))))
df = pd.read_csv(csv, dtype='category')

results in

  File "bug.py", line 5, in <module>
    df = pd.read_csv(csv, dtype='category')
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 705, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 451, in _read
    data = parser.read(nrows)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 1065, in read
    ret = self._engine.read(nrows)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 1828, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 894, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 944, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 2218, in pandas._libs.parsers._concatenate_chunks
  File "lib/python3.6/site-packages/numpy/core/numerictypes.py", line 1016, in find_common_type
    array_types = [dtype(x) for x in array_types]
  File "lib/python3.6/site-packages/numpy/core/numerictypes.py", line 1016, in <listcomp>
    array_types = [dtype(x) for x in array_types]
TypeError: data type not understood

Problem description

read_csv now raises when reading a column with many unique values as a category. This appears to be a regression in 0.21.0, due to the introduction of CategoricalDtype.

Expected Output

No exception.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.11-300.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 9, 2017

Contributor

Thanks for the report.

Something like

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index 85857c158..e3a2a2186 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -2228,8 +2228,9 @@ def _concatenate_chunks(list chunks):
         arrs = [chunk.pop(name) for chunk in chunks]
         # Check each arr for consistent types.
         dtypes = set([a.dtype for a in arrs])
-        if len(dtypes) > 1:
-            common_type = np.find_common_type(dtypes, [])
+        numpy_dtypes = {x for x in dtypes if not is_categorical_dtype(x)}
+        if len(numpy_dtypes) > 1:
+            common_type = np.find_common_type(numpy_dtypes, [])
             if common_type == np.object:
                 warning_columns.append(str(name))

is what we want, though that's specific to Categoricals. We would want to avoid sending any of our extension dtypes to np.find_common_type.

Contributor

TomAugspurger commented Nov 9, 2017

Thanks for the report.

Something like

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index 85857c158..e3a2a2186 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -2228,8 +2228,9 @@ def _concatenate_chunks(list chunks):
         arrs = [chunk.pop(name) for chunk in chunks]
         # Check each arr for consistent types.
         dtypes = set([a.dtype for a in arrs])
-        if len(dtypes) > 1:
-            common_type = np.find_common_type(dtypes, [])
+        numpy_dtypes = {x for x in dtypes if not is_categorical_dtype(x)}
+        if len(numpy_dtypes) > 1:
+            common_type = np.find_common_type(numpy_dtypes, [])
             if common_type == np.object:
                 warning_columns.append(str(name))

is what we want, though that's specific to Categoricals. We would want to avoid sending any of our extension dtypes to np.find_common_type.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 9, 2017

Contributor

use pandas_dtype here or find_common_type

Contributor

jreback commented Nov 9, 2017

use pandas_dtype here or find_common_type

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 9, 2017

Contributor

@adbull do you have time to submit a PR with a fix like that, along with some tests?

Contributor

TomAugspurger commented Nov 9, 2017

@adbull do you have time to submit a PR with a fix like that, along with some tests?

@tomanizer

This comment has been minimized.

Show comment
Hide comment
@tomanizer

tomanizer Nov 13, 2017

Is there a work around which would help us run existing code on 0.21.0?
Or should users downgrade to 0.20.3?

tomanizer commented Nov 13, 2017

Is there a work around which would help us run existing code on 0.21.0?
Or should users downgrade to 0.20.3?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 13, 2017

Contributor

@tomanizer not sure off the top of my head. Reading them as strings and then converting to categorical I suppose, but that may not be an option depending on memory usage. You could presumably use chunksize=., but that may introduce problems as well.

We're doing a bugfix release Wednesday or Thursday. If you have a chance to submit a PR before then we can get it merged.

Contributor

TomAugspurger commented Nov 13, 2017

@tomanizer not sure off the top of my head. Reading them as strings and then converting to categorical I suppose, but that may not be an option depending on memory usage. You could presumably use chunksize=., but that may introduce problems as well.

We're doing a bugfix release Wednesday or Thursday. If you have a chance to submit a PR before then we can get it merged.

@sam-cohan sam-cohan referenced this issue Nov 21, 2017

Merged

Read csv category fix #18402

4 of 4 tasks complete

sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017

sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017

sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017

sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 22, 2017

sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment