Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: na_values dict form not working on index column #57547

Closed
2 of 3 tasks
anna-intellegens opened this issue Feb 21, 2024 · 6 comments · Fixed by #57965
Closed
2 of 3 tasks

BUG: na_values dict form not working on index column #57547

anna-intellegens opened this issue Feb 21, 2024 · 6 comments · Fixed by #57965
Assignees
Labels
Bug IO CSV read_csv, to_csv

Comments

@anna-intellegens
Copy link

anna-intellegens commented Feb 21, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO

from pandas._libs.parsers import STR_NA_VALUES
import pandas as pd

file_contents = """,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = STR_NA_VALUES | {"squid"}
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

pd.read_csv(
    StringIO(file_contents),
    index_col=0,
    header=0,
    engine="c",
    dtype=dtype,
    names=names,
    na_values=nan_mapping,
    keep_default_na=False,
)

Issue Description

I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:

Traceback (most recent call last):
  File ".../test.py", line 17, in <module>
    pd.read_csv(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
    index, column_names = self._make_index(date_data, alldata, names)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
    index = self._agg_index(simple_index)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
    arr, _ = self._infer_types(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
    na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)

This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)

Expected Behavior

The pandas table should be read without error, leading to a pandas table a bit like the following:

       x    y
MA   1.0  2.0
NA   2.0  1.0
OA   NaN  3.0

Installed Versions

This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.

INSTALLED VERSIONS ------------------ commit : fd3f571 python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-18-generic Version : #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@anna-intellegens anna-intellegens added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2024
@monsterkaran
Copy link

import io
import pandas as pd

file_contents = """
,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = set(["NA", "squid"])
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

try:
df = pd.read_csv(
io.StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=True,
)
print(df)
except Exception as e:
print(f"Error occurred: {e}")

@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2024
@tomhoq
Copy link
Contributor

tomhoq commented Mar 5, 2024

take

@asishm
Copy link
Contributor

asishm commented Mar 5, 2024

replacing the None in names with anything else (string) works fine.

@tomhoq
Copy link
Contributor

tomhoq commented Mar 17, 2024

@thomas-intellegens Sorry to bother, but in the issue post you mention that

The dict form of na_values seems to be the only way implied in the documentation to allow having no na values on a specific column

In case you might remember, was the documentation this one?

Because otherwise, I cannot find, in the docs, where such property is mentioned.

Thank you

@anna-intellegens
Copy link
Author

In case you might remember, was the documentation this one?

Yeah, this was the section I was reading. Many thanks for taking a look at this

mroeschke pushed a commit that referenced this issue Apr 9, 2024
BUG: Na_values dict not working on index column (#57547)

* fix base_parser not setting col_na_values when na_values is a dict containing None

* fix python_parser applying na_values in a column None

* add unit test to test_na_values.py;

* update whatsnew.
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this issue May 7, 2024
pandas-dev#57965)

BUG: Na_values dict not working on index column (pandas-dev#57547)

* fix base_parser not setting col_na_values when na_values is a dict containing None

* fix python_parser applying na_values in a column None

* add unit test to test_na_values.py;

* update whatsnew.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants