Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv from Google Cloud Storage ignores encoding #32392

Closed
EgorBEremeev opened this issue Mar 2, 2020 · 3 comments · Fixed by #35681
Closed

read_csv from Google Cloud Storage ignores encoding #32392

EgorBEremeev opened this issue Mar 2, 2020 · 3 comments · Fixed by #35681
Labels
IO CSV read_csv, to_csv
Milestone

Comments

@EgorBEremeev
Copy link

EgorBEremeev commented Mar 2, 2020

Code Sample, a copy-pastable example if possible

    dataframe = pd.read_csv('gs://mybucket/my_file', encoding = 'cp1251')

Problem description

Reading csv files which have encoding other than utf-8, like cp1251, from the Google Cloud Storage fails with error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

the stacktrace from the Google Cloud Function environment:

Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 60, in load_csv_to_bq
na_filter=False)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 748, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

It looks that in pandas ignoring of encoding parameter happens, because in the pandas.io.gcs.get_filepath_or_buffer the mode = 'rb' is passed to call of GCSFileSystem.open(filepath_or_buffer, mode)

Tracing back to the moment of the first actual setting the mode parameter we have stop on this line:

pandas.io.common.py

def get_filepath_or_buffer(
    filepath_or_buffer, encoding=None, compression=None, mode=None
)

, because in the call of get_filepath_or_buffer() performed from here

pandas/pandas/io/parsers.py

Lines 430 to 432 in 29d6b02

fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
filepath_or_buffer, encoding, compression
)

we do not pass value of mode and default mode=None works.

But in the current gcsf master GCSFileSystem.open() has been removed and fsspec.AbstractFileSystem.open() has works instead:

where applying of passed encoding for the text reading\writing is now implemented:

        if "b" not in mode:
            mode = mode.replace("t", "") + "b"

            text_kwargs = {
                k: kwargs.pop(k)
                for k in ["encoding", "errors", "newline"]
                if k in kwargs
            }
            return io.TextIOWrapper(
                self.open(path, mode, block_size, **kwargs), **text_kwargs
            )

Expected Output

The encoding value passed into pd.read_csv() is applyied while reading from GCS, csv files are read.

As I could suggest for read_csv() we need pass mode=r and for to_csv() (see #26124) we need pass mode=w in the call of get_filepath_or_buffer(). But I'm not sure where in code it's better to implement this change.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.13.1
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@TomAugspurger
Copy link
Contributor

Would adding GCSFile (or perhaps fsspec.file.AbstractFile) to

pandas/pandas/io/common.py

Lines 377 to 382 in f25ed6f

try:
from s3fs import S3File
need_text_wrapping = (BufferedIOBase, RawIOBase, S3File)
except ImportError:
need_text_wrapping = (BufferedIOBase, RawIOBase)
help?

@EgorBEremeev
Copy link
Author

EgorBEremeev commented Mar 2, 2020

Hi, @TomAugspurger
As I understand the code below

pandas/pandas/io/common.py

Lines 453 to 457 in 78c1a74

# Convert BytesIO or file objects passed with an encoding
if is_text and (compression or isinstance(f, need_text_wrapping)):
from io import TextIOWrapper
g = TextIOWrapper(f, encoding=encoding, newline="")

it is planned to check if encoding is passed and path_or_buf is presented in the this list need_text_wrapping. If so then wrap with TextIOWrapper().

I think adding GCSFile in need_text_wrapping is a right approach.

I just confused, because do not see where get_handle() is called while read_csv()

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 3, 2020 via email

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jun 5, 2020
@jreback jreback added this to the 1.2 milestone Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants