PandasCursor interfering with pandas.DataFrame writing to s3 #465

louis-vines · 2023-07-19T10:22:23Z

I've found that if we import PandasCursor this interfers with pandas using s3fs/fsspec to write to s3.

Minimal viable example:

import pandas as pd
from pyathena.pandas.cursor import PandasCursor

my_df = pd.DataFrame(dict(a=[1], b=[1]))
s3_bucket = ""  # add a valid s3 bucket
output_path = f"s3://{s3_bucket}/my_prefix/file.csv"
my_df.to_csv(output_path)

and I get the following error:

Traceback (most recent call last):
  File "/app/pipelines/bug_example.py", line 12, in <module>
    my_df.to_csv(output_path)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/generic.py", line 3772, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/usr/local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1186, in to_csv
    csv_formatter.save()
  File "/usr/local/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 240, in save
    with get_handle(
  File "/usr/local/lib/python3.10/site-packages/pandas/io/common.py", line 138, in __exit__
    self.close()
  File "/usr/local/lib/python3.10/site-packages/pandas/io/common.py", line 130, in close
    handle.close()
  File "/usr/local/lib/python3.10/site-packages/pyathena/filesystem/s3.py", line 518, in close
    super(S3File, self).close()
  File "/usr/local/lib/python3.10/site-packages/fsspec/spec.py", line 1876, in close
    self.flush(force=True)
  File "/usr/local/lib/python3.10/site-packages/fsspec/spec.py", line 1742, in flush
    self._initiate_upload()
  File "/usr/local/lib/python3.10/site-packages/pyathena/filesystem/s3.py", line 522, in _initiate_upload
    raise NotImplementedError  # pragma: no cover

If you remove the pandascursor import, this works fine:

import pandas as pd
# from pyathena.pandas.cursor import PandasCursor

my_df = pd.DataFrame(dict(a=[1], b=[1]))
s3_bucket = ""  # add a valid s3 bucket
output_path = f"s3://{s3_bucket}/my_prefix/file.csv"
my_df.to_csv(output_path)

The text was updated successfully, but these errors were encountered:

laughingman7743 · 2023-07-19T10:58:39Z

https://github.com/laughingman7743/PyAthena/blob/master/pyathena/filesystem/s3.py#L403-L416
I do not currently support writing DataFrames to S3 with PandasCursor.

louis-vines · 2023-07-19T11:33:01Z

Thanks for the swift response. Totally understand that your package doesn't have support to write to s3. The issue is that once we import the PandasCursor, it interfers with how pandas dataframes write to s3. Notice in my example that my_df.to_csv(output_path) has nothing to do with PyAthena. It seems to me from the stack trace that importing PandasCursor interfers with how pandas interacts with s3fs and fsspec and this is an issue as it means we can't use pyathena and then write our resulting dataframe back to s3 using pandas.

laughingman7743 · 2023-07-19T11:43:04Z

https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/__init__.py#L4-L5
Importing PandasCursor will read this code and overwrite the S3fs implementation.

louis-vines · 2023-07-19T12:40:00Z

hmmmm this seems like quite a problematic requirement for using the pandas cursor and means you can't really use pyathena for data pipelines as you can't read data from athena and the write back to s3. I've thought of a workaround by being explicit on which s3 file system I want when writing so s3:

s3 = s3fs.S3FileSystem()

with s3.open(f"{s3_bucket}/my_prefix/file.csv", "w") as file_in:
    my_df.to_csv(file_in, index=False)

But would addressing this issue be something you are willing to think about? either by using s3fs, not registering your s3 implementation globally with fsspec (not sure if this is possible) or extending your implementation to support writing?

laughingman7743 · 2023-07-19T13:43:06Z

I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.
https://github.com/aio-libs/aiobotocore/blob/master/setup.py#L10
I welcome pull requests, you know.

louis-vines · 2023-07-21T08:28:09Z

I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.

Yes this is fair enough - it's also a problem we've been encountering in projects.

I welcome pull requests, you know.

Touche. Given that you don't want to include s3fs as a dependancy do you think the best fix is to implement an s3 writer in this package? This seems slightly odd thing to be implementing in this package. The alternative would be to avoid registering your s3 implentation globally to fsspec. I'll admit I've not looked at the code for fsspec at all and don't know if there is a way to implement things without the global registration to fsspec.

laughingman7743 · 2023-07-22T05:51:55Z

I am not sure what is odd about it.
If you want to save the Dataframe as an Athena table, you can also use the to_sql method.
https://github.com/laughingman7743/PyAthena#to-sql

graydenshand · 2023-10-09T15:34:43Z

As a work around you can convert query results to pandas yourself instead of using the as_pandas function or PandasCursor.

E.g. This succeeds.

from pyathena import connect
import pandas as pd

with connect(s3_staging_dir="s3://...") as conn:
    with conn.cursor() as cursor:
        cursor.execute("select * from my_table limit 10")        
        df = pd.DataFrame(cursor, columns=[c[0] for c in cursor.description])
        # later...
        df.to_csv("s3://...")

I'll suggest that it would be preferred for this library to not support pandas conversion at all, rather than to support it in a way that breaks typical pandas usage in other contexts.

gvogel-hh · 2024-03-16T00:52:51Z

Many users will need to solve the dependency problems surrounding s3fs/aiobotocore anyway, so I would not consider this a sufficient reason to take the decision from the user's hands. The usual way of handling such a choice would be to offer an extra like pyathena[replace_s3fs] pulling your s3fs implementation as a separate package.

* Support for writing with s3 filesystem (fix #465) * Fix test cases * Apply fmt * Remove unnecessary vars * FIx fixture args * Add test cases for file writing * Add type hints

louis-vines changed the title ~~PandasCursor interfering with pandas.Dataframe writing to s3~~ Bug: PandasCursor interfering with pandas.Dataframe writing to s3 Jul 19, 2023

louis-vines changed the title ~~Bug: PandasCursor interfering with pandas.Dataframe writing to s3~~ Bug: PandasCursor interfering with pandas.DataFrame writing to s3 Jul 19, 2023

laughingman7743 changed the title ~~Bug: PandasCursor interfering with pandas.DataFrame writing to s3~~ PandasCursor interfering with pandas.DataFrame writing to s3 Jul 19, 2023

laughingman7743 mentioned this issue Oct 9, 2023

pyathena hijacks pandas s3fs #476

Closed

laughingman7743 mentioned this issue Oct 18, 2023

Implement all fsspec specs in the s3 file system #483

Closed

laughingman7743 linked a pull request Apr 30, 2024 that will close this issue

Support for writing with s3 file system #539

Merged

laughingman7743 added a commit that referenced this issue May 2, 2024

Support for writing with s3 filesystem (fix #465)

cb8e327

laughingman7743 closed this as completed in #539 May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PandasCursor interfering with pandas.DataFrame writing to s3 #465

PandasCursor interfering with pandas.DataFrame writing to s3 #465

louis-vines commented Jul 19, 2023

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 19, 2023

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 19, 2023 •

edited

Loading

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 21, 2023 •

edited

Loading

laughingman7743 commented Jul 22, 2023

graydenshand commented Oct 9, 2023

gvogel-hh commented Mar 16, 2024

PandasCursor interfering with pandas.DataFrame writing to s3 #465

PandasCursor interfering with pandas.DataFrame writing to s3 #465

Comments

louis-vines commented Jul 19, 2023

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 19, 2023

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 19, 2023 • edited Loading

laughingman7743 commented Jul 19, 2023

louis-vines commented Jul 21, 2023 • edited Loading

laughingman7743 commented Jul 22, 2023

graydenshand commented Oct 9, 2023

gvogel-hh commented Mar 16, 2024

louis-vines commented Jul 19, 2023 •

edited

Loading

louis-vines commented Jul 21, 2023 •

edited

Loading