Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2639] Cannot read csv in chunks with pandas #598

Closed
noklam opened this issue Nov 4, 2020 · 18 comments
Closed

[KED-2639] Cannot read csv in chunks with pandas #598

noklam opened this issue Nov 4, 2020 · 18 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@noklam
Copy link
Contributor

noklam commented Nov 4, 2020

Description

Cannot read csv in chunks with kedro data catalog.

df = pd.read_csv(csv, chunksize=1000)
df.get_chunk()

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

df = catalog.load("train_dataset")
df.get_chunk()

ValueError: I/O operation on closed file.
df
<pandas.io.parsers.TextFileReader at 0x7fde97a82450>

Expected Result

I should be able to loop over the reader.

Actual Result

ValueError: I/O operation on closed file.

-- If you received an error, place it here.

ValueError: I/O operation on closed file.

```yaml
train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

-- Separate them if you have more than one.


## Your Environment
Include as many relevant details about the environment in which you experienced the bug:

* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu
@noklam noklam added the Issue: Bug Report 🐞 Bug that needs to be fixed label Nov 4, 2020
@WaylonWalker
Copy link
Contributor

Its been awhile since I have used chunksize. If I remember correct it returns a generator.

chunks = catalog.load("train_dataset")

for chunk in chunks:
   # chunk is a DataFrame do what you need with it
   process(chunk)

@noklam
Copy link
Contributor Author

noklam commented Nov 6, 2020

@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better.

When I iterate it it throws error that saying file is closed already.

@WaylonWalker
Copy link
Contributor

I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to kedro run or catalog.load it. I am not able to replicate the issue just loading with pandas, even if I use fsspec like the pandas.CSVDataSet does. Someone with a deeper understanding of the internals may need to take a look

I posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.

I have read your blog about Kedro befoe it helps me understand some concepts better.

That is awesome!!! and potentially motivating to keep making more content.

@noklam
Copy link
Contributor Author

noklam commented Nov 6, 2020

@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too.
catalog.load() will first call fsspec, then it also calls the transformer, I suspect transformer tries to read that generator and closed it.

But I haven't dig dive into transformer before yet, it would be great if someone has more knowledge jump in.

@carlosbertoncelli
Copy link

I'm facing the same issue, anyone has updates on this problem?

@Skalwalker
Copy link

Have we got a solution for this? I have been having a rough time trying to integrate big data with Kedro.

@noklam
Copy link
Contributor Author

noklam commented Mar 16, 2021

Looking for a solution too, still bugging me.

@Skalwalker
Copy link

@noklam Did you find a workaround?

@carlosbertoncelli
Copy link

carlosbertoncelli commented Mar 16, 2021

I solved the problem by creating a custom class for it, which basically loads the file using fsspec (like the CSVDataset) and saves it on a temp file, so i pass the file reference through the load function and inside my pipeline functions i just have to delete it after use (if i forget this no problem, bcause it's created using tempfile). I forgot to mention that the file reference is basically a iterator for the file chunks

@noklam
Copy link
Contributor Author

noklam commented Mar 16, 2021

My solution is simply give up using dataset. I simply load it in a node via the typical pandas.read_csv.

@stale
Copy link

stale bot commented May 15, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 15, 2021
@noklam
Copy link
Contributor Author

noklam commented May 16, 2021

It is still a relevant bug, I don't think it should be closed.

@merelcht
Copy link
Member

Thanks for reflagging this @noklam. I've added this as a bug ticket to our backlog, but of course we still very much welcome a PR fix on this if you have one.

@stale stale bot removed the stale label May 17, 2021
@merelcht merelcht changed the title Cannot read csv in chunks with pandas [KED-2639] Cannot read csv in chunks with pandas May 17, 2021
@antonymilne
Copy link
Contributor

I believe the problem here is that the context manager that is used in catalog.load for a csv file closes the file:
https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157

Since pandas added fsspec support in their API starting with version 1.1.0, we are in the process of converting this code (and others like JSONDataSet) to use pd.read_* without the need for the context manager. This should fix the bug but won't be out until kedro 0.18.

In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk...):

   def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
        return pd.read_csv(load_path, **self._load_args)

Note also that since pandas 1.2 TextFileReader (which is what is returned when specifying chunksize) is now a context manager - see pandas-dev/pandas#38225. It's still iterable, so correct usage would now be:

with dataset_name as chunks:
    for chunk in chunks:
        process(chunk)

@noklam
Copy link
Contributor Author

noklam commented Jun 21, 2021

For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.

Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the _load() method.

from typing import Any, Dict
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd

from kedro.io.core import (
    get_filepath_str,
    get_protocol_and_path,
)


def _load(self)  -> pd.DataFrame:
    load_path = get_filepath_str(self._get_load_path(), self._protocol)

    return pd.read_csv(load_path, **self._load_args)

CSVDataSet._load = _load

@antonymilne
Copy link
Contributor

This is a great point, thanks @noklam.

@stale
Copy link

stale bot commented Aug 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 20, 2021
@stale stale bot closed this as completed Aug 27, 2021
@datajoely datajoely reopened this Aug 28, 2021
@stale stale bot removed the stale label Aug 28, 2021
@antonymilne
Copy link
Contributor

I can confirm that this will be fixed in 0.18 - see 4f5f9c1. The fix should work for both pandas.CSVDataSet and others that currently use a context manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

7 participants