[KED-2639] Cannot read csv in chunks with pandas #598

noklam · 2020-11-04T07:34:36Z

Description

Cannot read csv in chunks with kedro data catalog.

df = pd.read_csv(csv, chunksize=1000)
df.get_chunk()

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

df = catalog.load("train_dataset")
df.get_chunk()

ValueError: I/O operation on closed file.
df
<pandas.io.parsers.TextFileReader at 0x7fde97a82450>

Expected Result

I should be able to loop over the reader.

Actual Result

ValueError: I/O operation on closed file.

-- If you received an error, place it here.

ValueError: I/O operation on closed file.

```yaml
train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

-- Separate them if you have more than one.


## Your Environment
Include as many relevant details about the environment in which you experienced the bug:

* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu

The text was updated successfully, but these errors were encountered:

WaylonWalker · 2020-11-05T16:14:31Z

Its been awhile since I have used chunksize. If I remember correct it returns a generator.

chunks = catalog.load("train_dataset")

for chunk in chunks:
   # chunk is a DataFrame do what you need with it
   process(chunk)

noklam · 2020-11-06T01:03:06Z

@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better.

When I iterate it it throws error that saying file is closed already.

WaylonWalker · 2020-11-06T02:47:02Z

I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to kedro run or catalog.load it. I am not able to replicate the issue just loading with pandas, even if I use fsspec like the pandas.CSVDataSet does. Someone with a deeper understanding of the internals may need to take a look

I posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.

I have read your blog about Kedro befoe it helps me understand some concepts better.

That is awesome!!! and potentially motivating to keep making more content.

noklam · 2020-11-06T02:58:44Z

@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too.
catalog.load() will first call fsspec, then it also calls the transformer, I suspect transformer tries to read that generator and closed it.

But I haven't dig dive into transformer before yet, it would be great if someone has more knowledge jump in.

carlosbertoncelli · 2020-12-15T14:59:54Z

I'm facing the same issue, anyone has updates on this problem?

Skalwalker · 2021-03-16T13:25:41Z

Have we got a solution for this? I have been having a rough time trying to integrate big data with Kedro.

noklam · 2021-03-16T16:28:59Z

Looking for a solution too, still bugging me.

Skalwalker · 2021-03-16T16:38:19Z

@noklam Did you find a workaround?

carlosbertoncelli · 2021-03-16T16:57:55Z

I solved the problem by creating a custom class for it, which basically loads the file using fsspec (like the CSVDataset) and saves it on a temp file, so i pass the file reference through the load function and inside my pipeline functions i just have to delete it after use (if i forget this no problem, bcause it's created using tempfile). I forgot to mention that the file reference is basically a iterator for the file chunks

noklam · 2021-03-16T17:14:41Z

My solution is simply give up using dataset. I simply load it in a node via the typical pandas.read_csv.

stale · 2021-05-15T20:50:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

noklam · 2021-05-16T04:05:56Z

It is still a relevant bug, I don't think it should be closed.

merelcht · 2021-05-17T14:57:26Z

Thanks for reflagging this @noklam. I've added this as a bug ticket to our backlog, but of course we still very much welcome a PR fix on this if you have one.

antonymilne · 2021-06-01T16:14:50Z

I believe the problem here is that the context manager that is used in catalog.load for a csv file closes the file:
https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157

Since pandas added fsspec support in their API starting with version 1.1.0, we are in the process of converting this code (and others like JSONDataSet) to use pd.read_* without the need for the context manager. This should fix the bug but won't be out until kedro 0.18.

In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk...):

   def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
        return pd.read_csv(load_path, **self._load_args)

Note also that since pandas 1.2 TextFileReader (which is what is returned when specifying chunksize) is now a context manager - see pandas-dev/pandas#38225. It's still iterable, so correct usage would now be:

with dataset_name as chunks:
    for chunk in chunks:
        process(chunk)

noklam · 2021-06-21T09:34:31Z

For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.

Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the _load() method.

from typing import Any, Dict
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd

from kedro.io.core import (
    get_filepath_str,
    get_protocol_and_path,
)


def _load(self)  -> pd.DataFrame:
    load_path = get_filepath_str(self._get_load_path(), self._protocol)

    return pd.read_csv(load_path, **self._load_args)

CSVDataSet._load = _load

antonymilne · 2021-06-21T10:05:51Z

This is a great point, thanks @noklam.

stale · 2021-08-20T10:08:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

antonymilne · 2021-09-01T10:42:19Z

I can confirm that this will be fixed in 0.18 - see 4f5f9c1. The fix should work for both pandas.CSVDataSet and others that currently use a context manager.

noklam added the Issue: Bug Report 🐞 Bug that needs to be fixed label Nov 4, 2020

stale bot added the stale label May 15, 2021

stale bot removed the stale label May 17, 2021

merelcht changed the title ~~Cannot read csv in chunks with pandas~~ [KED-2639] Cannot read csv in chunks with pandas May 17, 2021

stale bot added the stale label Aug 20, 2021

stale bot closed this as completed Aug 27, 2021

datajoely reopened this Aug 28, 2021

stale bot removed the stale label Aug 28, 2021

antonymilne closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2639] Cannot read csv in chunks with pandas #598

[KED-2639] Cannot read csv in chunks with pandas #598

noklam commented Nov 4, 2020 •

edited

Loading

WaylonWalker commented Nov 5, 2020

noklam commented Nov 6, 2020

WaylonWalker commented Nov 6, 2020

noklam commented Nov 6, 2020

carlosbertoncelli commented Dec 15, 2020

Skalwalker commented Mar 16, 2021

noklam commented Mar 16, 2021

Skalwalker commented Mar 16, 2021

carlosbertoncelli commented Mar 16, 2021 •

edited

Loading

noklam commented Mar 16, 2021

stale bot commented May 15, 2021

noklam commented May 16, 2021

merelcht commented May 17, 2021

antonymilne commented Jun 1, 2021

noklam commented Jun 21, 2021 •

edited

Loading

antonymilne commented Jun 21, 2021

stale bot commented Aug 20, 2021

antonymilne commented Sep 1, 2021

[KED-2639] Cannot read csv in chunks with pandas #598

[KED-2639] Cannot read csv in chunks with pandas #598

Comments

noklam commented Nov 4, 2020 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

WaylonWalker commented Nov 5, 2020

noklam commented Nov 6, 2020

WaylonWalker commented Nov 6, 2020

noklam commented Nov 6, 2020

carlosbertoncelli commented Dec 15, 2020

Skalwalker commented Mar 16, 2021

noklam commented Mar 16, 2021

Skalwalker commented Mar 16, 2021

carlosbertoncelli commented Mar 16, 2021 • edited Loading

noklam commented Mar 16, 2021

stale bot commented May 15, 2021

noklam commented May 16, 2021

merelcht commented May 17, 2021

antonymilne commented Jun 1, 2021

noklam commented Jun 21, 2021 • edited Loading

antonymilne commented Jun 21, 2021

stale bot commented Aug 20, 2021

antonymilne commented Sep 1, 2021

noklam commented Nov 4, 2020 •

edited

Loading

carlosbertoncelli commented Mar 16, 2021 •

edited

Loading

noklam commented Jun 21, 2021 •

edited

Loading