ENH: support reading from several files for read_* functions #39435

anmyachev · 2021-01-27T11:50:26Z

Is your feature request related to a problem?

In general, the implementation of this idea should contribute to simplification of reading functions use and reduce the use of boilerplate code.
On the other hand, this shouldn't make it much more difficult to maintain that functions in Pandas.

Current reading approach (from Pandas docs):

  import glob
  files = glob.glob('file_*.csv')
  result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

Describe the solution you'd like

We can make reading several files as "out-of-the-box" feature of Pandas (with using wildcard):

  result = pd.read_csv('file_*csv')

API breaking implications

In one of the two proposed solutions: filepath_or_buffer also can be of list[str] type.
Changes do not break backward compatibility.

Describe alternatives you've considered

Another possible option (using list of files in read_* call):

  import glob
  result = pd.read_csv(glob.glob('file_*csv'))

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-02-12T15:40:34Z

Hi ! Are there any news on this ? This would be greatly appreciated especially when dealing with fsspec compatible urls.

Right now the boiler plate code involve instantiating a FileSystem, doing the glob, re-create full paths, and contatenate:

import pandas as pd
from huggingface_hub import HfFileSystem

path = "hf://datasets/Anthropic/hh-rlhf"
splits = {"train": "**/*/train.jsonl.gz", "test": "**/*/test.jsonl.gz"}

files = ["hf://" + path for path in HfFileSystem().glob(f"{path}/{splits['train']}")]

df = pd.concat(pd.read_json(file, lines=True) for file in files)

instead of

import pandas as pd

path = "hf://datasets/Anthropic/hh-rlhf"
splits = {"train": "**/*/train.jsonl.gz", "test": "**/*/test.jsonl.gz"}
df = pd.read_json(f"{path}/{splits['train']}", lines=True)

anmyachev added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 27, 2021

anmyachev mentioned this issue Jan 27, 2021

Feature request: read_csv/read_table/read_fwf - read multiple files with the same structure, applying the same parameters (skiprows, skipfooter, nrows) #12618

Closed

jbrockmendel added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support reading from several files for read_* functions #39435

ENH: support reading from several files for read_* functions #39435

anmyachev commented Jan 27, 2021

lhoestq commented Feb 12, 2024

ENH: support reading from several files for read_* functions #39435

ENH: support reading from several files for read_* functions #39435

Comments

anmyachev commented Jan 27, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

lhoestq commented Feb 12, 2024