Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support reading from several files for read_* functions #39435

Open
anmyachev opened this issue Jan 27, 2021 · 1 comment
Open

ENH: support reading from several files for read_* functions #39435

anmyachev opened this issue Jan 27, 2021 · 1 comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@anmyachev
Copy link
Contributor

Is your feature request related to a problem?

In general, the implementation of this idea should contribute to simplification of reading functions use and reduce the use of boilerplate code.
On the other hand, this shouldn't make it much more difficult to maintain that functions in Pandas.

Current reading approach (from Pandas docs):

  import glob
  files = glob.glob('file_*.csv')
  result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

Describe the solution you'd like

We can make reading several files as "out-of-the-box" feature of Pandas (with using wildcard):

  result = pd.read_csv('file_*csv')

API breaking implications

In one of the two proposed solutions: filepath_or_buffer also can be of list[str] type.
Changes do not break backward compatibility.

Describe alternatives you've considered

Another possible option (using list of files in read_* call):

  import glob
  result = pd.read_csv(glob.glob('file_*csv'))
@anmyachev anmyachev added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 27, 2021
@jbrockmendel jbrockmendel added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 15, 2021
@lhoestq
Copy link

lhoestq commented Feb 12, 2024

Hi ! Are there any news on this ? This would be greatly appreciated especially when dealing with fsspec compatible urls.

Right now the boiler plate code involve instantiating a FileSystem, doing the glob, re-create full paths, and contatenate:

import pandas as pd
from huggingface_hub import HfFileSystem

path = "hf://datasets/Anthropic/hh-rlhf"
splits = {"train": "**/*/train.jsonl.gz", "test": "**/*/test.jsonl.gz"}

files = ["hf://" + path for path in HfFileSystem().glob(f"{path}/{splits['train']}")]

df = pd.concat(pd.read_json(file, lines=True) for file in files)

instead of

import pandas as pd

path = "hf://datasets/Anthropic/hh-rlhf"
splits = {"train": "**/*/train.jsonl.gz", "test": "**/*/test.jsonl.gz"}
df = pd.read_json(f"{path}/{splits['train']}", lines=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants