Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to read csv from hdfs:// URL #18199

Open
AbdealiLoKo opened this issue Nov 9, 2017 · 20 comments
Open

Feature to read csv from hdfs:// URL #18199

AbdealiLoKo opened this issue Nov 9, 2017 · 20 comments
Labels
Enhancement IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@AbdealiLoKo
Copy link
Contributor

When running pandas in AWS, The following works perfectly fine:

pd.read_csv("s3://mybucket/data.csv")

But running the following, does not:

pd.read_csv("hdfs:///tmp/data.csv")

It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.

@TomAugspurger
Copy link
Contributor

Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole.

If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it.

@TomAugspurger TomAugspurger added API Design IO Data IO issues that don't fit into a more specific label Difficulty Intermediate Enhancement labels Nov 9, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 9, 2017
@AbdealiLoKo
Copy link
Contributor Author

How should this be implemented ? Should there also be a read_hdfs like the read_s3 ?

@TomAugspurger
Copy link
Contributor

See https://github.com/pandas-dev/pandas/blob/dc4b0708f36b971f71890bfdf830d9a5dc019c7b/pandas/io/s3.py and

def _is_s3_url(url):
and search for uses of _is_s3_url.

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

@AbdealiLoKo
Copy link
Contributor Author

So, here is a quick comparison:

  • hdfs: Package for connecting to WebFS and HttpFS which are REST protocols to access HDFS data
  • hdfs3: Wrapper on the library libhdfs3 which needs to be installed independently
  • pyarrow: Supports both engines the native libhdfs and separately installed libhdfs3
  • cyhdfs: Cython wrapper for native libhdfs

These seem to be the active (have their latest release in 2017) options.

As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option.

@AbdealiLoKo AbdealiLoKo mentioned this issue Nov 29, 2017
4 tasks
@jbrockmendel jbrockmendel added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels Dec 1, 2019
@jbrockmendel jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019
@sergei3000
Copy link
Contributor

This would be a great feature to have in Pandas. Is it still being worked on?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2020

@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort

@DavidKatz-il
Copy link

Hi @jreback, I want to work on this PR.
The changes proposed in the previous PR is not relevant anymore since pandas is using fsspec.
Do you have any suggestions on how to start?

@jreback
Copy link
Contributor

jreback commented Sep 25, 2020

i am not sure what is the appropriate library to read from hdfs
is now a days - so need to figure that out. note that pyarrow i believe does support this - so that's an option as well

otherwise this would be similar to how we implement other readers eg like gcs

@jreback
Copy link
Contributor

jreback commented Sep 25, 2020

tests can be on a similar manner to here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_gcs.py

@sergei3000
Copy link
Contributor

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

@DavidKatz-il
Copy link

DavidKatz-il commented Sep 26, 2020

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

pandas is using fsspec for reading s3 / gcs files, fsspec also supports reading for hdfs with pyarrow.

@sergei3000
Copy link
Contributor

Here you can find some code samples of using pyarrow written by Wes McKinney.

https://wesmckinney.com/blog/python-hdfs-interfaces/

@DavidKatz-il
Copy link

Hey @jreback
I just checked reading / writing hdfs files (csv and parquet) with pandas, and as I guessed it works fine. its probably works since the project started to use fsspec for reading s3 / gsc files.
So i will continue working on adding tests.

@martindurant
Copy link
Contributor

Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users.

@DavidKatz-il
Copy link

Hi @jreback
Should the test_hdfs be able run on a separate docker (aimiliar to dask docker) or in the pandas docker (which will require additional installations)?

@jreback
Copy link
Contributor

jreback commented Oct 13, 2020

we don't run any containers as part of the ci

this is just mocked which i think is fine

if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill)

that said if u want to go yeah

@martindurant
Copy link
Contributor

Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
No open projects
IO Method Robustness
Awaiting triage
Development

Successfully merging a pull request may close this issue.

8 participants