Feature to read csv from hdfs:// URL #18199

AbdealiLoKo · 2017-11-09T18:43:14Z

When running pandas in AWS, The following works perfectly fine:

pd.read_csv("s3://mybucket/data.csv")

But running the following, does not:

pd.read_csv("hdfs:///tmp/data.csv")

It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.

TomAugspurger · 2017-11-09T19:37:50Z

Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole.

If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it.

AbdealiLoKo · 2017-11-09T19:53:06Z

How should this be implemented ? Should there also be a read_hdfs like the read_s3 ?

TomAugspurger · 2017-11-09T20:22:52Z

See https://github.com/pandas-dev/pandas/blob/dc4b0708f36b971f71890bfdf830d9a5dc019c7b/pandas/io/s3.py and

pandas/pandas/io/common.py

Line 94 in 2f9d4fb

def _is_s3_url(url):

and search for uses of _is_s3_url.

jreback · 2017-11-09T21:48:38Z

i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm

jreback · 2017-11-09T21:51:37Z

https://hdfs3.readthedocs.io/en/latest/

jreback · 2017-11-09T21:53:02Z

https://arrow.apache.org/docs/python/filesystems.html

AbdealiLoKo · 2017-11-11T06:48:12Z

So, here is a quick comparison:

hdfs: Package for connecting to WebFS and HttpFS which are REST protocols to access HDFS data
hdfs3: Wrapper on the library libhdfs3 which needs to be installed independently
pyarrow: Supports both engines the native libhdfs and separately installed libhdfs3
cyhdfs: Cython wrapper for native libhdfs

These seem to be the active (have their latest release in 2017) options.

As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option.

sergei3000 · 2020-01-21T12:05:18Z

This would be a great feature to have in Pandas. Is it still being worked on?

jreback · 2020-01-21T12:28:38Z

@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort

DavidKatz-il · 2020-09-24T15:06:13Z

Hi @jreback, I want to work on this PR.
The changes proposed in the previous PR is not relevant anymore since pandas is using fsspec.
Do you have any suggestions on how to start?

jreback · 2020-09-25T00:37:10Z

i am not sure what is the appropriate library to read from hdfs
is now a days - so need to figure that out. note that pyarrow i believe does support this - so that's an option as well

otherwise this would be similar to how we implement other readers eg like gcs

jreback · 2020-09-25T12:23:04Z

tests can be on a similar manner to here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_gcs.py

sergei3000 · 2020-09-25T23:09:03Z

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

DavidKatz-il · 2020-09-26T18:12:40Z

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

pandas is using fsspec for reading s3 / gcs files, fsspec also supports reading for hdfs with pyarrow.

sergei3000 · 2020-09-26T22:58:37Z

Here you can find some code samples of using pyarrow written by Wes McKinney.

https://wesmckinney.com/blog/python-hdfs-interfaces/

DavidKatz-il · 2020-10-01T07:42:22Z

Hey @jreback
I just checked reading / writing hdfs files (csv and parquet) with pandas, and as I guessed it works fine. its probably works since the project started to use fsspec for reading s3 / gsc files.
So i will continue working on adding tests.

martindurant · 2020-10-01T13:44:05Z

Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users.

DavidKatz-il · 2020-10-13T15:04:37Z

Hi @jreback
Should the test_hdfs be able run on a separate docker (aimiliar to dask docker) or in the pandas docker (which will require additional installations)?

jreback · 2020-10-13T15:55:40Z

we don't run any containers as part of the ci

this is just mocked which i think is fine

if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill)

that said if u want to go yeah

martindurant · 2020-10-13T16:00:05Z

Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131

TomAugspurger added API Design IO Data IO issues that don't fit into a more specific label Difficulty Intermediate Enhancement labels Nov 9, 2017

TomAugspurger added this to the Next Major Release milestone Nov 9, 2017

AbdealiLoKo mentioned this issue Nov 29, 2017

Add HDFS reading #18568

Closed

4 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added IO CSV read_csv, to_csv and removed IO Data IO issues that don't fit into a more specific label labels Dec 1, 2019

jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019

DavidKatz-il mentioned this issue Oct 1, 2020

AttributeError: 'HadoopFileSystem' object has no attribute 'close' fsspec/filesystem_spec#429

Closed

mroeschke removed the API Design label Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature to read csv from hdfs:// URL #18199

Feature to read csv from hdfs:// URL #18199

AbdealiLoKo commented Nov 9, 2017

TomAugspurger commented Nov 9, 2017

AbdealiLoKo commented Nov 9, 2017

TomAugspurger commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

AbdealiLoKo commented Nov 11, 2017

sergei3000 commented Jan 21, 2020

jreback commented Jan 21, 2020

DavidKatz-il commented Sep 24, 2020

jreback commented Sep 25, 2020

jreback commented Sep 25, 2020

sergei3000 commented Sep 25, 2020

DavidKatz-il commented Sep 26, 2020 •

edited

sergei3000 commented Sep 26, 2020

DavidKatz-il commented Oct 1, 2020

martindurant commented Oct 1, 2020

DavidKatz-il commented Oct 13, 2020

jreback commented Oct 13, 2020

martindurant commented Oct 13, 2020

Feature to read csv from hdfs:// URL #18199

Feature to read csv from hdfs:// URL #18199

Comments

AbdealiLoKo commented Nov 9, 2017

TomAugspurger commented Nov 9, 2017

AbdealiLoKo commented Nov 9, 2017

TomAugspurger commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

jreback commented Nov 9, 2017

AbdealiLoKo commented Nov 11, 2017

sergei3000 commented Jan 21, 2020

jreback commented Jan 21, 2020

DavidKatz-il commented Sep 24, 2020

jreback commented Sep 25, 2020

jreback commented Sep 25, 2020

sergei3000 commented Sep 25, 2020

DavidKatz-il commented Sep 26, 2020 • edited

sergei3000 commented Sep 26, 2020

DavidKatz-il commented Oct 1, 2020

martindurant commented Oct 1, 2020

DavidKatz-il commented Oct 13, 2020

jreback commented Oct 13, 2020

martindurant commented Oct 13, 2020

DavidKatz-il commented Sep 26, 2020 •

edited