Set 0 to `pyarrow.fs.HadoopFileSystem` port in v2 API to make it consistent to the v1 API behavior #225

belltailjp · 2021-10-15T05:44:37Z

With the v2 API, it fails to connect to HDFS whereas v1 (legacy API) is fine.

>>> pfio.v2.from_url('hdfs:///')
...
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/fs.py", line 285, in open_url
    with from_url(dirname) as fs:
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/fs.py", line 314, in from_url
    fs = Hdfs(dirname, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/hdfs.py", line 185, in __init__
    self._fs = _create_fs()
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/hdfs.py", line 167, in _create_fs
    return HadoopFileSystem(nameservice)
  File "pyarrow/_hdfs.pyx", line 83, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
OSError: HDFS connection failed

The v1 API uses the pyarrow deprecated function pyarrow.hdfs.connect, while v2 uses the new recommended one pyarrow.fs.HadoopFileSystem.
In both API, pfio doesn't explicitly specify port number, however hdfs.connect uses 0 for the default port, but fs.HadoopFileSystem sets 8020. I found that this mismatch causes the issue where pfio+HDFS works with v1 but doesn't work with v2.

If I modify the v2 API to explicitly specfy 0 to the port argument when instantiating pyarrow.fs.HadoopFileSystem, it seems to work as expected.

This PR proposes to introduce this change.

I guess this change should be related to how the storage environment is set up, so if there's a better way to configure port, please let me know.

kuenishi · 2021-10-18T07:33:50Z

/test

pfn-ci-bot · 2021-10-18T07:33:53Z

Successfully created a job for commit 19a1935:

Dashboard for commit 19a1935

kuenishi · 2021-10-18T07:38:38Z

Thank you, and nice catch. The old API pyarrow.hdfs.connect has its documentation describes "port (NameNode's port. Set to 0 for default or logical (HA) nodes.) –", which is reasonable for our configuration.

kuenishi

We could add additional option specifying port number to from_url() and so on, but from the perspective of name services in Hadoop are often configured as HA - and there are much more configuration knobs relying on $HADOOP_CONF_DIR. Thus, backward-compatible change like this would be sufficient.

Set 0 to HDFS port

19a1935

kuenishi added the cat:bug Bug report or fix. label Oct 18, 2021

kuenishi added this to the 2.0.0 milestone Oct 18, 2021

kuenishi approved these changes Oct 18, 2021

View reviewed changes

kuenishi merged commit 90c9c30 into pfnet:master Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set 0 to `pyarrow.fs.HadoopFileSystem` port in v2 API to make it consistent to the v1 API behavior #225

Set 0 to `pyarrow.fs.HadoopFileSystem` port in v2 API to make it consistent to the v1 API behavior #225

belltailjp commented Oct 15, 2021 •

edited

Loading

kuenishi commented Oct 18, 2021

pfn-ci-bot commented Oct 18, 2021

kuenishi commented Oct 18, 2021

kuenishi left a comment

Set 0 to pyarrow.fs.HadoopFileSystem port in v2 API to make it consistent to the v1 API behavior #225

Set 0 to pyarrow.fs.HadoopFileSystem port in v2 API to make it consistent to the v1 API behavior #225

Conversation

belltailjp commented Oct 15, 2021 • edited Loading

kuenishi commented Oct 18, 2021

pfn-ci-bot commented Oct 18, 2021

kuenishi commented Oct 18, 2021

kuenishi left a comment

Choose a reason for hiding this comment

Set 0 to `pyarrow.fs.HadoopFileSystem` port in v2 API to make it consistent to the v1 API behavior #225

Set 0 to `pyarrow.fs.HadoopFileSystem` port in v2 API to make it consistent to the v1 API behavior #225

belltailjp commented Oct 15, 2021 •

edited

Loading