Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set 0 to pyarrow.fs.HadoopFileSystem port in v2 API to make it consistent to the v1 API behavior #225

Merged
merged 1 commit into from
Oct 18, 2021

Conversation

belltailjp
Copy link
Member

@belltailjp belltailjp commented Oct 15, 2021

With the v2 API, it fails to connect to HDFS whereas v1 (legacy API) is fine.

>>> pfio.v2.from_url('hdfs:///')
...
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/fs.py", line 285, in open_url
    with from_url(dirname) as fs:
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/fs.py", line 314, in from_url
    fs = Hdfs(dirname, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/hdfs.py", line 185, in __init__
    self._fs = _create_fs()
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/hdfs.py", line 167, in _create_fs
    return HadoopFileSystem(nameservice)
  File "pyarrow/_hdfs.pyx", line 83, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
OSError: HDFS connection failed

The v1 API uses the pyarrow deprecated function pyarrow.hdfs.connect, while v2 uses the new recommended one pyarrow.fs.HadoopFileSystem.
In both API, pfio doesn't explicitly specify port number, however hdfs.connect uses 0 for the default port, but fs.HadoopFileSystem sets 8020. I found that this mismatch causes the issue where pfio+HDFS works with v1 but doesn't work with v2.

If I modify the v2 API to explicitly specfy 0 to the port argument when instantiating pyarrow.fs.HadoopFileSystem, it seems to work as expected.

This PR proposes to introduce this change.

I guess this change should be related to how the storage environment is set up, so if there's a better way to configure port, please let me know.

@kuenishi kuenishi added the cat:bug Bug report or fix. label Oct 18, 2021
@kuenishi
Copy link
Member

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit 19a1935:

@kuenishi kuenishi added this to the 2.0.0 milestone Oct 18, 2021
@kuenishi
Copy link
Member

Thank you, and nice catch. The old API pyarrow.hdfs.connect has its documentation describes "port (NameNode's port. Set to 0 for default or logical (HA) nodes.) –", which is reasonable for our configuration.

Copy link
Member

@kuenishi kuenishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add additional option specifying port number to from_url() and so on, but from the perspective of name services in Hadoop are often configured as HA - and there are much more configuration knobs relying on $HADOOP_CONF_DIR. Thus, backward-compatible change like this would be sufficient.

@kuenishi kuenishi merged commit 90c9c30 into pfnet:master Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:bug Bug report or fix.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants