Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: pyarrow._hdfs.HadoopFileSystem__init__() takes at least 1 positional argument (0 given) #657

Closed
reoono opened this issue Mar 24, 2022 · 1 comment 路 Fixed by #658

Comments

@reoono
Copy link
Contributor

reoono commented Mar 24, 2022

馃悰 Bug

In papermill==2.3.4, when I try to use HadoopFileSystem, I get the following error message.

$ papermill Untitled.ipynb hdfs://myhost/tmp.ipynb
Executing:   0%|                                                                     | 0/9 [00:00<?, ?cell/s]
Traceback (most recent call last):
  File "/opt/conda/bin/papermill", line 8, in <module>
    sys.exit(papermill())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/cli.py", line 267, in papermill
    execution_timeout=execution_timeout,
  File "/opt/conda/lib/python3.7/site-packages/papermill/execute.py", line 118, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 357, in execute_notebook
    nb_man.notebook_start()
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 69, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 198, in notebook_start
    self.save()
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 69, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 139, in save
    write_ipynb(self.nb, self.output_path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 447, in write_ipynb
    papermill_io.write(nbformat.writes(nb), path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 143, in write
    return self.get_handler(path).write(buf, path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 372, in write
    with self._get_client().open(path, 'wb') as f:
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 361, in _get_client
    self._client = HadoopFileSystem()
  File "pyarrow/_hdfs.pyx", line 55, in pyarrow._hdfs.HadoopFileSystem.__init__
TypeError: __init__() takes at least 1 positional argument (0 given)

Steps to reproduce the behavior

  1. Install the latest version of papermill by either of the following steps
  • pip install papermill==2.3.4.
  • Install according to CONTRIBUTING.md
  1. Execute the papermill with specifing a URL starting with hdfs:// for the path.
    e.g.: $ papermill untitled.ipynb hdfs://myhost/tmp.ipynb

Analysis

I found that the HDFSHandler in iorw.py has not been modified, even though the API has changed with the filesystem change (#615).

Proposed amendment

As mentioned above, the function interface is different and I would like to fix HDFSHandler to match the new API of PyArrow.fs.HadoopFileSystem.

Note that in that case, I think we should delete the part importing PyArrow.HadoopFileSystem and update requirements/hdfs.txt.
(If it continues to support the previous filesystem, it needs to determine which was Imported and call the appropriate API.
However, I personally think that it is not necessary to support the deprecated class, considering the future maintenance effort.
Would you mind sharing your thoughts?)

@reoono
Copy link
Contributor Author

reoono commented Mar 24, 2022

An example implementation for adapting to the new API is shown below.
https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html

Please check and let me know if you have any comments.
If there are no problems, I will create a PR.

parts which need to be fixed

HDFSHandler

init()

unnecessary

_get_client(self)

before
HadoopFileSystem() 
after
HadoopFileSystem(host=default)

read(self, path):

before
self._get_client().open(path, 'rb')
after
self._get_client().open_input_stream(path)

listdir(self, path):

before
self._get_client().ls(path)
after
[f.path in f for self._get_client().get_file_info(FileSelector(path))]

https://arrow.apache.org/docs/python/filesystems.html#listing-files

write(self, buf, path):

before
self._get_client().open(path, 'wb') as f:
after
self._get_client().open_output_stream(path) as f:

pretty_path(self, path):

unnecessary

requirements/hdfs.txt

before
pyarrow
after
pyarrow >= 2.0

(Because pyarrow.hdfs.HadoopFileSystem has been deprecated since 2.0.
https://arrow.apache.org/docs/_modules/pyarrow/hdfs.html#HadoopFileSystem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant