TypeError: pyarrow._hdfs.HadoopFileSysteminit() takes at least 1 positional argument (0 given) #657

reoono · 2022-03-24T03:37:59Z

🐛 Bug

In papermill==2.3.4, when I try to use HadoopFileSystem, I get the following error message.

$ papermill Untitled.ipynb hdfs://myhost/tmp.ipynb
Executing:   0%|                                                                     | 0/9 [00:00<?, ?cell/s]
Traceback (most recent call last):
  File "/opt/conda/bin/papermill", line 8, in <module>
    sys.exit(papermill())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/cli.py", line 267, in papermill
    execution_timeout=execution_timeout,
  File "/opt/conda/lib/python3.7/site-packages/papermill/execute.py", line 118, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 357, in execute_notebook
    nb_man.notebook_start()
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 69, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 198, in notebook_start
    self.save()
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 69, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/papermill/engines.py", line 139, in save
    write_ipynb(self.nb, self.output_path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 447, in write_ipynb
    papermill_io.write(nbformat.writes(nb), path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 143, in write
    return self.get_handler(path).write(buf, path)
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 372, in write
    with self._get_client().open(path, 'wb') as f:
  File "/opt/conda/lib/python3.7/site-packages/papermill/iorw.py", line 361, in _get_client
    self._client = HadoopFileSystem()
  File "pyarrow/_hdfs.pyx", line 55, in pyarrow._hdfs.HadoopFileSystem.__init__
TypeError: __init__() takes at least 1 positional argument (0 given)

Steps to reproduce the behavior

Install the latest version of papermill by either of the following steps

pip install papermill==2.3.4.
Install according to CONTRIBUTING.md

Execute the papermill with specifing a URL starting with hdfs:// for the path.
e.g.: $ papermill untitled.ipynb hdfs://myhost/tmp.ipynb

Analysis

I found that the HDFSHandler in iorw.py has not been modified, even though the API has changed with the filesystem change (#615).

Proposed amendment

As mentioned above, the function interface is different and I would like to fix HDFSHandler to match the new API of PyArrow.fs.HadoopFileSystem.

Note that in that case, I think we should delete the part importing PyArrow.HadoopFileSystem and update requirements/hdfs.txt.
(If it continues to support the previous filesystem, it needs to determine which was Imported and call the appropriate API.
However, I personally think that it is not necessary to support the deprecated class, considering the future maintenance effort.
Would you mind sharing your thoughts?)

The text was updated successfully, but these errors were encountered:

reoono · 2022-03-24T03:41:42Z

An example implementation for adapting to the new API is shown below.
https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html

Please check and let me know if you have any comments.
If there are no problems, I will create a PR.

parts which need to be fixed

HDFSHandler

init()

unnecessary

_get_client(self)

before

HadoopFileSystem()

after

HadoopFileSystem(host=default)

read(self, path):

before

self._get_client().open(path, 'rb')

after

self._get_client().open_input_stream(path)

listdir(self, path):

before

self._get_client().ls(path)

after

[f.path in f for self._get_client().get_file_info(FileSelector(path))]

https://arrow.apache.org/docs/python/filesystems.html#listing-files

write(self, buf, path):

before

self._get_client().open(path, 'wb') as f:

after

self._get_client().open_output_stream(path) as f:

pretty_path(self, path):

unnecessary

requirements/hdfs.txt

before

pyarrow

after

pyarrow >= 2.0

(Because pyarrow.hdfs.HadoopFileSystem has been deprecated since 2.0.
https://arrow.apache.org/docs/_modules/pyarrow/hdfs.html#HadoopFileSystem)

reoono added bug help wanted labels Mar 24, 2022

reoono mentioned this issue Mar 25, 2022

fix(iorw.py): update HDFSHandler to adapt PyArrow.fs.HadoopFileSystem API #658

Merged

MSeal closed this as completed in #658 May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: pyarrow._hdfs.HadoopFileSysteminit() takes at least 1 positional argument (0 given) #657

TypeError: pyarrow._hdfs.HadoopFileSysteminit() takes at least 1 positional argument (0 given) #657

reoono commented Mar 24, 2022

reoono commented Mar 24, 2022 •

edited

TypeError: pyarrow._hdfs.HadoopFileSystem__init__() takes at least 1 positional argument (0 given) #657

TypeError: pyarrow._hdfs.HadoopFileSystem__init__() takes at least 1 positional argument (0 given) #657

Comments

reoono commented Mar 24, 2022

🐛 Bug

Steps to reproduce the behavior

Analysis

Proposed amendment

reoono commented Mar 24, 2022 • edited

parts which need to be fixed

HDFSHandler

init()

_get_client(self)

before

after

read(self, path):

before

after

listdir(self, path):

before

after

write(self, buf, path):

before

after

pretty_path(self, path):

requirements/hdfs.txt

before

after

TypeError: pyarrow._hdfs.HadoopFileSysteminit() takes at least 1 positional argument (0 given) #657

TypeError: pyarrow._hdfs.HadoopFileSysteminit() takes at least 1 positional argument (0 given) #657

reoono commented Mar 24, 2022 •

edited