Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HdfsFileSystem.stat to return a compatible object to os.stat #114

Merged
merged 8 commits into from
May 1, 2020

Conversation

belltailjp
Copy link
Member

@belltailjp belltailjp commented Mar 26, 2020

I am using ChainerIO 0.1.1.
Currently HdfsFileSystem.stat returns a tiny dict such as {'size': 5, 'kind': 'file'},
while PosixFileSystem.stat, which is equivalent to os.stat, returns a os.stat_result object.
Users who try to transparently handle different type of filesystems by using ChainerIO filesystem handler suffer from this interface difference.

In this PR, 2 independent problems are handled.

  • stat for both xxxxFileSystem return the same type of object (and I used os.stat_result)
  • In HDFS (or probably pyarrow issue), stat returns poor information as mentioned above, so I use info instead. At least size and kind are included in the info too, so I guess it's a safe modification, unless there are some system dependent differences.

There are a few issues remained.

  • st_ino, st_dev, st_nlink are left as blank. I think it's affordable.
  • owner, group are not id (int) but string in this PR. If there is an efficient way of conversion from string to uid/gid, I'd like to implement it.
  • st_ctime is not available in HDFS. I set the same value as st_mtime

It's just a PoC PR so I'm not confident the approach I implement is a good direction (specifically, using os.stat_result as a common return type). I'd like to hear your opinion.

None, # st_ino
None, # st_dev
None, # st_nlink
info['owner'], # st_uid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the info['owner'] and info['group'] are strings instead of uids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are! In contrast what os.stat returns are int (uid/gid). It's fortunate (?) that os.stat_result also accepts strings, but if a user expects the normal behavior (having int in st_uid and st_gid attributes), he or she will suffer.

In [4]: os.stat('pytest.ini')       
Out[4]: os.stat_result(st_mode=33204, st_ino=27665508, st_dev=2049, st_nlink=1, st_uid=1000, st_gid=1000, st_size=51, st_atime=1585040937, st_mtime=1583388582, st_ctime=1583388582)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I did not know os.stat_result also accepts strings, I thought it just a wrapper of struct stat. There is actually a way to convert the username to uid, and I used that when writing the hdfs fuse client. However, we still have the problem when the user or group does not exist on the local machine.
https://stackoverflow.com/questions/826082/python-finding-uid-gid-for-a-given-username-groupname-for-os-chown

Copy link
Member

@belldandyxtq belldandyxtq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests in case we are missing some fields?

stat.kind = info['kind']
stat.mode = info['permissions']
stat.size = info['size']
stat.owner = info['owner']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave some comments to mention that the stat.owner and stat.group can be string in hdfs case?
https://github.com/chainer/chainerio/blob/faf6d8bb5ad10526436ede13375646f842e7890c/chainerio/io.py#L134
can be a good place

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point it out! I have written docstrings of FileStatss to clarify these things.

chainerio/filesystems/posix.py Outdated Show resolved Hide resolved
chainerio/filesystems/posix.py Outdated Show resolved Hide resolved
chainerio/filesystems/hdfs.py Outdated Show resolved Hide resolved
@belldandyxtq belldandyxtq added the cat:feature Implementation that introduces new interfaces. label Apr 9, 2020
@belltailjp
Copy link
Member Author

belltailjp commented Apr 27, 2020

Sorry for the late update!
I have refined the implementation and wrote tests.
I confirmed all tests pass with Python 3.5.7 and 3.7.2.

Summary of the change is:

  • Previously stat returned different types, dict, os.stat, zipinfo
  • I inserted one abstraction FileStat, and defined filesystem specific stat class (e.g., HdfsFileStat) on top of it
  • To keep maximum compatibility, common attributes such as mode (permissions) or last_modified have identical behavior
    • permission: integer including regular-file / directory flag
    • last_modified: UNIX timestamp

@belldandyxtq
Copy link
Member

Can you resolve the conflict?

Copy link
Member

@belldandyxtq belldandyxtq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for you contribution, I left some concerns.

stat = handler.stat(test_file_name)
self.assertIsInstance(stat, pfio.containers.zip.ZipFileStat)
self.assertTrue(stat.filename.endswith(test_file_name))
self.assertEqual(stat.size, 22)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we retrieve the data from the zip_info so that these tests are more robust to changes?

self.assertIsInstance(stat, pfio.containers.zip.ZipFileStat)
self.assertTrue(stat.filename.endswith(test_file_name))
self.assertEqual(stat.size, 22)
self.assertEqual(stat.mode, 0o100664)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

stat = handler.stat(test_dir_name)
self.assertIsInstance(stat, pfio.containers.zip.ZipFileStat)
self.assertTrue(stat.filename.endswith(test_dir_name))
self.assertEqual(stat.size, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

self.assertIsInstance(stat, pfio.containers.zip.ZipFileStat)
self.assertTrue(stat.filename.endswith(test_dir_name))
self.assertEqual(stat.size, 0)
self.assertEqual(stat.mode, 0o40775)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

self.assertEqual(stat.mode, 0o100664)
self.assertFalse(stat.isdir())
self.assertTrue(stat.last_modified)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also test the rest of ZipFileStat

stat = handler.stat(test_dir_name)
self.assertIsInstance(stat, HdfsFileStat)
self.assertTrue(stat.filename.endswith(test_dir_name))
self.assertEqual(stat.size, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

self.assertTrue(stat.filename.endswith(test_dir_name))
self.assertEqual(stat.size, 0)
self.assertTrue(stat.isdir())
self.assertEqual(stat.mode, 0o40755)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

stat = handler.stat(test_file_name)
self.assertIsInstance(stat, PosixFileStat)
self.assertTrue(stat.filename.endswith(test_file_name))
self.assertEqual(stat.size, 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


stat = handler.stat(test_dir_name)
self.assertIsInstance(stat, PosixFileStat)
self.assertTrue(stat.filename.endswith(test_dir_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

self.group = info['group']
self.last_modified = info['last_modified']
self.last_accessed = info['last_accessed']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful to add other fields from the hdfs.info, like "replication" and "block_size"

@belltailjp
Copy link
Member Author

belltailjp commented Apr 28, 2020

I have fixed the review points!

  • Provide all the public attributes each filesystem provides
  • Get expected values for test cases through direct use of each filesystem API

Also, it seems I forgot to push some modifications that I made before the review. Now they are already available.

  • Make sure to have float in FileStat.last_modified for compatibility (zip and hdfs has no sub-second precision, though)
  • docstrings to clarify each attributes, as well as notation on behavior difference among filesystems (e.g. group and user in HDFS and POSIX, time resolution difference in last_modified)

What I have checked is:

  • Test all-green (Py 3.5.7, 3.7.2)
  • flake8 (pfio and tests directory)

I haven't been able to check document generation yet.

Copy link
Member

@belldandyxtq belldandyxtq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits in docstring

pfio/io.py Outdated

`stat` of filesystem/container handlers return an object of
subclass of `FileStat`.
In addition to the common attribute that the `FileStat` abstract provides,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attributes

pfio/io.py Outdated
In addition to the common attribute that the `FileStat` abstract provides,
each `FileStat` subclass implements some additional attributes depending on
what information the corresponding filesystem or container can handle.
The common attributes behave almost the same in spite of filesystem or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost is kind of ambiguous

pfio/io.py Outdated
size = None

def isdir(self):
"""Distinguish whether the target is directory, based on the permission flag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about return

Return whether the target is directory

@belldandyxtq
Copy link
Member

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit eaeb20c:

@belltailjp
Copy link
Member Author

I have updated docstrings! (fixed the review points, also applied grammarly)

@belldandyxtq
Copy link
Member

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit e8469f6:

@belldandyxtq
Copy link
Member

@belltailjp Can you please fix the style?

@belltailjp
Copy link
Member Author

Oops! Sorry I have fixed it right now.

@belldandyxtq
Copy link
Member

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit 07dc703:

@belltailjp
Copy link
Member Author

Sorry, give me just a moment.
I noticed links are not properly made in docstrings.
image

@belltailjp
Copy link
Member Author

Done Sphinx fix! 🙇‍♂️

@belldandyxtq
Copy link
Member

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit cdbd40a:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants