Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File access V2 API #151

Merged
merged 14 commits into from
Mar 5, 2021
5 changes: 5 additions & 0 deletions chainerio/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
import sys
import warnings

warnings.warn("package 'chainerio' is deprecated and will be removed."
" Please use 'pfio' instead.",
DeprecationWarning)

# make sure pfio is in sys.modules
import pfio # NOQA
Expand Down
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Welcome to PFIO's documentation!

design
reference
v2


Indices and tables
==================
Expand Down
75 changes: 75 additions & 0 deletions docs/source/v2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
.. module:: pfio.v2

V2 API
======

.. note:: this is still in exprerimental phase.


PFIO v2 API tries to solve the impedance mismatch between different
local filesystem, NFS, and other object storage systems, with a lot
simpler and cleaner code.

It has removed several toplevel functions that seem to be less
important. It turned out that they introduced more complexity than
originally intended, due to the need of the global context. Thus,
functions that depends on the global context such as ``open()``,
``set_root()`` and etc. have been removed in v2 API.

Instead, v2 API provides only two toplevel functions that enable
direct resource access with full URL: ``open_url()`` and
``from_url()``. The former opens a file and returns FileObject. The
latter, creates a ``fs.FS`` object that enable resource access under
the URL. The new class ``fs.FS``, is something close to handler object
in version 1 API. ``fs.FS`` is intended to be as much compatible as
possible, however, it has several differences.

One notable difference is that it has the virtual concept of current
working directory, and thus provides ``subfs()`` method. ``subfs()``
method behaves like ``chroot(1)`` or ``os.chdir()`` without actually
changing current working directory of the process, but actually
returns a *new* ``fs.FS`` object that has different working
directory. All resouce access through the object automatically
prepends the working directory.

V2 API does not provide lazy resouce initialization any more. Instead,
it provides simple wrapper ``lazify()``, which recreates the ``fs.FS``
object every time the object experiences ``fork(2)``. ``Hdfs`` and
``Zip`` can be wrapped with it, and will be fork-tolerant object.



Reference
---------

.. autofunction:: open_url
.. autofunction:: from_url
.. autofunction:: lazify


.. autoclass:: pfio.v2.fs.FS
:members:

Local file system
~~~~~~~~~~~~~~~~~

.. autoclass:: Local
:members:

HDFS (Hadoop File System)
~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: Hdfs
:members:

S3 (AWS S3)
~~~~~~~~~~~

.. autoclass:: S3
:members:

Zip Archive
~~~~~~~~~~~

.. autoclass:: Zip
:members:
6 changes: 6 additions & 0 deletions pfio/chainer_extensions/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
import warnings

from pfio.chainer_extensions.snapshot import load_snapshot # NOQA

warnings.warn("Chainer extentions are deprecated and "
"will be removed. Please use 'pfio' instead.",
DeprecationWarning)
49 changes: 1 addition & 48 deletions pfio/io.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import abc
import stat
from abc import abstractmethod
from importlib import import_module
from io import IOBase
Expand All @@ -8,6 +7,7 @@

from pfio._typing import Optional
from pfio.profiler import IOProfiler
from pfio.v2.fs import FileStat


def open_wrapper(func):
Expand All @@ -25,53 +25,6 @@ def wrapper(self, file_path: str, mode: str = 'rb',
return wrapper


class FileStat(abc.ABC):
"""Detailed file or directory information abstraction

:meth:`pfio.IO.stat` of filesystem/container handlers return an object of
subclass of ``FileStat``.
In addition to the common attributes that the ``FileStat`` abstract
provides, each ``FileStat`` subclass implements some additional
attributes depending on what information the corresponding filesystem or
container can handle.
The common attributes have the same behavior despite filesystem or
container type difference.

Attributes:
filename (str):
Filename in the filesystem or container.
last_modifled (float):
UNIX timestamp of mtime. Note that some
filesystems or containers do not have sub-second precision.
mode (int):
Permission with file type flag (regular file or directory).
You can make a human-readable interpretation by
`stat.filemode <https://docs.python.org/3/library/stat.html#stat.filemode>`_.
size (int):
Size in bytes. Note that directories may have different
sizes depending on the filesystem or container type.
""" # NOQA
filename = None
last_modified = None
mode = None
size = None

def isdir(self):
"""Returns whether the target is a directory, based on the permission flag

Returns:
`True` if directory, `False` otherwise.
"""
return bool(self.mode & 0o40000)

def __str__(self):
return '<{} filename="{}" mode="{}">'.format(
type(self).__name__, self.filename, stat.filemode(self.mode))

def __repr__(self):
return str(self.__str__())


class IO(abc.ABC):
def __init__(self, io_profiler: Optional[IOProfiler] = None,
root: str = ""):
Expand Down
63 changes: 63 additions & 0 deletions pfio/testing/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import os
import random
import string
from zipfile import ZipFile


class ZipForTest:
def __init__(self, destfile, data=None):
if data is None:
self.data = dict(
file=b"foo",
dir=dict(
f=b"bar"
)
)
else:
self.data = data

self._make_zip(destfile)
self.destfile = destfile

def content(self, path):
d = self.data

for node in path.split(os.path.sep):
d = d.get(node)
if not isinstance(d, dict):
return d

def _make_zip(self, destfile):
with ZipFile(destfile, "w") as z:
stack = []
self._write_zip_contents(z, stack, self.data)

def _write_zip_contents(self, z, stack, data):
for k in data:
if isinstance(data[k], dict):
self._write_zip_contents(z, stack+[k], data[k])
else:
path = os.path.join(*stack, k)
with z.open(path, 'w') as fp:
fp.write(data[k])


def make_zip(zipfilename, root_dir, base_dir):
pwd = os.getcwd()
with ZipFile(zipfilename, "w") as f:
try:
os.chdir(root_dir)
for root, dirs, filenames in os.walk(base_dir):
for _dir in dirs:
path = os.path.normpath(os.path.join(root, _dir))
f.write(path)
for _file in filenames:
path = os.path.normpath(os.path.join(root, _file))
f.write(path)
finally:
os.chdir(pwd)


def make_random_str(n):
return ''.join([random.choice(string.ascii_letters + string.digits)
for i in range(n)])
36 changes: 36 additions & 0 deletions pfio/v2/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
'''
fs.FS> interface
implementations:
open_fs(URI, container=None|zip) => Local/HDFS/S3/Zip, etc
- Local
- subfs() -> Local
- open_zip() -> Zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using this name, are we not going to support containers other than zip

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any other container format in the scope? I would like being as specific as we need.

- open() -> FileObject
- HDFS
- subfs() -> HDFS
- open_zip() -> Zip
- open() -> FileObject
- Zip
- subfs() -> Zip
- open_zip() -> Zip
- open() -> FileObject
- S3 (TBD)
- GS (TBD)

For example of globally switching backend file systems::

from pfio.v2 import local as pfio

Or::

from pfio.v2 import Hdfs
pfio = Hdfs()

'''
from .fs import from_url, lazify, open_url # NOQA
from .hdfs import Hdfs, HdfsFileStat # NOQA
from .local import Local, LocalFileStat # NOQA
from .s3 import S3 # NOQA
from .zip import Zip, ZipFileStat # NOQA

local = Local()
Loading