Skip to content

Commit

Permalink
Merge pull request #140 from martindurant/cache_docs
Browse files Browse the repository at this point in the history
update docs, mainly for caching [skip ci]
  • Loading branch information
martindurant committed Sep 17, 2019
2 parents bb5043e + 4e39342 commit 3e8c37d
Show file tree
Hide file tree
Showing 8 changed files with 143 additions and 64 deletions.
27 changes: 27 additions & 0 deletions docs/source/api.rst
Expand Up @@ -30,6 +30,7 @@ Base Classes
fsspec.spec.AbstractBufferedFile
fsspec.FSMap
fsspec.core.OpenFile
fsspec.core.BaseCache

.. autoclass:: fsspec.spec.AbstractFileSystem
:members:
Expand All @@ -46,6 +47,9 @@ Base Classes
.. autoclass:: fsspec.core.OpenFile
:members:

.. autoclass:: fsspec.core.BaseCache
:members:


.. _implementations:

Expand All @@ -62,6 +66,7 @@ Built-in Implementations
fsspec.implementations.webhdfs.WebHDFS
fsspec.implementations.zip.ZipFileSystem
fsspec.implementations.cached.CachingFileSystem
fsspec.implementations.cached.WholeFileCacheFileSystem

.. autoclass:: fsspec.implementations.ftp.FTPFileSystem
:members: __init__
Expand All @@ -88,3 +93,25 @@ Built-in Implementations

.. autoclass:: fsspec.implementations.cached.CachingFileSystem
:members: __init__

.. autoclass:: fsspec.implementations.cached.WholeFileCacheFileSystem

.. _readbuffering:

Read Buffering
--------------

.. autosummary::

fsspec.core.ReadAheadCache
fsspec.core.BytesCache
fsspec.core.MMapCache

.. autoclass:: fsspec.core.ReadAheadCache
:members:

.. autoclass:: fsspec.core.BytesCache
:members:

.. autoclass:: fsspec.core.MMapCache
:members:
52 changes: 52 additions & 0 deletions docs/source/features.rst
Expand Up @@ -165,3 +165,55 @@ Since files can hold on to write caches and read buffers,
the instance cache may cause excessive memory usage in some situations; but normally, files
will get ``close``d, and the data discarded. Only when there is also an unfinalised transaction or
captured traceback might this be anticipated becoming a problem.
File Buffering
--------------
Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and
have many behaviours in common. These files offer buffering of both read and write operations, so that
communication with the remote resource is limited. The size of the buffer is generally configured
with the ``blocksize=`` kwargs at p[en time, although the implementation may have some minimum or
maximum sizes that need to be respected.

For reading, a number of buffering schemes are available, listed in ``fsspec.core.caches``
(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead
buffer, you can do

.. code-block:: python
fs = fsspec.filesystem(...)
with fs.open(path, mode='rb', cache_type='readahead') as f:
use_for_something(f)
Caching Files Locally
---------------------

``fsspec`` allows you to access data on remote file systems, that is its purpose. However, such
access can often be rather slow compared to local storage, so as well as buffering (see above), the
option exists to cp[y files locally when you first access them, and thereafter to use the local data.
This local cache of data might be temporary (i.e., attached to the process and discarded when the
process ends) or at some specific location in your local storage.

Two mechanisms are provided, and both involve wrapping a `target` filesystem. The following example
creates a file-based cache.

.. code-block:: python
fs = fsspec.filesystem("filecache", target_protocol='s3', target_options={'anon': True},
cache_storage='/tmp/files/')
Each time you open a remote file on S3, it will first copy it to
a local temporary directory, and then all further access will use the local file. Since we specify
a particular local location, the files will persist and can be reused from future sessions, although
you can also set policies to have cached files expire after some time, or to check the remote file system
on each open, to see if the target file has changed since it was copied.

With the "blockcache" variant, data is downloaded block-wise: only the specific parts of the remote file
which are accessed. This means that the local copy of the file might end up being much smaller than the
remote one, if only certain parts of it are required.

Whereas "filecache" works for all file system implementations, and provides a real local file for other
libraries to use, "blockcache" has restrictions: that you have a storage/OS combination which supports
sparse files, that the backend implementation uses files which derive ``from AbstractBufferedFile``,
and that the library you pass the resultant object to accepts generic python file-like objects. You
should not mix block- and file-caches in the same directory.
76 changes: 38 additions & 38 deletions fsspec/core.py
Expand Up @@ -22,20 +22,20 @@ class OpenFile(object):
Parameters
----------
fs : FileSystem
fs: FileSystem
The file system to use for opening the file. Should match the interface
of ``dask.bytes.local.LocalFileSystem``.
path : str
path: str
Location to open
mode : str like 'rb', optional
mode: str like 'rb', optional
Mode of the opened file
compression : str or None, optional
compression: str or None, optional
Compression to apply
encoding : str or None, optional
encoding: str or None, optional
The encoding to use if opened in text mode.
errors : str or None, optional
errors: str or None, optional
How to handle encoding errors if opened in text mode.
newline : None or str
newline: None or str
Passed to TextIOWrapper in text mode, how to handle line endings.
"""
def __init__(self, fs, path, mode='rb', compression=None, encoding=None,
Expand Down Expand Up @@ -114,31 +114,31 @@ def open_files(urlpath, mode='rb', compression=None, encoding='utf8',
Parameters
----------
urlpath : string or list
urlpath: string or list
Absolute or relative filepath(s). Prefix with a protocol like ``s3://``
to read from alternative filesystems. To read from multiple files you
can pass a globstring or a list of paths, with the caveat that they
must all have the same protocol.
mode : 'rb', 'wt', etc.
compression : string
mode: 'rb', 'wt', etc.
compression: string
Compression to use. See ``dask.bytes.compression.files`` for options.
encoding : str
encoding: str
For text mode only
errors : None or str
errors: None or str
Passed to TextIOWrapper in text mode
name_function : function or None
name_function: function or None
if opening a set of files for writing, those files do not yet exist,
so we need to generate their names by formatting the urlpath for
each sequence number
num : int [1]
num: int [1]
if writing mode, number of files we expect to create (passed to
name+function)
protocol : str or None
protocol: str or None
If given, overrides the protocol found in the URL.
newline : bytes or None
newline: bytes or None
Used for line terminator in text mode. If None, uses system default;
if blank, uses no translation.
**kwargs : dict
**kwargs: dict
Extra options that make sense to a particular storage connection, e.g.
host, port, username, password, etc.
Expand Down Expand Up @@ -166,23 +166,23 @@ def open(urlpath, mode='rb', compression=None, encoding='utf8',
Parameters
----------
urlpath : string or list
urlpath: string or list
Absolute or relative filepath. Prefix with a protocol like ``s3://``
to read from alternative filesystems. Should not include glob
character(s).
mode : 'rb', 'wt', etc.
compression : string
mode: 'rb', 'wt', etc.
compression: string
Compression to use. See ``dask.bytes.compression.files`` for options.
encoding : str
encoding: str
For text mode only
errors : None or str
errors: None or str
Passed to TextIOWrapper in text mode
protocol : str or None
protocol: str or None
If given, overrides the protocol found in the URL.
newline : bytes or None
newline: bytes or None
Used for line terminator in text mode. If None, uses system default;
if blank, uses no translation.
**kwargs : dict
**kwargs: dict
Extra options that make sense to a particular storage connection, e.g.
host, port, username, password, etc.
Expand Down Expand Up @@ -231,12 +231,12 @@ def expand_paths_if_needed(paths, mode, num, fs, name_function):
"""Expand paths if they have a ``*`` in them.
:param paths: list of paths
mode : str
mode: str
Mode in which to open files.
num : int
num: int
If opening in writing mode, number of files we expect to create.
fs : filesystem object
name_function : callable
fs: filesystem object
name_function: callable
If opening in writing mode, this callable is used to generate path
names. Names are generated for each partition by
``urlpath.replace('*', name_function(partition_index))``.
Expand Down Expand Up @@ -272,18 +272,18 @@ def get_fs_token_paths(urlpath, mode='rb', num=1, name_function=None,
Parameters
----------
urlpath : string or iterable
urlpath: string or iterable
Absolute or relative filepath, URL (may include protocols like
``s3://``), or globstring pointing to data.
mode : str, optional
mode: str, optional
Mode in which to open files.
num : int, optional
num: int, optional
If opening in writing mode, number of files we expect to create.
name_function : callable, optional
name_function: callable, optional
If opening in writing mode, this callable is used to generate path
names. Names are generated for each partition by
``urlpath.replace('*', name_function(partition_index))``.
storage_options : dict, optional
storage_options: dict, optional
Additional keywords to pass to the filesystem class.
protocol: str or None
To override the protocol specifier in the URL
Expand Down Expand Up @@ -363,12 +363,12 @@ class BaseCache(object):
Parameters
----------
blocksize : int
blocksize: int
How far to read ahead in numbers of bytes
fetcher : func
fetcher: func
Function of the form f(start, end) which gets bytes from remote as
specified
size : int
size: int
How big this file is
"""
def __init__(self, blocksize, fetcher, size, **kwargs):
Expand Down Expand Up @@ -491,7 +491,7 @@ class BytesCache(BaseCache):
Parameters
----------
trim : bool
trim: bool
As we read more data, whether to discard the start of the buffer when
we are more than a blocksize ahead of it.
"""
Expand Down
10 changes: 5 additions & 5 deletions fsspec/fuse.py
Expand Up @@ -125,19 +125,19 @@ def run(fs, path, mount_point, foreground=True, threads=False):
Parameters
----------
fs : file-system instance
fs: file-system instance
From one of the compatible implementations
path : str
path: str
Location on that file-system to regard as the root directory to
mount. Note that you typically should include the terminating "/"
character.
mount_point : str
mount_point: str
An empty directory on the local file-system where the contents of
the remote path will appear
foreground : bool
foreground: bool
Whether or not calling this function will block. Operation will
typically be more stable if True.
threads : bool
threads: bool
Whether or not to create threads when responding to file operations
within the mounter directory. Operation will typically be more
stable if False.
Expand Down
12 changes: 6 additions & 6 deletions fsspec/implementations/cached.py
Expand Up @@ -38,25 +38,25 @@ def __init__(self, target_protocol=None, cache_storage='TMP',
Parameters
----------
target_protocol : str
target_protocol: str
Target fielsystem protocol
cache_storage : str or list(str)
cache_storage: str or list(str)
Location to store files. If "TMP", this is a temporary directory,
and will be cleaned up by the OS when this process ends (or later).
If a list, each location will be tried in the order given, but
only the last will be considered writable.
cache_check : int
cache_check: int
Number of seconds between reload of cache metadata
check_files : bool
check_files: bool
Whether to explicitly see if the UID of the remote file matches
the stored one before using. Warning: some file systems such as
HTTP cannot reliably give a unique hash of the contents of some
path, so be sure to set this option to False.
expiry_time : int
expiry_time: int
The time in seconds after which a local copy is considered useless.
Set to falsy to prevent expiry. The default is equivalent to one
week.
target_options : dict or None
target_options: dict or None
Passed to the instantiation of the FS, if fs is None.
"""
if cache_storage == "TMP":
Expand Down
6 changes: 3 additions & 3 deletions fsspec/mapping.py
Expand Up @@ -12,10 +12,10 @@ class FSMap(MutableMapping):
Parameters
----------
root : string
root: string
prefix for all the files
fs : FileSystem instance
check : bool (=True)
fs: FileSystem instance
check: bool (=True)
performs a touch at the location, to check for write access.
Examples
Expand Down

0 comments on commit 3e8c37d

Please sign in to comment.