[Feature] Loading objects from different backends and dumping objects…

… to different backends (#1330) * [Feature] Choose storage backend by the prefix of filepath * refactor FileClient and add unittest * support loading from different backends * polish docstring * fix unittet * rename attribute str_like_obj to is_str_like_obj * add infer_client method * add check_exist method * rename var client to file_client * polish docstring * add join_paths method * remove join_paths and add _format_path * enhance unittest * refactor unittest * singleton pattern * fix test_clientio.py * deprecate CephBackend * enhance docstring * refactor unittest for petrel * refactor unittest for disk backend * update io.md * add concat_paths method * improve docstring * improve docstring * add isdir and copyfile for file backend * delete copyfile and add get_local_path * remove isdir method of petrel * fix typo * add comment and polish docstring * polish docstring * rename _path_mapping to _map_path * polish docstring and fix typo * refactor get_local_path * add list_dir_or_file for FileClient * add list_dir_or_file for PetrelBackend * fix windows ci * Add return docstring * polish docstring * fix typo * fix typo * deprecate the conversion from Path to str * add docs for loading checkpoints with FileClient * refactor map_path * add _ensure_methods to ensure methods have been implemented * fix list_dir_or_file * rename _ensure_method_implemented to has_method
open-mmlab · Nov 3, 2021 · 01bc35e · 01bc35e
1 parent 9b3cffd
commit 01bc35e
Show file tree

Hide file tree

Showing 13 changed files with 1,860 additions and 91 deletions.
diff --git a/docs/understand_mmcv/io.md b/docs/understand_mmcv/io.md
@@ -2,11 +2,17 @@
 
 This module provides two universal API to load and dump files of different formats.
 
+```{note}
+Since v1.3.16, the IO modules support loading (dumping) data from (to) different backends, respectively. More details are in PR [#1330](https://github.com/open-mmlab/mmcv/pull/1330).
+```
+
 ### Load and dump data
 
 `mmcv` provides a universal api for loading and dumping data, currently
 supported formats are json, yaml and pickle.
 
+#### Load from disk or dump to disk
+
 ```python
 import mmcv
 
@@ -29,6 +35,20 @@ with open('test.yaml', 'w') as f:
     data = mmcv.dump(data, f, file_format='yaml')
 ```
 
+#### Load from other backends or dump to other backends
+
+```python
+import mmcv
+
+# load data from a file
+data = mmcv.load('s3://bucket-name/test.json')
+data = mmcv.load('s3://bucket-name/test.yaml')
+data = mmcv.load('s3://bucket-name/test.pkl')
+
+# dump data to a file with a filename (infer format from file extension)
+mmcv.dump(data, 's3://bucket-name/out.pkl')
+```
+
 It is also very convenient to extend the api to support more file formats.
 All you need to do is to write a file handler inherited from `BaseFileHandler`
 and register it with one or several file formats.
@@ -92,7 +112,9 @@ d
 e
 ```
 
-Then use `list_from_file` to load the list from a.txt.
+#### Load from disk
+
+Use `list_from_file` to load the list from a.txt.
 
 ```python
 >>> mmcv.list_from_file('a.txt')
@@ -113,11 +135,113 @@ For example `b.txt` is a text file with 3 lines.
 3 panda
 ```
 
-Then use `dict_from_file` to load the dict from `b.txt` .
+Then use `dict_from_file` to load the dict from `b.txt`.
 
 ```python
 >>> mmcv.dict_from_file('b.txt')
 {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
 >>> mmcv.dict_from_file('b.txt', key_type=int)
 {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
 ```
+
+#### Load from other backends
+
+Use `list_from_file` to load the list from `s3://bucket-name/a.txt`.
+
+```python
+>>> mmcv.list_from_file('s3://bucket-name/a.txt')
+['a', 'b', 'c', 'd', 'e']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', offset=2)
+['c', 'd', 'e']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', max_num=2)
+['a', 'b']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', prefix='/mnt/')
+['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
+```
+
+Use `dict_from_file` to load the dict from `s3://bucket-name/b.txt`.
+
+```python
+>>> mmcv.dict_from_file('s3://bucket-name/b.txt')
+{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
+>>> mmcv.dict_from_file('s3://bucket-name/b.txt', key_type=int)
+{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
+```
+
+### Load and dump checkpoints
+
+#### Load checkpoints from disk or save to disk
+
+We can read the checkpoints from disk or save to disk in the following way.
+
+```python
+import torch
+
+filepath1 = '/path/of/your/checkpoint1.pth'
+filepath2 = '/path/of/your/checkpoint2.pth'
+# read from filepath1
+checkpoint = torch.load(filepath1)
+# save to filepath2
+torch.save(checkpoint, filepath2)
+```
+
+MMCV provides many backends. `HardDiskBackend` is one of them and we can use it to read or save checkpoints.
+
+```python
+import io
+from mmcv.fileio.file_client import HardDiskBackend
+
+disk_backend = HardDiskBackend()
+with io.BytesIO(disk_backend.get(filepath1)) as buffer:
+    checkpoint = torch.load(buffer)
+with io.BytesIO() as buffer:
+    torch.save(checkpoint, f)
+    disk_backend.put(f.getvalue(), filepath2)
+```
+
+If we want to implement an interface which automatically select the corresponding
+backend based on the file path, we can use the `FileClient`.
+For example, we want to implement two methods for reading checkpoints as well as saving checkpoints,
+which need to support different types of file paths, either disk paths, network paths or other paths.
+
+```python
+from mmcv.fileio.file_client import FileClient
+
+def load_checkpoint(path):
+    file_client = FileClient.infer(uri=path)
+    with io.BytesIO(file_client.get(path)) as buffer:
+        checkpoint = torch.load(buffer)
+    return checkpoint
+
+def save_checkpoint(checkpoint, path):
+    with io.BytesIO() as buffer:
+        torch.save(checkpoint, buffer)
+        file_client.put(buffer.getvalue(), path)
+
+file_client = FileClient.infer_client(uri=filepath1)
+checkpoint = load_checkpoint(filepath1)
+save_checkpoint(checkpoint, filepath2)
+```
+
+#### Load checkpoints from the Internet
+
+```{note}
+Currently, it only supports reading checkpoints from the Internet, and does not support saving checkpoints to the Internet.
+```
+
+```python
+import io
+import torch
+from mmcv.fileio.file_client import HTTPBackend, FileClient
+
+filepath = 'http://path/of/your/checkpoint.pth'
+checkpoint = torch.utils.model_zoo.load_url(filepath)
+
+http_backend = HTTPBackend()
+with io.BytesIO(http_backend.get(filepath)) as buffer:
+    checkpoint = torch.load(buffer)
+
+file_client = FileClient.infer_client(uri=filepath)
+with io.BytesIO(file_client.get(filepath)) as buffer:
+    checkpoint = torch.load(buffer)
+```
diff --git a/docs_zh_CN/understand_mmcv/io.md b/docs_zh_CN/understand_mmcv/io.md
@@ -2,10 +2,16 @@
 
 文件输入输出模块提供了两个通用的 API 接口用于读取和保存不同格式的文件。
 
+```{note}
+在 v1.3.16 及之后的版本中，IO 模块支持从不同后端读取数据并支持将数据至不同后端。更多细节请访问 PR [#1330](https://github.com/open-mmlab/mmcv/pull/1330)。
+```
+
 ### 读取和保存数据
 
 `mmcv` 提供了一个通用的 api 用于读取和保存数据，目前支持的格式有 json、yaml 和 pickle。
 
+#### 从硬盘读取数据或者将数据保存至硬盘
+
 ```python
 import mmcv
 
@@ -28,6 +34,20 @@ with open('test.yaml', 'w') as f:
     data = mmcv.dump(data, f, file_format='yaml')
 ```
 
+#### 从其他后端加载或者保存至其他后端
+
+```python
+import mmcv
+
+# 从 s3 文件读取数据
+data = mmcv.load('s3://bucket-name/test.json')
+data = mmcv.load('s3://bucket-name/test.yaml')
+data = mmcv.load('s3://bucket-name/test.pkl')
+
+# 将数据保存至 s3 文件 (根据文件名后缀反推文件类型)
+mmcv.dump(data, 's3://bucket-name/out.pkl')
+```
+
 我们提供了易于拓展的方式以支持更多的文件格式。我们只需要创建一个继承自 `BaseFileHandler` 的
 文件句柄类并将其注册到 `mmcv` 中即可。句柄类至少需要重写三个方法。
 
@@ -49,7 +69,7 @@ class TxtHandler1(mmcv.BaseFileHandler):
         return str(obj)
 ```
 
-举 `PickleHandler` 为例。
+以 `PickleHandler` 为例
 
 ```python
 import pickle
@@ -87,8 +107,9 @@ c
 d
 e
 ```
+#### 从硬盘读取
 
-使用 `list_from_file` 读取 `a.txt` 。
+使用 `list_from_file` 读取 `a.txt`
 
 ```python
 >>> mmcv.list_from_file('a.txt')
@@ -101,19 +122,119 @@ e
 ['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
 ```
 
-同样， `b.txt` 也是文本文件，一共有3行内容。
+同样， `b.txt` 也是文本文件，一共有3行内容
 
 ```
 1 cat
 2 dog cow
 3 panda
 ```
 
-使用 `dict_from_file` 读取 `b.txt` 。
+使用 `dict_from_file` 读取 `b.txt`
 
 ```python
 >>> mmcv.dict_from_file('b.txt')
 {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
 >>> mmcv.dict_from_file('b.txt', key_type=int)
 {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
 ```
+
+#### 从其他后端读取
+
+使用 `list_from_file` 读取 `s3://bucket-name/a.txt`
+
+```python
+>>> mmcv.list_from_file('s3://bucket-name/a.txt')
+['a', 'b', 'c', 'd', 'e']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', offset=2)
+['c', 'd', 'e']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', max_num=2)
+['a', 'b']
+>>> mmcv.list_from_file('s3://bucket-name/a.txt', prefix='/mnt/')
+['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
+```
+
+使用 `dict_from_file` 读取 `b.txt`
+
+```python
+>>> mmcv.dict_from_file('s3://bucket-name/b.txt')
+{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
+>>> mmcv.dict_from_file('s3://bucket-name/b.txt', key_type=int)
+{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
+```
+
+### 读取和保存权重文件
+
+#### 从硬盘读取权重文件或者将权重文件保存至硬盘
+
+我们可以通过下面的方式从磁盘读取权重文件或者将权重文件保存至磁盘
+
+```python
+import torch
+
+filepath1 = '/path/of/your/checkpoint1.pth'
+filepath2 = '/path/of/your/checkpoint2.pth'
+# 从 filepath1 读取权重文件
+checkpoint = torch.load(filepath1)
+# 将权重文件保存至 filepath2
+torch.save(checkpoint, filepath2)
+```
+
+MMCV 提供了很多后端，`HardDiskBackend` 是其中一个，我们可以通过它来读取或者保存权重文件。
+
+```python
+import io
+from mmcv.fileio.file_client import HardDiskBackend
+
+disk_backend = HardDiskBackend()
+with io.BytesIO(disk_backend.get(filepath1)) as buffer:
+    checkpoint = torch.load(buffer)
+with io.BytesIO() as buffer:
+    torch.save(checkpoint, f)
+    disk_backend.put(f.getvalue(), filepath2)
+```
+
+如果我们想在接口中实现根据文件路径自动选择对应的后端，我们可以使用 `FileClient`。
+例如，我们想实现两个方法，分别是读取权重以及保存权重，它们需支持不同类型的文件路径，可以是磁盘路径，也可以是网络路径或者其他路径。
+
+```python
+from mmcv.fileio.file_client import FileClient
+
+def load_checkpoint(path):
+    file_client = FileClient.infer(uri=path)
+    with io.BytesIO(file_client.get(path)) as buffer:
+        checkpoint = torch.load(buffer)
+    return checkpoint
+
+def save_checkpoint(checkpoint, path):
+    with io.BytesIO() as buffer:
+        torch.save(checkpoint, buffer)
+        file_client.put(buffer.getvalue(), path)
+
+file_client = FileClient.infer_client(uri=filepath1)
+checkpoint = load_checkpoint(filepath1)
+save_checkpoint(checkpoint, filepath2)
+```
+
+#### 从网络远端读取权重文件
+
+```{note}
+目前只支持从网络远端读取权重文件，暂不支持将权重文件写入网络远端
+```
+
+```python
+import io
+import torch
+from mmcv.fileio.file_client import HTTPBackend, FileClient
+
+filepath = 'http://path/of/your/checkpoint.pth'
+checkpoint = torch.utils.model_zoo.load_url(filepath)
+
+http_backend = HTTPBackend()
+with io.BytesIO(http_backend.get(filepath)) as buffer:
+    checkpoint = torch.load(buffer)
+
+file_client = FileClient.infer_client(uri=filepath)
+with io.BytesIO(file_client.get(filepath)) as buffer:
+    checkpoint = torch.load(buffer)
+```