Skip to content

Commit

Permalink
[Feature] Loading objects from different backends and dumping objects…
Browse files Browse the repository at this point in the history
… to different backends (#1330)

* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* add infer_client method

* add check_exist method

* rename var client to file_client

* polish docstring

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* add comment and polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method
  • Loading branch information
zhouzaida committed Nov 3, 2021
1 parent 9b3cffd commit 01bc35e
Show file tree
Hide file tree
Showing 13 changed files with 1,860 additions and 91 deletions.
128 changes: 126 additions & 2 deletions docs/understand_mmcv/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,17 @@

This module provides two universal API to load and dump files of different formats.

```{note}
Since v1.3.16, the IO modules support loading (dumping) data from (to) different backends, respectively. More details are in PR [#1330](https://github.com/open-mmlab/mmcv/pull/1330).
```

### Load and dump data

`mmcv` provides a universal api for loading and dumping data, currently
supported formats are json, yaml and pickle.

#### Load from disk or dump to disk

```python
import mmcv

Expand All @@ -29,6 +35,20 @@ with open('test.yaml', 'w') as f:
data = mmcv.dump(data, f, file_format='yaml')
```

#### Load from other backends or dump to other backends

```python
import mmcv

# load data from a file
data = mmcv.load('s3://bucket-name/test.json')
data = mmcv.load('s3://bucket-name/test.yaml')
data = mmcv.load('s3://bucket-name/test.pkl')

# dump data to a file with a filename (infer format from file extension)
mmcv.dump(data, 's3://bucket-name/out.pkl')
```

It is also very convenient to extend the api to support more file formats.
All you need to do is to write a file handler inherited from `BaseFileHandler`
and register it with one or several file formats.
Expand Down Expand Up @@ -92,7 +112,9 @@ d
e
```

Then use `list_from_file` to load the list from a.txt.
#### Load from disk

Use `list_from_file` to load the list from a.txt.

```python
>>> mmcv.list_from_file('a.txt')
Expand All @@ -113,11 +135,113 @@ For example `b.txt` is a text file with 3 lines.
3 panda
```

Then use `dict_from_file` to load the dict from `b.txt` .
Then use `dict_from_file` to load the dict from `b.txt`.

```python
>>> mmcv.dict_from_file('b.txt')
{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
>>> mmcv.dict_from_file('b.txt', key_type=int)
{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
```

#### Load from other backends

Use `list_from_file` to load the list from `s3://bucket-name/a.txt`.

```python
>>> mmcv.list_from_file('s3://bucket-name/a.txt')
['a', 'b', 'c', 'd', 'e']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', offset=2)
['c', 'd', 'e']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', max_num=2)
['a', 'b']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', prefix='/mnt/')
['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
```

Use `dict_from_file` to load the dict from `s3://bucket-name/b.txt`.

```python
>>> mmcv.dict_from_file('s3://bucket-name/b.txt')
{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
>>> mmcv.dict_from_file('s3://bucket-name/b.txt', key_type=int)
{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
```

### Load and dump checkpoints

#### Load checkpoints from disk or save to disk

We can read the checkpoints from disk or save to disk in the following way.

```python
import torch

filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = '/path/of/your/checkpoint2.pth'
# read from filepath1
checkpoint = torch.load(filepath1)
# save to filepath2
torch.save(checkpoint, filepath2)
```

MMCV provides many backends. `HardDiskBackend` is one of them and we can use it to read or save checkpoints.

```python
import io
from mmcv.fileio.file_client import HardDiskBackend

disk_backend = HardDiskBackend()
with io.BytesIO(disk_backend.get(filepath1)) as buffer:
checkpoint = torch.load(buffer)
with io.BytesIO() as buffer:
torch.save(checkpoint, f)
disk_backend.put(f.getvalue(), filepath2)
```

If we want to implement an interface which automatically select the corresponding
backend based on the file path, we can use the `FileClient`.
For example, we want to implement two methods for reading checkpoints as well as saving checkpoints,
which need to support different types of file paths, either disk paths, network paths or other paths.

```python
from mmcv.fileio.file_client import FileClient

def load_checkpoint(path):
file_client = FileClient.infer(uri=path)
with io.BytesIO(file_client.get(path)) as buffer:
checkpoint = torch.load(buffer)
return checkpoint

def save_checkpoint(checkpoint, path):
with io.BytesIO() as buffer:
torch.save(checkpoint, buffer)
file_client.put(buffer.getvalue(), path)

file_client = FileClient.infer_client(uri=filepath1)
checkpoint = load_checkpoint(filepath1)
save_checkpoint(checkpoint, filepath2)
```

#### Load checkpoints from the Internet

```{note}
Currently, it only supports reading checkpoints from the Internet, and does not support saving checkpoints to the Internet.
```

```python
import io
import torch
from mmcv.fileio.file_client import HTTPBackend, FileClient

filepath = 'http://path/of/your/checkpoint.pth'
checkpoint = torch.utils.model_zoo.load_url(filepath)

http_backend = HTTPBackend()
with io.BytesIO(http_backend.get(filepath)) as buffer:
checkpoint = torch.load(buffer)

file_client = FileClient.infer_client(uri=filepath)
with io.BytesIO(file_client.get(filepath)) as buffer:
checkpoint = torch.load(buffer)
```
129 changes: 125 additions & 4 deletions docs_zh_CN/understand_mmcv/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,16 @@

文件输入输出模块提供了两个通用的 API 接口用于读取和保存不同格式的文件。

```{note}
在 v1.3.16 及之后的版本中,IO 模块支持从不同后端读取数据并支持将数据至不同后端。更多细节请访问 PR [#1330](https://github.com/open-mmlab/mmcv/pull/1330)。
```

### 读取和保存数据

`mmcv` 提供了一个通用的 api 用于读取和保存数据,目前支持的格式有 json、yaml 和 pickle。

#### 从硬盘读取数据或者将数据保存至硬盘

```python
import mmcv

Expand All @@ -28,6 +34,20 @@ with open('test.yaml', 'w') as f:
data = mmcv.dump(data, f, file_format='yaml')
```

#### 从其他后端加载或者保存至其他后端

```python
import mmcv

# 从 s3 文件读取数据
data = mmcv.load('s3://bucket-name/test.json')
data = mmcv.load('s3://bucket-name/test.yaml')
data = mmcv.load('s3://bucket-name/test.pkl')

# 将数据保存至 s3 文件 (根据文件名后缀反推文件类型)
mmcv.dump(data, 's3://bucket-name/out.pkl')
```

我们提供了易于拓展的方式以支持更多的文件格式。我们只需要创建一个继承自 `BaseFileHandler`
文件句柄类并将其注册到 `mmcv` 中即可。句柄类至少需要重写三个方法。

Expand All @@ -49,7 +69,7 @@ class TxtHandler1(mmcv.BaseFileHandler):
return str(obj)
```

`PickleHandler` 为例
`PickleHandler` 为例

```python
import pickle
Expand Down Expand Up @@ -87,8 +107,9 @@ c
d
e
```
#### 从硬盘读取

使用 `list_from_file` 读取 `a.txt`
使用 `list_from_file` 读取 `a.txt`

```python
>>> mmcv.list_from_file('a.txt')
Expand All @@ -101,19 +122,119 @@ e
['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
```

同样, `b.txt` 也是文本文件,一共有3行内容
同样, `b.txt` 也是文本文件,一共有3行内容

```
1 cat
2 dog cow
3 panda
```

使用 `dict_from_file` 读取 `b.txt`
使用 `dict_from_file` 读取 `b.txt`

```python
>>> mmcv.dict_from_file('b.txt')
{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
>>> mmcv.dict_from_file('b.txt', key_type=int)
{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
```

#### 从其他后端读取

使用 `list_from_file` 读取 `s3://bucket-name/a.txt`

```python
>>> mmcv.list_from_file('s3://bucket-name/a.txt')
['a', 'b', 'c', 'd', 'e']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', offset=2)
['c', 'd', 'e']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', max_num=2)
['a', 'b']
>>> mmcv.list_from_file('s3://bucket-name/a.txt', prefix='/mnt/')
['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
```

使用 `dict_from_file` 读取 `b.txt`

```python
>>> mmcv.dict_from_file('s3://bucket-name/b.txt')
{'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
>>> mmcv.dict_from_file('s3://bucket-name/b.txt', key_type=int)
{1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
```

### 读取和保存权重文件

#### 从硬盘读取权重文件或者将权重文件保存至硬盘

我们可以通过下面的方式从磁盘读取权重文件或者将权重文件保存至磁盘

```python
import torch

filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = '/path/of/your/checkpoint2.pth'
# 从 filepath1 读取权重文件
checkpoint = torch.load(filepath1)
# 将权重文件保存至 filepath2
torch.save(checkpoint, filepath2)
```

MMCV 提供了很多后端,`HardDiskBackend` 是其中一个,我们可以通过它来读取或者保存权重文件。

```python
import io
from mmcv.fileio.file_client import HardDiskBackend

disk_backend = HardDiskBackend()
with io.BytesIO(disk_backend.get(filepath1)) as buffer:
checkpoint = torch.load(buffer)
with io.BytesIO() as buffer:
torch.save(checkpoint, f)
disk_backend.put(f.getvalue(), filepath2)
```

如果我们想在接口中实现根据文件路径自动选择对应的后端,我们可以使用 `FileClient`
例如,我们想实现两个方法,分别是读取权重以及保存权重,它们需支持不同类型的文件路径,可以是磁盘路径,也可以是网络路径或者其他路径。

```python
from mmcv.fileio.file_client import FileClient

def load_checkpoint(path):
file_client = FileClient.infer(uri=path)
with io.BytesIO(file_client.get(path)) as buffer:
checkpoint = torch.load(buffer)
return checkpoint

def save_checkpoint(checkpoint, path):
with io.BytesIO() as buffer:
torch.save(checkpoint, buffer)
file_client.put(buffer.getvalue(), path)

file_client = FileClient.infer_client(uri=filepath1)
checkpoint = load_checkpoint(filepath1)
save_checkpoint(checkpoint, filepath2)
```

#### 从网络远端读取权重文件

```{note}
目前只支持从网络远端读取权重文件,暂不支持将权重文件写入网络远端
```

```python
import io
import torch
from mmcv.fileio.file_client import HTTPBackend, FileClient

filepath = 'http://path/of/your/checkpoint.pth'
checkpoint = torch.utils.model_zoo.load_url(filepath)

http_backend = HTTPBackend()
with io.BytesIO(http_backend.get(filepath)) as buffer:
checkpoint = torch.load(buffer)

file_client = FileClient.infer_client(uri=filepath)
with io.BytesIO(file_client.get(filepath)) as buffer:
checkpoint = torch.load(buffer)
```
Loading

0 comments on commit 01bc35e

Please sign in to comment.