# Pytorch的数据读取
Pytorch的数据读取非常方便, 可以很容易地实现多线程数据预读. 我个人认为编程难度比TF小很多，而且灵活性也更高.

Pytorch的数据读取主要包含三个类:

1. Dataset
2. DataLoader
3. DataLoaderIter

这三者大致是一个依次封装的关系: 1.被装进2., 2.被装进3.

DataLoader本质上就是一个iterable（跟python的内置类型list等一样），并利用多进程来加速batch data的处理，使用yield来使用有限的内存

① 创建一个 Dataset 对象
② 创建一个 DataLoader 对象
③ 循环这个 DataLoader 对象，将img, label加载到模型中进行训练

DataLoader 创建 Iter， 调用 next()

# torch.utils.data

## Dataset

表示Dataset的抽象类。所有其他数据集都应该进行子类化。 所有子类应该override `__len__` 和`__getitem__`，前者提供了数据集的大小，后者支持整数索引，范围从0到len(self)

```python 
class Dataset(object):
	# 强制所有的子类override getitem和len两个函数，否则就抛出错误；
	# 输入数据索引，输出为索引指向的数据以及标签；
	def __getitem__(self, index):
		raise NotImplementedError
	
	# 输出数据的长度
	def __len__(self):
		raise NotImplementedError
		
	def __add__(self, other):
		return ConcatDataset([self, other])
    ```

### Subset

`class torch.utils.data.Subset(dataset, indices)`

选取特殊索引下的数据子集； dataset：数据集； indices：想要选取的数据的索引；

### random_split

`class torch.utils.data.random_split(dataset, lengths):`
随机不重复分割数据集； dataset：要被分割的数据集 lengths：长度列表，e.g. [7, 3]， 保证7+3=len(dataset)

## 划分 Teat Train Valid

```python
import torch
from torch.utils import data
import random

master = data.Dataset( ... )  # your "master" dataset
n = len(master)  # how many total elements you have
n_test = int( n * .05 )  # number of test/val elements
n_train = n - 2 * n_test
idx = list(range(n))  # indices to all elements
random.shuffle(idx)  # in-place shuffle the indices to facilitate random splitting
train_idx = idx[:n_train]
val_idx = idx[n_train:(n_train + n_test)]
test_idx = idx[(n_train + n_test):]

train_set = data.Subset(master, train_idx)
val_set = data.Subset(master, val_idx)
test_set = data.Subset(master, test_idx)
```

This can also be achieved using data.random_split:

```python
train_set, val_set, test_set = data.random_split(master, (n_train, n_val, n_test))
```

In [1]:
import sys
sys.path.append("..") 
import dl_utils
import torch
import torchvision
train_iter, test_iter = dl_utils.load_data_fashion_mnist(batch_size = 256)

In [2]:
#将训练数据 分为k份，其中k-1 为训练集 ，k为验证集
root='~/Datasets/FashionMNIST'

trans = []
trans.append(torchvision.transforms.ToTensor())    
transform = torchvision.transforms.Compose(trans)

mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)

print('mnist_train  len {}'.format(mnist_train))

print('mnist_test  len {}'.format(mnist_test))

def validation_dataset(train_iter, k):
    import random
    n = len(train_iter)
    idx = list(range(n))  # indices to all elements
    random.shuffle(idx)  # in-place shuffle the indices to facilitate random splitting
    #形成一个对训练集idx 的 随机排序
    vail_index = idx[0 : int(n/k)]
    train_index = idx[int(n/k) : ]
    val_set = torch.utils.data.Subset(train_iter, vail_index)
    train_set = torch.utils.data.Subset(train_iter, train_index)
    return train_set, val_set

train_set, val_set = validation_dataset(mnist_train, 5)

print(train_set.__len__())
print(val_set.__len__())

mnist_train  len Dataset FashionMNIST
    Number of datapoints: 60000
    Root location: C:\Users\Jarvis/Datasets/FashionMNIST
    Split: Train
mnist_test  len Dataset FashionMNIST
    Number of datapoints: 10000
    Root location: C:\Users\Jarvis/Datasets/FashionMNIST
    Split: Test
48000
12000


In [3]:
def random_split_validation(train_data, k):
    n = len(train_data)
    n_val= int (n / k )
    n_train = n - n_val
    return torch.utils.data.random_split(train_data, (n_train, n_val))
    
train_set2, val_set2 = random_split_validation(mnist_train, 5)
print(len(train_set2))
print(len(val_set2))

48000
12000


In [6]:
import pandas as pd
data = pd.read_csv('../test_files.csv') 
print(len(data))
print(data.iloc[0,0])
print(data.index)
print(data.columns)

908
D:\PointNet\PointNet-PyTorch-master\data\ModelNet10_\bathtub\test\bathtub_0107.txt
RangeIndex(start=0, stop=908, step=1)
Index(['Path', ' Class'], dtype='object')


## 举个例子
## 在Data/toy-points-dataset 中有三个类别的数据


In [7]:
import os
data_set_path = 'D:\\PointNet\\PointNet-PyTorch-master\\data\\ModelNet10_'
name = os.listdir(data_set_path)
num = list(range(0, len(name)))
class_num_dict = dict(zip(name, num))
print(class_num_dict)

{'bathtub': 0, 'bed': 1, 'chair': 2, 'desk': 3, 'dresser': 4, 'monitor': 5, 'night_stand': 6, 'sofa': 7, 'table': 8, 'toilet': 9}


In [8]:
from torch.utils.data.dataset import Dataset
import os
#class_num_dict = {'bed':0, 'sofa':1, 'desk':2, 'bathhub':3}
def get_path_piar(data_set_path, is_train = True):
    class_names = os.listdir(data_set_path)
    num = list(range(0, len(class_names)))
    class_num_dict = dict(zip(class_names, num))
    all_files = [['Path',' Class']]
    for index, class_name in enumerate(class_names):
        if(is_train): 
            file_path = os.path.join(data_set_path,class_name, 'train')
        else:
            file_path = os.path.join(data_set_path,class_name, 'test')
        files = os.listdir(file_path)
        
        #files_path = [[os.path.join(os.getcwd(),file_path,file), class_name] for file in files ]
        files_path = [[os.path.join(os.getcwd(),file_path,file), class_name] for file in files ]
        all_files += (files_path)
        #list append 与 + 操作不一样
    return all_files

modelnet10_path = 'D:\\PointNet\\PointNet-PyTorch-master\\data\\ModelNet10_'
train_files_path = get_path_piar(modelnet10_path, is_train=True) 
test_files_path = get_path_piar(modelnet10_path, is_train=False) 
print(test_files_path[1])

with open('../train_files.csv','w') as f:
    for path in train_files_path:
        f.write(path[0] + ',' + path[1] + '\n')

with open('../test_files.csv','w') as f:
    for path in test_files_path:
        f.write(path[0] + ',' + path[1] + '\n')

['D:\\PointNet\\PointNet-PyTorch-master\\data\\ModelNet10_\\bathtub\\test\\bathtub_0107.txt', 'bathtub']


## 假设我们已经有了训练集与测试集 文件的路径文件


参考资料：

https://www.pytorchtutorial.com/pytorch-custom-dataset-examples/

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

https://likewind.top/2019/02/01/Pytorch-dataprocess/

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

https://www.jianshu.com/p/8ea7fba72673



形如一个csv或者txt，存放有：
['D:\\OneDrive\\Desktop\\2Learning-Pytorch-Geometric\\Learning_Pytorch\\Data\\toy-points-dataset\\bed\\test\\bed_0533.points', 'bed']

getitem 函数 

根据 index， 获取一个Path，然后根据path， 读取数据
例如 f.read, cv.imread 等

然后返回 数据 和 label 

__len__ 函数一定要写，返回数据集的大小

In [7]:
import pandas as pd
from torchvision import transforms
from torch.utils.data.dataset import Dataset
import numpy as np
import torch

class ReadDataFromFloder(Dataset):
    def __init__(self, data_set_path, is_train = True):
        self.data_path = pd.read_csv(data_set_path) 
        
        #写一些transforms操作,对于不同阶段，可能不同，例如train 时候会加入一些噪声，或者旋转等
        self.transformations  = {'train': transforms.Compose([transforms.ToTensor()  ]),
                                 
                                'test': transforms.Compose([transforms.ToTensor()    ])}
        self.is_train = is_train
     #这个函数根据数据的类型是变化的，因为不同类型的数据，读取为tensor的操作也不同。   
    def txt_PointsCloud_parser(self, path_to_off_file):
        # Read the OFF file
        with open(path_to_off_file, 'r') as f:
            contents = f.readlines()
        num_vertices = len(contents)
        # print(num_vertices)
        # Convert all the vertex lines to a list of lists
        vertex_list = [list(map(float, contents[i].strip().split(' '))) for i in list(range(0, num_vertices))]
        # Return the vertices as a 3 x N numpy array
        return np.array(vertex_list)
        #return torch.tensor(vertex_list)
        
    def augment_data(self, vertices):
        # Random rotation about the Y-axis
        theta = 2 * np.pi * np.random.rand(1)
        Ry = np.array([[np.cos(theta), 0, np.sin(theta)],
                       [0, 1, 0],
                       [-np.sin(theta), 0, np.cos(theta)]], dtype=np.float)
        # print(Ry)
        vertices = np.dot(vertices, Ry)
        # Add Gaussian noise with standard deviation of 0.2

        vertices += np.random.normal(scale=0.02, size=vertices.shape)
        return vertices

    def __getitem__(self, index):
        # stuff
        
        #根据index 拿到 对应的文件路径
        path =  self.data_path.iloc[index , 0]
        
        # 从路径 读取数据 这个函数可以优化，例如用h5文件格式
        data = self.txt_PointsCloud_parser(path)
        
        #返回值应该是一个tensor 才能被网络consume, 
        #所以手动转tensor 或者 transform
        
        if  self.is_train :

            data = self.augment_data(data)
            data = self.transformations['train'](data)

            
        else:
            data = self.transformations['test'](data)
            
        label = self.data_path.iloc[index , 1]
       
        return torch.squeeze(data) , label
 
    def __len__(self):
        return len(self.data_path)
    


In [26]:
point_data_set = ReadDataFromFloder('train_files.csv')
point_data_loader = torch.utils.data.DataLoader(point_data_set, batch_size = 24, shuffle=False)
import time
start = time.time()
i = 0
for data, label in point_data_loader:
    i+=1
print(i)
print('use %f s'% (time.time() - start))

167
use 28.844467 s


# H5Py 格式

https://geektutu.com/post/tensorflow-make-npy-hdf5-data-set.html#%E5%88%B6%E4%BD%9CHDF5%E6%A0%BC%E5%BC%8F%E7%9A%84%E6%95%B0%E6%8D%AE%E9%9B%86

https://www.neusncp.com/user/blog?id=97

https://zhuanlan.zhihu.com/p/34405536

https://github.com/Lyken17/Efficient-PyTorch

https://www.cnblogs.com/nwpuxuezha/p/7846751.html

https://towardsdatascience.com/hdf5-datasets-for-pytorch-631ff1d750f5

从上面看，遍历一遍dataloader比较慢，单线程 时间需要28.5s，如果我们将数据集先遍历一遍，batch_size = 1
然后将每个数据与标签，加入到一个list
最后用一个h5py文件保存，让数据集全部数据存在一个.h5文件数据库中。


## 读取h5py

1. 读取之后，然后用 H5Dataset 通过tensor构造数据集，由于h5py返回numpy数组，所以在dataset构造函数里判断是否为numpy array,如果是先转换为tensor

通过这样构造的Dataset，遍历一遍只要0.004s， 200倍。

多线程： 普通文件读取需要：

In [21]:
28.5688/0.146


195.67671232876714

In [10]:
import h5py
if f:
    f.close()

feature_num = 3
data_h5py, label_h5py = [], []

#对于数据集进行遍历，然后加入list，存入disk
point_data_loader = torch.utils.data.DataLoader(point_data_set, batch_size = 1, shuffle=False)
for data, label in point_data_loader:
    data_h5py.append((torch.squeeze(data)).numpy())
    label_h5py.append(class_num_dict[label[0]])
    # label_one_hot = [0 if i != class_num_dict[label[0]] else 1 for i in range(feature_num24)] 
    
with h5py.File('data.h5','w') as f:
    f.create_dataset('points', data = data_h5py )
    f.create_dataset('label', data =  label_h5py)
    

In [11]:
with h5py.File('data.h5','w') as f:
    f.create_dataset('points', data = data_h5py )
    f.create_dataset('label', data =  label_h5py)

In [12]:
with h5py.File('data.h5', 'r') as f:
    x, y = f['points'][()], f['label'][()]
    
print(x.shape, y.shape)

(3991, 2000, 3) (3991,)


In [13]:

class H5Dataset(Dataset):
    """Dataset wrapping data and target tensors.

    Each sample will be retrieved by indexing both tensors along the first
    dimension.

    Arguments:
        data_tensor (Tensor): contains sample data.
        target_tensor (Tensor): contains sample targets (labels).
    """

    def __init__(self, data_tensor, target_tensor):
        assert data_tensor.shape[0] == target_tensor.shape[0]
        if isinstance(x, np.ndarray):
            
            self.data_tensor = torch.from_numpy(data_tensor)
            self.target_tensor = torch.from_numpy(target_tensor)
            
        else:
            self.data_tensor = data_tensor
            self.target_tensor = target_tensor


    def __getitem__(self, index):
        # print(index)
        return self.data_tensor[index], self.target_tensor[index]

    def __len__(self):
        return self.data_tensor.shape[0]

In [14]:
h5_point_set = H5Dataset(x, y)

## h5py文件 不支持num_worker > 0 ，不支持多线程读取，要改多线程，很麻烦。
感谢来自评论区@Tio 同学的分享：

背景：我们知道Torch框架需要符合其自身规格的输入数据的格式，在图像识别中用到的是以.t7扩展名的文件类型，同时也有h5格式类型，这种类型的和t7差不多，均可被torch框架使用，但在读入时候有个官方BUG
问题：DataLoader, when num_worker >0, there is bug 读入.h5 数据格式时候如果dataloader>0 内存会占满，并报错
问题解决：

In [20]:
point_data_loader2 = torch.utils.data.DataLoader(h5_point_set, batch_size = 24, shuffle=False,num_workers=0)
start = time.time()
i = 0
X = torch.zeros(2000,3)
for x,y in point_data_loader2:
    i+=1
print(i)
print(time.time() - start)

def draw_Point_Cloud(Points, Lables, axis = True, **kags):
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    x_axis = Points[:,0]
    y_axis = Points[:,1]
    z_axis = Points[:,2]
    fig = plt.figure() 
    ax = Axes3D(fig) 

    ax.scatter(x_axis, y_axis, z_axis, c = Lables)
    # 设置坐标轴显示以及旋转角度
    ax.set_xlabel('x') 
    ax.set_ylabel('y')
    ax.set_zlabel('z')
    ax.view_init(elev=10,azim=235)
    if not axis:
        #关闭显示坐标轴
        plt.axis('off')
    
    plt.show()
#draw_Point_Cloud(X.numpy(), Lables=None)

167
0.1466062068939209


In [16]:
#划分训练集 ，验证集，h5数据集，同样ok
train_set2, val_set2 = random_split_validation(h5_point_set, 6)
print(len(train_set2))
print(len(val_set2))

3326
665


In [17]:
if f:
    f.close()

In [18]:
!pip install h5py

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
