## 1.Dataset基类介绍
- 在torch中提供了数据集的基类torch.utils.data.Dataset，继承这个基类，我们能非常快速的实现对数据的加载

In [None]:
class Dataset(object):
    r"""An abstract class representing a :class:`Dataset`.

    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
    data sample for a given key. Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~torch.utils.data.Sampler` implementations and the default options
    of :class:`~torch.utils.data.DataLoader`.

    .. note::
      :class:`~torch.utils.data.DataLoader` by default constructs a index
      sampler that yields integral indices.  To make it work with a map-style
      dataset with non-integral indices/keys, a custom sampler must be provided.
    """

    def __getitem__(self, index) -> T_co:
        raise NotImplementedError

    def __add__(self, other: 'Dataset[T_co]') -> 'ConcatDataset[T_co]':
        return ConcatDataset([self, other])  # 将两个数据集合并到一起

    # No `def __len__(self)` default?
    # See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
    # in pytorch/torch/utils/data/sampler.py

#### 可知，我们需要在自定义的数据集类中继承Dataset类，同时还要实现两个方法
- 1.__len__方法，能够实现通过全局的len()方法获取其中的元素个数
- 2.__getitem__方法，能够通过传入索引的方法获取数据，例如通过dataset[i]获取其中的第i条数据

### 数据集的加载

In [10]:
import torch
from torch.utils.data import Dataset, DataLoader

data_path = r'./dataset/SMSSpam/SMSSpamCollection'

## 定义自己的数据集
class MyDataset(Dataset):
    def __init__(self):
        self.lines = open(data_path, encoding='utf-8').readlines()
    
    def __getitem__(self, index):
        # 获取索引对应位置的数据
        # return self.lines[index].strip()   # strip():删除字符串前后空格或特殊字符
        cur_line = self.lines[index].strip()
        label = cur_line[:4].strip()    # 获取前面的label
        content = cur_line[4:].strip()  # 获取后面的text
        return label,content
    
    def __len__(self):
        # 返回数据的总数量
        return len(self.lines)
        
    
if __name__ =='__main__':
    my_dataset = MyDataset()
    print(my_dataset[0])
    print(len(my_dataset))

('ham', 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
5574


### 数据加载器

In [16]:
import torch
from torch.utils.data import Dataset, DataLoader

data_path = r'./dataset/SMSSpam/SMSSpamCollection'

## 定义自己的数据集
class MyDataset(Dataset):
    def __init__(self):
        self.lines = open(data_path, encoding='utf-8').readlines()
    
    def __getitem__(self, index):
        # 获取索引对应位置的数据
        # return self.lines[index].strip()   # strip():删除字符串前后空格或特殊字符
        cur_line = self.lines[index].strip()
        label = cur_line[:4].strip()    # 获取前面的label
        content = cur_line[4:].strip()  # 获取后面的text
        return label,content
    
    def __len__(self):
        # 返回数据的总数量
        return len(self.lines)
    

# 数据加载器
my_dataset = MyDataset()
data_loader = DataLoader(dataset=my_dataset,batch_size=2,shuffle=True)

for i,(label, text) in enumerate(data_loader):
    print(i,label,text)

0 ('ham', 'ham') ('1) Go to write msg 2) Put on Dictionary mode 3)Cover the screen with hand, 4)Press  &lt;#&gt; . 5)Gently remove Ur hand.. Its interesting..:)', 'You always make things bigger than they are')
1 ('ham', 'ham') ('Hmm .. Bits and pieces lol ... *sighs* ...', 'I want to tel u one thing u should not mistake me k THIS IS THE MESSAGE THAT YOU SENT:)')
2 ('ham', 'ham') ('Detroit. The home of snow. Enjoy it.', "Pls tell nelson that the bb's are no longer comin. The money i was expecting aint coming")
3 ('ham', 'ham') ('Do u knw dis no. &lt;#&gt; ?', "Ok that's great thanx a lot.")
4 ('ham', 'spam') ('Oh yah... We never cancel leh... Haha', 'we tried to contact you re your response to our offer of a new nokia fone and camcorder hit reply or call 08000930705 for delivery')
5 ('ham', 'spam') ('Good morning, my Love ... I go to sleep now and wish you a great day full of feeling better and opportunity ... You are my last thought babe, I LOVE YOU *kiss*', 'Text PASS to 69669 to coll