-
Notifications
You must be signed in to change notification settings - Fork 45
/
datasets.md
59 lines (46 loc) · 2.54 KB
/
datasets.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Datasets
Hezar provides both dataset class implementations and ready-to-use data files for the community.
## Hub Datasets
Hezar datasets are all hosted on the Hugging Face Hub and can be loaded just like any dataset on the Hub.
### Load using Hugging Face datasets
```python
from datasets import load_dataset
sentiment_dataset = load_dataset("hezarai/sentiment-dksf")
lscp_dataset = load_dataset("hezarai/lscp-pos-500k")
xlsum_dataset = load_dataset("hezarai/xlsum-fa")
...
```
### Load using Hezar Dataset
```python
from hezar.data import Dataset
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
...
```
The difference between using Hezar vs Hugging Face datasets is the output class. In Hezar when you load
a dataset using the `Dataset` class, it automatically finds the proper class for that dataset and creates a
PyTorch `Dataset` instance so that it can be easily passed to a PyTorch `DataLoader` class.
```python
from torch.utils.data import DataLoader
from hezar.data.datasets import Dataset
dataset = Dataset.load(
"hezarai/lscp-pos-500k",
tokenizer_path="hezarai/distilbert-base-fa", # tokenizer_path is necessary for data collator
)
loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=dataset.data_collator)
itr = iter(loader)
print(next(itr))
```
But when loading using Hugging Face datasets, the output is an HF Dataset instance.
So in a nutshell, any Hezar dataset can be loaded using HF datasets but not vise-versa!
(Because Hezar looks out for a `dataset_config.yaml` file in any dataset repo so non-Hezar datasets cannot be
loaded using Hezar `Dataset` class.)
## Dataset classes
Hezar categorizes datasets based on their target task. The dataset classes all inherit from the base `Dataset` class
which is a PyTorch Dataset subclass. (hence having `__getitem__` and `__len__` methods.)
Some examples of the dataset classes are `TextClassificationDataset`, `TextSummarizationDataset`, `SequenceLabelingDataset`, etc.
## Dataset Templates
We try to have a simple yet practical pattern for all datasets on the Hub. Every dataset on the Hub needs to have
a dataset loading script. Some ready to use templates are located in the [templates/dataset_scripts](https://github.com/hezarai/hezar/tree/main/templates/dataset_scripts) folder.
To add a new Hezar compatible dataset to the Hub you can follow the guide provided there.