-
Notifications
You must be signed in to change notification settings - Fork 45
/
quick_tour.md
214 lines (187 loc) · 7.28 KB
/
quick_tour.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Quick Tour
## Models
There's a bunch of ready to use trained models for different tasks on the Hub!
**🤗Hugging Face Hub Page**: [https://huggingface.co/hezarai](https://huggingface.co/hezarai)
Let's walk you through some examples!
- **Text Classification (sentiment analysis, categorization, etc)**
```python
from hezar.models import Model
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load("hezarai/bert-fa-sentiment-dksf")
outputs = model.predict(example)
print(outputs)
```
```
[[{'label': 'positive', 'score': 0.812910258769989}]]
```
- **Sequence Labeling (POS, NER, etc.)**
```python
from hezar.models import Model
pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition
inputs = ["شرکت هوش مصنوعی هزار"]
pos_outputs = pos_model.predict(inputs)
ner_outputs = ner_model.predict(inputs)
print(f"POS: {pos_outputs}")
print(f"NER: {ner_outputs}")
```
```
POS: [[{'token': 'شرکت', 'label': 'Ne'}, {'token': 'هوش', 'label': 'Ne'}, {'token': 'مصنوعی', 'label': 'AJe'}, {'token': 'هزار', 'label': 'NUM'}]]
NER: [[{'token': 'شرکت', 'label': 'B-org'}, {'token': 'هوش', 'label': 'I-org'}, {'token': 'مصنوعی', 'label': 'I-org'}, {'token': 'هزار', 'label': 'I-org'}]]
```
- **Mask Filling**
```python
from hezar.models import Model
model = Model.load("hezarai/roberta-fa-mask-filling")
inputs = ["سلام بچه ها حالتون <mask>"]
outputs = model.predict(inputs, top_k=1)
print(outputs)
```
```
[[{'token': 'چطوره', 'sequence': 'سلام بچه ها حالتون چطوره', 'token_id': 34505, 'score': 0.2230483442544937}]]
```
- **Speech Recognition**
```python
from hezar.models import Model
model = Model.load("hezarai/whisper-small-fa")
transcripts = model.predict("examples/assets/speech_example.mp3")
print(transcripts)
```
```
[{'text': 'و این تنها محدود به محیط کار نیست'}]
```
- **Image to Text (OCR)**
```python
from hezar.models import Model
# OCR with TrOCR
model = Model.load("hezarai/trocr-base-fa-v2")
texts = model.predict(["examples/assets/ocr_example.jpg"])
print(f"TrOCR Output: {texts}")
# OCR with CRNN
model = Model.load("hezarai/crnn-fa-printed-96-long")
texts = model.predict("examples/assets/ocr_example.jpg")
print(f"CRNN Output: {texts}")
```
```
TrOCR Output: [{'text': 'چه میشه کرد، باید صبر کنیم'}]
CRNN Output: [{'text': 'چه میشه کرد، باید صبر کنیم'}]
```
![](https://raw.githubusercontent.com/hezarai/hezar/main/examples/assets/ocr_example.jpg)
- **Image to Text (License Plate Recognition)**
```python
from hezar.models import Model
model = Model.load("hezarai/crnn-fa-64x256-license-plate-recognition")
plate_text = model.predict("assets/license_plate_ocr_example.jpg")
print(plate_text) # Persian text of mixed numbers and characters might not show correctly in the console
```
```
[{'text': '۵۷س۷۷۹۷۷'}]
```
![](https://raw.githubusercontent.com/hezarai/hezar/main/examples/assets/license_plate_ocr_example.jpg)
- **Image to Text (Image Captioning)**
```python
from hezar.models import Model
model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
texts = model.predict("examples/assets/image_captioning_example.jpg")
print(texts)
```
```
[{'text': 'سگی با توپ تنیس در دهانش می دود.'}]
```
![](https://raw.githubusercontent.com/hezarai/hezar/main/examples/assets/image_captioning_example.jpg)
We constantly keep working on adding and training new models and this section will hopefully be expanding over time ;)
## Word Embeddings
- **FastText**
```python
from hezar.embeddings import Embedding
fasttext = Embedding.load("hezarai/fasttext-fa-300")
most_similar = fasttext.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7579, 'word': 'میلیون'},
{'score': 0.6943, 'word': '21هزار'},
{'score': 0.6861, 'word': 'میلیارد'},
{'score': 0.6825, 'word': '26هزار'},
{'score': 0.6803, 'word': '٣هزار'}]
```
- **Word2Vec (Skip-gram)**
```python
from hezar.embeddings import Embedding
word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7885, 'word': 'چهارهزار'},
{'score': 0.7788, 'word': '۱۰هزار'},
{'score': 0.7727, 'word': 'دویست'},
{'score': 0.7679, 'word': 'میلیون'},
{'score': 0.7602, 'word': 'پانصد'}]
```
- **Word2Vec (CBOW)**
```python
from hezar.embeddings import Embedding
word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7407, 'word': 'دویست'},
{'score': 0.7400, 'word': 'میلیون'},
{'score': 0.7326, 'word': 'صد'},
{'score': 0.7276, 'word': 'پانصد'},
{'score': 0.7011, 'word': 'سیصد'}]
```
For a full guide on the embeddings module, see the [embeddings tutorial](https://hezarai.github.io/hezar/tutorial/embeddings.html).
## Datasets
You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below:
```python
from hezar.data import Dataset
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
alpr_ocr_dataset = Dataset.load("hezarai/persian-license-plate-v1") # An OCRDataset instance
...
```
The returned dataset objects from `load()` are PyTorch Dataset wrappers for specific tasks and can be used by a data loader out-of-the-box!
You can also load Hezar's datasets using 🤗Datasets:
```python
from datasets import load_dataset
dataset = load_dataset("hezarai/sentiment-dksf")
```
For a full guide on Hezar's datasets, see the [datasets tutorial](https://hezarai.github.io/hezar/tutorial/datasets.html).
## Training
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library.
```python
from hezar.models import BertSequenceLabeling, BertSequenceLabelingConfig
from hezar.data import Dataset
from hezar.trainer import Trainer, TrainerConfig
from hezar.preprocessors import Preprocessor
base_model_path = "hezarai/bert-base-fa"
dataset_path = "hezarai/lscp-pos-500k"
train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)
model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
preprocessor = Preprocessor.load(base_model_path)
train_config = TrainerConfig(
output_dir="bert-fa-pos-lscp-500k",
task="sequence_labeling",
device="cuda",
init_weights_from=base_model_path,
batch_size=8,
num_epochs=5,
metrics=["seqeval"],
)
trainer = Trainer(
config=train_config,
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=train_dataset.data_collator,
preprocessor=preprocessor,
)
trainer.train()
trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs
```
Want to go deeper? Check out the [guides](../guide/index.md).