<a href="https://colab.research.google.com/github/hezarai/notebooks/blob/main/hezar/00_quick_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick Start
We built Hezar to bring all the best works in AI developed by the Persian community together! So that they're more reachable and usable. Besides there's no need to reinvent the wheel for every AI usecase. In this notebook we show you the basics in a few cells of code!

First things first, let's install hezar. In this notebook we'll go through different aspects of Hezar, so it's better to install the full version.

In [None]:
%pip install hezar[all]

Now that we have everything ready, let's start with loading a ready-to-use model from the Hub!

## Models
There's a bunch of ready to use trained models for different tasks on the Hub!

**🤗Hugging Face Hub Page**: [https://huggingface.co/hezarai](https://huggingface.co/hezarai)

Let's walk you through some examples!

### Text Classification (sentiment analysis, categorization, etc)

In [None]:
from hezar.models import Model

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load("hezarai/bert-fa-sentiment-dksf")
outputs = model.predict(example)
print(outputs)

### Sequence Labeling (POS, NER, etc.)

In [None]:
from hezar.models import Model

pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k")  # Part-of-speech
ner_model = Model.load("hezarai/bert-fa-ner-arman")  # Named entity recognition
inputs = ["شرکت هوش مصنوعی هزار"]
pos_outputs = pos_model.predict(inputs)
ner_outputs = ner_model.predict(inputs)
print(f"POS: {pos_outputs}")
print(f"NER: {ner_outputs}")

### Language Modeling (Mask Filling)

In [None]:
from hezar.models import Model

roberta_mlm = Model.load("hezarai/roberta-fa-mlm")
inputs = ["سلام بچه ها حالتون <mask>"]
outputs = roberta_mlm.predict(inputs, top_k=1)
print(outputs)

### Speech Recognition

In [None]:
from hezar.models import Model

whisper = Model.load("hezarai/whisper-small-fa")
transcripts = whisper.predict("examples/assets/speech_example.mp3")
print(transcripts)

### Image to Text (OCR)

In [None]:
from hezar.models import Model
# OCR with TrOCR
model = Model.load("hezarai/trocr-base-fa-v2")
texts = model.predict(["examples/assets/ocr_example.jpg"])
print(f"TrOCR Output: {texts}")

# OCR with CRNN
model = Model.load("hezarai/crnn-fa-printed-96-long")
texts = model.predict("examples/assets/ocr_example.jpg")
print(f"CRNN Output: {texts}")

### Image to Text (License Plate Recognition)

In [None]:
from hezar.models import Model

model = Model.load("hezarai/crnn-fa-64x256-license-plate-recognition")
plate_text = model.predict("assets/license_plate_ocr_example.jpg")
print(plate_text)  # Persian text of mixed numbers and characters might not show correctly in the console

### Image to Text (Image Captioning)

In [None]:
from hezar.models import Model

model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
texts = model.predict("examples/assets/image_captioning_example.jpg")
print(texts)

## Word Embeddings

Hezar has support for word embeddings (word2vec, fasttext, etc) too. 

### FastText

In [None]:
from hezar.embeddings import Embedding

fasttext = Embedding.load("hezarai/fasttext-fa-300")
most_similar = fasttext.most_similar("هزار")
print(most_similar)

### Word2Vec (Skipgram)

In [None]:
from hezar.embeddings import Embedding

word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)

### Word2Vec (CBOW)

In [None]:
from hezar.embeddings import Embedding

word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)

## Datasets

Hezar is also home to ready-to-use datasets. All of our datasets are hosted on the 🤗Hub. The cool thing about Hezar's datasets is that
besides being able to load all of them using regular `load_dataset` function in 🤗Datasets, you can also load a dataset from Hub into a ready to use 
Hezar Dataset class which is a PyTorch compatible Dataset wrapper that can be directly fed to a data loader, etc.

### Load using Hugging Face datasets

In [None]:
from datasets import load_dataset

sentiment_dataset = load_dataset("hezarai/sentiment-dksf")
lscp_dataset = load_dataset("hezarai/lscp-pos-500k")
xlsum_dataset = load_dataset("hezarai/xlsum-fa")
...

### Load using Hezar Dataset

In [None]:
from hezar.data import Dataset

sentiment_dataset = Dataset.load("hezarai/sentiment-dksf")  # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k")  # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa")  # A TextSummarizationDataset instance
...

The difference between using Hezar vs Hugging Face datasets is the output class. In Hezar when you load
a dataset using the `Dataset` class, it automatically finds the proper class for that dataset and creates a
PyTorch `Dataset` instance so that it can be easily passed to a PyTorch `DataLoader` class.

In [None]:
from torch.utils.data import DataLoader

from hezar.data.datasets import Dataset

dataset = Dataset.load(
    "hezarai/lscp-pos-500k",
    tokenizer_path="hezarai/distilbert-base-fa",  # tokenizer_path is necessary for data collator
)

loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=dataset.data_collator)
itr = iter(loader)
print(next(itr))

But when loading using Hugging Face datasets, the output is an HF Dataset instance.

So in a nutshell, any Hezar dataset can be loaded using HF datasets but not vise-versa!
(Because Hezar looks out for a `dataset_config.yaml` file in any dataset repo so non-Hezar datasets cannot be
loaded using Hezar `Dataset` class.)

## Training

Hezar also makes it super easy to train or fine-tune models on different datasets on the Hub using the `Trainer` class.

In [None]:
from hezar.models import BertSequenceLabeling, BertSequenceLabelingConfig
from hezar.data import Dataset
from hezar.trainer import Trainer, TrainerConfig
from hezar.preprocessors import Preprocessor

base_model_path = "hezarai/bert-base-fa"
dataset_path = "hezarai/lscp-pos-500k"

train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)

model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
preprocessor = Preprocessor.load(base_model_path)

train_config = TrainerConfig(
    output_dir="bert-fa-pos-lscp-500k",
    task="sequence_labeling",
    device="cuda",
    init_weights_from=base_model_path,
    batch_size=8,
    num_epochs=5,
    metrics=["seqeval"],
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
    preprocessor=preprocessor,
)
trainer.train()

trainer.push_to_hub("bert-fa-pos-lscp-500k")  # push model, config, preprocessor, trainer files and configs

You can also use custom models and datasets with the Trainer by some minor tweaks in your model and dataset classes. Refer to the in-depth guides to find out!