Specify Tajik VOCAB for training

Hi, and thanks for the fantastic job!

I am planning to add support for the Tajik language, which has 90% intersection with the Cyrillic alphabet. I have a couple of questions. Could you please answer them?
1) Do I have to update VOCABS for training, or is it used only on inference
2) Is it possible to use a pre-trained model that supports the Cyrillic alphabet and fine-tune it to the Tajik alphabet if the alphabets are almost identical (and how to choose a specific pre-trained model for FT during training)?
3) What dataset sizes would you recommend for training and fine-tuning for good results?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify Tajik VOCAB for training #1899

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Specify Tajik VOCAB for training #1899

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions