Skip to content

Specify Tajik VOCAB for training #1899

@sultanovazamat

Description

@sultanovazamat

Hi, and thanks for the fantastic job!

I am planning to add support for the Tajik language, which has 90% intersection with the Cyrillic alphabet. I have a couple of questions. Could you please answer them?

  1. Do I have to update VOCABS for training, or is it used only on inference
  2. Is it possible to use a pre-trained model that supports the Cyrillic alphabet and fine-tune it to the Tajik alphabet if the alphabets are almost identical (and how to choose a specific pre-trained model for FT during training)?
  3. What dataset sizes would you recommend for training and fine-tuning for good results?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions