Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about modifying prepare.sh for training ASR model on custom data #1636

Open
daocunyang opened this issue May 23, 2024 · 2 comments

Comments

@daocunyang
Copy link

daocunyang commented May 23, 2024

Hi, opening a new issue since the old one has been closed.

Currently, we are writing our own prepare.sh to train an ASR model based on our own Chinese audio data, following the example of aishell's prepare.sh, but given our lack of experience we are unsure about some contents in it, below are the questions:

  1. What role does vocab_sizes play, and how to decide what number we should assign to it? Do we need it?

  2. Looking at stage 5 to stage 8 of Aishell's prepare.sh, from what I can tell, we need to replace aishell_transcript_v0.8.txt (line 151) with our own text file, correct? Other than that, is there anything else we need to modify to prepare our own data during these stages?

  3. We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

  4. Just to confirm, we can get rid of the part related to Whisper large-v3 at the end of prepare.sh, since we are not using Whisper.

  5. We plan to use the lexicon.txt file from Aishell, but we notice there are certain words which are important to us yet are missing from the current lexicon.txt. For example, we want to add the word "对的" to lexicon.txt. But I wonder if it is necessary to add it to lexicon.txt? Because I noticed the lexicon.txt from Aishell already contains the following, which are the parts that make up the word "对的":

对 d ui4
的 d e5
的 d i2
的 d i4

Thanks in advance.

@daocunyang
Copy link
Author

daocunyang commented May 23, 2024

Sorry for too many (silly) questions. But another one regarding training:

We are trying to finetune the model sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20, which we found was converted from here. The latter mentions the following training command: ./pruned_transducer_stateless7_streaming/train.py which I assume is equivalent to this file.

We hope to continue training based on the existing pretrained.pt file or epoch-99.pt in here, how can we do it? From this section of the doc, it seems we can specify --start-epoch 100 to resume training based on epoch-99.pt , is that correct?

@JinZr Could you take another look when you get a chance, thanks a lot

@marcoyang1998
Copy link
Collaborator

If you want to continue train your model on your own data, I would recommend you to use finetune.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants