Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fixes from fork #35

Open
nmfisher opened this issue Apr 5, 2023 · 3 comments
Open

Possible fixes from fork #35

nmfisher opened this issue Apr 5, 2023 · 3 comments

Comments

@nmfisher
Copy link

nmfisher commented Apr 5, 2023

Thanks for this great package! I forked the repo to tweak a few things to help my use case, and some of them might be useful to merge back into the master branch. I haven't submitted a PR because some of them might not be appropriate/desirable to merge, so I figured you could tell me which ones you want and I could clean up the code/add some tests if necessary and submit a PR then.

Fork is at https://github.com/nmfisher/charsiu

Changes are:

  1. don't require sampling rate to be explicitly provided as librosa can resample to 16000Hz when loading a file
  2. re-instate punctuation and insert the punctuation token, rather than silence, into the phone list
  3. downweight silence to minimize erroneous insertion of silence in the middle of a word (this should probably be a parameter rather than a hardcoded 0.1)
  4. ignore silence where left and right phones are identical (to completely avoid inserting silence frames in the middle of consecutive frames for a single phone). This works for me right now but needs a bit more thought because if phones are intentionally repeated (e.g. "ai ai"), this will fold silence between them into the left phone, so "ai [SIL] ai") will always becomes "ai ai". Solution is probably just to pass a parameter for a minimum silence duration (so if silence is greater than X, it's presered, otherwise it's folded into the left phone).
@lingjzhu
Copy link
Owner

lingjzhu commented Apr 7, 2023

Thank you for your help! I had undergone many changes in my life so I didn't update this repo regularly. So there changes are highly appreciated!

I think 2 is really helpful for some applications but not for others. For example, sometimes people only want to work with phonemes so punctuations are not necessary for them. Could you make it optional?

3 and 4 are really, really helpful! Thank you so much!!!
Let me know if I can help in any way. I am working on a improved model so hopefully I can also incorporate your features in the new models. But that might take a few months to complete :)

@phliulei
Copy link

Firstly, I would like to express my gratitude for the development of such an excellent tool. During my testing using L2 Mandarin speech, it became clear that these speakers tend to speak more slowly, which often results in the insertion of false [SIL]. The modified script has shown to produce better results with this speech, but I am curious to know if there is a way to completely avoid the insertion of [SIL], particularly when it is inserted in the middle of one Chinese character, given this is a rare occurrence in Mandarin.

@nmfisher
Copy link
Author

@phliulei I think the best way is to specify a minimum silence duration, so anything shorter is ignored/treated as part of the previous phone. I mentioned this in point (4) above but I haven't had a chance to implement yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants