Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't currently support long audio ? #25

Open
wxbool opened this issue Jun 19, 2022 · 3 comments
Open

Can't currently support long audio ? #25

wxbool opened this issue Jun 19, 2022 · 3 comments

Comments

@wxbool
Copy link

wxbool commented Jun 19, 2022

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

@lingjzhu
Copy link
Owner

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

Hi I think I made a typo there. What I meant is it tends to make more mistakes when the audio is longer than 15 seconds, but I typed 15 ms instead. The reason is that the training data are sentences shorter than 15 seconds, so the model might not generalize well to long audios.

It's hard for me to detect the problem in your code based on incomplete knowledge. Does the example we give work for you? Could you share the actual input text and audio?

@wxbool
Copy link
Author

wxbool commented Jun 19, 2022

test code :

# if there are errors importing, uncomment the following lines and add path to charsiu
import sys

sys.path.append('E:/ViggoSpace/GoWork/Home/video-srt-pro/client/test/python/charsiu/charsiu/src')

# import selected model from Charsiu
from Charsiu import charsiu_predictive_aligner

# import selected model from Charsiu
from Charsiu import charsiu_forced_aligner

audio = './charsiu/local/4.wav'
text = "据国际学生评估项目小组统计数据显示,中国14个小时,波兰6.6小时,爱尔兰7.3小时,意大利和俄罗斯为8.7小时,作业量较小的为芬兰和韩国,分别为2.8小时和2.9小时,该调查还指出,富裕家庭的学生平均每周比经济弱势的学生多花1.6小时写作业,中国中小学写作业压力报告更是显示在中国91.2%的家长有过陪孩子写作业的经历,其中每天陪写的家长高达78%,有75.79%的家庭曾经写作业,发生过亲子矛盾事,多一写作业就要干这个干那个,磨磨蹭蹭半天写不出几道题来,太笨了,怎么教也不会不少受调查的家长提到陪写作业,成为幸福感下降的原因之一。"
save_to = './charsiu/local/4.TextGrid'

# Chinese
# initialize model
charsiu = charsiu_forced_aligner(aligner='charsiu/zh_xlsr_fc_10ms', lang='zh')
charsiu.align(audio=audio, text=text)
charsiu.serve(audio=audio, text=text,
              save_to=save_to)

Audio file link : https://file.viggo.site/temp/4.wav

thank you for your reply .
I'm wondering, is it currently possible to force alignment of long audio?
I have used Montreal Forced Aligner, and after I tried to test it, I found that there are obvious abnormal problems with long audio forced alignment, and the efficiency is very slow .

@lingjzhu
Copy link
Owner

test code :

# if there are errors importing, uncomment the following lines and add path to charsiu
import sys

sys.path.append('E:/ViggoSpace/GoWork/Home/video-srt-pro/client/test/python/charsiu/charsiu/src')

# import selected model from Charsiu
from Charsiu import charsiu_predictive_aligner

# import selected model from Charsiu
from Charsiu import charsiu_forced_aligner

audio = './charsiu/local/4.wav'
text = "据国际学生评估项目小组统计数据显示,中国14个小时,波兰6.6小时,爱尔兰7.3小时,意大利和俄罗斯为8.7小时,作业量较小的为芬兰和韩国,分别为2.8小时和2.9小时,该调查还指出,富裕家庭的学生平均每周比经济弱势的学生多花1.6小时写作业,中国中小学写作业压力报告更是显示在中国91.2%的家长有过陪孩子写作业的经历,其中每天陪写的家长高达78%,有75.79%的家庭曾经写作业,发生过亲子矛盾事,多一写作业就要干这个干那个,磨磨蹭蹭半天写不出几道题来,太笨了,怎么教也不会不少受调查的家长提到陪写作业,成为幸福感下降的原因之一。"
save_to = './charsiu/local/4.TextGrid'

# Chinese
# initialize model
charsiu = charsiu_forced_aligner(aligner='charsiu/zh_xlsr_fc_10ms', lang='zh')
charsiu.align(audio=audio, text=text)
charsiu.serve(audio=audio, text=text,
              save_to=save_to)

Audio file link : https://file.viggo.site/temp/4.wav

thank you for your reply . I'm wondering, is it currently possible to force alignment of long audio? I have used Montreal Forced Aligner, and after I tried to test it, I found that there are obvious abnormal problems with long audio forced alignment, and the efficiency is very slow .

Thank you for sharing. I think there are two problems here.
First, since this is the code for a research paper, I haven't implemented any methods to normalize numbers, so the conversion to Pinyin will fail on numbers. This is a potential source of mistakes.

Second, the audio is too long. Both the speech model and the alignment algorithm have a complexity of log(O^2), it is very difficult for them to handle very long audios. Processing long audio is still a challenge for speech research, so I do not have a good solution. Segmenting the audio into shorter clips will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants