Can't currently support long audio ? #25

wxbool · 2022-06-19T06:46:56Z

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

lingjzhu · 2022-06-19T14:38:05Z

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

Hi I think I made a typo there. What I meant is it tends to make more mistakes when the audio is longer than 15 seconds, but I typed 15 ms instead. The reason is that the training data are sentences shorter than 15 seconds, so the model might not generalize well to long audios.

It's hard for me to detect the problem in your code based on incomplete knowledge. Does the example we give work for you? Could you share the actual input text and audio?

wxbool · 2022-06-19T15:08:29Z

test code :

# if there are errors importing, uncomment the following lines and add path to charsiu
import sys

sys.path.append('E:/ViggoSpace/GoWork/Home/video-srt-pro/client/test/python/charsiu/charsiu/src')

# import selected model from Charsiu
from Charsiu import charsiu_predictive_aligner

# import selected model from Charsiu
from Charsiu import charsiu_forced_aligner

audio = './charsiu/local/4.wav'
text = "据国际学生评估项目小组统计数据显示，中国14个小时，波兰6.6小时，爱尔兰7.3小时，意大利和俄罗斯为8.7小时，作业量较小的为芬兰和韩国，分别为2.8小时和2.9小时，该调查还指出，富裕家庭的学生平均每周比经济弱势的学生多花1.6小时写作业，中国中小学写作业压力报告更是显示在中国91.2%的家长有过陪孩子写作业的经历，其中每天陪写的家长高达78%，有75.79%的家庭曾经写作业，发生过亲子矛盾事，多一写作业就要干这个干那个，磨磨蹭蹭半天写不出几道题来，太笨了，怎么教也不会不少受调查的家长提到陪写作业，成为幸福感下降的原因之一。"
save_to = './charsiu/local/4.TextGrid'

# Chinese
# initialize model
charsiu = charsiu_forced_aligner(aligner='charsiu/zh_xlsr_fc_10ms', lang='zh')
charsiu.align(audio=audio, text=text)
charsiu.serve(audio=audio, text=text,
              save_to=save_to)

Audio file link : https://file.viggo.site/temp/4.wav

thank you for your reply .
I'm wondering, is it currently possible to force alignment of long audio?
I have used Montreal Forced Aligner, and after I tried to test it, I found that there are obvious abnormal problems with long audio forced alignment, and the efficiency is very slow .

lingjzhu · 2022-06-21T14:05:41Z

test code :

# if there are errors importing, uncomment the following lines and add path to charsiu
import sys

sys.path.append('E:/ViggoSpace/GoWork/Home/video-srt-pro/client/test/python/charsiu/charsiu/src')

# import selected model from Charsiu
from Charsiu import charsiu_predictive_aligner

# import selected model from Charsiu
from Charsiu import charsiu_forced_aligner

audio = './charsiu/local/4.wav'
text = "据国际学生评估项目小组统计数据显示，中国14个小时，波兰6.6小时，爱尔兰7.3小时，意大利和俄罗斯为8.7小时，作业量较小的为芬兰和韩国，分别为2.8小时和2.9小时，该调查还指出，富裕家庭的学生平均每周比经济弱势的学生多花1.6小时写作业，中国中小学写作业压力报告更是显示在中国91.2%的家长有过陪孩子写作业的经历，其中每天陪写的家长高达78%，有75.79%的家庭曾经写作业，发生过亲子矛盾事，多一写作业就要干这个干那个，磨磨蹭蹭半天写不出几道题来，太笨了，怎么教也不会不少受调查的家长提到陪写作业，成为幸福感下降的原因之一。"
save_to = './charsiu/local/4.TextGrid'

# Chinese
# initialize model
charsiu = charsiu_forced_aligner(aligner='charsiu/zh_xlsr_fc_10ms', lang='zh')
charsiu.align(audio=audio, text=text)
charsiu.serve(audio=audio, text=text,
              save_to=save_to)

Audio file link : https://file.viggo.site/temp/4.wav

thank you for your reply . I'm wondering, is it currently possible to force alignment of long audio? I have used Montreal Forced Aligner, and after I tried to test it, I found that there are obvious abnormal problems with long audio forced alignment, and the efficiency is very slow .

Thank you for sharing. I think there are two problems here.
First, since this is the code for a research paper, I haven't implemented any methods to normalize numbers, so the conversion to Pinyin will fail on numbers. This is a potential source of mistakes.

Second, the audio is too long. Both the speech model and the alignment algorithm have a complexity of log(O^2), it is very difficult for them to handle very long audios. Processing long audio is still a challenge for speech research, so I do not have a good solution. Segmenting the audio into shorter clips will help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't currently support long audio ? #25

Can't currently support long audio ? #25

wxbool commented Jun 19, 2022

lingjzhu commented Jun 19, 2022

wxbool commented Jun 19, 2022

lingjzhu commented Jun 21, 2022

Can't currently support long audio ? #25

Can't currently support long audio ? #25

Comments

wxbool commented Jun 19, 2022

lingjzhu commented Jun 19, 2022

wxbool commented Jun 19, 2022

lingjzhu commented Jun 21, 2022