Feature Request - Speech / Phonetics automatic generation/ alignment #49

merlin2v · 2018-10-01T03:39:26Z

I've been wondering why the file text had to be used. Couldn't you separate the sound via phonetics?
This would be better as it would help translate things more accurate than the text alone.
take the following example:

I do like

This could be said as:

adɪ̈ lik (IPA)

vs. someone being using pronunciation:

a i dɵ lik (IPA)

Both of these end up using different mouth movements and because of this can make some of the mouth movements off.

morevnaproject · 2018-10-05T08:38:15Z

I was thinking about that too, and started to investigate. This is what I found - https://cmusphinx.github.io/wiki/phonemerecognition/

Frequently, people want to use Sphinx to do phoneme recognition. In other words, they would like to convert speech to a stream of phonemes rather than words. This is possible, although the results can be disappointing. The reason is that automatic speech recognition relies heavily on contextual constraints (i.e. language modeling) to guide the search algorithm.

For now, I think integrating with RhubarbLipSync (#44) is a way to go.

morevnaproject · 2019-01-28T16:00:28Z

We've got Rhubarb feature merged just now. - #50 ^__^

Hunanbean · 2021-05-04T13:08:14Z

Montreal Forced Aligner may be something to look into, but that would be more for automatic alignment from the text, rather than the full shebang.

steveway · 2021-05-19T12:54:02Z

As mentioned we currently have Rhubarb integrated.
But I just found an interesting project for this call Allosaurus.
It seems to be pretty easy to use and here on Windows 10 it was very easy to pip install it.
The only problem is that it outputs IPA Phonemes and that it does not provide any timestamps (yet).
We should be able to create a mapping for the phonemes to the ones we support already.
And for the timestamps there is already an open issue: xinjli/allosaurus#24
But even without timestamps it might already be usable with some conversion of the phonemes.

steveway · 2021-05-19T13:24:43Z

Here is a very simple conversion dict from IPA to CMU:
{ "b": "B", "ʧ": "CH", "d": "D", "ð": "DH", "f": "F", "g": "G", "h": "HH", "ʤ": "JH", "k": "K", "l": "L", "m": "M", "n": "N", "ŋ": "NG", "p": "P", "r": "R", "s": "S", "ʃ": "SH", "t": "T", "θ": "TH", "v": "V", "w": "W", "j": "Y", "z": "Z", "ʒ": "ZH", "ɑ": "AA2", "æ": "AE2", "ə": "AH0", "ʌ": "AH2", "ɔ": "AO2", "ɛ": "EH2", "ɚ": "ER0", "ɝ": "ER2", "ɪ": "IH2", "i": "IY2", "ʊ": "UH2", "u": "UW2", "aʊ": "AW2", "aɪ": "AY2", "eɪ": "EY2", "oʊ": "OW2", "ɔɪ": "OY2" }
It's based on this mapping from CMU to IPA.
https://github.com/margonaut/CMU-to-IPA-Converter/blob/master/cmu_ipa_mapping.rb

Hunanbean · 2021-05-19T13:41:05Z

I must have underestimated what Rhubarb actually does. I will take a look at it now.
In the CMU phoneme set i did, i purposely simplified it to remove the specific variants, such as AO1, AO2 would both become just AO. But, i am pretty sure i still have the full setup before i truncated it if you want me to post that on my git. But due to the imperceivable differences between AO, AO1, AO2 in action, perhaps it makes more sense to just have the conversion dictionary truncate to the existing set of 39

steveway · 2021-05-19T14:22:00Z

Yes, Rhubarb is quite nice, it would be awesome if it could also output text besides phonemes.
With our language dictionaries we could try to convert phonemes back to words too for that.
The results from Rhubarb are not as exact as our manual methods, but I guess for most animations it's enough.
I don't think we need the untruncated list for CMU.
I just quickly generated that list up there based on that little converter from @margonaut.
If we really want to integrate Allosaurus then we should make a fitting conversion table for our phoneme list.
We can use that information to create a new phoneme_set and phoneme_conversion dictionary for IPA.
And we should add some code to use these to convert between different phoneme sets, that should already kinda be possible in a limited way.

steveway · 2021-06-02T08:03:06Z

I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus
This currently uses pydub to prepare the sound files for allosaurus.
This works very well with our Tutorial Files, even the spanish ones.
The results seem to be better than what Rhubarb provides.
Here is a quick test showing the result for running it on the lame.wav file:
https://youtu.be/4hqHaEXo9xU
The phonemes are partially overlapping, so some pruning needs to be done for animation purposes.
But as you can see the results are quite good.

Hunanbean · 2021-06-02T16:14:07Z

That is very cool! Thank you I just ditched windows and went back to linux last night, so it is going to take me a little while before i can test it. Seems like now would be a good time to start making some noise on the forums. That looks like a Patreon magnet to me!

…

On Wed, Jun 2, 2021 at 1:03 AM Stefan Murawski ***@***.***> wrote: I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus This currently uses pydub to prepare the sound files for allosaurus. This works very well with our Tutorial Files, even the spanish ones. The results seem to be better than what Rhubarb provides. Here is a quick test showing the result for running it on the lame.wav file: https://youtu.be/4hqHaEXo9xU The phonemes are partially overlapping, so some pruning needs to be done for animation purposes. But as you can see the results are quite good. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMNVR7D3RQZHLOXBOMRFTILTQXQUTANCNFSM4FYFKTXQ> .

steveway · 2021-06-10T09:03:27Z

Alright, I made some more progress.
Based on the time between phonemes from the automatic recognition I chunk the phonemes together to single "words".
This should make editing the results a bit easier.
Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts.
Also at the moment I convert first to CMU39 and then from that to the desired ones.
For the best results we should create conversions between each set manually.

I think the direct next step would be to add some GUI option to change the selected phonemes/words/phrases to belong to a different voice.
That way after auto recognition did it's work we just need to separate the parts to the different voices, if there are any and then it's pretty much done.

And for the future some automatic speaker diarization would be awesome, that way we could automate almost everything.

steveway · 2021-07-02T08:59:00Z

Ok, it's now available.
I've created a first pull request for this: #94
The results are pretty good, I would recommend everyone to test this.
You should download FFmpeg and the allosaurus AI model using the actions at the top of Papagayo-NG first, then restart Papagayo-NG and it should all work.

Hunanbean · 2021-07-02T23:59:14Z

It is working and Very impressive! Thank you very much!

aziagiles · 2021-07-03T14:45:48Z

@steveway Everything is working except the 'Convert Phonemes' function which doesn't really function well. I think it still needs some fixing. I'm on a windows 10 platform.

aziagiles · 2021-07-03T14:52:29Z

@steveway Also for the default 'cmu_39' automatic breakdown, the breakdown goes well for all the earlier parts of the input audio but it doesn't breakdown the end of the audio.

steveway · 2021-07-03T15:34:42Z

Yes, the conversion needs some work, as I mentioned:

Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts.
Also at the moment I convert first to CMU39 and then from that to the desired ones.
For the best results we should create conversions between each set manually.

Can you show what files it does not break down all the way?
The test files worked pretty well, the automatic breakdown is being done by https://github.com/xinjli/allosaurus so there might not be much we can do depending on the cause.

Hunanbean · 2021-07-03T16:52:48Z

Yes, there appears to be a problem with it truncating the last about .5 seconds of the audio file. I will see if it works if i add some empty time at the end of the audio file.

aziagiles · 2021-07-03T16:54:01Z

@steveway Ok. I made a video of the 2 worries I had. That of the conversion from CMU39 to Preston Blair not working properly as alot of missing phonemes are notified. The second worry was that the last phrase or words in the audio are not broken down.

phoneme.conversion.mp4

Hunanbean · 2021-07-03T17:04:08Z

Ok, after i added 1 second of silence to the end of the audio file, it now picks up the last phrase

Edit: I was mistaken, it still truncates the end. It was just the audio lining up to the end, not the actual conversion. The last word/words are still truncated

aziagiles · 2021-07-03T17:13:00Z

@Hunanbean I just added a second silence at the end of my audio, and after the breakdown, it ended exactly where my audio sound ended but unfortunately did not still pick the last phrase.

Hunanbean · 2021-07-03T17:16:10Z

Hmm. Perhaps it just cannot recognize the last phrase, or more silence needs to be added?

Edit: My mistake. You are correct. It is still truncating the end. It was just the audio now finishing at the correct spot

Hunanbean · 2021-07-03T17:34:05Z

yes, verified that any file I try, regardless of added silence at the end, does truncate the last phrase.

steveway · 2021-07-05T08:55:19Z

I think I found the cause.
There is of course still the possibility that allosaurus can't recognize all the phonemes.
But my code did accidentally skip the last few phonemes in some cases.
The reason was the logic I used to chunk phonemes into possible "words", I use some peak detection of the time between the phonemes to split between possible words.
If that result was uneven then the loop over that would skip the last one.
I changed this a bit now: steveway@fb94377

Hunanbean · 2021-07-05T17:21:21Z

With the files I added one second of silence to, this now picks up that last phrase. However, the problem still remains on the same files without the added silence.

Thank you

steveway · 2021-07-06T09:35:07Z

Allosaurus is likely not picking up that part at all.
Can you send a file which has this problem?
I can test to see if it is really Allosaurus or something we do then.

Hunanbean · 2021-07-06T19:55:52Z

Here is an example. The jenna file says the same words, but is recognized without silence added. The salli file says the same words, but the last portion is only recognized with silence added. For the salli voice, both with and without silence are included.
ZippityDoDa.zip

Thank you

steveway · 2021-07-07T09:11:25Z

I now add half a second of silence at the end for Allosaurus, if you then increase the emission to about 1.4 it recognizes your file to the end.
It's a bit strange, without upping the emission it doesn't work, even if I add 10 seconds of silence.
You can test this with a re-download and install of this release: https://github.com/steveway/papagayo-ng/releases/tag/1.6.1

Hunanbean · 2021-07-07T15:07:10Z

That is working well. Thank you

steveway · 2021-09-13T08:24:51Z

I see, the setting was sometimes not loading correctly because QSettings does not handle bools correctly when loading from .ini files.
I fixed that.
I also changed this one setting and split it into two.
One will display the rest frame between every phoneme and the other will only display rest frames between words.

The descriptions of the setting is also a bit clearer now this way.

aziagiles · 2021-09-13T10:23:43Z

@steveway Thank you very much for the prompt fix. Hope you're having a nice day. I have a last recommendation for the Papagayo-NG software. It's still related to Holding back phonemes. I was wondering if the code could be modified in such a way that, a user can choose whether the phonemes should be held or not when exporting the file in .dat format or other formats. In the animation I'm currently working on, there are many cases I'll need them not to be held back. I know in your Grease Pencil importer addon, there is a place to check that, but for us using the Lip Sync Importer addon, we can't modify at that stage.

aziagiles · 2021-09-13T16:43:30Z

@steveway Just downloaded the recent master branch and thanks so much for considering the above recommendation. I believe the ''Show Rest Frames after Words'' function together with the ''Apply Rest Frame settings on Export'' function were actually it. But I think the ''Show Rest Frames after Phonemes'' function can be taken off as I don't believe anyone will ever use it.

aziagiles · 2021-09-13T17:24:59Z

@steveway Just tested the software again, and it seems like the ''Apply Rest Frame settings on Export'' function didn't work. I think it needs fixing.

steveway · 2021-09-13T17:51:47Z

Yes, it's not yet doing anything. 3dcc3b7
It's a little bit more difficult to add this to the export methods.
But I already have an idea how to add it without changing the code too much.

steveway · 2021-09-14T09:35:38Z

Alright, I've added some functionality.
For now it only adds rest frames between words on export to MOHO.
And it will only add the rest frame if there is space free in between.
The whole thing was a bit more confusing than it needed to be because it seems that MOHO starts at 1 while most other software starts at 0.
I'm not sure if we want to add this functionality to the other export options.
It's better to add some logic during import and keep the data unchanged.
Of course for MOHO that change makes sense since we can't update the importer into their software.
Well, we could ask them if they want to support the new JSON format for input and while they are at it they can add the logic to insert rest frames during import depending on user choice like my Blender and Krita Plugins do.
Maybe they will be responsive to this since Mike Clifton is back at the wheel of MOHO.

aziagiles · 2021-09-15T17:03:30Z

@steveway Hello Steve. I'm very happy with your last comment because it goes alongside my line of thought. I believe the 'Show Rest Frames after phonemes' should be taken off, and replace with something like 'Show Rest Frames after Sentences' whereby, when the words within a sentence are being said, phonemes are held back, but when it encounters a silence within the dialogue of say greater or equal to 8 frames (1/3 of a second), the Rest frame appears. And 'Show Rest Frames after Words' should just be the same type of phoneme breakdown as in Papagayo-NG version 1.6.3 and lower where phonemes are not held back. Below, is a test video of a lip sync exercise I did illustrating the concern.

0001-0891.mp4

steveway · 2021-09-21T09:23:05Z

@aziagiles That sounds like a useful feature.
There is silence detection in Pydub, I've tried before to split the sounds into words based on that.
But it's a bit fiddly, so I didn't get a good result yet.
I'll have to do some testing to see how we can combine the information we have from Allosaurus and Pydub to get something usable, that might take some time.

aziagiles · 2021-09-22T08:37:21Z

@steveway OK

aziagiles · 2021-09-26T13:45:31Z

@steveway I just downloaded the current master branch of Papagayo-NG Allosaurus github version but it failed to load. I guess there is a bug.

steveway · 2021-09-27T10:44:37Z

If you downloaded from my fork then it's best to use the master branch. That is the most complete one.

aziagiles · 2021-09-27T11:50:24Z

@steveway Ok. Let me try once again.

steveway · 2021-09-27T13:47:30Z

Something more related to the original topic.
I just found Vosk, for which there also seems to be some support to get phonemes out of.
alphacep/vosk-api#528
While the results from Allosaurus are very good, this might be an interesting alternative.

aziagiles · 2021-09-27T15:12:02Z

@steveway Adding it in Papagayo-NG alongside Allosaurus and Rhubarb will be a great idea. That will be awesome.

aziagiles · 2023-05-02T16:33:34Z

@Hunanbean Hello bro. I just downloaded the "Papagayo-NG Lipsync Importer For Blender" addon from your github page and notice, it doesn't work in Blender 3.5.x. Please can it be updated?

aziagiles · 2023-05-02T16:39:16Z

@Hunanbean I believe the problem comes as a result of modifications made on the Pose Library function in recent versions of Blender, and the addon script will thus needs some retouching in other to function properly taking this into account.

Hunanbean · 2023-05-03T12:31:09Z

@aziagiles Howdy! Shoot. Ok, i just tested it and am experiencing the same thing. I will talk to CGPT4 about it, because i am still no programmer :) I'll see what we can do

aziagiles · 2023-05-03T12:38:52Z

@Hunanbean Ok. I get your point. Best wishes as you fix the issue. I believe you can do this.

Hunanbean · 2023-05-03T13:45:08Z

@aziagiles Ok, i think i've got it worked out.. I will post the fix to the Git as long as it works for you too.
to fix it real quick, replace
from bpy.props import *
with
from bpy.props import EnumProperty, FloatProperty, StringProperty, PointerProperty, IntProperty

aziagiles · 2023-05-03T15:12:46Z

@Hunanbean Ok. will do just that and give you feedback.

Hunanbean · 2023-05-03T15:18:04Z

@Hunanbean Ok. will do just that and give you feedback.

incase it is easier, i've gone ahead and updated the git at https://github.com/Hunanbean/Papagayo-NGLipsyncImporterForBlender
I will try again, if that does not work

aziagiles · 2023-05-03T15:33:58Z

@Hunanbean What should I fill in the space of pose library? or I should leave it empty.?The default may be to put "Current File" as that's where my mouth shapes are, but when I do, it doesn't work.

Hunanbean · 2023-05-03T15:43:04Z

@aziagiles Ok, the biggest issues is, i use Shape Keys which got fixed by the change in that line. I am not set up to even try Pose Libraries.. Is there a way you can send me a generic file with a pose library i can use to test with?

aziagiles · 2023-05-03T15:45:41Z

@Hunanbean receiving error messages. I'm on a Windows 64 bit computer.

Hunanbean · 2023-05-03T15:47:37Z

@aziagiles If you have some generic .Blend with a configured Pose library and a test model, i will keep trying to fix the code, but i do not even know how to make pose libraries yet. I am not sure if uploads can be done here, so you could send it to hunanbean.learning@gmail.com if you have such a file

aziagiles · 2023-05-03T15:47:59Z

@aziagiles Ok, the biggest issues is, i use Shape Keys which got fixed by the change in that line. I am not set up to even try Pose Libraries.. Is there a way you can send me a generic file with a pose library i can use to test with?

Ok. Let me prepare a file and send. Me I use grease pencil, and my phonemes are controlled by a bone.

aziagiles · 2023-05-03T16:08:46Z

hunanbean.learning@gmail.com

I've just replied you on your email box, with the blend file attached.

Hunanbean · 2023-05-03T16:16:27Z

I've just replied you on your email box, with the blend file attached.

Ok, i just got it, thanks. I will see what i can do. This is going to take some time though, so probably will not have much until tomorrow or the day after as they are working on the power here, and it is scheduled to be down for several hours :/

aziagiles · 2023-05-03T16:30:10Z

I've just replied you on your email box, with the blend file attached.

Ok, i just got it, thanks. I will see what i can do. This is going to take some time though, so probably will not have much until tomorrow or the day after as they are working on the power here, and it is scheduled to be down for several hours :/

Ok. Best of luck.

aziagiles · 2023-05-04T17:30:13Z

@Hunanbean Hello Hunanbean. I just sent you 2 emails. The second one is attached with another blend file containing mouth shapes I did in Blender 3.51 for testing of the lipsync addon. Please check.

steveway mentioned this issue Jun 4, 2021

Support Speaker Diarization xinjli/allosaurus#26

Closed

steveway pinned this issue Jul 7, 2021

steveway mentioned this issue Jul 7, 2021

Destructive Uninstall, incomplete Uninstall steveway/papagayo-ng#23

Closed

Feature Request - Speech / Phonetics automatic generation/ alignment #49

Feature Request - Speech / Phonetics automatic generation/ alignment #49

Comments

merlin2v commented Oct 1, 2018

morevnaproject commented Oct 5, 2018

morevnaproject commented Jan 28, 2019

Hunanbean commented May 4, 2021

steveway commented May 19, 2021

steveway commented May 19, 2021 • edited

Hunanbean commented May 19, 2021

steveway commented May 19, 2021

steveway commented Jun 2, 2021

Hunanbean commented Jun 2, 2021 via email

steveway commented Jun 10, 2021

steveway commented Jul 2, 2021

Hunanbean commented Jul 2, 2021

aziagiles commented Jul 3, 2021

aziagiles commented Jul 3, 2021

steveway commented Jul 3, 2021

Hunanbean commented Jul 3, 2021

aziagiles commented Jul 3, 2021

Hunanbean commented Jul 3, 2021 • edited

aziagiles commented Jul 3, 2021

Hunanbean commented Jul 3, 2021 • edited

Hunanbean commented Jul 3, 2021

steveway commented Jul 5, 2021

Hunanbean commented Jul 5, 2021

steveway commented Jul 6, 2021

Hunanbean commented Jul 6, 2021

steveway commented Jul 7, 2021

Hunanbean commented Jul 7, 2021

steveway commented Sep 13, 2021

aziagiles commented Sep 13, 2021

aziagiles commented Sep 13, 2021

aziagiles commented Sep 13, 2021

steveway commented Sep 13, 2021

steveway commented Sep 14, 2021

aziagiles commented Sep 15, 2021

steveway commented Sep 21, 2021

aziagiles commented Sep 22, 2021

aziagiles commented Sep 26, 2021

steveway commented Sep 27, 2021

aziagiles commented Sep 27, 2021

steveway commented Sep 27, 2021

aziagiles commented Sep 27, 2021

aziagiles commented May 2, 2023

aziagiles commented May 2, 2023

Hunanbean commented May 3, 2023 • edited

aziagiles commented May 3, 2023

Hunanbean commented May 3, 2023 • edited

aziagiles commented May 3, 2023

Hunanbean commented May 3, 2023

aziagiles commented May 3, 2023

Hunanbean commented May 3, 2023

aziagiles commented May 3, 2023

Hunanbean commented May 3, 2023

aziagiles commented May 3, 2023

aziagiles commented May 3, 2023

Hunanbean commented May 3, 2023

aziagiles commented May 3, 2023

aziagiles commented May 4, 2023

steveway commented May 19, 2021 •

edited

Hunanbean commented Jul 3, 2021 •

edited

Hunanbean commented Jul 3, 2021 •

edited

Hunanbean commented May 3, 2023 •

edited

Hunanbean commented May 3, 2023 •

edited