Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Speech / Phonetics automatic generation/ alignment #49

Open
merlin2v opened this issue Oct 1, 2018 · 94 comments
Open
Projects

Comments

@merlin2v
Copy link

merlin2v commented Oct 1, 2018

I've been wondering why the file text had to be used. Couldn't you separate the sound via phonetics?
This would be better as it would help translate things more accurate than the text alone.
take the following example:

I do like

This could be said as:

adɪ̈ lik (IPA)

vs. someone being using pronunciation:

ai dɵ lik (IPA)

Both of these end up using different mouth movements and because of this can make some of the mouth movements off.

@morevnaproject
Copy link
Collaborator

I was thinking about that too, and started to investigate. This is what I found - https://cmusphinx.github.io/wiki/phonemerecognition/

Frequently, people want to use Sphinx to do phoneme recognition. In other words, they would like to convert speech to a stream of phonemes rather than words. This is possible, although the results can be disappointing. The reason is that automatic speech recognition relies heavily on contextual constraints (i.e. language modeling) to guide the search algorithm.

For now, I think integrating with RhubarbLipSync (#44) is a way to go.

@morevnaproject
Copy link
Collaborator

We've got Rhubarb feature merged just now. - #50 ^__^

@Hunanbean
Copy link

Montreal Forced Aligner may be something to look into, but that would be more for automatic alignment from the text, rather than the full shebang.

@steveway
Copy link
Collaborator

As mentioned we currently have Rhubarb integrated.
But I just found an interesting project for this call Allosaurus.
It seems to be pretty easy to use and here on Windows 10 it was very easy to pip install it.
The only problem is that it outputs IPA Phonemes and that it does not provide any timestamps (yet).
We should be able to create a mapping for the phonemes to the ones we support already.
And for the timestamps there is already an open issue: xinjli/allosaurus#24
But even without timestamps it might already be usable with some conversion of the phonemes.

@steveway
Copy link
Collaborator

steveway commented May 19, 2021

Here is a very simple conversion dict from IPA to CMU:
{ "b": "B", "ʧ": "CH", "d": "D", "ð": "DH", "f": "F", "g": "G", "h": "HH", "ʤ": "JH", "k": "K", "l": "L", "m": "M", "n": "N", "ŋ": "NG", "p": "P", "r": "R", "s": "S", "ʃ": "SH", "t": "T", "θ": "TH", "v": "V", "w": "W", "j": "Y", "z": "Z", "ʒ": "ZH", "ɑ": "AA2", "æ": "AE2", "ə": "AH0", "ʌ": "AH2", "ɔ": "AO2", "ɛ": "EH2", "ɚ": "ER0", "ɝ": "ER2", "ɪ": "IH2", "i": "IY2", "ʊ": "UH2", "u": "UW2", "aʊ": "AW2", "aɪ": "AY2", "eɪ": "EY2", "oʊ": "OW2", "ɔɪ": "OY2" }
It's based on this mapping from CMU to IPA.
https://github.com/margonaut/CMU-to-IPA-Converter/blob/master/cmu_ipa_mapping.rb

@Hunanbean
Copy link

I must have underestimated what Rhubarb actually does. I will take a look at it now.
In the CMU phoneme set i did, i purposely simplified it to remove the specific variants, such as AO1, AO2 would both become just AO. But, i am pretty sure i still have the full setup before i truncated it if you want me to post that on my git. But due to the imperceivable differences between AO, AO1, AO2 in action, perhaps it makes more sense to just have the conversion dictionary truncate to the existing set of 39

@steveway
Copy link
Collaborator

Yes, Rhubarb is quite nice, it would be awesome if it could also output text besides phonemes.
With our language dictionaries we could try to convert phonemes back to words too for that.
The results from Rhubarb are not as exact as our manual methods, but I guess for most animations it's enough.
I don't think we need the untruncated list for CMU.
I just quickly generated that list up there based on that little converter from @margonaut.
If we really want to integrate Allosaurus then we should make a fitting conversion table for our phoneme list.
We can use that information to create a new phoneme_set and phoneme_conversion dictionary for IPA.
And we should add some code to use these to convert between different phoneme sets, that should already kinda be possible in a limited way.

@steveway
Copy link
Collaborator

steveway commented Jun 2, 2021

I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus
This currently uses pydub to prepare the sound files for allosaurus.
This works very well with our Tutorial Files, even the spanish ones.
The results seem to be better than what Rhubarb provides.
Here is a quick test showing the result for running it on the lame.wav file:
https://youtu.be/4hqHaEXo9xU
The phonemes are partially overlapping, so some pruning needs to be done for animation purposes.
But as you can see the results are quite good.

@Hunanbean
Copy link

Hunanbean commented Jun 2, 2021 via email

@steveway
Copy link
Collaborator

Alright, I made some more progress.
Based on the time between phonemes from the automatic recognition I chunk the phonemes together to single "words".
This should make editing the results a bit easier.
Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts.
Also at the moment I convert first to CMU39 and then from that to the desired ones.
For the best results we should create conversions between each set manually.

I think the direct next step would be to add some GUI option to change the selected phonemes/words/phrases to belong to a different voice.
That way after auto recognition did it's work we just need to separate the parts to the different voices, if there are any and then it's pretty much done.

And for the future some automatic speaker diarization would be awesome, that way we could automate almost everything.

@steveway
Copy link
Collaborator

steveway commented Jul 2, 2021

Ok, it's now available.
I've created a first pull request for this: #94
The results are pretty good, I would recommend everyone to test this.
You should download FFmpeg and the allosaurus AI model using the actions at the top of Papagayo-NG first, then restart Papagayo-NG and it should all work.

@Hunanbean
Copy link

It is working and Very impressive! Thank you very much!

@aziagiles
Copy link

@steveway Everything is working except the 'Convert Phonemes' function which doesn't really function well. I think it still needs some fixing. I'm on a windows 10 platform.

@aziagiles
Copy link

@steveway Also for the default 'cmu_39' automatic breakdown, the breakdown goes well for all the earlier parts of the input audio but it doesn't breakdown the end of the audio.

@steveway
Copy link
Collaborator

steveway commented Jul 3, 2021

Yes, the conversion needs some work, as I mentioned:

Also I added some simple logic to convert between different phoneme sets, but for that we likely need some good hand-made phoneme conversion dicts.
Also at the moment I convert first to CMU39 and then from that to the desired ones.
For the best results we should create conversions between each set manually.

Can you show what files it does not break down all the way?
The test files worked pretty well, the automatic breakdown is being done by https://github.com/xinjli/allosaurus so there might not be much we can do depending on the cause.

@Hunanbean
Copy link

Yes, there appears to be a problem with it truncating the last about .5 seconds of the audio file. I will see if it works if i add some empty time at the end of the audio file.

@aziagiles
Copy link

@steveway Ok. I made a video of the 2 worries I had. That of the conversion from CMU39 to Preston Blair not working properly as alot of missing phonemes are notified. The second worry was that the last phrase or words in the audio are not broken down.

phoneme.conversion.mp4

@Hunanbean
Copy link

Hunanbean commented Jul 3, 2021

Ok, after i added 1 second of silence to the end of the audio file, it now picks up the last phrase

Edit: I was mistaken, it still truncates the end. It was just the audio lining up to the end, not the actual conversion. The last word/words are still truncated

@aziagiles
Copy link

@Hunanbean I just added a second silence at the end of my audio, and after the breakdown, it ended exactly where my audio sound ended but unfortunately did not still pick the last phrase.

@Hunanbean
Copy link

Hunanbean commented Jul 3, 2021

Hmm. Perhaps it just cannot recognize the last phrase, or more silence needs to be added?

Edit: My mistake. You are correct. It is still truncating the end. It was just the audio now finishing at the correct spot

@Hunanbean
Copy link

yes, verified that any file I try, regardless of added silence at the end, does truncate the last phrase.

@steveway
Copy link
Collaborator

steveway commented Jul 5, 2021

I think I found the cause.
There is of course still the possibility that allosaurus can't recognize all the phonemes.
But my code did accidentally skip the last few phonemes in some cases.
The reason was the logic I used to chunk phonemes into possible "words", I use some peak detection of the time between the phonemes to split between possible words.
If that result was uneven then the loop over that would skip the last one.
I changed this a bit now: steveway@fb94377

@Hunanbean
Copy link

With the files I added one second of silence to, this now picks up that last phrase. However, the problem still remains on the same files without the added silence.

Thank you

@steveway
Copy link
Collaborator

steveway commented Jul 6, 2021

Allosaurus is likely not picking up that part at all.
Can you send a file which has this problem?
I can test to see if it is really Allosaurus or something we do then.

@Hunanbean
Copy link

Here is an example. The jenna file says the same words, but is recognized without silence added. The salli file says the same words, but the last portion is only recognized with silence added. For the salli voice, both with and without silence are included.
ZippityDoDa.zip

Thank you

@steveway
Copy link
Collaborator

steveway commented Jul 7, 2021

I now add half a second of silence at the end for Allosaurus, if you then increase the emission to about 1.4 it recognizes your file to the end.
It's a bit strange, without upping the emission it doesn't work, even if I add 10 seconds of silence.
You can test this with a re-download and install of this release: https://github.com/steveway/papagayo-ng/releases/tag/1.6.1

@Hunanbean
Copy link

That is working well. Thank you

@steveway
Copy link
Collaborator

I see, the setting was sometimes not loading correctly because QSettings does not handle bools correctly when loading from .ini files.
I fixed that.
I also changed this one setting and split it into two.
One will display the rest frame between every phoneme and the other will only display rest frames between words.
rest_frames_settings
The descriptions of the setting is also a bit clearer now this way.

@aziagiles
Copy link

@steveway Thank you very much for the prompt fix. Hope you're having a nice day. I have a last recommendation for the Papagayo-NG software. It's still related to Holding back phonemes. I was wondering if the code could be modified in such a way that, a user can choose whether the phonemes should be held or not when exporting the file in .dat format or other formats. In the animation I'm currently working on, there are many cases I'll need them not to be held back. I know in your Grease Pencil importer addon, there is a place to check that, but for us using the Lip Sync Importer addon, we can't modify at that stage.

@aziagiles
Copy link

@steveway Just downloaded the recent master branch and thanks so much for considering the above recommendation. I believe the ''Show Rest Frames after Words'' function together with the ''Apply Rest Frame settings on Export'' function were actually it. But I think the ''Show Rest Frames after Phonemes'' function can be taken off as I don't believe anyone will ever use it.

@aziagiles
Copy link

@steveway Just tested the software again, and it seems like the ''Apply Rest Frame settings on Export'' function didn't work. I think it needs fixing.

@steveway
Copy link
Collaborator

Yes, it's not yet doing anything. 3dcc3b7
It's a little bit more difficult to add this to the export methods.
But I already have an idea how to add it without changing the code too much.

@steveway
Copy link
Collaborator

Alright, I've added some functionality.
For now it only adds rest frames between words on export to MOHO.
And it will only add the rest frame if there is space free in between.
The whole thing was a bit more confusing than it needed to be because it seems that MOHO starts at 1 while most other software starts at 0.
I'm not sure if we want to add this functionality to the other export options.
It's better to add some logic during import and keep the data unchanged.
Of course for MOHO that change makes sense since we can't update the importer into their software.
Well, we could ask them if they want to support the new JSON format for input and while they are at it they can add the logic to insert rest frames during import depending on user choice like my Blender and Krita Plugins do.
Maybe they will be responsive to this since Mike Clifton is back at the wheel of MOHO.

@aziagiles
Copy link

@steveway Hello Steve. I'm very happy with your last comment because it goes alongside my line of thought. I believe the 'Show Rest Frames after phonemes' should be taken off, and replace with something like 'Show Rest Frames after Sentences' whereby, when the words within a sentence are being said, phonemes are held back, but when it encounters a silence within the dialogue of say greater or equal to 8 frames (1/3 of a second), the Rest frame appears. And 'Show Rest Frames after Words' should just be the same type of phoneme breakdown as in Papagayo-NG version 1.6.3 and lower where phonemes are not held back. Below, is a test video of a lip sync exercise I did illustrating the concern.

0001-0891.mp4

@steveway
Copy link
Collaborator

@aziagiles That sounds like a useful feature.
There is silence detection in Pydub, I've tried before to split the sounds into words based on that.
But it's a bit fiddly, so I didn't get a good result yet.
I'll have to do some testing to see how we can combine the information we have from Allosaurus and Pydub to get something usable, that might take some time.

@aziagiles
Copy link

@steveway OK

@aziagiles
Copy link

@steveway I just downloaded the current master branch of Papagayo-NG Allosaurus github version but it failed to load. I guess there is a bug.

@steveway
Copy link
Collaborator

If you downloaded from my fork then it's best to use the master branch. That is the most complete one.

@aziagiles
Copy link

@steveway Ok. Let me try once again.

@steveway
Copy link
Collaborator

Something more related to the original topic.
I just found Vosk, for which there also seems to be some support to get phonemes out of.
alphacep/vosk-api#528
While the results from Allosaurus are very good, this might be an interesting alternative.

@aziagiles
Copy link

@steveway Adding it in Papagayo-NG alongside Allosaurus and Rhubarb will be a great idea. That will be awesome.

@aziagiles
Copy link

@Hunanbean Hello bro. I just downloaded the "Papagayo-NG Lipsync Importer For Blender" addon from your github page and notice, it doesn't work in Blender 3.5.x. Please can it be updated?

@aziagiles
Copy link

@Hunanbean I believe the problem comes as a result of modifications made on the Pose Library function in recent versions of Blender, and the addon script will thus needs some retouching in other to function properly taking this into account.

@Hunanbean
Copy link

Hunanbean commented May 3, 2023

@aziagiles Howdy! Shoot. Ok, i just tested it and am experiencing the same thing. I will talk to CGPT4 about it, because i am still no programmer :) I'll see what we can do

@aziagiles
Copy link

@Hunanbean Ok. I get your point. Best wishes as you fix the issue. I believe you can do this.

@Hunanbean
Copy link

Hunanbean commented May 3, 2023

@aziagiles Ok, i think i've got it worked out.. I will post the fix to the Git as long as it works for you too.
to fix it real quick, replace
from bpy.props import *
with
from bpy.props import EnumProperty, FloatProperty, StringProperty, PointerProperty, IntProperty

@aziagiles
Copy link

@Hunanbean Ok. will do just that and give you feedback.

@Hunanbean
Copy link

@Hunanbean Ok. will do just that and give you feedback.

incase it is easier, i've gone ahead and updated the git at https://github.com/Hunanbean/Papagayo-NGLipsyncImporterForBlender
I will try again, if that does not work

@aziagiles
Copy link

@Hunanbean What should I fill in the space of pose library? or I should leave it empty.?The default may be to put "Current File" as that's where my mouth shapes are, but when I do, it doesn't work.

update

@Hunanbean
Copy link

@aziagiles Ok, the biggest issues is, i use Shape Keys which got fixed by the change in that line. I am not set up to even try Pose Libraries.. Is there a way you can send me a generic file with a pose library i can use to test with?

@aziagiles
Copy link

@Hunanbean receiving error messages. I'm on a Windows 64 bit computer.
update

@Hunanbean
Copy link

@aziagiles If you have some generic .Blend with a configured Pose library and a test model, i will keep trying to fix the code, but i do not even know how to make pose libraries yet. I am not sure if uploads can be done here, so you could send it to hunanbean.learning@gmail.com if you have such a file

@aziagiles
Copy link

@aziagiles Ok, the biggest issues is, i use Shape Keys which got fixed by the change in that line. I am not set up to even try Pose Libraries.. Is there a way you can send me a generic file with a pose library i can use to test with?

Ok. Let me prepare a file and send. Me I use grease pencil, and my phonemes are controlled by a bone.

@aziagiles
Copy link

hunanbean.learning@gmail.com

I've just replied you on your email box, with the blend file attached.

@Hunanbean
Copy link

I've just replied you on your email box, with the blend file attached.

Ok, i just got it, thanks. I will see what i can do. This is going to take some time though, so probably will not have much until tomorrow or the day after as they are working on the power here, and it is scheduled to be down for several hours :/

@aziagiles
Copy link

I've just replied you on your email box, with the blend file attached.

Ok, i just got it, thanks. I will see what i can do. This is going to take some time though, so probably will not have much until tomorrow or the day after as they are working on the power here, and it is scheduled to be down for several hours :/

Ok. Best of luck.

@aziagiles
Copy link

@Hunanbean Hello Hunanbean. I just sent you 2 emails. The second one is attached with another blend file containing mouth shapes I did in Blender 3.51 for testing of the lipsync addon. Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Releases
Awaiting triage
Development

No branches or pull requests

5 participants