-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request - Speech / Phonetics automatic generation/ alignment #49
Comments
I was thinking about that too, and started to investigate. This is what I found - https://cmusphinx.github.io/wiki/phonemerecognition/
For now, I think integrating with RhubarbLipSync (#44) is a way to go. |
We've got Rhubarb feature merged just now. - #50 ^__^ |
Montreal Forced Aligner may be something to look into, but that would be more for automatic alignment from the text, rather than the full shebang. |
As mentioned we currently have Rhubarb integrated. |
Here is a very simple conversion dict from IPA to CMU: |
I must have underestimated what Rhubarb actually does. I will take a look at it now. |
Yes, Rhubarb is quite nice, it would be awesome if it could also output text besides phonemes. |
I now have an allosaurus branch: https://github.com/steveway/papagayo-ng/tree/allosaurus |
That is very cool! Thank you
I just ditched windows and went back to linux last night, so it is going to
take me a little while before i can test it. Seems like now would be a good
time to start making some noise on the forums. That looks like a Patreon
magnet to me!
…On Wed, Jun 2, 2021 at 1:03 AM Stefan Murawski ***@***.***> wrote:
I now have an allosaurus branch:
https://github.com/steveway/papagayo-ng/tree/allosaurus
This currently uses pydub to prepare the sound files for allosaurus.
This works very well with our Tutorial Files, even the spanish ones.
The results seem to be better than what Rhubarb provides.
Here is a quick test showing the result for running it on the lame.wav
file:
https://youtu.be/4hqHaEXo9xU
The phonemes are partially overlapping, so some pruning needs to be done
for animation purposes.
But as you can see the results are quite good.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#49 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMNVR7D3RQZHLOXBOMRFTILTQXQUTANCNFSM4FYFKTXQ>
.
|
Alright, I made some more progress. I think the direct next step would be to add some GUI option to change the selected phonemes/words/phrases to belong to a different voice. And for the future some automatic speaker diarization would be awesome, that way we could automate almost everything. |
Ok, it's now available. |
It is working and Very impressive! Thank you very much! |
@steveway Everything is working except the 'Convert Phonemes' function which doesn't really function well. I think it still needs some fixing. I'm on a windows 10 platform. |
@steveway Also for the default 'cmu_39' automatic breakdown, the breakdown goes well for all the earlier parts of the input audio but it doesn't breakdown the end of the audio. |
Yes, the conversion needs some work, as I mentioned:
Can you show what files it does not break down all the way? |
Yes, there appears to be a problem with it truncating the last about .5 seconds of the audio file. I will see if it works if i add some empty time at the end of the audio file. |
@steveway Ok. I made a video of the 2 worries I had. That of the conversion from CMU39 to Preston Blair not working properly as alot of missing phonemes are notified. The second worry was that the last phrase or words in the audio are not broken down. phoneme.conversion.mp4 |
Ok, after i added 1 second of silence to the end of the audio file, it now picks up the last phrase Edit: I was mistaken, it still truncates the end. It was just the audio lining up to the end, not the actual conversion. The last word/words are still truncated |
@Hunanbean I just added a second silence at the end of my audio, and after the breakdown, it ended exactly where my audio sound ended but unfortunately did not still pick the last phrase. |
Hmm. Perhaps it just cannot recognize the last phrase, or more silence needs to be added? Edit: My mistake. You are correct. It is still truncating the end. It was just the audio now finishing at the correct spot |
yes, verified that any file I try, regardless of added silence at the end, does truncate the last phrase. |
I think I found the cause. |
With the files I added one second of silence to, this now picks up that last phrase. However, the problem still remains on the same files without the added silence. Thank you |
Allosaurus is likely not picking up that part at all. |
Here is an example. The jenna file says the same words, but is recognized without silence added. The salli file says the same words, but the last portion is only recognized with silence added. For the salli voice, both with and without silence are included. Thank you |
I now add half a second of silence at the end for Allosaurus, if you then increase the emission to about 1.4 it recognizes your file to the end. |
That is working well. Thank you |
@steveway Thank you very much for the prompt fix. Hope you're having a nice day. I have a last recommendation for the Papagayo-NG software. It's still related to Holding back phonemes. I was wondering if the code could be modified in such a way that, a user can choose whether the phonemes should be held or not when exporting the file in .dat format or other formats. In the animation I'm currently working on, there are many cases I'll need them not to be held back. I know in your Grease Pencil importer addon, there is a place to check that, but for us using the Lip Sync Importer addon, we can't modify at that stage. |
@steveway Just downloaded the recent master branch and thanks so much for considering the above recommendation. I believe the ''Show Rest Frames after Words'' function together with the ''Apply Rest Frame settings on Export'' function were actually it. But I think the ''Show Rest Frames after Phonemes'' function can be taken off as I don't believe anyone will ever use it. |
@steveway Just tested the software again, and it seems like the ''Apply Rest Frame settings on Export'' function didn't work. I think it needs fixing. |
Yes, it's not yet doing anything. 3dcc3b7 |
Alright, I've added some functionality. |
@steveway Hello Steve. I'm very happy with your last comment because it goes alongside my line of thought. I believe the 'Show Rest Frames after phonemes' should be taken off, and replace with something like 'Show Rest Frames after Sentences' whereby, when the words within a sentence are being said, phonemes are held back, but when it encounters a silence within the dialogue of say greater or equal to 8 frames (1/3 of a second), the Rest frame appears. And 'Show Rest Frames after Words' should just be the same type of phoneme breakdown as in Papagayo-NG version 1.6.3 and lower where phonemes are not held back. Below, is a test video of a lip sync exercise I did illustrating the concern. 0001-0891.mp4 |
@aziagiles That sounds like a useful feature. |
@steveway OK |
@steveway I just downloaded the current master branch of Papagayo-NG Allosaurus github version but it failed to load. I guess there is a bug. |
If you downloaded from my fork then it's best to use the master branch. That is the most complete one. |
@steveway Ok. Let me try once again. |
Something more related to the original topic. |
@steveway Adding it in Papagayo-NG alongside Allosaurus and Rhubarb will be a great idea. That will be awesome. |
@Hunanbean Hello bro. I just downloaded the "Papagayo-NG Lipsync Importer For Blender" addon from your github page and notice, it doesn't work in Blender 3.5.x. Please can it be updated? |
@Hunanbean I believe the problem comes as a result of modifications made on the Pose Library function in recent versions of Blender, and the addon script will thus needs some retouching in other to function properly taking this into account. |
@aziagiles Howdy! Shoot. Ok, i just tested it and am experiencing the same thing. I will talk to CGPT4 about it, because i am still no programmer :) I'll see what we can do |
@Hunanbean Ok. I get your point. Best wishes as you fix the issue. I believe you can do this. |
@aziagiles Ok, i think i've got it worked out.. I will post the fix to the Git as long as it works for you too. |
@Hunanbean Ok. will do just that and give you feedback. |
incase it is easier, i've gone ahead and updated the git at https://github.com/Hunanbean/Papagayo-NGLipsyncImporterForBlender |
@Hunanbean What should I fill in the space of pose library? or I should leave it empty.?The default may be to put "Current File" as that's where my mouth shapes are, but when I do, it doesn't work. |
@aziagiles Ok, the biggest issues is, i use Shape Keys which got fixed by the change in that line. I am not set up to even try Pose Libraries.. Is there a way you can send me a generic file with a pose library i can use to test with? |
@Hunanbean receiving error messages. I'm on a Windows 64 bit computer. |
@aziagiles If you have some generic .Blend with a configured Pose library and a test model, i will keep trying to fix the code, but i do not even know how to make pose libraries yet. I am not sure if uploads can be done here, so you could send it to hunanbean.learning@gmail.com if you have such a file |
Ok. Let me prepare a file and send. Me I use grease pencil, and my phonemes are controlled by a bone. |
I've just replied you on your email box, with the blend file attached. |
Ok, i just got it, thanks. I will see what i can do. This is going to take some time though, so probably will not have much until tomorrow or the day after as they are working on the power here, and it is scheduled to be down for several hours :/ |
Ok. Best of luck. |
@Hunanbean Hello Hunanbean. I just sent you 2 emails. The second one is attached with another blend file containing mouth shapes I did in Blender 3.51 for testing of the lipsync addon. Please check. |
I've been wondering why the file text had to be used. Couldn't you separate the sound via phonetics?
This would be better as it would help translate things more accurate than the text alone.
take the following example:
This could be said as:
vs. someone being using pronunciation:
Both of these end up using different mouth movements and because of this can make some of the mouth movements off.
The text was updated successfully, but these errors were encountered: