-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coqui_TTS needs TTS updating or it will keep downloading the model. Also sounds Strange (FIX) #4723
Comments
I have the same problem. Is there any solution? |
EDIT - Manual workaround here along with sounding strange FIX #4723 (comment) Run the CMD_yourOS file in the text-gen-webUI folder.....to put you into your python environment. Then run pip install --upgrade tts (Once you have done this, it will only download the new model once) BUT..... the new model https://huggingface.co/coqui/XTTS-v2/tree/main which it downloads, isn't sounding right. A couple of us are talking about it here coqui-ai/TTS#3301 (comment) I think Ill be updating TTS (as above) but downloading the old model and probably blocking python from going out on my firewall (for now) to stop it updating to the new model. Old model is here on this link https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2 |
What a mess. I will follow your idea and see how it turns out. I hope someone can fix this. Just running it all with internet turned off or blocking python through the firewall seems messy to me as well. Thank you anyway for your help here |
Just had time to look at this again. The manual workaround is:
The update to TTS will probably get captured by the main update https://github.com/oobabooga/text-generation-webui#getting-updates when the Coqui_TTS extension requirements file is updated, but for now, you can do the above. The next time the Coqui_TTS extension is loaded, it will probably download the model the 1x last time (it may not too) but let it complete if it does download. Though, quite a few people seem to think the 2.03 model is sounding strange, so to drop back to the 2.0.2 model:
Then it starts to generate audio like the previous version did (without the strange accent) and it doesn't demand you download the model each time! :) |
For the Devs of the Coqui_TTS Extension After a bit of discussion on the Coqui_TTS forum, it looks like you can use --model_path and --config_path to run any model you choose or download manually. My discussion here coqui-ai/TTS#3301 (reply in thread) It may therefore be best if people don't like the sound of the new 2.0.3 model, the Coqui_TTS extension downloads the 2.0.2 model on first run, checking for its existence on prior runs, and then uses model_path and config_path to load the model. This would mean that we could version control future model releases I guess. The 2.0.2 model is hosted here https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2 So the below would need to be model_path and config_path.
and
Please also see my suggested updated script.py for low VRAM handing #4712 (comment) |
/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2 |
closing this ticket off. I assume anyone who searches will find it in the closed section. |
any idea how to set this in xtts in pinokio browser (macos)? apparently xtts will not start without internet connection, and it will always update to most recent model, which is wrong. |
@testter21 you can download the updated version that I built if you like. Ive tested for you and as long as you have run it with an internet connection once, it then works offline with the "API Local and XTTSv2 Local" methods. API TTS may well need an internet connection, so dont click on that one if offline!! I've not yet managed to separate it off into its own simple download, so you would have to do these instructions. you would need to download these files into your existing /extensions/coqui_tts/ folder: from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts then create a subfolder called templates e.g. /extensions/coqui_tts/templates and download the generate_form.html into it. from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts/templates to download a each file individually, you click on each file one by one, then click the "download raw file" button: There is a full manual and settings page included in this one. You will see the link on the settings interface! |
@erew123, quick question. With oobabooga, as I'm not programmer, I followed these instructions: Theoretically all went well until at some point (6'35" on the video) I ended up with an error: ... Thus, the tts gui part is not being included. Will your suggestion fix that problem too? As for pinokio browser, xtts folder/data structure is different there. |
Ive only ever seen one other person with that issue! Which strangely I responded to on here: #4718 From your error message its not too clear why it didn't load on your system. Perhaps TTS isnt properly installed. If you download the one I suggested and when you have downloaded it, go into the extensions/coquii_tts folder at the command prompt/terminal (being a mac I guess terminal) and type pip install -r requirements.txt that should ensure that all the necessary files are installed.. in theory! Beyond that, when the version I sent you starts up, it performs a few checks and should warn you if there is something missing/wrong (please see my screenshot on my last post and the "warning" message it gave there). So its more likely to tell you what to do, if something else needs doing. |
The person was not me. I'm guessing rather very few people attempt to install this stuff, as this usually requires either some programming skills or ability to navigate in file/folder structures and some logic. I did not posted more info on this error, because I need to reproduce the steps again to see where something could go wrong. Technically, oobabooga says, that it makes an isolated install (thus, dependencies are stored in it's own folder), but during the installation steps, I indeed had some, well... not sure if this was an error or ambiguity issue, but it said somethig related to macwhisper (now, I don't know if this is included in oobabooga as scripts, but I do have macwhisper installed as an app; if installers interfered with external app, then this is not isolated install as stated on the webpage). As for now, I'm a bit lost with all this, but I will follow your steps and see where it leads to. |
okay, I'm reinstalling oobaboga. So I unpacked text-generation-webui-main, started start_macos.sh, selected vendor (apple m in this case). After that, during installation I get something like this: ... It's the same error I had previously, and I don't know if it's related to the fact that tts gui part is not loading later. and later: ...
WARNING: Skipping torch-grammar as it is not installed. and then at the end again: ... and at the end of course this:
|
Thats not related to the Coqui_tts extension its self. It looks like you arent using the default python environment that text-generation-webui sets up... though not having a mac, Ive never tried it on mac, though I assume it to be similar to linux. My suggestion would be, to get yourself to a stable/known position.
This will re-download and setup the whole of text-gen-webi and build a new python environment for it.
If you have an issue beyond that I would report it on the "issues" page on here, as that would be a general issue with the installer for it. |
I used exactly this one and started start_macos.sh as in description. |
If youve done that, gone through the whole install and next time you try running it with start_macos.sh its giving errors, Id post on the issues page here https://github.com/oobabooga/text-generation-webui/issues Either oobabooga or someone with a mac will take a look. You'll need to post the text output of the issue you're getting when you run start_macos.sh as that should tell them what's wrong. It sounds like it could be a requirements file issue, maybe, but they would have to take a look at that (someone whom works on the core code and has access to a mac). You can also hunt that issues page for others with the same problem (also look for closed issues). |
I reposted notes from here. btw, such installments work like windows portable apps, i.e. can be migrated between folders or computers? |
The things in the extensions folder should move fine. As for moving the whole of Text generation webUI... umm.. possibly, though you might have to run the setup_YOUROSHERE file. Ive never tried, so I can only say hypothetically. |
In essence it would be good if you could relatively freely move the whole thing around and and separately backup/reuse subfolders. I guess some config file would be then needed to point starting path and inclusions if new stuff is added. I have no idea whether this works as I described just now, I'm guessing it would be reasonable course of action. btw, as for pinokio browser, I just got enigmatic response on coqui discord:
whatever that means. |
Updating to TTS 0.21.3 stops the continuous download occurring (hence the warning on the version I built if you look at the screenshot, along with the command line instruction for performing the update) My version also allows you to use both the 2.0.2 and latest (2.0.3) version of the model, simultaneously (as the 2.0.3 model sounds bad). All of that is detailed in the settings page after its installed and up and running. If you just want to stop it re-downloading the model all the time, you need to be in the text generation webui python environment, so either start the text generation webui at a command prompt with start_macos and then ctrl+c to exit that (it should have loaded the environment) then: pip install --upgrade tts |
So far.
and then: ... ERROR: Application startup failed. Exiting. |
Got it! Ok, Its looking for CUDA, which you wont have on a mac, because CUDA is for an Nvidia card. Its something that I can code into a new version.... though Im still working on separating this out into its own install, so as not to impact the supplied Coqui_tts extension (Which will have exactly the same issue, as this version is a derivative version). I'm 40-60% of the way through splitting this out into its own version......and once I have that done and tested Ill see what I can do about the cuda thing, though Im guessing theres 2-3 hours of coding involved. Ill let you know when I get it done. |
Ok, thanks! Let me know if anything else is needed. btw, there is also this missing tts gui part bug. is this related to oobabooga in general, or current situation? |
I cant say. Possibly because the extension didnt load. |
@erew123 - any luck? |
@testter21 Not yet no. It took me until about 2-3 hours ago to finally build a release based on the main code I was re-writing (problems along the way) https://github.com/erew123/alltalk_tts . I was literally at it (coding+writing documentation) for roughly 14 hours yesterday. Im taking a little mental break this evening. I have loosely had a bit of space to think my way around mac and using system ram only, now that I've got the main bit out of the way Im going to take a little downtime, gather my thoughts and Ill let you know. |
Ok, have a nice weekend then. btw, this is pinokio version of xtts: |
@testter21 You're welcome to go give it a go https://github.com/erew123/alltalk_tts I've done what I can to put logic into the code to ensure it wont attempt to load into an Nvidia graphics card, when a system doesn't have an Nvidia graphics card. Unfortunately its not as simple as a couple of changes, due to the amount of logic already going on inside the scripts. But, as I say, I've done what I can. I've also separated out the requirements files, so you will be installing the requirements_other.txt All the instructions are written up on the front page of the above link. I guess it only remains to say, I don't have a mac so I haven't tested it out on a mac. In theory, it should work, but I cant say for certain you might not encounter some error. |
Ok, thanks for the effort, I will check it and see where it goes. btw, what's (in this case) the relation of Coqui_TTS to bark voice cloning? Is one built onto the top of another, or these are two separate workflows? |
Okay, so this is the workflow on mac, and things you will see on typical machine. It seems to work, nice job there. In preview box, when I type something in my language, text generation is present (initially I had wrong output selected on the device). And now the big question - how I can use it as regular tts generator with downloadable output? Without bothering about chat. (So far, I don't see yet anyway any multilingual local models, that would be suitable for reasonable data processing) How it will handle longer chunks of text? This is recent observation. On pinokio browser, I noticed that there is a token limit (I don't know what it refers to - phonemes or words). I'm not sure whether text there is processed 'sendence by sentence' or 'as is'. Technically, if tts itself has input token limit, sentences could be split (primarily by dots, secondarily by comma) in preprocessing. |
Hm... So basically it would have to be something, a dummy that sends user input to output window (text passthru) |
Thanks for documents. That at least confirms to me it does what its supposed to do! It looks like you may have a decent speed on generating voice samples too, despite being on a cpu/ram setup. Your first question Coqui, Bark and Xtts Bark is one of their models/methods for voice cloning https://tts.readthedocs.io/en/latest/models/bark.html which I believe requires a larger voice sample and "training" on those voice samples so has a higher memory requirement and cant just clone voices without creating a trained RVC file ahead of time. It may however be able to generate more accurate sounding results (unsure). XTTSv2 is another model/method for voice cloning https://tts.readthedocs.io/en/latest/models/xtts.html that doesn't require training ahead of time, it just requires a 6-12 second voice sample and it generates on the fly. So effectively, they are just 2x different ways/methods of doing voice cloning, each with their pros and cons. If you scroll down the left index on the links above, under the "TTS" section, you will see various other models they manage e.g. Tortoise etc. AllTalk is working with the XTTSv2 method. Generating TTS content. tts --text "this is my text I want to hear" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --out_path output.wav --speaker_wav voices/female_01.wav --language_idx en The problem is with that method, that each time you run it, it has to load the model into memory each time, so that adds 15 seconds onto each voice generation. The other way you could do it, is to run my script, which will load a model into memory and keep it there. It can be controlled via CURL commands (from a web page if you wanted to make one or from the command line). I have included the format of CURL commands in the documentation built into AllTalk, so have a look there for the commands to use! (pretty sure CURL will be installed already on a mac, if not, its a tiny thing). Technically speaking, if you wanted to, you could build a custom Python environment that is purely for TTS, no need to have the Text generation webUI and its requirements installed and either use the command line TTS as I showed OR you could run my script from its folder "python script.py" and it would load into memory (though I have a change/update to make before it will run as a standalone, so you would need to update AllTalk again, when ive made that change, if you wanted to go down that route.). Longer Sentences The other method, like I showed you above "tts --text "this is my text I w......." or "API" as I call it, is called model to file. and will split generation of very long lines into individual sentences for each bit it generates (aka processed 'sentence by sentence'), before combining them into one wav file. So it may be better for longer text generation. Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange. So typically there is a limit, hence you may have to split long paragraphs into individual sentences, so that each sentence generates nicely. I believe the "API" method, will always split into sentences automatically. The XTTSv2 method, I had to write code to clean up the text, split it into sentences and do other such things, though, that code wont be used if you are sending text to my engine via CURL commands. Its a very long and complicated thing, and Im not expert on it all to be honest. but hopefully some of the above answers some of your questions! |
Thanks for detailed info, it clarifies some things. btw, so currently macbook m gpu's are not supported by xtts in anyway? |
I have no real clue to be honest. I did a quick search and found this from 1 year ago coqui-ai/TTS#2208 You may want to post the question there, in their discussions forum, as they are the people who will 100% know |
As of right now, I tested this part, and it works. In the meantime, we tried to run bark with project owner within other repo, but for now it doesn't work, at least not on mac-m so that one will have to wait. On a side note here, it seems that text has to be pre-filtered from some characters (like "-"), otherwise it will not work.
I would appreciate some update, so that this can be used via gui.
I noticed, that token limit probably relates to single sentences, not whole text in paragraph. Technically, there is a way around for too long sentences. By simply preparing and modifying the text firtst. For audiobook production, text preparation is not that big issue, and I guess it could be automated to some degree. The first step would be to indicate too long sentences by token count. Then, these can be reedited manually (or by comma and other separators). When long ago I was using tts to make spoken content, at the end, what didn't sound right, had to be resynthesized and repasted. This can be easily done in apps like Reaper. |
On the API calls, I am not performing any filtering of the text you input currently. So yes, clearing out special characters etc is best done to avoid strange sounds or issues. I may in future provide a separate API call that will give you the option to push any text sent to it through my filter that I use for the AI models. Its just not been a core focus at the moment. I was discussing this with someone here, today erew123/alltalk_tts#3 I would appreciate some update, so that this can be used via gui.What do you mean? a text box you can type in and press a "generate audio" button? Token limitA few days ago, I added automatic sentence splitting on the "XTTSv2 Local" method. Obviously you would have to update to have that enabled. Though saying that, Im still not sure what the actual limit is there.....I mentioned this to the person I was talking with above. You may want to check that for a few more details on the API (as far as I have gotten). |
My apologies for late reply, flu season. In the meantime I did some proof of concept experiments with xtts in pinokio implementation, as I figured out what lines to mute in which files, so that I can use model version I want. Apparently this works. My tests there showed following, as I pushed a 40 miniutes long lecture through it. It's possible that some thoughts are valid here to. (I will check your update in a few days).
In other words, making audiobooks is possible with cloned voice text generation, but still a bit time consuming. But results are decent and satisfactory What I'm wondering for this model -are there any modifiers for mood, pitch, speed, accent (stc) control? |
Hope you are feeling improved. Flu is never good. Speed yes https://docs.coqui.ai/en/dev/models/xtts.html#inference-parameters the others, not currently, though I understand they are planned. As for pushing other things through it, its about another 2-3 updates since last week and there is now a full API suite for it https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-api-suite-and-json-curl I think you said you are on a Mac, which as I understand (though could be wrong) DeepSpeed isn't available for it. Which is a shame as that would give you a 2-3x speed increase. Typically trash data/noises (in my experience) are mostly caused by anything that isnt basic punctuation and even then sometimes by basic punctuation like ! ? etc. Generally filtering these out gets better results (the new API supports some levels of those filtering methods, though you just cant avoid everything). Obviously, it would be reasonably simple to have some kind of script that could split sentences by whatever method (periods etc) and just keep firing chunks of text at a TTS). Obviously, combining it after is a question. There are methods to iterate through and do things like that, esp if you number your files, making it easy to do in the correct order. (The API supports file naming). I also did a vocabulary update at some point that further improves the quality of the speech output, but I cant recall precisely when that was done. Some time in the last 10 days. You may also find that finetuning a voice improves the quality of the output, however, you would need a system with an Nvidia card for that. |
Yes, I'm accessing this on mac arm. As for trash data, simple experiment narrows this problem. Just paste, multiple times, the same short sentence. At some point you will notice strange audio artifacts, in this case - in foreign languages, from what I saw. So this issue is outside punctation. As for finetuning, I don't see any practical reasons for it, at least not at this point. Adding more training data may (or may not?) make tts speech sound more similar to person's voice, but the same is true for giving xtts different short voice samples of the same person. So it's a matter of finding samples, that translate well to the cloning. On the other hand, currently trained model - in some cases is better than original person in terms of articulation. So it's a good trade off, I think. As for audiobook creation, I did a proof of concept, to see, how this can be made, with some visual content, and how much times it takes. Audio can be adjusted in Reaper, great tool for such kind of stuff. Spoken text has to be translated into subtitles. The easiest way seems to be macwhisper, which re-transcibes spoken text, and creates so-so synced subtitles. Subtitles can be re-processed in Davinci Resolve (this seems to be the only app, that handles well subtitle editing for audiobooks), which also great video editor. So this is manageable and relatively pleasant workflow, although it takes few hours in total per 30mins of data (if you want to have quality material). |
@testter21 I think this is what you are after... Its almost finished |
Is this in anyway related to why the first few seconds always sound bad? Because I run into that all the time, the first 2 seconds or so sound like the voice is drunk, and then it gets really clean for a while before getting drunk again at the end of long sentences. When using the 'demo' option in oobabooga, I've gotten to the point of padding what I need audio for with about 3-4 words at the beginning and end and just clipping them off in audacity later. I'll definitely be checking out your work on, alltalk_tts That looks like a very nice simplified way to get the audio I need. |
It can be multiple things that cause bad audio. Certain characters slipping through e.g. three dots ... (causes it to go oohwowowhhh kind of sound). So Ive done lots in AllTalk to try filter things like this. Your Audio sample can also cause some of these effects. So if you have AllTalk and you try "female_01.wav" the model was finetuned on that, so you will notice its very unlikely to produce any strange sounds on that (as long as you dont have ... etc slipping through). So good quality samples are important. AllTalk includes the option to finetune models if you want. Other than that, you do just get the occasional bad sound and you arent meant to ask the AI to produce lines longer than 250 characters at a time. In AllTalk I have enabled sentence splitting, but you can still get the odd thing slip through. This is why Ive made a long form generator that will split things into shorter productions that you can merge together at the end..... though ive not published it yet. It will be part of AllTalk soon. |
@erew123 Thanks for the response. I'll be following your work on AllTalk. |
@erew123, sorry for late reply, but the beginning of the year is always busy time. On the photo, what is the "chunk sizes" referring to? (because it's not sentence or paragraph from what I see on this example). Btw, optional (checkbox) splitting text by "sentences" (dot, question mark, exclamation mark, semicolon, colon) seems to be good idea too. This way, source text outside this - can be indexed separately in completely different workflow (like translation setups, which follow well sentence-by-sentence routines), and if numbering is the same, then it whould be match between these. I said optional, because I'm not sure if xtts would handle this so well as with longer chunks (timbral and flow continuity). So far, for test, I made two "visual audiobooks" using xtts ( link1 ), and major time consuming task is regenerating faulty sentences. So, listening to sentences one by one, and one-click regeneration of the wrong ones seems to be good idea. Then, I guess, it would be nice if playback list was configurable (checkbox single/continuous) to continue from last clicked segment. Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great). Now let see updates in your project. |
@testter21 "Chunk sizes" refers to sentences. It is not split by colon as that would break how the AI looks at pronouncing the TTS. Splitting by anything other than standard punctuation that forms an entire sentence will cause the AI to deviate from its pronunciation, hence only splitting by entire sentences (1, 2, 3 etc). Obviously you can choose where to play back from This Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great). sounds very complicated and not something I specifically know something about. I can make an extra export option (assuming not too complex, because it could be) would you have a better example of what you mean or some research on this? |
Describe the bug
When Coqui_TTS starts up it will download the model, saying it has as update, every time it loads the model into memory. Resulting in a 1.8GB download each time. (They just updated the model and TTS about 2 hours ago)
It needs a "pip install --upgrade tts" to bump it to v0.21.1
EDIT - Manual workaround here along with sounding strange FIX #4723 (comment) (for anyone looking for it)
Is there an existing issue for this?
Reproduction
load the Coqui_TTS tts_models/multilingual/multi-dataset/xtts_v2 model into memory on an older version of the TTS engine e.g. v0.20.6
Screenshot
N/A
Logs
System Info
The text was updated successfully, but these errors were encountered: