Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coqui_TTS needs TTS updating or it will keep downloading the model. Also sounds Strange (FIX) #4723

Closed
1 task done
erew123 opened this issue Nov 24, 2023 · 46 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@erew123
Copy link
Contributor

erew123 commented Nov 24, 2023

Describe the bug

When Coqui_TTS starts up it will download the model, saying it has as update, every time it loads the model into memory. Resulting in a 1.8GB download each time. (They just updated the model and TTS about 2 hours ago)

It needs a "pip install --upgrade tts" to bump it to v0.21.1

EDIT - Manual workaround here along with sounding strange FIX #4723 (comment) (for anyone looking for it)

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

load the Coqui_TTS tts_models/multilingual/multi-dataset/xtts_v2 model into memory on an older version of the TTS engine e.g. v0.20.6

Screenshot

N/A

Logs

N/A

System Info

N/A
@erew123 erew123 added the bug Something isn't working label Nov 24, 2023
@erew123 erew123 changed the title Coqui_TTS needs updating or it will keep downloading the model. Coqui_TTS needs TTS updating or it will keep downloading the model. Nov 24, 2023
@HolgerBr65
Copy link

I have the same problem. Is there any solution?

@erew123
Copy link
Contributor Author

erew123 commented Nov 24, 2023

EDIT - Manual workaround here along with sounding strange FIX #4723 (comment)

Run the CMD_yourOS file in the text-gen-webUI folder.....to put you into your python environment.

Then run pip install --upgrade tts (Once you have done this, it will only download the new model once)

BUT..... the new model https://huggingface.co/coqui/XTTS-v2/tree/main which it downloads, isn't sounding right.

A couple of us are talking about it here coqui-ai/TTS#3301 (comment)

I think Ill be updating TTS (as above) but downloading the old model and probably blocking python from going out on my firewall (for now) to stop it updating to the new model.

Old model is here on this link https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2

@HolgerBr65
Copy link

What a mess. I will follow your idea and see how it turns out. I hope someone can fix this. Just running it all with internet turned off or blocking python through the firewall seems messy to me as well. Thank you anyway for your help here

@erew123
Copy link
Contributor Author

erew123 commented Nov 24, 2023

Just had time to look at this again. The manual workaround is:

  1. Open a command prompt/terminal
  2. Navigate to your text-generation-webui folder
  3. Run the correct CMD file for your OS, to start the python environment up https://github.com/oobabooga/text-generation-webui#running-commands
  4. Run pip install --upgrade tts

The update to TTS will probably get captured by the main update https://github.com/oobabooga/text-generation-webui#getting-updates when the Coqui_TTS extension requirements file is updated, but for now, you can do the above.

The next time the Coqui_TTS extension is loaded, it will probably download the model the 1x last time (it may not too) but let it complete if it does download.

Though, quite a few people seem to think the 2.03 model is sounding strange, so to drop back to the 2.0.2 model:

  1. Download the 2.0.2 model.pth and vocab.json from the 2.0.2 model https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2
  2. Find where the tts_models--multilingual--multi-dataset--xtts_v2 folder is on your computer. In windows this is:
    C:\Users\ YOUR-USER-ACCOUNT \AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2
    The Linux location is probably:
    /home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
  3. Copy those 2x files over the top of the files in there. Do not change/delete or touch any other files in there!

Then it starts to generate audio like the previous version did (without the strange accent) and it doesn't demand you download the model each time! :)

@erew123 erew123 changed the title Coqui_TTS needs TTS updating or it will keep downloading the model. Coqui_TTS needs TTS updating or it will keep downloading the model. Also sounds Strange (FIX) Nov 24, 2023
@erew123
Copy link
Contributor Author

erew123 commented Nov 25, 2023

For the Devs of the Coqui_TTS Extension

After a bit of discussion on the Coqui_TTS forum, it looks like you can use --model_path and --config_path to run any model you choose or download manually.

My discussion here coqui-ai/TTS#3301 (reply in thread)

It may therefore be best if people don't like the sound of the new 2.0.3 model, the Coqui_TTS extension downloads the 2.0.2 model on first run, checking for its existence on prior runs, and then uses model_path and config_path to load the model.

This would mean that we could version control future model releases I guess. The 2.0.2 model is hosted here https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2

So the below would need to be model_path and config_path.

"model_name": "tts_models/multilingual/multi-dataset/xtts_v2",

and

def load_model():
    model = TTS(params["model_name"]).to(params["device"])
    return model

Please also see my suggested updated script.py for low VRAM handing #4712 (comment)

@q5sys
Copy link
Contributor

q5sys commented Nov 25, 2023

6. (not sure on the Linux location)

/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
(at least on RHEL based distros)

@erew123
Copy link
Contributor Author

erew123 commented Dec 2, 2023

closing this ticket off. I assume anyone who searches will find it in the closed section.

@erew123 erew123 closed this as completed Dec 2, 2023
@testter21
Copy link

testter21 commented Dec 7, 2023

any idea how to set this in xtts in pinokio browser (macos)? apparently xtts will not start without internet connection, and it will always update to most recent model, which is wrong.

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

@testter21 you can download the updated version that I built if you like. Ive tested for you and as long as you have run it with an internet connection once, it then works offline with the "API Local and XTTSv2 Local" methods. API TTS may well need an internet connection, so dont click on that one if offline!!

I've not yet managed to separate it off into its own simple download, so you would have to do these instructions.

you would need to download these files into your existing /extensions/coqui_tts/ folder:
config.json
modeldownload.json
modeldownload.py
script.py
requirements.txt
tts_server.py

from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts

then create a subfolder called templates e.g. /extensions/coqui_tts/templates and download the generate_form.html into it.

from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts/templates

to download a each file individually, you click on each file one by one, then click the "download raw file" button:
image
This version will store a separate model file from the TTS python service, but that does mean it WILL download a file on first start-up. It may well tell you to "pip install --upgrade tts" at your command prompt. Ive made this version verbose at the command prompt for things like that.
image

There is a full manual and settings page included in this one. You will see the link on the settings interface!
image
image

@testter21
Copy link

testter21 commented Dec 8, 2023

@erew123, quick question.

With oobabooga, as I'm not programmer, I followed these instructions:
https://www.youtube.com/watch?v=lZkQUOpLg6g
adjusting to macos as this fellow refers to windows.

Theoretically all went well until at some point (6'35" on the video) I ended up with an error:

...
Closing server running on port: 7860
2023-12-08 10:37:38 INFO:Loading the extension "gallery"...
2023-12-08 10:37:38 INFO:Loading the extension "coqui_tts"...
[XTTS] Loading XTTS...
tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
Using model: xtts
2023-12-08 10:37:56 ERROR:Failed to load the extension "coqui_tts".
...

Thus, the tts gui part is not being included. Will your suggestion fix that problem too?

As for pinokio browser, xtts folder/data structure is different there.

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

Ive only ever seen one other person with that issue! Which strangely I responded to on here: #4718

From your error message its not too clear why it didn't load on your system. Perhaps TTS isnt properly installed.

If you download the one I suggested and when you have downloaded it, go into the extensions/coquii_tts folder at the command prompt/terminal (being a mac I guess terminal) and type

pip install -r requirements.txt

that should ensure that all the necessary files are installed.. in theory!

Beyond that, when the version I sent you starts up, it performs a few checks and should warn you if there is something missing/wrong (please see my screenshot on my last post and the "warning" message it gave there). So its more likely to tell you what to do, if something else needs doing.

@testter21
Copy link

The person was not me. I'm guessing rather very few people attempt to install this stuff, as this usually requires either some programming skills or ability to navigate in file/folder structures and some logic.

I did not posted more info on this error, because I need to reproduce the steps again to see where something could go wrong. Technically, oobabooga says, that it makes an isolated install (thus, dependencies are stored in it's own folder), but during the installation steps, I indeed had some, well... not sure if this was an error or ambiguity issue, but it said somethig related to macwhisper (now, I don't know if this is included in oobabooga as scripts, but I do have macwhisper installed as an app; if installers interfered with external app, then this is not isolated install as stated on the webpage).

As for now, I'm a bit lost with all this, but I will follow your steps and see where it leads to.

@testter21
Copy link

testter21 commented Dec 8, 2023

okay, I'm reinstalling oobaboga.

So I unpacked text-generation-webui-main, started start_macos.sh, selected vendor (apple m in this case). After that, during installation I get something like this:

...
Downloading werkzeug-3.0.1-py3-none-any.whl (226 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.7/226.7 kB 2.8 MB/s eta 0:00:00
Installing collected packages: Werkzeug, sniffio, itsdangerous, click, blinker, tiktoken, Flask, anyio, starlette, flask_cloudflared, sse-starlette
Attempting uninstall: tiktoken
Found existing installation: tiktoken 0.3.3
Uninstalling tiktoken-0.3.3:
Successfully uninstalled tiktoken-0.3.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openai-whisper 20230918 requires tiktoken==0.3.3, but you have tiktoken 0.5.2 which is incompatible.
...

It's the same error I had previously, and I don't know if it's related to the fact that tts gui part is not loading later.

and later:

...


  • Installing webui requirements from file: requirements_apple_silicon.txt

WARNING: Skipping torch-grammar as it is not installed.
Uninstalled torch-grammar
Requirement already satisfied:
...

and then at the end again:

...
Uninstalling starlette-0.33.0:
Successfully uninstalled starlette-0.33.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openai-whisper 20230918 requires tiktoken==0.3.3, but you have tiktoken 0.5.2 which is incompatible.
Successfully installed GitPython-3.1.40
...

and at the end of course this:


  • WARNING: You haven't downloaded any model yet.
  • Once the web UI launches, head over to the "Model" tab and download one.

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

Thats not related to the Coqui_tts extension its self. It looks like you arent using the default python environment that text-generation-webui sets up... though not having a mac, Ive never tried it on mac, though I assume it to be similar to linux.

My suggestion would be, to get yourself to a stable/known position.

  1. download the one click installer into a new folder https://github.com/oobabooga/text-generation-webui#one-click-installers and run start_macos.sh

This will re-download and setup the whole of text-gen-webi and build a new python environment for it.

  1. Every time you want to run it, you run start_macos.sh that ensures it is loading into the correct python environment with all the correct requirements. If you DONT run it this way, you will load up a different python environment that could have any settings in it...this may even be your current problem!

If you have an issue beyond that I would report it on the "issues" page on here, as that would be a general issue with the installer for it.

@testter21
Copy link

testter21 commented Dec 8, 2023

https://github.com/oobabooga/text-generation-webui#one-click-installers and run start_macos.sh

I used exactly this one and started start_macos.sh as in description.
I redownloaded just in case, but get exactly the same errors

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

If youve done that, gone through the whole install and next time you try running it with start_macos.sh its giving errors, Id post on the issues page here https://github.com/oobabooga/text-generation-webui/issues

Either oobabooga or someone with a mac will take a look. You'll need to post the text output of the issue you're getting when you run start_macos.sh as that should tell them what's wrong.

It sounds like it could be a requirements file issue, maybe, but they would have to take a look at that (someone whom works on the core code and has access to a mac). You can also hunt that issues page for others with the same problem (also look for closed issues).

@testter21
Copy link

I reposted notes from here.
I see what happens if I add on a clean install your srteps, whether it will include tts gui or not.

btw, such installments work like windows portable apps, i.e. can be migrated between folders or computers?

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

The things in the extensions folder should move fine. As for moving the whole of Text generation webUI... umm.. possibly, though you might have to run the setup_YOUROSHERE file. Ive never tried, so I can only say hypothetically.

@testter21
Copy link

In essence it would be good if you could relatively freely move the whole thing around and and separately backup/reuse subfolders. I guess some config file would be then needed to point starting path and inclusions if new stuff is added. I have no idea whether this works as I described just now, I'm guessing it would be reasonable course of action.

btw, as for pinokio browser, I just got enigmatic response on coqui discord:

Me: How to disable automatic model downloads in xtts in the pinokio browser (macos)? Most recent model is buggy, I't like to keep earlier iteration of 2.x
Reply: tts = TTS("xtts_v2.0.2", gpu=True)
Me: In which file?
Reply: It's for if you're calling TTS programmatically. We don't have anything yet for the CLI.

whatever that means.

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

Updating to TTS 0.21.3 stops the continuous download occurring (hence the warning on the version I built if you look at the screenshot, along with the command line instruction for performing the update)

My version also allows you to use both the 2.0.2 and latest (2.0.3) version of the model, simultaneously (as the 2.0.3 model sounds bad). All of that is detailed in the settings page after its installed and up and running.

If you just want to stop it re-downloading the model all the time, you need to be in the text generation webui python environment, so either start the text generation webui at a command prompt with start_macos and then ctrl+c to exit that (it should have loaded the environment) then:

pip install --upgrade tts

image

@testter21
Copy link

testter21 commented Dec 8, 2023

So far.

  1. although with some errors mentioned above, oobabooga is istalled, apple m series selected.
  2. 6 files from coqui_tts are downloaded and placed in correct folder (some old files are replaced, so no issue here), template file/folder handled too
  3. pip install -r extensions/coqui_tts/requirements.txt handled
  4. oobabooga started via start_macos.sh; when enabling in session coqui_tts, after applying, tts files (models and so on) were downloaded.

and then:

...
[CoquiTTS Startup] DeepSpeed Not Detected. See https://github.com/microsoft/DeepSpeed
[CoquiTTS Model] XTTSv2 Local Loading xttsv2_2.0.2 into cpu
[CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum
[CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum
ERROR: Traceback (most recent call last):
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/contextlib.py", line 204, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 46, in startup_shutdown
await setup()
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 110, in setup
model = await xtts_manual_load_model()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 154, in xtts_manual_load_model
model.cuda()
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 918, in cuda
return self._apply(lambda t: t.cuda(device))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 918, in
return self._apply(lambda t: t.cuda(device))
^^^^^^^^^^^^^^
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/cuda/init.py", line 289, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

ERROR: Application startup failed. Exiting.
[CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum
... [repeats multiple times]...
[CoquiTTS Startup] Startup timed out. Check the server logs for more information.
2023-12-08 13:35:51 ERROR:Failed to load the extension "coqui_tts".
Traceback (most recent call last):
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/modules/extensions.py", line 36, in load_extensions
exec(f"import extensions.{name}.script")
File "", line 1, in
File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/script.py", line 206, in
sys.exit(1)
SystemExit: 1

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

Got it! Ok, Its looking for CUDA, which you wont have on a mac, because CUDA is for an Nvidia card. Its something that I can code into a new version.... though Im still working on separating this out into its own install, so as not to impact the supplied Coqui_tts extension (Which will have exactly the same issue, as this version is a derivative version). I'm 40-60% of the way through splitting this out into its own version......and once I have that done and tested Ill see what I can do about the cuda thing, though Im guessing theres 2-3 hours of coding involved. Ill let you know when I get it done.

@testter21
Copy link

Ok, thanks! Let me know if anything else is needed.

btw, there is also this missing tts gui part bug. is this related to oobabooga in general, or current situation?

@erew123
Copy link
Contributor Author

erew123 commented Dec 8, 2023

I cant say. Possibly because the extension didnt load.

@testter21
Copy link

@erew123 - any luck?

@erew123
Copy link
Contributor Author

erew123 commented Dec 10, 2023

@testter21 Not yet no. It took me until about 2-3 hours ago to finally build a release based on the main code I was re-writing (problems along the way) https://github.com/erew123/alltalk_tts .

I was literally at it (coding+writing documentation) for roughly 14 hours yesterday. Im taking a little mental break this evening. I have loosely had a bit of space to think my way around mac and using system ram only, now that I've got the main bit out of the way Im going to take a little downtime, gather my thoughts and Ill let you know.

@testter21
Copy link

testter21 commented Dec 10, 2023

Ok, have a nice weekend then.

btw, this is pinokio version of xtts:
https://github.com/cocktailpeanut/xtts.pinokio
It works on mac m-series (despite lack of model version management and token limit at 400), so maybe it will give some help.

@erew123
Copy link
Contributor Author

erew123 commented Dec 10, 2023

@testter21 You're welcome to go give it a go https://github.com/erew123/alltalk_tts

I've done what I can to put logic into the code to ensure it wont attempt to load into an Nvidia graphics card, when a system doesn't have an Nvidia graphics card. Unfortunately its not as simple as a couple of changes, due to the amount of logic already going on inside the scripts.

But, as I say, I've done what I can. I've also separated out the requirements files, so you will be installing the requirements_other.txt

All the instructions are written up on the front page of the above link.

I guess it only remains to say, I don't have a mac so I haven't tested it out on a mac. In theory, it should work, but I cant say for certain you might not encounter some error.

@testter21
Copy link

Ok, thanks for the effort, I will check it and see where it goes.
I'm determined to have this running here, so in any case I can help with testing on mac.

btw, what's (in this case) the relation of Coqui_TTS to bark voice cloning? Is one built onto the top of another, or these are two separate workflows?

@testter21
Copy link

Okay, so this is the workflow on mac, and things you will see on typical machine.
workflow.zip

It seems to work, nice job there. In preview box, when I type something in my language, text generation is present (initially I had wrong output selected on the device).

And now the big question - how I can use it as regular tts generator with downloadable output? Without bothering about chat. (So far, I don't see yet anyway any multilingual local models, that would be suitable for reasonable data processing)

How it will handle longer chunks of text? This is recent observation. On pinokio browser, I noticed that there is a token limit (I don't know what it refers to - phonemes or words). I'm not sure whether text there is processed 'sendence by sentence' or 'as is'. Technically, if tts itself has input token limit, sentences could be split (primarily by dots, secondarily by comma) in preprocessing.

@testter21
Copy link

Hm... So basically it would have to be something, a dummy that sends user input to output window (text passthru)

@erew123
Copy link
Contributor Author

erew123 commented Dec 10, 2023

Thanks for documents. That at least confirms to me it does what its supposed to do! It looks like you may have a decent speed on generating voice samples too, despite being on a cpu/ram setup.

Your first question Coqui, Bark and Xtts
Coqui, they are an open source company/foundation for making text to speech engines and that kind of thing https://coqui.ai/about. So they manage the core TTS code and various different models/methods for generating text to speech. Other companies and people can submit code or models to them, hence you will see other peoples/companies names flying around all over the place on their site.

Bark is one of their models/methods for voice cloning https://tts.readthedocs.io/en/latest/models/bark.html which I believe requires a larger voice sample and "training" on those voice samples so has a higher memory requirement and cant just clone voices without creating a trained RVC file ahead of time. It may however be able to generate more accurate sounding results (unsure).

XTTSv2 is another model/method for voice cloning https://tts.readthedocs.io/en/latest/models/xtts.html that doesn't require training ahead of time, it just requires a 6-12 second voice sample and it generates on the fly.

So effectively, they are just 2x different ways/methods of doing voice cloning, each with their pros and cons. If you scroll down the left index on the links above, under the "TTS" section, you will see various other models they manage e.g. Tortoise etc.

AllTalk is working with the XTTSv2 method.

Generating TTS content.
Now that you have the TTS engine installed within your text generation webui Python environment, as long as you are loaded into that Python environment you can actually just use TTS in your terminal window https://tts.readthedocs.io/en/latest/inference.html below is an example command line (the bits in bold you would not change:

tts --text "this is my text I want to hear" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --out_path output.wav --speaker_wav voices/female_01.wav --language_idx en

The problem is with that method, that each time you run it, it has to load the model into memory each time, so that adds 15 seconds onto each voice generation.

The other way you could do it, is to run my script, which will load a model into memory and keep it there. It can be controlled via CURL commands (from a web page if you wanted to make one or from the command line). I have included the format of CURL commands in the documentation built into AllTalk, so have a look there for the commands to use! (pretty sure CURL will be installed already on a mac, if not, its a tiny thing).

Technically speaking, if you wanted to, you could build a custom Python environment that is purely for TTS, no need to have the Text generation webUI and its requirements installed and either use the command line TTS as I showed OR you could run my script from its folder "python script.py" and it would load into memory (though I have a change/update to make before it will run as a standalone, so you would need to update AllTalk again, when ive made that change, if you wanted to go down that route.).

Longer Sentences
There are 2x generation methods the XTTS and TTS engine use for generating text. I list these in my interface as "API" and "XTTSv2" (where you can select the model/method). The XTTS method requires that the model is loaded into memory at all times and you can throw as much text at it as you like in one long paragrapgh *Note below on this

The other method, like I showed you above "tts --text "this is my text I w......." or "API" as I call it, is called model to file. and will split generation of very long lines into individual sentences for each bit it generates (aka processed 'sentence by sentence'), before combining them into one wav file. So it may be better for longer text generation.

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange. So typically there is a limit, hence you may have to split long paragraphs into individual sentences, so that each sentence generates nicely. I believe the "API" method, will always split into sentences automatically. The XTTSv2 method, I had to write code to clean up the text, split it into sentences and do other such things, though, that code wont be used if you are sending text to my engine via CURL commands.

Its a very long and complicated thing, and Im not expert on it all to be honest. but hopefully some of the above answers some of your questions!

@testter21
Copy link

Thanks for detailed info, it clarifies some things.
During following days, I will see how it works.

btw, so currently macbook m gpu's are not supported by xtts in anyway?

@erew123
Copy link
Contributor Author

erew123 commented Dec 12, 2023

I have no real clue to be honest. I did a quick search and found this from 1 year ago coqui-ai/TTS#2208

You may want to post the question there, in their discussions forum, as they are the people who will 100% know

@testter21
Copy link

tts --text "this is my text I want to hear" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --out_path output.wav --speaker_wav voices/female_01.wav --language_idx en
The problem is with that method, that each time you run it, it has to load the model into memory each time, so that adds 15 seconds onto each voice generation.

As of right now, I tested this part, and it works. In the meantime, we tried to run bark with project owner within other repo, but for now it doesn't work, at least not on mac-m so that one will have to wait.

On a side note here, it seems that text has to be pre-filtered from some characters (like "-"), otherwise it will not work.

The other way you could do it, is to run my script, which will load a model into memory and keep it there. It can be controlled via CURL commands (from a web page if you wanted to make one or from the command line). I have included the format of CURL commands in the documentation built into AllTalk, so have a look there for the commands to use! (pretty sure CURL will be installed already on a mac, if not, its a tiny thing).

Technically speaking, if you wanted to, you could build a custom Python environment that is purely for TTS, no need to have the Text generation webUI and its requirements installed and either use the command line TTS as I showed OR you could run my script from its folder "python script.py" and it would load into memory (though I have a change/update to make before it will run as a standalone, so you would need to update AllTalk again, when ive made that change, if you wanted to go down that route.).

I would appreciate some update, so that this can be used via gui.

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange. So typically there is a limit, hence you may have to split long paragraphs into individual sentences, so that each sentence generates nicely. I believe the "API" method, will always split into sentences automatically. The XTTSv2 method, I had to write code to clean up the text, split it into sentences and do other such things, though, that code wont be used if you are sending text to my engine via CURL commands.

I noticed, that token limit probably relates to single sentences, not whole text in paragraph. Technically, there is a way around for too long sentences. By simply preparing and modifying the text firtst. For audiobook production, text preparation is not that big issue, and I guess it could be automated to some degree. The first step would be to indicate too long sentences by token count. Then, these can be reedited manually (or by comma and other separators). When long ago I was using tts to make spoken content, at the end, what didn't sound right, had to be resynthesized and repasted. This can be easily done in apps like Reaper.

@erew123
Copy link
Contributor Author

erew123 commented Dec 15, 2023

On the API calls, I am not performing any filtering of the text you input currently. So yes, clearing out special characters etc is best done to avoid strange sounds or issues. I may in future provide a separate API call that will give you the option to push any text sent to it through my filter that I use for the AI models. Its just not been a core focus at the moment.

I was discussing this with someone here, today erew123/alltalk_tts#3

I would appreciate some update, so that this can be used via gui.

What do you mean? a text box you can type in and press a "generate audio" button?

Token limit

A few days ago, I added automatic sentence splitting on the "XTTSv2 Local" method. Obviously you would have to update to have that enabled. Though saying that, Im still not sure what the actual limit is there.....I mentioned this to the person I was talking with above. You may want to check that for a few more details on the API (as far as I have gotten).

@erew123
Copy link
Contributor Author

erew123 commented Dec 16, 2023

If it was a web page, this now exists:
image

If it is other methods for playing, Im working on a separate API that will have all options. Though that probably wont be done for a few days.

@testter21
Copy link

testter21 commented Dec 24, 2023

My apologies for late reply, flu season.

In the meantime I did some proof of concept experiments with xtts in pinokio implementation, as I figured out what lines to mute in which files, so that I can use model version I want. Apparently this works.

My tests there showed following, as I pushed a 40 miniutes long lecture through it. It's possible that some thoughts are valid here to. (I will check your update in a few days).

  1. I had to prepare text manually, i.e. I had to shorten some sentences, which simply broke the generation at all. So what I did is, when I saw the sentence is very long (these lecrures came from spoken word, so...), I looked for neighbouring comma to split it. Automation in this regards would probably count approximate tokens (characters?) per sentence and decide whether and where to split or not. Maybe general split by comma and not dot is not bad idea too.
  2. I had to split the generation into chunks (separate generations). The 40 minutes long script breaks (stops) the generation somewhere in the middle. Or it's related somehow to above. But since it's time consuming, it was better to do that. If feeding model with too many phrases is indeed problematic, then maybe the solution is to reload model from time to time for a fresh start.
  3. From what I see, per 40 minutes generation, there are approx 30 mark points, when phrase has to be regenrerated. Model (all versions?) tends to produce trash data, like halfwordings in foreign languages or human0like huffing and puffing. On a side note, and this is good - regenerating the same phrase - each times produces different articulation (old tts methods had fixed sounding for these).

In other words, making audiobooks is possible with cloned voice text generation, but still a bit time consuming. But results are decent and satisfactory

What I'm wondering for this model -are there any modifiers for mood, pitch, speed, accent (stc) control?

@erew123
Copy link
Contributor Author

erew123 commented Dec 24, 2023

Hope you are feeling improved. Flu is never good.

Speed yes https://docs.coqui.ai/en/dev/models/xtts.html#inference-parameters the others, not currently, though I understand they are planned.

As for pushing other things through it, its about another 2-3 updates since last week and there is now a full API suite for it https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-api-suite-and-json-curl

I think you said you are on a Mac, which as I understand (though could be wrong) DeepSpeed isn't available for it. Which is a shame as that would give you a 2-3x speed increase.

Typically trash data/noises (in my experience) are mostly caused by anything that isnt basic punctuation and even then sometimes by basic punctuation like ! ? etc. Generally filtering these out gets better results (the new API supports some levels of those filtering methods, though you just cant avoid everything).

Obviously, it would be reasonably simple to have some kind of script that could split sentences by whatever method (periods etc) and just keep firing chunks of text at a TTS). Obviously, combining it after is a question. There are methods to iterate through and do things like that, esp if you number your files, making it easy to do in the correct order. (The API supports file naming).

I also did a vocabulary update at some point that further improves the quality of the speech output, but I cant recall precisely when that was done. Some time in the last 10 days.

You may also find that finetuning a voice improves the quality of the output, however, you would need a system with an Nvidia card for that.

@testter21
Copy link

Yes, I'm accessing this on mac arm.

As for trash data, simple experiment narrows this problem. Just paste, multiple times, the same short sentence. At some point you will notice strange audio artifacts, in this case - in foreign languages, from what I saw. So this issue is outside punctation.

As for finetuning, I don't see any practical reasons for it, at least not at this point. Adding more training data may (or may not?) make tts speech sound more similar to person's voice, but the same is true for giving xtts different short voice samples of the same person. So it's a matter of finding samples, that translate well to the cloning. On the other hand, currently trained model - in some cases is better than original person in terms of articulation. So it's a good trade off, I think.

As for audiobook creation, I did a proof of concept, to see, how this can be made, with some visual content, and how much times it takes. Audio can be adjusted in Reaper, great tool for such kind of stuff. Spoken text has to be translated into subtitles. The easiest way seems to be macwhisper, which re-transcibes spoken text, and creates so-so synced subtitles. Subtitles can be re-processed in Davinci Resolve (this seems to be the only app, that handles well subtitle editing for audiobooks), which also great video editor. So this is manageable and relatively pleasant workflow, although it takes few hours in total per 30mins of data (if you want to have quality material).

@erew123
Copy link
Contributor Author

erew123 commented Jan 6, 2024

@testter21 I think this is what you are after...

image

Its almost finished

@q5sys
Copy link
Contributor

q5sys commented Jan 6, 2024

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange.

Is this in anyway related to why the first few seconds always sound bad? Because I run into that all the time, the first 2 seconds or so sound like the voice is drunk, and then it gets really clean for a while before getting drunk again at the end of long sentences. When using the 'demo' option in oobabooga, I've gotten to the point of padding what I need audio for with about 3-4 words at the beginning and end and just clipping them off in audacity later.

I'll definitely be checking out your work on, alltalk_tts That looks like a very nice simplified way to get the audio I need.

@erew123
Copy link
Contributor Author

erew123 commented Jan 6, 2024

It can be multiple things that cause bad audio. Certain characters slipping through e.g. three dots ... (causes it to go oohwowowhhh kind of sound). So Ive done lots in AllTalk to try filter things like this.

Your Audio sample can also cause some of these effects. So if you have AllTalk and you try "female_01.wav" the model was finetuned on that, so you will notice its very unlikely to produce any strange sounds on that (as long as you dont have ... etc slipping through). So good quality samples are important.

AllTalk includes the option to finetune models if you want.

Other than that, you do just get the occasional bad sound and you arent meant to ask the AI to produce lines longer than 250 characters at a time. In AllTalk I have enabled sentence splitting, but you can still get the odd thing slip through. This is why Ive made a long form generator that will split things into shorter productions that you can merge together at the end..... though ive not published it yet. It will be part of AllTalk soon.

@q5sys
Copy link
Contributor

q5sys commented Jan 6, 2024

@erew123 Thanks for the response. I'll be following your work on AllTalk.

@testter21
Copy link

@erew123, sorry for late reply, but the beginning of the year is always busy time.

On the photo, what is the "chunk sizes" referring to? (because it's not sentence or paragraph from what I see on this example).

Btw, optional (checkbox) splitting text by "sentences" (dot, question mark, exclamation mark, semicolon, colon) seems to be good idea too. This way, source text outside this - can be indexed separately in completely different workflow (like translation setups, which follow well sentence-by-sentence routines), and if numbering is the same, then it whould be match between these. I said optional, because I'm not sure if xtts would handle this so well as with longer chunks (timbral and flow continuity).

So far, for test, I made two "visual audiobooks" using xtts ( link1 ), and major time consuming task is regenerating faulty sentences. So, listening to sentences one by one, and one-click regeneration of the wrong ones seems to be good idea. Then, I guess, it would be nice if playback list was configurable (checkbox single/continuous) to continue from last clicked segment.

Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great).

Now let see updates in your project.

@erew123
Copy link
Contributor Author

erew123 commented Jan 15, 2024

@testter21 "Chunk sizes" refers to sentences. It is not split by colon as that would break how the AI looks at pronouncing the TTS. Splitting by anything other than standard punctuation that forms an entire sentence will cause the AI to deviate from its pronunciation, hence only splitting by entire sentences (1, 2, 3 etc).

Obviously you can choose where to play back from
image
I can look at a checkbox to stop it resetting the playback location at some point.

This Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great). sounds very complicated and not something I specifically know something about. I can make an extra export option (assuming not too complex, because it could be) would you have a better example of what you mean or some research on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants