Skip to content

Explore the evolution and applications of AI speech generation, from robotic monotone to expressive human-like voices. Learn about text-to-speech technology, recent advancements, and its transformative impact across various industries.

License

Notifications You must be signed in to change notification settings

mejbass/Voices-in-the-Machine-AI-Speech-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Voices in the Machine - AI Speech Generation

img-2pLD9MoxpH43tzpwWLrudH6W

From monotone march to expressive symphony, AI-powered voices whisper possibilities: audiobooks with the author's soul, stories narrated in forgotten tongues, and connections beyond the veil. This project covers everything you need to get started with Text-to-Speech AI, exploring its technical underpinnings, its recent advancements, and its diverse applications across various industries. We will examine the ethical considerations surrounding voice cloning and the future potential of this technology to reshape how we interact with information and create content.

Understanding Text To Speech

In the heart of the digital soundscape lies the fascinating technology of Text-to-Voice AI, where written words seamlessly transform into spoken expressions. While the final output may sound seamless, there's an intricate interplay of components working behind the scenes. Let's break down this technological symphony:

  • Text Pre-Processing
  • Text to Phoneme Conversion
  • Prosody Prediction
  • Speech Synthesis
  • Post Processing

Evolution of Text-to-Speech

For centuries, the quest to make machines speak sounded like robots stuck on repeat. From bellows and reeds to early digital squawks, text-to-speech was more sci-fi nightmare than technological marvel. But then came the AI symphony. Deep learning algorithms, trained on vast libraries of human voices, now generate speech so nuanced and expressive it rivals the spoken word. This newfound eloquence unlocks a treasure trove of possibilities: from empowering the visually impaired to narrating audiobooks with the author's touch, AI voices are shaping how we consume, create, and even grieve. As ethical frameworks guide its development, this technological symphony promises to reshape communication, amplify diverse voices, and weave a richer tapestry of human connection.

Text-to-Speech Tools

TTS technology as discussed earlier is not new. However, with the advancement of AI, the generated output has got a lot more natural and blurs the line between actual speech and generated speech.

There are countless tools to try out Text-to-Speech, both open-source and commercial. Among the open sources ones, here are the most widely used:

  • Bark: Text-Prompted Generative Audio Model
  • PlayHT: AI Voice Generator
  • HierSpeech++: The official implementation of HierSpeech++
  • ElevenLabs: Text to Speech & AI Voice Generator

Tutorial

Text-to-Speech using PlayHT

Let's start with the easiest way to use voice cloning and TTS - PlayHT

Visit Play ht and create a free account. The service allows you to clone a single voice for free and generate speech from text.

image-14

PlayHT allows you to generate voices from the existing voices or clone a new voice. To use the existing voices, click on the name of the voice above the text input, and you can search and select any voice you like. They have amazing voices that you can try out to narrate blocks of text you provide.

image-16

image-17

image-18

The real fun is using your own voice or a voice you want to clone. The tool allows you to do just that. Click on "Voice Cloning" and follow the simple steps provided.

Click on "Instant" to create a clone from a "30 Sec" audio recording.

image-15

image-19

Then click on "Create New Model" and select the "PlayHT 2.0" model. Now when you click the name of the voice as before you will be able to select your newly cloned voice.

image-20

image-22

Then, add your text and click "Generate Speech" or hit the Play button

image-24

image-25

Text-to-Speech using Bark

Bark is Suno's text-to-audio model that's capable of generating highly realistic speech from text. Bark goes beyond the basics, effortlessly generating natural-sounding, multilingual speech. But it doesn't stop there – it can create all sorts of audio, from music and background noise to simple sound effects. Bark even adds a human touch with nonverbal cues like laughter, sighs, and crying.

To get started, click the link to visit the Google Colab notebook.

image-13

The interface is pretty straight forward, hit the play button besides the "Cells" - this are each greyed areas that have code inside them. You can try out various voices and languages. For list of supported voices checkout the Bark's voice prompt library

image-26

Interesting feature of Bark is its ability to incorporate non-speech sounds such as laughter, sighs, music (although not great currently) ... etc

[[laughter]

[laughs]

[sighs]

[music]

[gasps]

[clears throat]

or ... for hesitations

for song lyrics

CAPITALIZATION for emphasis of a word

Two caveat about bark are although it supports voice cloning, it does not provide this feature out of the box. Another issue you might face is the limitation with the length of audio you can generate. In order to address this issues check out the below two projects

bark-with-voice-clone

bark

Other Useful Tools

  • Adobe podcast: Clean up the generated voices and make them even more realistic.
  • Mp3Cut: Online MP3 Cutter to cut out a piece of music.
  • Convertio: Easy tool to convert files online.

Contact

Questions? Feedback? Requests? Discord: Samej2023

About

Explore the evolution and applications of AI speech generation, from robotic monotone to expressive human-like voices. Learn about text-to-speech technology, recent advancements, and its transformative impact across various industries.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published