**Text-to-speech (TTS**) is a technology that converts written text into spoken words. This task involves generating natural-sounding speech from text input, allowing computers to “read” text aloud.

However, in classification tasks, there is typically only one correct label, or sometimes a few. In automatic speech recognition (ASR), a single correct transcription corresponds to a given utterance.

However, there are countless ways to articulate the same sentence, with variations in voices, dialects, and speaking styles. Despite these challenges, some open-source models excel at this task. We will use two of them: the VITS pre-trained model from Kakao Enterprise to convert English text into speech, as well as the speecht5_tts_clartts_ar model from Mubazi to convert Arabic text into speech.

In [1]:
!pip install timm -q
!pip install inflect -q
!pip install phonemizer -q
!pip install gtts -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.8/103.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.4/213.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.9/564.9 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pylatexenc (setup.py) ... [?25l[?25hdone


In [3]:
!pip install transformers -q
!pip install -U datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m317.4/480.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source

In [5]:
from transformers import pipeline
from datasets import load_dataset
from IPython.display import Audio as IPythonAudio
import soundfile as sf
import torch

# English Text to Speech using the VITS Model

Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.

A set of spectrogram-based acoustic features is predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers.

The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesize speech with different rhythms from the same input text.

To use the VITS model to convert text to speech, we will utilize the Hugging Face pipeline to perform text-to-speech (TTS) using a specific model stored locally (./models/kakao-enterprise/vits-ljs).

The text provided, which discusses the Israeli occupation of Palestine, is passed to the narrator pipeline. The pipeline converts the text into speech, generating audio that narrates the provided text.

The result, stored in the narrated_text variable contains the audio data produced by the model. This allows for the text to be listened to as spoken words, facilitating the accessibility and auditory presentation of the information.

In [12]:
from gtts import gTTS
from IPython.display import Audio

text = """
The Israeli occupation of Palestine began in 1967
during the Six-Day War when Israel captured the West Bank,
Gaza Strip, and East Jerusalem.
These areas, home to many Palestinians, have since been a
focal point of conflict. The international community generally views Israeli settlements there as illegal.
Efforts towards peace continue, with Palestinians seeking
independence and Israelis seeking security.
The situation remains highly complex and contentious.
"""

# Convert text to speech
sound = gTTS(text)  # Correct case: gTTS
sound.save("text.mp3")  # Save as an MP3 file

# Play the audio
Audio("text.mp3", autoplay=True)


gTTS:
gTTS stands for Google Text-to-Speech. It converts a given text into a spoken audio file using Google's TTS API.
text: This is the string you want to convert into speech. For example, if text = "Hello, world!", the library will generate audio saying "Hello, world!".

# Arabic Text to Speech using ArTST

3.Arabic Text to Speech using ArTST
ArTST is a pre-trained Arabic text and speech transformer that supports open-source speech technologies for the Arabic language. The model architecture in this first edition follows the unified-modal framework, SpeechT5, that was recently released for English and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions.

The model is pre-trained from scratch on MSA speech and text data and fine-tuned for the following tasks: Automatic Speech Recognition (ASR), TTS, and spoken dialect identification. SpeechT5 for Arabic (TTS task) is a pre-trained weight from ArTST and fine-tuned using the huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech). To use this model to convert text to speech we will use the Hugging Face pipeline to perform a text-to-speech (TTS) task with a specific model (MBZUAI/speecht5_tts_clartts_ar).

We will also load speaker embeddings from a dataset (herwoww/arabic_xvector_embeddings) and selects a particular embedding to simulate a specific speaker's voice.

The selected text, which describes the Israeli occupation of Palestine, is converted to speech using this embedding. The generated speech audio is then saved to a file called "speech.wav" with the specified sample rate. The TTS model generates speech without diacritics, focusing on the natural pronunciation of the text.

https://huggingface.co/MBZUAI/speecht5_tts_clartts_ar

In [13]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-to-speech", model="MBZUAI/speecht5_tts_clartts_ar")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/578M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/322 [00:00<?, ?B/s]

spm_char.model:   0%|          | 0.00/403k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

In [30]:
synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[170]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.
text = """
بدأ الاحتلال الإسرائيلي لفلسطين في عام 1967 خلال حرب الأيام الستة عندما
احتلت إسرائيل الضفة الغربية وقطاع غزة والقدس الشرقية. أصبحت هذه المناطق، التي يعيش فيها العديد من الفلسطينيين، محورًا للصراع منذ ذلك الحين.
يرى المجتمع الدولي عمومًا أن المستوطنات الإسرائيلية هناك غير قانونية.
تستمر الجهود نحو السلام، حيث يسعى الفلسطينيون إلى الاستقلال ويسعى الإسرائيليون إلى الأمن.
لا تزال القضية معقدة للغاية ومثيرة للجدل.
"""
speech = synthesiser(text, forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])

Device set to use cpu


speaker_embedding: It extracts the speaker embedding from the dataset (for example, the embedding of a specific speaker at index 105). The unsqueeze(0) function is used to add a batch dimension to the embedding tensor, which is required by the model. You can replace this embedding with your own for customized speaker voices

**peaker Embedding**: In TTS systems, a speaker embedding represents the unique voice characteristics of a specific speaker. It is a vector that encodes features of a speaker’s voice, such as tone, pitch, accent, and speaking style. By providing a specific speaker embedding to a TTS model, you can generate speech in that speaker’s voice.  

So, here we speciphy the embedding to arabic embedding from the dataset that affect the tone,..

**sf.write**: This function from the soundfile library is used to save the generated audio to a file.


In [31]:
# Play the audio
Audio("speech.wav",autoplay=True)

In [None]:
# not good in arabic but what we do ...we wish that we could deploy it