<a href="https://colab.research.google.com/github/iamr7d/AI---PHD-ASSITANT/blob/main/Madhuri_TTS_Resources_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Text to Speech**

### This notebook demonstrates text-to-speech synthesis using the `parler-tts` model.
### It installs the necessary libraries, loads the model, and generates speech for English and Malayalam text.

## Installing necessary libraries
`!pip install torch transformers soundfile parler-tts ` Core libraries for TTS

`!pip install flash-attn --no-build-isolation`  Optimized attention for faster generation





## **Library Explanation:**
### * **`torch`**: Fundamental framework for deep learning and tensor computations, used by `parler-tts`.
### * **`transformers`**: Simplifies using pre-trained transformer models, essential for loading and interacting with `parler-tts`.
### * **`soundfile`**: Enables reading and writing audio files, used for saving generated speech.
### * **`parler-tts`**: Provides the core TTS model, likely tailored for multilingual or Indic languages.
### * **`flash-attn`**: Offers optimized attention mechanisms for faster speech generation.

In [1]:
!pip install torch transformers soundfile parler-tts
!pip install flash-attn --no-build-isolation

Collecting parler-tts
  Downloading parler_tts-0.2.3.tar.gz (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.2/80.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cub

### Hugging face login

In [2]:
from huggingface_hub import login
import os
# Give your huggingface  token

hf_token = os.environ.get("HF_TOKEN")

# Authenticate with the Hugging Face Hub
login(token=hf_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## English TTS

In [5]:
import torch  #PyTorch is used for handling tensors, GPU acceleration, and model computations.
from parler_tts import ParlerTTSForConditionalGeneration # This library contains the ParlerTTSForConditionalGeneration model for text-to-speech (TTS) synthesis.
from transformers import AutoTokenizer #The AutoTokenizer class is used to tokenize text for input to the model.
import soundfile as sf #Used to save the generated audio as a .wav file.


  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si

#### **Set the Device (CPU or GPU)**
To ensure efficient computation, we check if a GPU (CUDA) is available and assign the device accordingly.



> Example:


✅ If a GPU is available: "cuda:0"

✅ If no GPU, it defaults to "cpu"

In [6]:
# Set device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

### **Load the TTS Model and Tokenizers**
We load a pre-trained TTS model and tokenizers from AI4Bharat's Indic Parler-TTS repository.

> Example:

✅ model now holds Parler-TTS (Text-to-Speech model).

✅ tokenizer will process spoken text input (prompt).

✅ description_tokenizer will process voice style description.

In [7]:
# Load model and tokenizers
model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts").to(device) #Loads the ParlerTTS model from the "ai4bharat/indic-parler-tts" repository & Moves the model to the selected device (CPU/GPU).
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts") #Loads the tokenizer for processing the input text (prompt).
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path) #Loads a second tokenizer for encoding the description of the desired speech style and the model internally defines which tokenizer is needed for text descriptions.

  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si

### **Define the Speech Prompt and Style Description**

The ***prompt is the text that will be converted into speech***,

and the ***description defines the speaking style.***

In [21]:
# Define input text and description
prompt = "Good morning! This is 70 year old R. J Madhuri, live on our Radio Station. I m here with freshest beats and the hottest vibes to kickstart your day!"

In [22]:
#A description of how the speech should sound (tone, pronunciation, speed, quality, etc.).
description = (
    "A lively and energetic 70 year old Indian English female Radio Jockey with a friendly and expressive tone. "
    "The speech is engaging, moderately paced, and blends Indian and neutral English pronunciation, "
    "with a touch of warmth and enthusiasm. The intonation is dynamic, with smooth transitions between sentences, "
    "and the delivery maintains a conversational and interactive feel. The recording is of high quality, "
    "with minimal background noise and a professional studio-like clarity."
)

### **Tokenize the Prompt and Description**
The model does not understand raw text, so we convert both the prompt and the description into ***token IDs***.

> Example:

Original Text:

```
"Good morning! This is R. J Madhuri, live on our Radio Station."
```
Tokenized Output:

```
[312, 45, 987, 23, 1204, 78, 92, 34, 209, 67, 889, ...]
```





✅ **description_input_ids** contains the voice description in tokenized form.

✅ **prompt_input_ids** contains the speech text in tokenized form.


In [23]:
# Tokenize description and prompt
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device) #Converts the description into token IDs.return_tensors="pt" ensures the output is a PyTorch tensor& Moves the tensor to the selected device (CPU/GPU).
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device) #Converts the prompt (text to be spoken) into token IDs. Also converted into a PyTorch tensor and moved to the selected device.

### **Generate Speech using Parler-TTS**
Now, we pass the tokenized text to the model's **generate()** function.

✅ generate() takes in the ***description (voice style)*** + ***prompt (text to speak)*** and generates audio output.


In [24]:
# Fix the attention mask issue
generation = model.generate(                  #Calls the model's generate() function to create the speech.
    input_ids=description_input_ids.input_ids,                #The tokenized description of speech style.
    attention_mask=description_input_ids.attention_mask,            #Tells the model which tokens are actual input and which are padding.
    prompt_input_ids=prompt_input_ids.input_ids,                #The tokenized text to be spoken.
    prompt_attention_mask=torch.ones_like(prompt_input_ids.input_ids).to(device)  # Fix this line
)

### **Convert Audio Tensor to NumPy & Save as WAV**

The model's output is a PyTorch tensor containing the generated speech. We convert it to a NumPy array and save it as a .wav file.

✅ .cpu() moves the tensor to CPU memory

✅ .numpy() converts it to NumPy format

✅ .squeeze() removes unnecessary dimensions

✅ sf.write("English_tts_out.wav", audio_arr, model.config.sampling_rate) saves the audio as a WAV file

In [25]:
# Convert audio to numpy and save
audio_arr = generation.cpu().numpy().squeeze()            #Moves the generated audio tensor to CPU (.cpu()).Converts it into a NumPy array (.numpy()).Removes extra dimensions using .squeeze().

In [26]:
sf.write("English_tts_out.wav", audio_arr, model.config.sampling_rate)    #Saves the generated speech as a WAV file. Uses the model's default sampling rate.

In [27]:
from IPython.display import Audio

# Play the generated audio
Audio("English_tts_out.wav", autoplay=True)

## **Malayalam TTS**

In [29]:
import torch  #PyTorch is used for handling tensors, GPU acceleration, and model computations.
from parler_tts import ParlerTTSForConditionalGeneration # This library contains the ParlerTTSForConditionalGeneration model for text-to-speech (TTS) synthesis.
from transformers import AutoTokenizer #The AutoTokenizer class is used to tokenize text for input to the model.
import soundfile as sf #Used to save the generated audio as a .wav file.
from IPython.display import Audio  #use Audio() to play the generated speech output without opening an external media player.

#### **Set the Device (CPU or GPU)**
To ensure efficient computation, we check if a GPU (CUDA) is available and assign the device accordingly.



> Example:


✅ If a GPU is available: "cuda:0"

✅ If no GPU, it defaults to "cpu"

In [30]:
# Set device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

### **Load the TTS Model and Tokenizers**
We load a pre-trained TTS model and tokenizers from AI4Bharat's Indic Parler-TTS repository.

> Example:

✅ model now holds Parler-TTS (Text-to-Speech model).

✅ tokenizer will process spoken text input (prompt).

✅ description_tokenizer will process voice style description.

In [31]:
# Load model and tokenizers
model_name = "ai4bharat/indic-parler-tts"
model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts").to(device) #Loads the ParlerTTS model from the "ai4bharat/indic-parler-tts" repository & Moves the model to the selected device (CPU/GPU).
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts") #Loads the tokenizer for processing the input text (prompt).
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path) #Loads a second tokenizer for encoding the description of the desired speech style and the model internally defines which tokenizer is needed for text descriptions.

  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si

### **Define the Speech Prompt and Style Description**

The ***prompt is the text that will be converted into speech***,

and the ***description defines the speaking style.***

In [54]:
# Malayalam text prompt (input sentence)
prompt = "ഇപ്പോൾ സമയം 2 മണി കഴിഞ്ഞ് 10 മിനിറ്റ്. ഇത് നിങ്ങളുടെ സ്വന്തം ആർ. ജെ മാധുരി! എല്ലാരും പെരുന്നാൾ ഒക്കെ ആയിട്ട് മടി പിടിച്ച് കോളേജിൽ വന്നതാണെന്ന് അറിയാം. എങ്ങനുണ്ടായിരിന്നു എല്ലാവരുടെയും പെരുന്നാൾ?"

In [60]:
# Description of a native Malayalam speaker with suitable characteristics
description = (
"A lively and energetic native Malayalam female speaker, embodying the spirit of a campus RJ, bringing a vibrant and dynamic tone to every word."

"Her voice is warm, fresh, and brimming with enthusiasm, capturing the listener's attention with its natural expressiveness."

"She speaks with a steady pace and moderate pitch, maintaining an upbeat, engaging rhythm that reflects the lively atmosphere of college life."

"Her pronunciation is sharp and crystal clear, filled with youthful energy that makes her sound like the heart of the conversation."

"The recording, captured in a professional studio, boasts exceptional quality with minimal background noise, ensuring a crisp, clear, and immersive listening experience."
)

### **Tokenize the Prompt and Description**
The model does not understand raw text, so we convert both the prompt and the description into ***token IDs***.

> Example:

Original Text:

```
"Good morning! This is R. J Madhuri, live on our Radio Station."
```
Tokenized Output:

```
[312, 45, 987, 23, 1204, 78, 92, 34, 209, 67, 889, ...]
```





✅ **description_input_ids** contains the voice description in tokenized form.

✅ **prompt_input_ids** contains the speech text in tokenized form.


In [61]:
# Tokenize description and prompt
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device) #Converts the description into token IDs.return_tensors="pt" ensures the output is a PyTorch tensor& Moves the tensor to the selected device (CPU/GPU).
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device) #Converts the prompt (text to be spoken) into token IDs. Also converted into a PyTorch tensor and moved to the selected device.

### **Generate Speech using Parler-TTS**
Now, we pass the tokenized text to the model's **generate()** function.

✅ generate() takes in the ***description (voice style)*** + ***prompt (text to speak)*** and generates audio output.


In [62]:
# Fix the attention mask issue
generation = model.generate(                  #Calls the model's generate() function to create the speech.
    input_ids=description_input_ids.input_ids,                #The tokenized description of speech style.
    attention_mask=description_input_ids.attention_mask,            #Tells the model which tokens are actual input and which are padding.
    prompt_input_ids=prompt_input_ids.input_ids,                #The tokenized text to be spoken.
    prompt_attention_mask=torch.ones_like(prompt_input_ids.input_ids).to(device)  # Fix this line
)

### **Convert Audio Tensor to NumPy & Save as WAV**

The model's output is a PyTorch tensor containing the generated speech. We convert it to a NumPy array and save it as a .wav file.

✅ .cpu() moves the tensor to CPU memory

✅ .numpy() converts it to NumPy format

✅ .squeeze() removes unnecessary dimensions

✅ sf.write("English_tts_out.wav", audio_arr, model.config.sampling_rate) saves the audio as a WAV file

In [63]:
# Convert audio to numpy and save
audio_arr = generation.cpu().numpy().squeeze()            #Moves the generated audio tensor to CPU (.cpu()).Converts it into a NumPy array (.numpy()).Removes extra dimensions using .squeeze().

In [64]:
sf.write("malayalam_tts_out.wav", audio_arr, model.config.sampling_rate)    #Saves the generated speech as a WAV file. Uses the model's default sampling rate.

In [65]:
# Play the synthesized audio
Audio("malayalam_tts_out.wav", autoplay=True)