### This notebook requires a GPU runtime to run.
### Please select the menu option "Runtime" -> "Change runtime type", select "Hardware Accelerator" -> "GPU" and click "SAVE"

----------------------------------------------------------------------

# Tacotron 2

*Author: NVIDIA*

**The Tacotron 2 model for generating mel spectrograms from text**

<img src="https://pytorch.org/assets/images/tacotron2_diagram.png" alt="alt" width="50%"/>



### Model Description

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow (also available via torch.hub) is a flow-based model that consumes the mel spectrograms to generate speech.

This implementation of Tacotron 2 model differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers.

### Example

In the example below:
- pretrained Tacotron2 and Waveglow models are loaded from torch.hub
- Given a tensor representation of the input text ("Hello world, I missed you so much"), Tacotron2 generates a Mel spectrogram as shown on the illustration
- Waveglow generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed.
These are needed for preprocessing the text and audio, as well as for display and input / output.

In [None]:
%%bash
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.5/235.5 kB 5.7 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.3.8
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,068 kB]
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:8 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:9 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [57.7 kB]
Get:

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


Load the Tacotron2 model pre-trained on [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/) and prepare it for inference:

In [None]:
import torch
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_amp/versions/19.09.0/files/nvidia_tacotron2pyt_fp16_20190427
  ckpt = torch.load(ckpt_file)


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0-2): 3 x Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (linear_layer): Linear(in_features=80, out_features=256, bias=False)
        )
        (1): LinearNorm(
          (linear_layer): Linear(in_features=256, out_features=256, bias=False)
        )
      )
    )
    (attention_rnn): LSTMCell(768, 1024)
    (attention_layer): Attention(
      (query_layer): LinearNorm(
        (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
      )
      (memory_layer): LinearNorm(
        (linear_layer): Linear(in_fea

Load pretrained WaveGlow model

In [None]:
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ckpt_amp/versions/19.09.0/files/nvidia_waveglowpyt_fp16_20190427
  ckpt = torch.load(ckpt_file)
  WeightNorm.apply(module, name, dim)


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0-3): 4 x WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0-6): 7 x Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (7

Now, let's make the model say:

In [10]:
text = "Here is my Raect portfolio"

Format the input using utility methods

In [11]:
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
  return s in _symbol_to_id and s is not '_' and s is not '~'
  return s in _symbol_to_id and s is not '_' and s is not '~'


RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Run the chained models:

In [None]:
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

You can write it to a file and listen to it

In [None]:
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)

Alternatively, play it right away in a notebook with IPython widgets

In [None]:
from IPython.display import Audio
Audio(audio_numpy, rate=rate)

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) and/or [NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch)

### References

 - [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
 - [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002)
 - [Tacotron2 and WaveGlow on NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch)
 - [Tacotron2 and Waveglow on github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)

In [2]:
import os
import shutil
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, Audio, clear_output
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

### Upload .wav files

In [3]:
# Function to handle the uploaded files
def handle_upload(change):
    loading_text.value = 'Uploading files...'  # Show loading text
    wav_folder = 'wav_files'

    # Create the wav folder if it doesn't exist
    if not os.path.exists(wav_folder):
        os.makedirs(wav_folder)

    file_count = len(os.listdir(wav_folder)) + 1  # Count existing files to continue numbering

    for file_name, file_content in change['new'].items():
        if file_name.endswith('.wav'):
            save_path = os.path.join(wav_folder, f'{file_count}.wav')

            # Save the wav file
            with open(save_path, 'wb') as f:
                f.write(file_content['content'])

            print(f'Saved: {file_count}.wav')
            file_count += 1
        else:
            print(f'{file_name} is not a .wav file. Please upload only .wav files.')

    loading_text.value = ''  # Hide loading text after the upload

# Function to clear the wav folder
def clear_directory(b):
    loading_text.value = 'Emptying directory...'  # Show loading text
    wav_folder = 'wav_files'

    if os.path.exists(wav_folder):
        # Remove all files in the folder
        shutil.rmtree(wav_folder)
        os.makedirs(wav_folder)  # Recreate the folder after clearing
        print('Directory emptied.')
    else:
        print('Directory does not exist.')

    loading_text.value = ''  # Hide loading text after clearing

# Create a file upload widget
upload_button = widgets.FileUpload(
    accept='.wav',
    multiple=True
)

# Create a button to clear the directory
clear_button = widgets.Button(
    description="Empty Directory",
    button_style='danger'  # Style the button with a red background
)

# Create a label for loading text
loading_text = widgets.Label(value='')

# Link the upload widget and clear button to their handlers
upload_button.observe(handle_upload, names='value')
clear_button.on_click(clear_directory)

# Arrange buttons and loading text horizontally
buttons = widgets.HBox([upload_button, clear_button])
layout = widgets.VBox([buttons, loading_text])

# Display the buttons and loading text
display(layout)

VBox(children=(HBox(children=(FileUpload(value={}, accept='.wav', description='Upload', multiple=True), Button…

Saved: 1.wav
Saved: 2.wav
Saved: 3.wav
Saved: 4.wav


### Transcription

#### Generate Transcript

In [9]:
# Define the path where the .wav files are located
wav_directory = "wav_files"  # Assuming files are already uploaded to this directory in Colab

# Count how many .wav files are in the directory
wav_files = [f for f in os.listdir(wav_directory) if f.endswith('.wav')]
num_wav_files = len(wav_files)
print(f"Found {num_wav_files} .wav files in the directory.")

# Define the output file name
output_file = os.path.join(wav_directory, "list.txt")

# Automatically set the range of .wav files based on the number of files found
wav_files_range = range(1, num_wav_files + 1)  # Range starts from 1 to num_wav_files

# Initialize the list to store file paths and transcripts
file_and_transcripts = []

# Initialize the wav2vec model and processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

# Iterate through the .wav files
for i in wav_files_range:
    wav_file = os.path.join(wav_directory, f"{i}.wav")

    # Check if the .wav file exists
    if os.path.exists(wav_file):
        # Recognize the speech in the .wav file
        try:
            waveform, sample_rate = torchaudio.load(wav_file)

            if waveform.size(0) > 1:
                waveform = torch.mean(waveform, dim=0, keepdim=True)
                torchaudio.save(wav_file, waveform, sample_rate)

            waveform = waveform.squeeze()  # Squeeze the batch dimension
            resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
            waveform = resampler(waveform)
            input_values = processor(waveform, return_tensors="pt", sampling_rate=16000).input_values
            logits = model(input_values).logits
            predicted_ids = torch.argmax(logits, dim=-1)
            transcript = processor.decode(predicted_ids[0])
        except Exception as e:
            print(f"Error processing file {wav_file}: {e}")
            continue

        # Append the desired path format and transcript to the list
        file_and_transcripts.append(f"/content/{wav_directory}/{i}.wav|{transcript}")
    else:
        print(f"File not found: {wav_file}")

# Write the file paths and transcripts to the output file
with open(output_file, "w") as f:
    for line in file_and_transcripts:
        f.write(f"{line}\n")

print(f"File '{output_file}' updated successfully in the Colab folder.")

Found 4 .wav files in the directory.


Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

File 'wav_files/list.txt' updated successfully in the Colab folder.


#### User verifies transcript


In [26]:
# Load the contents of wav_files/list.txt
file_path = 'wav_files/list.txt'

with open(file_path, 'r') as file:
    lines = file.readlines()

# Create lists to hold the audio widgets and text inputs
text_inputs_after = []

# Create an Output widget to display audio players
output_widgets = []

# Create an Audio widget and Text widget for each line
for line in lines:
    # Split the line on the "|" character
    parts = line.split('|', 1)

    # Get the part before the "|" and remove "/content/" if present
    audio_file_path = parts[0].strip().replace('/content/', '') if parts[0].startswith('/content/') else parts[0].strip()

    # Create an Output widget for the audio player
    output_widget = widgets.Output()
    with output_widget:
        display(Audio(audio_file_path, autoplay=False))
    output_widgets.append(output_widget)

    # Create a Text widget for the editable input
    after_part = parts[1].strip() if len(parts) > 1 else ''
    text_input_after = widgets.Text(value=after_part, layout=widgets.Layout(flex='1 1 auto', width='auto'))
    text_inputs_after.append(text_input_after)

    # Create a horizontal box layout for audio output and text input
    row = widgets.HBox([output_widget, text_input_after], layout=widgets.Layout(display='flex', flex_flow='row', align_items='center', width='100%'))
    display(row)

# Function to update the list.txt file with new values
def update_list_file():
    new_lines = []
    for i, line in enumerate(lines):
        parts = line.split('|', 1)
        # Keep the part before "|" and replace with the user input
        new_line = f"{parts[0]} | {text_inputs_after[i].value}\n"  # Add the newline character
        new_lines.append(new_line)

    # Write the new lines back to the file
    with open(file_path, 'w') as file:
        file.writelines(new_lines)

# Button to update the list.txt file
update_button = widgets.Button(description="Update list.txt")

def on_update_button_click(b):
    update_list_file()
    print("list.txt has been updated!")

update_button.on_click(on_update_button_click)
display(update_button)


HBox(children=(Output(), Text(value="HALLOW THERE I'M JUST TALKING ER FOR THREE SIX SECONDS LIKE THIS", layout…

HBox(children=(Output(), Text(value="Okay SO KNOW I'M TALKING AGAIN AND THIS TIME I'M TALKING ABOUT CATS AND D…

HBox(children=(Output(), Text(value='OKESO A BIG FROG JUMPS OVER A DARK LAKE', layout=Layout(flex='1 1 auto', …

HBox(children=(Output(), Text(value="I'VE WORKED KOIR LOT IN FRONTAND ESPECIALLY IN REACT AND FIRE BASE AND SO…

Button(description='Update list.txt', style=ButtonStyle())

### Preprocess .wav files

In [31]:
!pip install pytaglib

import os
import librosa
import soundfile as sf
import shutil
import taglib

Collecting pytaglib
  Downloading pytaglib-3.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Downloading pytaglib-3.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.9 MB[0m [31m14.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m19.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytaglib
Successfully installed pytaglib-3.0.0


In [36]:
#@markdown ## <font color="black">**Preprocess .wav files for Tacotron 2**

# Define input and output paths
input_path = "wav_files/"  # Path to the input .wav files
output_path = "wav_processed/"  # Path to save processed .wav files

# Create the output directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)

# Copy the list.txt file from input_path to output_path
list_file_path = os.path.join(input_path, "list.txt")
if os.path.exists(list_file_path):
    shutil.copy(list_file_path, output_path)
else:
    print("list.txt file not found in the input path.")

# Process .wav files in the input directory
for i, filename in enumerate(os.listdir(input_path)):
    if filename.endswith(".wav"):
        # Load the .wav file
        filepath = os.path.join(input_path, filename)
        y, sr = librosa.load(filepath, sr=22050)

        # Trim silence
        trimmed_audio, _ = librosa.effects.trim(y, top_db=20)

        # Normalize audio
        normalized_audio = librosa.util.normalize(trimmed_audio)

        # Save processed .wav file to the output folder
        output_filepath = os.path.join(output_path, filename)
        sf.write(output_filepath, normalized_audio, sr, subtype='PCM_16')

        # Set metadata using taglib
        with taglib.File(output_filepath) as audio:
            # Set the title to match the file name without the extension
            audio.tags["TITLE"] = [os.path.splitext(filename)[0]]
            # Set the track number to match the index of the file in the enumeration
            audio.tags["TRACKNUMBER"] = [str(i + 1)]  # Track number starts at 1

            # Save updated WAV file
            audio.save()

print("All .wav files have been preprocessed and saved to the output folder with metadata.")


All .wav files have been preprocessed and saved to the output folder with metadata.
