# Proof of Concept Testing - Pre-trained Tacotron2

This notebook is intended to show proof of concept for using Coqui-AI to convert ebooks into audiobooks. In order to accomplish this, the notebook will use tools from the epud-parser notebook, as well as the Coqui-AI / TTS engine tools to convert the parsed text into audio files.

**Note:** this notebook is intended for use with the `ttsenv` conda environment.

## Outline of steps

1. Convert .epub file to text string
2. Run text through TTS engine
3. Combine audio files

In [1]:
import ebooklib
import torch
import time
import numpy as np
from ebooklib import epub
from bs4 import BeautifulSoup
from IPython.display import Audio
from scipy.io.wavfile import write
from nltk import tokenize, download
download('punkt')

[nltk_data] Downloading package punkt to /home/paperspace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# ?torch.hub.load

In [3]:
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# map_location=torch.device('cpu')
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2 = tacotron2.eval()

Using cache found in /home/paperspace/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
  "pytorch_quantization module not found, quantization will not be available"
  "pytorch_quantization module not found, quantization will not be available"


In [4]:
sum([parameter.numel() for parameter in tacotron2.parameters()])

28193153

## Step 1 - Epub to text

Text here.

In [5]:
book = epub.read_epub('pg2554.epub')
# book = epub.read_epub('epub-test.epub')

Text here.

In [6]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        print(BeautifulSoup(item.get_content(), "html.parser").text)
        print('==================================')

NAME :  5044171427629287690_2554-h-0.htm.html
----------------------------------




The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky
This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.
Title: Crime and Punishment
Author: Fyodor Dostoevsky
Translator: Constance Garnett
Release Date: March, 2001 [eBook #2554]
[Most recently updated: August 6, 2021]
Language: English
Character set encoding: UTF-8
Produced by: John Bickers, Dagny and David Widger
*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***
CRIME AND PUNISHMENT
By Fyodor Dostoevsky
Translated By Constance Garn






 CHAPTER IV
Raskolnikov went straight to the house on the canal bank where Sonia lived. It was an old green house of three storeys. He found the porter and obtained from him vague directions as to the whereabouts of Kapernaumov, the tailor. Having found in the corner of the courtyard the entrance to the dark and narrow staircase, he mounted to the second floor and came out into a gallery that ran round the whole second storey over the yard. While he was wandering in the darkness, uncertain where to turn for Kapernaumov’s door, a door opened three paces from him; he mechanically took hold of it.
“Who is there?” a woman’s voice asked uneasily.
“It’s I... come to see you,” answered Raskolnikov and he walked into the tiny entry.
On a broken chair stood a candle in a battered copper candlestick.
“It’s you! Good heavens!” cried Sonia weakly, and she stood rooted to the spot.
“Which is your room? This way?” and Raskolnikov, trying not to look at her, hastened in.
A minute later Sonia, to

In [7]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print(item.get_content())

b'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" epub:prefix="z3998: http://www.daisy.org/z3998/2012/vocab/structure/#" lang="en" xml:lang="en">\n  <head/>\n  <body><div class="c1">The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky</div>\n<div class="c2">This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at <a href="https://www.gutenberg.org">www.gutenberg.org</a>. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.</div>\n<div class="c3">Title: Crime and Punishment</div>\n<div class="c3">Author: Fyodor Dostoevsky</div>\n<div class="c3">Translator: Const

## Step 2 - Tacotron2 from TorchHub
Text here.

In [15]:
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /home/paperspace/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0): WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (1): Conv1d(51

In [32]:
text = "".join(list("Hello world, I missed you so much.") * 5)


In [13]:
list(text)

['H',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd',
 ',',
 ' ',
 'I',
 ' ',
 'm',
 'i',
 's',
 's',
 'e',
 'd',
 ' ',
 'y',
 'o',
 'u',
 ' ',
 's',
 'o',
 ' ',
 'm',
 'u',
 'c',
 'h',
 '.']

In [33]:
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

Using cache found in /home/paperspace/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [34]:
sequences, lengths

(tensor([[45, 42, 49, 49, 52, 11, 60, 52, 55, 49, 41,  6, 11, 46, 11, 50, 46, 56,
          56, 42, 41, 11, 62, 52, 58, 11, 56, 52, 11, 50, 58, 40, 45,  7, 45, 42,
          49, 49, 52, 11, 60, 52, 55, 49, 41,  6, 11, 46, 11, 50, 46, 56, 56, 42,
          41, 11, 62, 52, 58, 11, 56, 52, 11, 50, 58, 40, 45,  7, 45, 42, 49, 49,
          52, 11, 60, 52, 55, 49, 41,  6, 11, 46, 11, 50, 46, 56, 56, 42, 41, 11,
          62, 52, 58, 11, 56, 52, 11, 50, 58, 40, 45,  7, 45, 42, 49, 49, 52, 11,
          60, 52, 55, 49, 41,  6, 11, 46, 11, 50, 46, 56, 56, 42, 41, 11, 62, 52,
          58, 11, 56, 52, 11, 50, 58, 40, 45,  7, 45, 42, 49, 49, 52, 11, 60, 52,
          55, 49, 41,  6, 11, 46, 11, 50, 46, 56, 56, 42, 41, 11, 62, 52, 58, 11,
          56, 52, 11, 50, 58, 40, 45,  7]], device='cuda:0'),
 tensor([170], device='cuda:0'))

In [35]:
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

In [None]:
write("outputs/audio.wav", rate, audio_numpy)

In [36]:
Audio(audio_numpy, rate=rate)

## Step 3 - Text from epub to .wav file

Text here.

In [37]:
max_characters = 0
total_characters = 0
sample_num = 0

for item in book.get_items():
#    if item.get_type() == ebooklib.ITEM_DOCUMENT:
    if item.get_name() == "5044171427629287690_2554-h-1.htm.html":
        # set up text sample and path
        input_text = BeautifulSoup(item.get_content(), "html.parser").text
        input_text = input_text.replace('—', '')
        text_list_tmp = input_text.split('\n')

        text_list = ['']
        for sample in text_list_tmp:
            text_list += tokenize.sent_tokenize(sample)

        # print(input_text)
        # print(text_list)
        for i in range(0,len(text_list)):
#             print("Index:" + str(i) + "Characters:" + str(len(text_list[i])))
#             print(text_list[i])
        
            if len(text_list[i]) >= max_characters:
                max_characters = len(text_list[i])
            
            total_characters += len(text_list[i])
            sample_num += 1
            
print("Max Characters in Sample:" + str(max_characters))
print("Total Characters:" + str(total_characters))
print("Number of Samples:" + str(sample_num))

Max Characters in Sample:336
Total Characters:4419
Number of Samples:40


In [38]:
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text_list[6]])

Using cache found in /home/paperspace/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [41]:
lengths

tensor([125], device='cuda:0')

In [42]:
%%timeit

with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

4.92 s ± 369 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
14166 * 4.92 / 3600

19.3602

In [40]:
Audio(audio_numpy, rate=rate)

## Step 4 - Convert paragraph

Text here.

In [46]:
from tqdm.notebook import tqdm

In [50]:
%%time

# audio_numpy_combined = np.empty(0,)
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
rate = 22050

for sample in tqdm(text_list):
    if sample:      
        sample_index = text_list.index(sample)
        sample_path = "outputs/output"+str(sample_index)+".wav"
        
#         print("Sample to convert:" + "\n" + sample)
#         print("Index of sample:" + str(sample_index))
#         print("Output path:" + str(sample_path))
#         print("Characters:" + str(len(sample)) + "\n\n")
        
        sequences, lengths = utils.prepare_input_sequence([sample])
        
        with torch.no_grad():
            mel, _, _ = tacotron2.infer(sequences, lengths)
            audio = waveglow.infer(mel)
        audio_numpy = audio[0].data.cpu().numpy()
        
        write(sample_path, rate, audio_numpy)
        
#         current_wav = AudioSegment.from_file(sample_path)
#         output_wav = AudioSegment.from_file("output_combined.wav")
        
#         output_wav += current_wav
#         output_wav.export(out_f = "output_combined.wav", format = "wav")
        
#         print("Shape of audio_numpy:" + str(audio_numpy.shape))
#         print("Shape of audio_numpy_combined before:" + str(audio_numpy_combined.shape))
        
#         audio_numpy_combined= np.concatenate([audio_numpy_combined, audio_numpy], axis=0)
#         print("Shape of audio_numpy_combined after:" + str(audio_numpy_combined.shape))

# write("output_combined.wav", rate, audio_numpy_combined)

Using cache found in /home/paperspace/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


CPU times: user 1min 53s, sys: 57.5 s, total: 2min 50s
Wall time: 2min 50s


In [53]:
Audio("outputs/output7.wav", rate=rate)

In [52]:
40 * 4.92 / 60

3.2800000000000002

## Step 5 - Combine output files

Text here.

In [54]:
from pydub import AudioSegment

In [None]:
start = time.time()

output_wav = AudioSegment.empty()

for i in range(1,40):
    current_wav = AudioSegment.from_file("outputs/output" + str(i) + ".wav")
    output_wav += current_wav
    output_wav.export(out_f = "outputs/output_combined.wav", format = "wav")
    
end = time.time()
print("The time of execution of above program is :", end-start)

Audio("outputs/output_combined.wav", rate=rate)

In [None]:
Audio("outputs/output2.wav", rate=rate)

In [57]:
output_wav = AudioSegment.empty()

wav_1 = AudioSegment.from_file("outputs/output1.wav")
output_wav = output_wav.append(wav_1, crossfade=0)

wav_2 = AudioSegment.from_file("outputs/output2.wav")
output_wav = output_wav.append(wav_2, crossfade=0)

output_wav.export(out_f = "outputs/output_combined.wav", format = "wav")

Audio("outputs/output1.wav", rate=rate)

In [58]:
output_wav

In [None]:
Audio("outputs/output_ffmpg.wav", rate=22050)

In [59]:
tacotron2.infer??

### Step 6 - Convert whole book

Text here.

In [None]:
# start = time.time()

# max_characters = 0
# total_characters = 0
# sample_num = 0
# chap_num = 0

# for item in book.get_items():
# #    if item.get_type() == ebooklib.ITEM_DOCUMENT:
#     if item.get_name() == "5044171427629287690_2554-h-1.htm.html":
#         # set up text sample and path
#         chap_num += 1
#         input_text = BeautifulSoup(item.get_content(), "html.parser").text
#         input_text = input_text.replace('—', '')
#         text_list_tmp = input_text.split('\n')

#         text_list = ['']
#         for sample in text_list_tmp:
#             text_list = tokenize.sent_tokenize(sample)
            
#             sample_index = text_list.index(sample)
#             sample_path = "outputs/chapter" + str(chap_num) + "/output" + str(sample_index) + ".wav"

#             print("Sample to convert:" + "\n" + sample)
#             print("Index of sample:" + str(sample_index))
#             print("Output path:" + str(sample_path))
#             print("Characters:" + str(len(sample)) + "\n\n")

#             utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
#             sequences, lengths = utils.prepare_input_sequence([sample])

#             with torch.no_grad():
#                 mel, _, _ = tacotron2.infer(sequences, lengths)
#                 audio = waveglow.infer(mel)
#             audio_numpy = audio[0].data.cpu().numpy()
#             rate = 22050

#             write(sample_path, rate, audio_numpy)

#         # print(input_text)
#         # print(text_list)
#         for i in range(0,len(text_list)):
# #             print("Index:" + str(i) + "Characters:" + str(len(text_list[i])))
# #             print(text_list[i])
        
#             if len(text_list[i]) >= max_characters:
#                 max_characters = len(text_list[i])
            
#             total_characters += len(text_list[i])
#             sample_num += 1
            
#         print("Max Characters in Sample:" + str(max_characters))
#         print("Total Characters:" + str(total_characters))
#         print("Number of Samples:" + str(sample_num))

# end = time.time()
# print("The time of execution of above program is :", end-start

In [None]:
# start = time.time()

# # audio_numpy_combined = np.empty(0,)

# for sample in text_list:
#     if sample:      
#         sample_index = text_list.index(sample)
#         sample_path = "outputs/output"+str(sample_index)+".wav"
        
#         print("Sample to convert:" + "\n" + sample)
#         print("Index of sample:" + str(sample_index))
#         print("Output path:" + str(sample_path))
#         print("Characters:" + str(len(sample)) + "\n\n")
        
#         utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
#         sequences, lengths = utils.prepare_input_sequence([sample])
        
#         with torch.no_grad():
#             mel, _, _ = tacotron2.infer(sequences, lengths)
#             audio = waveglow.infer(mel)
#         audio_numpy = audio[0].data.cpu().numpy()
#         rate = 22050
        
#         write(sample_path, rate, audio_numpy)
        
# #         current_wav = AudioSegment.from_file(sample_path)
# #         output_wav = AudioSegment.from_file("output_combined.wav")
        
# #         output_wav += current_wav
# #         output_wav.export(out_f = "output_combined.wav", format = "wav")
        
# #         print("Shape of audio_numpy:" + str(audio_numpy.shape))
# #         print("Shape of audio_numpy_combined before:" + str(audio_numpy_combined.shape))
        
# #         audio_numpy_combined= np.concatenate([audio_numpy_combined, audio_numpy], axis=0)
# #         print("Shape of audio_numpy_combined after:" + str(audio_numpy_combined.shape))

# # write("output_combined.wav", rate, audio_numpy_combined)
        
# end = time.time()
# print("The time of execution of above program is :", end-start)