# Audiobook Generator - Proof of Concept

This notebook is intended to be a proof of concept for the end-to-end work of generating an audiobook file from an ebook. This includes converting the .epub book files into raw python trxt strings, splitting into items and sentences, then tokenizing and batching them to run through the Nvidia implementation of Tacotron2.


## Outline of steps

1. Import .epub file
2. Divide ebook into chapters
3. Remove html tags
4. Tokenize text for use in the model

In [None]:
!pip install -q torchaudio omegaconf

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf
from pprint import pprint
from omegaconf import OmegaConf
from IPython.display import Audio, display
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader

torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml',
                               'latest_silero_models.yml',
                               progress=False)
models = OmegaConf.load('latest_silero_models.yml')

seed = 1337
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
# pg2554.epub = Crime and Punishment
# pg174.epub = Portrait of Dorian Gray
# pg1342.epub = Pride And Prejudice
ebook_path = 'pg1342.epub'
rate = 24000
batch_size = 4
max_char_len = 150

In [None]:
torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml',
                               'latest_silero_models.yml',
                               progress=False)
models = OmegaConf.load('latest_silero_models.yml')

In [None]:
language = 'en'
model_id = 'v3_en'

model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language=language,
                                     speaker=model_id)
model.to(device)  # gpu or cpu

In [None]:
# model.speakers

In [None]:
sample_rate = 24000
speaker = 'en_0'
example_text = 'Hello world, here is a test of Silero.'

audio = model.apply_tts(text=example_text,
                        speaker=speaker,
                        sample_rate=sample_rate)
print(example_text)
display(Audio(audio, rate=sample_rate))

In [None]:
def read_ebook(ebook_path):
    
    import ebooklib
    from ebooklib import epub
    from bs4 import BeautifulSoup
    from tqdm.notebook import tqdm
    from nltk import tokenize, download
    from textwrap import TextWrapper
    
    download('punkt')
    wrapper = TextWrapper(max_char_len, fix_sentence_endings=True)
    
    book = epub.read_epub(ebook_path)

    corpus = []
    for item in tqdm(list(book.get_items())):
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            input_text = BeautifulSoup(item.get_content(), "html.parser").text
            text_list = []
            for paragraph in input_text.split('\n'):
                paragraph = paragraph.replace('—', '-')
                sentences = tokenize.sent_tokenize(paragraph)
                
                # Truncate sentences to maximum character limit
                sentence_list = []
                for sentence in sentences:
                    wrapped_sentences = wrapper.wrap(sentence)
                    sentence_list.append(wrapped_sentences)
                # Flatten list of list of sentences
                trunc_sentences = [phrase for sublist in sentence_list for phrase in sublist]
                
                text_list.append(trunc_sentences)
            text_list = [text for sentences in text_list for text in sentences]
            corpus.append(text_list)

    return corpus

In [None]:
ebook = read_ebook(ebook_path)

In [None]:
len(ebook)

In [None]:
plt.hist([len(sentence) for chapter in ebook for sentence in chapter])

In [None]:
for chapter in tqdm(ebook):
    chapter_index = f'{ebook.index(chapter):03}'
    audio_list = []
    for sentence in tqdm(chapter):
        sample_index = f'{chapter.index(sentence):03}'
        sample_path = "outputs/silero/chapter"+str(chapter_index)+"-sample"+str(sample_index)+".wav"
#        sample_path = "outputs/chapter"+str(chapter_index)+"-sample"+str(sample_index)+".wav"
        
        audio = model.apply_tts(text=sentence,
                            speaker=speaker,
                            sample_rate=sample_rate)
        audio_list.append(audio)
        sf.write(sample_path, audio, sample_rate, format='wav')

### Results

##### CPU

Running "Pride and Prejudice" through the Silero model took **34m42s** to convert. This book is a good representation of the average book length: the average audiobook length on Audible is between 10 & 12 hours, while Pride and Prejudice is 11h20m.

This is approximately a 20:1 ratio of audio length to processing time.

Pride and Prejudice: **34m42s** - 1h39m33s on i7-4650u

Portrait of Dorian Gray: **18m18s**

Crime and Punishment: **Unknown** - error converting ebook at 7/50, 19/368

##### GPU

Running the same book through the Silero model on GPU took **5m39s** to convert.

This is approximately a 122:1 ratio of audio length to processing time.

Pride and Prejudice: **5m39s**

Portrait of Dorian Gray: **4m26s**

Crime and Punishment: **Unknown** - error converting ebook