<a href="https://colab.research.google.com/github/marcosfelt/latex2speech/blob/main/tts_latex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text to speech for Latex

This notebook converts Latex into speech. It's useful for having your papers read back to you during editing/proofreading. 

How to use:


1. Click the play button "Setup" to install all the necessary packages
2. From the Google Colab menu, select "Runtime" -> "Restart Runtime". This is necessary to make sure the correct versions of certain packages are used.
3. Paste your latex code into the text box and click play.
4. You'll get your Latex read out to you!

FAQ:

- **Does this remove citation and reference commands?** Yes, automatically done!
- **How long does it take to generate speech?** The total generation pipeline is ~4x realtime, so 1 minute of speech will take ~15 seconds. Note, that the first run will take longer, since the model needs to be downloaded.
- **Can I change the playback speed?** Click on the three dots in the audio player and select "Playback speed."
- **What model does this use?** It uses the [Tacotron-DDC](https://coqui.ai/blog/tts/solving-attention-problems-of-tts-models-with-double-decoder-consistency) model from [Coqui-AI](https://github.com/coqui-ai/TTS).

In [None]:
#@title Setup - Click the play icon

# Needed for inflect
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# Install packages
!pip install TTS inflect

from TTS.api import TTS
# from pydub import AudioSegment
# from pydub.effects import speedup
import pysbd
import re
import textwrap
import inflect
import string
import random
from IPython.display import display, clear_output, HTML, Audio
from google.colab import files
from pathlib import Path

# Conversion of numbers
p = inflect.engine()
def convert_numbers(matchobj):
    return p.number_to_words(matchobj.group(0))
clear_output(wait=True)


# Abbreviations
# Inspired by https://github.com/coqui-ai/TTS/discussions/987
abbreviations = {
    "a": "ay",
    "b": "bee",
    "c": "sieh",
    "d": "dea",
    "e": "ee",
    "f": "eff",
    "g": "jie",
    "h": "edge",
    "i": "eye",
    "j": "jay",
    "k": "kaye",
    "l": "elle",
    "m": "emme",
    "n": "en",
    "o": "owe",
    "p": "pea",
    "q": "queue",
    "r": "are",
    "s": "esse",
    "t": "tea",
    "u": "hugh",
    "v": "vee",
    "w": "doub you",
    "x": "ex",
    "y": "why",
    "z": "zee",
}

isin = lambda l, s: any([li in s for li in l])

def abbreviation_preprocessor(text: str):
  # A bit of duplicate work because tts does this as well
  seg = pysbd.Segmenter(language="en", clean=True)
  sentences = seg.segment(text)
  for i in range(len(sentences)):
    words = sentences[i].split(" ")
    for j in range(len(words)):
      # Only substitute things in all caps
      if words[j].upper() == words[j]:
          words[j] = abbreviation_replacement(words[j])
    sentences[i] = " ".join(words)
  return " ".join(sentences)

def abbreviation_replacement(word: str):
  """Heuristic for abbreviations"""
  subwords = word.split("-")
  for i in range(len(subwords)):
    # Only spell out acronyms without vowels
    if isin(["a", "e", "i", "o", "u"], subwords[i].lower()):
      continue
    tokens = list(subwords[i])
    new_tokens = []
    for token in tokens:
      token = abbreviations.get(token.lower(), token)
      new_tokens.extend([token, " "])
    subwords[i] = "".join(new_tokens)
  return "".join(subwords)

In [23]:
#@title Generate speech

text = "All neural network models (FFN, MPNN and D-MPNN) were trained for 1000 epochs to minimize the mean squared error loss using the optimizer Adam \\cite{Kingma2015} and the Noam scheduler.\\cite{Vaswani2017} The best model checkpoint according to validation loss was used. The learning rate was tuned for each model. We found that using dropout after the pooling step in the MPNN and D-MPNN improved generalization performance. All the final hyperparameters can be found in Table \\ref{tab:hyperparameters}." #@param {type:"string"}
smart_abbreviations = True #@param {type:"boolean"}
# Clean up latex
# Strip latex citations and references
text = re.sub(r"\\cite\{[A-za-z\d,\s]+\}", "", text)
text = re.sub(r"\\citep\{[A-za-z\d,\s]+\}", "", text)
text = re.sub(r"\\ref\{[A-za-z\d,\s\-\_:]+\}", "", text)
# Convert numbers to words
text = re.sub(r"\d+(\.\d+)?", convert_numbers, text)
# Remove random latex symbols
for s in ["$", "\\", "{" ,"}"]:
  text = text.replace(s, "")
text_with_abbrevs = text.replace("_", "-")
if smart_abbreviations:
  text_final = abbreviation_preprocessor(text_with_abbrevs)
else:
  text_final = text_with_abbrevs

wavs = []
model_name = "tts_models/en/ljspeech/tacotron2-DDC"
tts = TTS(model_name, gpu=True, progress_bar=False,)
wav = tts.tts(text_final)
clear_output(wait=True)
print(" \n".join(textwrap.wrap(text_with_abbrevs, width=70)))
print()
display(Audio(wav, rate=22050))

All neural network models (FFN, MPNN and D-MPNN) were trained for one 
thousand epochs to minimize the mean squared error loss using the 
optimizer Adam  and the Noam scheduler. The best model checkpoint 
according to validation loss was used. The learning rate was tuned for 
each model. We found that using dropout after the pooling step in the 
MPNN and D-MPNN improved generalization performance. All the final 
hyperparameters can be found in Table .

