#Title: End-to-end speech synthesis with use of architecture of FastSpeech 2

Warsaw University of Technology, *Deep Neural Networks* Postgraduate Study

Author: Katarzyna Papis, MSc.

Supervisor: Xin Chang, PhD. student of prof. Władysław Skarbek

Online version in Google Colaboratory: [here](https://colab.research.google.com/drive/1LVkl_S5l5vy88GQf4s-t1NOtyIZmx1Ay?usp=sharing)

##Abstract

In this thesis, artificial neural networks based end-to-end speech synthesis was presented with use of FastSpeech2 architecture. First, in introduction general problem of speech synthesis is introduced, followed by descriptions of dataset and architectures used. Next parts contain information about implementation of the model, and the results are presented and discussed. Last part summarizes the thesis with possible plans for future.

##Table of contents
1. Introduction
2. The dataset
3. The architecture
4. Results
5. Discussion
6. Conclusion
7. References

##1. Introduction



### 1.1. Overal view on speech synthesis



The idea of creating a project in the field of speech synthesis came from my personal interests. I do theorethical research on voice synthesizers used especially for singing. In my opinion voice synthesis can be of major importance in the near future. While even now they are used to large degree for either entertainment or facilitating access purposes, speech synthesizers have many potential usages, for example to become communication tool for people unable to speak.

The field of speech synthesis is not new, although the methods of voice synthesizing are still being developed. First attempts come from tens years back, when concatenative synthesis was introduced. Few years back, speech synthesis related artificial neural networks were introduced, which became a tool to achieve similar or even better results in this field. The challenge is to extract those essential features of natural human language, in order to generate best possible quality artificial speech, in considerable short time.

### 1.2. Problem definition



The principle of neural TTS (text-to-speech) system is to generate waveform from written text. The character sequence input is analyzed to retrieve linguistic features. Text analysis is also so-called *TTS front-end*. This part seems to be the most challenging and time consuming due to the complexity of human language. The text is being mapped into hidden feature representation which contains, among the others, information about context, part-of-speech, duration, exceptions, prosody etc.

Based on gathered linguistic information, processed further in the form of tensor (hidden feature representation), the acoustic features are being predicted. One of the acoustic features representation is mel-spectrogram, a spectrogram where frequencies are converted into mel scale. In this work, mel-spectrogram use is considered. Next, the acoustic features are processed to generate waveform, which further are transformed into audio that can be actually heard. 

The [Survey on Neural Speech Synthesis (2021)](https://arxiv.org/pdf/2106.15561.pdf) concludes most of the neural solutions for TTS systems known back then. The paper shows that over the years few different pipelines were introduced. On Figure 1.1. (taken from Survey cited above) the taxonomy is presented as well as possible workflows.

Majority of the solutions consist of at least two main parts, first one to create acoustic features, and second one to greate actual waveform. Although, there exist also models that perform all processes at once. These are so-called *end-to-end TTS* models - the input is text, and output is waveform.

![](https://drive.google.com/uc?id=1YHPs1qAX_suoRjlOWI0-G4e4rwlgNJIk)

*Figure 1.1. A taxonomy of neural TTS from the perspectives of key components and data flows.*



Nowadays, most of the pipelines are constructed the following way. Even though the TTS system constists of a few models that work and can be trained independently, they are coupled in a way that the workflow truly resembles end-to-end system. This kind of system was used in this project.





In Table 1.1. below, audio quality test results are combined on a basis of study by [Y. Ren, C. Hu, X. Tan, et al.](https://arxiv.org/abs/2006.04558) on FastSpeech 2 (*version 6*). The audio quality was measured by Mean Opinion Score (MOS) test. Seven types of audio samples were introduced to listeners: *GT* - ground truth; *GT (Mel + PWG)* - ground truth audio converted into mel spectrograms, and then convert the mel-spectrograms back to audio using [*Parallel WaveGAN*](https://arxiv.org/abs/1910.11480); two-step architectures: [*Tacotron 2*](https://arxiv.org/pdf/1712.05884.pdf), [*Transformer TTS*](https://ojs.aaai.org/index.php/AAAI/article/view/4642), [*FastSpeech*](https://arxiv.org/pdf/1905.09263.pdf) and [*FastSpeech 2*](https://arxiv.org/abs/2006.04558); and end-to-end solution: [*FastSpeech 2s*](https://arxiv.org/abs/2006.04558).

![](https://drive.google.com/uc?id=1c33tE84of8aGbI1qiro-6r371do66DaT)

*Table 1.1. The MOS with 95% confidence intervals*

### 1.3. Reasoning for choice of architecture

Choice of architecture was determined by few factors. One of them was available hardware (listed in section 3.3.), which is not of high performance nowadays. Important was the possibility to use CUDA due to presence of NVIDIA graphic card. First attempts were performed with another, more complex architecture, and hardware was not capable of running the model.

Next reason to use the chosen architecture was expected time of training, which was considerably short due to use of pretrained vocoder implemented. With hyperparameters set as in 4.2. and 4.3., the training time was about 24 hours.

Author level of experience is not without significance. The implementation found and used was considerably of low entry level as for programming experience.

And finally, FastSpeech 2 was new solution, published relatively recently (2020, and  further developed), and not many research works have been done on it, comparing to, for example, Tacotron 2. This modern solution adresses some issues that can be found in earlier works (described in sections 3.1. and 3.2.), and solves them. 

These are the reasons for choosing FastSpeech 2, which seemed to be the most suitable for my speech synthesis project plans.

## 2. The dataset






### 2.1. Based on


The structure and parameters of the dataset were based on [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/), which is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. 

LJ Speach Dataset is in public domain and commonly used by rearchers and developers. It has precisely defined parameters, which are easy to follow while preparing own dataset for a given architecture.  That is the reason why I decided to base my own dataset on LJ Speech Dataset parameters.

### 2.2. Data preparation

Dataset for this project was created from scratch. It consists of **724** audio clips of total length **2:14:58** [h:m:s], with individual clips length varying mostly from 1 to 10 seconds. Around 1 % of clips exceed in length to 13 seconds.

The audio was recorded in home environment, possibly soundproofed (Hardware: *Sony SingStar Wired Microphone* + *USB adapter* + PC).

Then the audio was preprocessed for noise reduction and cut into short clips (Software: *Audacity*), and afterwards converted to proper sampling rate (*FFmpeg*).

The texts used come from different areas of life and were published in different times, in order to make vocabulary varied. The following books are made available online thanks to *Gutenberg Project*. Additionaly, the science thesis was also used to increase the amount of numbers used. Set of publications from which parts were used:

- [Verne, J. *Twenty Thousand Leagues Under the Seas
An Underwater Tour of the World*](https://www.gutenberg.org/files/2488/2488-h/2488-h.htm). Translated by F. P. Walter. 1999.
- [Betts, G.H. *The Mind and Its Education*](https://www.gutenberg.org/files/20220/20220-h/20220-h.htm). 1906.
- [Tzu, S. *The Art of War*](https://www.gutenberg.org/files/132/132-h/132-h.htm). Translated by L. Giles. 1910.
- Papis, K. *Analysis of the impact of photovoltaic panels shading on the work of the solar power plant in LINTE^2 Laboratory*. 2016.

Each audio file is a single-channel 16-bit PCM WAV with a sampling rate of
**22050 Hz**.


### 2.3. Metadata

Text data is stored in the *.csv* file in the following manner. 

`CH01-0001|The year 1866 was marked by the bizarre development, an unexplained and downright inexplicable phenomenon that surely no one has forgotten.|The year eighteen sixty six was marked by the bizarre development, an unexplained and downright inexplicable phenomenon that surely no one has forgotten.`

- ID: this is the name of the corresponding .wav file: `CH01-0001`
- Transcription: words spoken by the reader (UTF-8): `(...) 1866 (...)`
- Normalized Transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8): `(...) eighteen sixty six (...)`


Further examples:

* `B3CH03-0030|The general, unable to control his irritation, will launch his men to the assault like swarming ants. The general, unable to control his irritation, will launch his men to the assault like swarming ants.`

* `MSC0401|All 132 panels are wired into 6 chains.|All one hundred thirty two panels are wired into six chains.`

### 2.4. Alignment

Data samples were aligned with use of **[Montreal Forced Aligner](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) (MFA)**, a command line utility for performing forced alignment of speech datasets using [Kaldi](http://kaldi-asr.org/). Kaldi is a toolkit for speech recognition, intended for use by speech recognition researchers. 

Forced alignment is a technique to take an orthographic transcription of an audio file and generate a time-aligned version using a pronunciation dictionary to look up phones for words. In this project pretrained English acoustic model along with English dictionary were used. MFA converts input waveforms and transcriptions into TextGrid format, and then adapts to specified speaker (possible for multiple speakers in one command). After gathering speaker information forced alignment is performed with Kaldi toolkit.


## 3. Architecture





The whole architecture consists of two main parts - **FastSpeech2 (encoder-decoder)** predicting mel spectrograms from input in the form of written text, and *pretrained* **HiFi-GAN (vocoder)** taking those spectrograms as input and generating waveforms. This makes *end-to-end speech synthesizer*. 

This solution was proposed in version 1 [FastSpeech 2: Fast and High-Quality End-to-End
Text-to-Speech](https://arxiv.org/abs/2006.04558). The dataflow for this project follows their solution, although some modifications were introduced, according to following [PyTorch Implementation](https://github.com/ming024/FastSpeech2). 

###3.1. FastSpeech2

Speech audio conveys not only pure information from text. What also matters are the pronounciation, prosody, pitch and other features (variance) that make it sound more natural.

In general, in previous TTS succesful solutions such as [Tacotron 2](https://arxiv.org/pdf/1712.05884.pdf) or [DeepVoice 3](https://arxiv.org/pdf/1710.07654.pdf), mel spectrogram is generated from input data with use of autoregressive methods. Each next step is basing on previous ones. This results in extended training time, inference speed, lack of robustness (word skipping and repeating issues) and cotrollability of synthesized speech speed and prosody. During works on [FastSpeech](https://arxiv.org/pdf/1905.09263.pdf) above issues were addressed and solutions were further developed in FastSpeech2 and FastSpeech2s.

Novel approach of FastSpeech2 eliminates autoregressiveness and introduces methods for extraction of specific features: phoneme duration, pitch, energy, which have impact on synthesized speech quality and naturalness.

This gives enough information about the actuall input thus there is no need for storing information of previous steps in order to perform prediction. The overall architecture of FastSpeech2 is presented in Figure 3.1. below.





![](https://drive.google.com/uc?id=1y6Ya1wCygIvsrDbqoJc2Igg_AK11DK4_)

*Figure 3.1. The overall architecture for FastSpeech 2 and 2s. LR in subfigure (b) denotes the length regulator operation proposed in FastSpeech. LN in subfigure (c) denotes layer normalization. Variance predictor represents duration/pitch/energy predictor.*


First, we feed the preprocessed aligned data (described in section *2.4.*) to the network. 

**Encoder**. Takes input phoneme (text) sequence and converts it into hidden sequence - hidden feature representation in a form of torch tensor. 

**Variance adaptor**. Adds variance features information - additional features containing (used for this project, although not exclusive):

- phoneme duration - represets how fast the speech sound lasts. Phoneme duration is extracted with MFA, converted to phoneme level duration sequence and fed into Length Regulator (LR). Length Regulator adjusts the duration length of phonemes to match the length in spectrogram sequence (some phonemes might last longer than others).
- pitch - represent emotions, affects perception. Pitch is obtained for each frame, extracting F0 from raw waveform with the same hop size of target melspectrograms.
- energy -  indicates frame-level magnitude of mel-spectrograms and directly affects the loss computed on mel-spectrograms. Energy is obtained by computing L2-Norm of the amplitude of each STFT frame.

**Decoder**. Decode hidden feature representation with variance information into mel-spectrogram sequence in parallel. 



### 3.2. HiFi-GAN 

While the objective of FastSpeech2 is mel-spectrogram output, it does not create a sound itself. For this purpose, there is a need for an architecture that takes mel-spectrogram as an input, and on this basis generates a waveform. Various models are used, among the others can be found [WaveNet](https://arxiv.org/abs/1609.03499), [WaveGlow](https://arxiv.org/abs/1811.00002), commonly used with Tacotron and Tacotron2, and others. 

In this implementation, [HiFi-GAN](https://arxiv.org/pdf/2010.05646.pdf) was used. Unlike WaveNet (autoregressive) or WaveGlow (flow-based), HiFi-GAN is an Generative Adversarial Network (GAN) model. It consists of generator and two complex discriminators, as described on Figures 3.2. and 3.3. below.

The generator is fully convolutional network and upsamples input mel-spectrograms through transposed convolutions until the length of the output sequence matches temporal resolution of raw waveforms.

![](https://drive.google.com/uc?id=1Q7qVZ_wU7lMDQcJIBnvds1hxC6OfTz4n)

*Figure 3.2. HiFi-GAN generator*

In this solution two discriminators are used. Since the audio signals are periodical, with various periods, this solution allows to capture periodic as well as long term patterns. Multi Period Discriminator (MPD) is a mixture of several sub-discriminators, each handling a portion of periodic signals of input audio. Additionally, Multi Scale Discriminator (MSD) captures consecutive patterns and long-term dependencies. 


![](https://drive.google.com/uc?id=1o4-ctax5Sa92y6bFDh3IqaWo3wF2fEKd)

*Figure 3.3. HiFi-GAN discriminator. a) Second sub-discriminator of Multi Scale Discriminator (MSD). b) Second discriminator od Multi Period Discriminator (MPD).*

In this implementation, pretrained HiFi-GAN was imported. This significantly decreases training time - only the FastSpeech2 is being trained, learning how to predict mel-spectrogram from text. Next adventage is that HiFi-GAN is applied into the project in a way that can be treated as inherent part. Single script couples both networks: FastSpeech2 (*text -> mel-spectrogram*) and HiFi-GAN (*mel-spectrogram -> waveform*). Thanks to applying them into one pipeline, we can talk about ***end-to-end speech synthesis***.

### 3.3. Configuration

**Hardware**. The project experiments were conducted on physical computer. Hardware configuration is listed below.

```
Processor: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
GPU: NVIDIA GeForce GT 840M 2048 MB
```
**Software**. Virtual environment was set in *PyCharm Community Edition 2020.2.2*. 
```
Python version: 3.8.10
Cuda version: 11.4
Torch version: 1.9.1+cu111
```


## 4. Results 

Below are presented few experiments results alongside with hyperparmeters used. These are not all the experiments conducted, the examples below present the progress in achieving the goal which is speech synthesis of possible high quality.

First attempts to train the network were conducted on cpu, without using CUDA. Those experiments were interrupted due to estimated time to finish which was more than 3000 hours with parameters corresponding to section *4.2.* Experiment 3.

Due to oversight, first few experiments were conducted with unproper train / validation elements ratio. Although the difference can be clearly heard, it can be considered quite impressive that with such small train dataset (`experiment 2, 212 elements`) the model was able to generate results possible to be partially understood.

For purposes of this project, the model was named **Hv_demo**. Three versions of Hv_demo (three experiments) are listed below.

### 4.1. Experiment 1





Hyperparameters
```
optimizer:
  batch_size: 2
  betas: [0.9, 0.98]
  eps: 0.000000001
  weight_decay: 0.0
  grad_clip_thresh: 1.0
  grad_acc_step: 1
  warm_up_step: 400
  anneal_steps: [3000, 4000, 5000]
  anneal_rate: 0.3

step:
  total_step: 9000
  log_step: 100
  synth_step: 1000
  val_step: 1000
  save_step: 1000
```

Train dataset: `212 elements`

Validation dataset: `512 elements`

![](https://drive.google.com/uc?id=11ezokZvyk1454NmCqis30y5GVX0oWVkf)

*Figure 4.1. Experiment 1 loss*

![](https://drive.google.com/uc?id=1m7VS60tPqEJYyOtgEWRjEiUEYF1f_af7)

*Figure 4.2. Experiment 1 generated mel-spectrograms. Upper part: generated. Lower part: ground truth*

Very similar results were obtained for total number of steps `total_step: 18000` instead of `9000`.

### 4.2. Experiment 2


Hyperparameters
```
optimizer:
  batch_size: 2
  betas: [0.9, 0.98]
  eps: 0.000000001
  weight_decay: 0.0
  grad_clip_thresh: 1.0
  grad_acc_step: 1
  warm_up_step: 4000
  anneal_steps: [30000, 40000, 50000]
  anneal_rate: 0.3

step:
  total_step: 90000
  log_step: 1000
  synth_step: 10000
  val_step: 10000
  save_step: 10000

```

Train dataset: `212 elements`

Validation dataset: `512 elements`


![](https://drive.google.com/uc?id=1xwwirmrESeSS-NKKI48U8H3fgBuWj8xh)

*Figure 4.3. Experiment 2 loss*

![](https://drive.google.com/uc?id=1QLFD2lou3bWq811En9mVj3zgSLaMMKxa)

*Figure 4.4. Experiment 2 generated mel-spectrograms. Upper part: generated. Lower part: ground truth*

### 4.3. Experiment 3



Hyperparameters
```
optimizer:
  batch_size: 2
  betas: [0.9, 0.98]
  eps: 0.000000001
  weight_decay: 0.0
  grad_clip_thresh: 1.0
  grad_acc_step: 1
  warm_up_step: 4000
  anneal_steps: [30000, 40000, 50000]
  anneal_rate: 0.3

step:
  total_step: 90000
  log_step: 1000
  synth_step: 10000
  val_step: 10000
  save_step: 10000


```

Train dataset: `664 elements`

Validation dataset: `60 elements`

![](https://drive.google.com/uc?id=15MzwpAfVXLeqOqYtq08Pex9jDuNrfxFF)

*Figure 4.5. Experiment 3 loss*

![](https://drive.google.com/uc?id=1gM7XLcaiyvYhsxU3jq1mbzrcHrY89fj1)

*Figure 4.6. Experiment 3 generated mel-spectrograms. Upper part: generated. Lower part: ground truth*

### 4.4. Results comparison

I decided to compare synthesis capabilities of programs from experiments 2 and 3, which mainly differ by training dataset size. Experiment 1 program was not capable of speech synthesis. 

In both cases, the program was feeded with written sentence that it had never seen before. Each time, the trained model processed the sequence, extracting deep features, predicting mel-spectrogram and generating waveform out of it. The comparison of results is presented in [AUDIO SAMPLES NOTEBOOK](https://colab.research.google.com/drive/1Cgr0z_vuYeb_zHGAsphcTvbg98GuFQyT?usp=sharing).

The sentences fed into the network were of various length, expected intonation, and also contained numbers.

* `Hello World`

* `Hello. I am synthesized and generally cool.`

* `Well everybodys happy so there is no problem right`

* `I have 1 dog and 7 spiders`

### 4.5. Synthesis of numbers

In this section I present the capability of Hv_demo from experiment 3 (the last model) to predict mel-spectrogram for numbers and thus synthesize them into speech. They also can be heard in audio samples notebook mentioned above.

* `123. 234. 5964. 10000`

* `96822. 260000. 19. 648`

* `0.5 Watt and 0.46 Volt and also 0.103 Amper`

### 4.6. Speech recognition of generated sentences

Additional experiment was conducted in order to test if generated sentences can be understood by speech recognition system. The system used was speech-to-text implemented in Google Translate application. The sentences presented in section *4.4.* of this work were played, and fed via 2019 produced smarthphone microphone into application. 

* **"Hello World"**

*Experiment 2:* How do I?

*Experiment 3:* Hello world

* **"Hello (again). I am synthesized and generally cool."**

*Experiment 2:* Finger waves in the YMCA

*Experiment 3:* Hello again by I'm synthesized and generally color

* **"Well everybodys happy so there is no problem right"**

*Experiment 2:* Elizabeth said there is nothing in life

*Experiment 3*: Well that's right but this happened so there is no problem.

* **"I have 1 dog and 7 spiders"**

*Experiment 2:* Rolling Stones by David

*Experiment 3:* Is a Wonderland sense fires


It should be mentioned here that for most of the original audio sentences, more than one transcription was possible to create. Factors affecting speech recognition quality might be, except the synthesized audio itself, the quality of speakers and microphone, and the fact that speech-to-text system has implemented alghorithms that aim to construct logical sentence from given input. 

### 4.7. MOS test

For the latest Hv_demo version (*Experiment 3*) the Mean Opinion Score (MOS) test was conducted. The test group consisted of multilanguage listeners, for whom English was not necessarily mothertongue. The MOS test was constructed of nine questions, each containing audio sample and scale 1-5 to vote. Inside set, two of the samples were from training dataset and other two from Experiment 2, while the rest were generated in Experiment 3.

Question constructed: *Please listen to the samples and judge using a five point scale their quality and naturalness.*

Number of testers: *19*

**MOS: *3.0924***

## 5. Discussion



### 5.1. Overfitting

It can be clearly heard that experiment 3 program has significantly better capabilities of speech synthesis then previous versions. Comparing to experiment 2, it outgrows it in cathegories like fluency, understandability, naturalness.

As can be seen from loss diagrams in section *4.2.*, the model was overfited. It was unable to generalize, thus with input sequences greatly differing from training dataset, synthesized output resembles gibberish. The issue was addressed and partially eliminated in experiment 3, where three times larger training dataset was used. It can be assumed that for dataset consisting of even more samples, model can be greatly generalize.

### 5.2. Numbers

In the dataset, many audio samples consisting numbers can be found. Because of that, the last model has considerably good capability of predicting mel-spectrograms for numbers input, which can be seen in section *4.5*. It can construct numbers consisting of more then four digits, although precission significantly decreases for more complex numbers. This way it can be detected which numbers were not contained in dataset.

Next thing to notice is that during the training the dotts in decimal numbers had to be treated as diacritic marks. Despite the fact that for decimal number `0.5` normalized transcription was `zero point five` (what was also reflected in audio sample), the synthesized output is rather `zero five`. 


### 5.3. Naturalness

The model has tendency to jump over some words which breaks the meaning. It can be also noticed that in case of numbers, the endings of words are being cut. Due to this fact some sentences, even if understandable, might get incorrect meaning.

Since the synthesized speech is intended to be understood generally by people, the MOS test was conducted. Test group, even if not large, was differentiated by origin and mothertongue. The answers varied, giving in most cases Gaussian distribution in range of values 1-4 or 2-5. 

The interesting outcome is that even natural speech samples from training dataset were not always judged as 5 (one vote was even on 2). This might imply that testers while unaware of that were too suspicious, and also that the quality of dataset itself might be questionable. 





## 6. Conclusion

In this work, I presented the issue of speech synthesis on example of FastSpeech2 architecture trained on own created dataset. I showed that even for an input dataset consisting of a few hundreds elements, it is possible to achieve good results - synthesized sequences that are possible to be understood, maintain accent and pronounciation characteristic for a speaker.

It was also proved that thanks to high level of advancement of the neural networks, which creates low entry treshold, it was considerably easy to enter the field of speech synthesis for non-programmer. The topic is far from being finished and has a potential to be further researched. Thus, I would consider the synthesizer showed here as a demonstration version of the product, and further continue on developing the program - with, among the others, larger dataset and hyperparameters adjustment.

## 7. References

[X. Tan, T. Qin, F. Soong, and T.-Y. Li, *Survey on Neural Speech Synthesis*](https://arxiv.org/pdf/2106.15561.pdf)

[K. Ito and L. Johnson, *The LJ Speech Dataset*](https://keithito.com/LJ-Speech-Dataset/)

[*Project Guthenberg*](https://www.gutenberg.org/)

[J. Shen, R. Pang, R. J. Weiss,et al., *Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions*](https://arxiv.org/pdf/1712.05884.pdf)

[R. Yamamoto, E. Song, and J.-M. Kim, *Parallel wavegan: A fast waveform generationmodel based on generative adversarial networks with multi-resolution spectro-gram*](https://arxiv.org/abs/1910.11480)

[N. Li, S. Liu, Y. L. S. Zhao, and M. Liu, *Neural speech synthesis with transformer network*](https://ojs.aaai.org/index.php/AAAI/article/view/4642)

[Y. Ren, Y. Ruan, X. Tan,et al., *FastSpeech: Fast, Robust and Controllable Text to Speech*](https://arxiv.org/pdf/1905.09263.pdf)

[Y. Ren, C. Hu, X. Tan, et al., *FastSpeech 2: Fast and High-Quality End-to-End
Text-to-Speech*](https://arxiv.org/abs/2006.04558)

[M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, *Montreal Forced Aligner*](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)

[D. Povey, A. Ghoshal, G. Boulianne,et al., *Kaldi*](http://kaldi-asr.org/)

[C. M. Chien, J. H. Lin, C. Y. Huang, P. C. Hsu, and H. Y. Lee, *FastSpeech 2 - PyTorch Implementation*](https://github.com/ming024/FastSpeech2)

[W. Ping, K. Peng, A. Gibiansky,et al., *Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning*](https://arxiv.org/pdf/1710.07654.pdf)

[A. van den Oord, S. Dieleman, H. Zen,et al., *WaveNet: A generative model for rawaudio*](https://arxiv.org/abs/1609.03499)

[R. Prenger, R. Valle, and B. Catanzaro, *WaveGlow: A flow-based generative net-work for speech synthesis model based on generative adversarial networks withmulti-resolution spectrogram*](https://arxiv.org/abs/1811.00002)

[K. Jungil, K. Jaehyeon, and B. Jaekyoun, *HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis*](https://arxiv.org/pdf/2010.05646.pdf)

__________________________________________________________________________

Attachment: [K. Papis, *End-to-end speech synthesis with use of architecture of FastSpeech 2. Attachment - Audio samples*](https://colab.research.google.com/drive/1Cgr0z_vuYeb_zHGAsphcTvbg98GuFQyT?usp=sharing).