[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/wavenet_vocoder.ipynb)

# WaveNet Vocoder Recipe Demonstration

**Tomoki Hayashi**

Department of Informatics, Nagoya University  
Human Dataware Lab. Co., Ltd.

In [None]:
import time
start_time = time.time()

## Environmental setup

First, install dependecies (It takes several minutes).

In [None]:
!apt-get install -qq -y bc tree
!git clone https://github.com/kan-bayashi/PytorchWaveNetVocoder.git -b IS19TUTORIAL
!git clone https://github.com/k2kobayashi/sprocket.git -b IS19TUTORIAL
!cd sprocket && pip install -q .
!cd PytorchWaveNetVocoder && pip install -q .
!cd PytorchWaveNetVocoder && mkdir -p tools/venv/bin && touch tools/venv/bin/activate
import sprocket, wavenet_vocoder  # check importable
!echo "Setup done!"

## What is the PytorchWaveNetVocoder?

Github: [kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder)  
Samples: https://kan-bayashi.github.io/WaveNetVocoderSamples/

- WaveNet vocoder implemention with pytorch
- Support [kaldi](https://github.com/kaldi-asr/kaldi)-like recipes, easy to reproduce the results
- Support [World](https://github.com/mmorise/World) features / mel-spectrogram based models
- Support multi-gpu training / decoding
- Support a noise shaping [[Tachibana+ 2018](https://ieeexplore.ieee.org/document/8461332)]




## What it the kaldi-like recipe?

Key features:
- Prepared for each corpus (e.g. CMU Arctic, LJSpeech)
- Consists of unified several stages  
  (e.g. data preparation, feature extraction, and so on.)
- Includes all procedures needed to reproduce the results
- All of the recipes are stored in `egs/<corpus>/<type>`.

Supported corpus:
- [CMUArctic database](http://www.festvox.org/cmu_arctic/): `egs/arctic`, 16 kHz, English, Several speakers.
- [LJ Speech database](https://keithito.com/LJ-Speech-Dataset/): `egs/ljspeech` 22.05 kHz, English, Single female speaker.
- [M-AILABS speech database](http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/):`egs/m-ailabs-speech`: 16 kHz, various speakers

About supported type, see detail in https://github.com/kan-bayashi/PytorchWaveNetVocoder/tree/master/egs

## Run the demo recipe

Let us run the demo recipe `egs/arctic/sd-mini`.

- Small version of `egs/arctic/sd`
- Use subset of all of the utterances
- **Cannot build a good model** but the flow is **the same**

You can understand each stage within 30 minutes!

In [None]:
# move on the recipe directory
import os
os.chdir("./PytorchWaveNetVocoder/egs/arctic/sd-mini")
!echo $(pwd)

Files in the recipe are as follows:
- `conf`: Directory including config files
- `path.sh`: Script to set the environmental variables.
- `run.sh`: Main script.

In [None]:
!tree -L 1

`conf` includes f0 setting files whose name format is `<speaker_name>.f0`.

In [None]:
!ls conf

`<speaker_name>.f0` includes `min_f0 max_f0`.  
These values are predecided by ourselve, so you can modify them.

In [None]:
!cat conf/slt.f0  # (minf0 maxf0)

All of the hyperparameters are written in `run.sh`.

In [None]:
!head -n 69 run.sh

Let us introduce these parameters in detail later.

In [None]:
# (Optional) here you can add your command to check the file!


### Overview of the recipe

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/overview.png?raw=1 width=80%>
</div>

If run `run.sh`, all of stages will be performed.

But we can specify the stage to run with `--stage` options.

- `run.sh --stage 0`: Run only the stage 0
- `run.sh --stage 012`: Run the stages 0, 1, and 2.

Here, let us run each stage step-by-step.

### Stage 0: Data preparation

This stage performs download of corpus and list preparation.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_0.png?raw=1 width=70%>
</div>

In arctic, there are seven speakers.  
Here let us use `slt` to build a model.

In [None]:
# you can specify the speaker via --spk (default=slt)
!./run.sh --stage 0 --spk slt

Corpus is saved in
- `downloads/cmu_us_<spk_name>_arctic_mini`

Two lists of wav files are created.
- `data/tr_slt/wav.scp`: wav list file for training
- `data/ev_slt/wav.scp`: wav list file for evaluation

In [None]:
!tree -L 3 -I local

The list file is that:
- Each line has the path of wav file
- All of the lines are sorted

In [None]:
 !head -n 3 data/*_slt/wav.scp

Here we use 32 utts for training, 4 for evaluation.

In [None]:
!wc -l < data/tr_slt/wav.scp
!wc -l < data/ev_slt/wav.scp

In [None]:
# (Optional) here you can check the file with your commands!


### Stage 1: Feature extraction

This stage performs feature extraction with the
list file.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_1.png?raw=1 width=70%>
</div>

In [None]:
# Hyperparameters related to stage 1
!head -n 36 run.sh | tail -n 13

In [None]:
# run stage 1 with default settings
!./run.sh --stage 1

Hyperparameters can be changed via command line.  
But it will overwrite the existing ones. Be careful.

In [None]:
# example of changing hyperparameters of feature extraction
# !./run.sh --stage 1 --mcep_dim 30 --shiftms 10

Extracted features are saved as `hdf5` in
- `hdf5/tr_slt/*.h5`: Feature file of training data
- `hdf5/ev_slt/*.h5`: Feature file of evaluation data

Lists of feature files are created
- `data/tr_slt/feats.scp`
- `data/ev_slt/feats.scp`

High pass filtered training wav files are saved in
- `wav_hpf/tr_slt/*.wav`: Filtered wav file of training data

List of filetered wav files is created
- `data/tr_slt/wav_hpf.scp`: List of filtered wav files

In [None]:
!tree -L 3 -I "*.f0|local|cmu_*"

Let us check the list file format:
- Each line has the path of feature or wav
file  
- All of the lines are sorted
- Assume that all of the lists has the same
order

In [None]:
!head -n 3 data/*_slt/feats.scp
!echo ""
!head -n 3 data/tr_slt/wav_hpf.scp

hdf5 format can be loaded as `numpy.ndarray` in python using `h5py` library.

In [None]:
import h5py
with h5py.File("hdf5/tr_slt/arctic_a0001.h5") as f:
    print(f.keys())
    feat = f["world"][()]
# or you can use our utils
from wavenet_vocoder.utils import read_hdf5
feat = read_hdf5("hdf5/tr_slt/arctic_a0001.h5", "world")
print("Feature shape: (#num_frames=%d, #feature_dims=%d)" % (feat.shape[0], feat.shape[1]))

The feature is extracted with World.
- `U/V binary` (1 dim)
- `continuous F0` (1 dim),
- `mcep`(25 dim = `mcep_dim + 1`)
- `ap` (1 dim).

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 9))
plt.subplot(2, 2, 1)
plt.plot(feat[:, 0])
plt.title("U/V binary")
plt.subplot(2, 2, 3)
plt.plot(feat[:, 1])
plt.title("Continuous F0")
plt.subplot(2, 2, 2)
plt.imshow(feat[:, 2:26].T, aspect="auto")
plt.title("Mel-cepstrum")
plt.subplot(2, 2, 4)
plt.plot(feat[:, -1])
plt.title("Aperiodicity")
plt.tight_layout()
plt.show()

In [None]:
# (Optional) here you can check the file with your commands!


### Stage 2: Statistics calculation

This stage calculates the mean and variance of features.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_2.png?raw=1 width=70%>
</div>

In [None]:
# run stage 2 with default settings
!./run.sh --stage 2

Calculated statistics are saved as `hdf5` format in
- `data/tr_slt/stats.h5`

`stats.h5` is used for:
- Feature normalization during training
- Calculation of noise shaping filter coefficient

In [None]:
!tree -L 3 -I "*.f0|*.wav|*[0-9].h5|local|cmu_*"

`stats.h5` can be loaded as follows:

In [None]:
with h5py.File("data/tr_slt/stats.h5") as f:
    print(f.keys())
    print(f['world'].keys())
    mean = f['world']['mean'][()]
    scale = f['world']['scale'][()]
    print(mean.shape)
    print(scale.shape)

# or you use our utils
mean = read_hdf5("data/tr_slt/stats.h5", "world/mean")
scale = read_hdf5("data/tr_slt/stats.h5", "world/scale")

In [None]:
# here you can check the file with your commands!


### Stage 3: Noise weighting

This stage applies noise weighting filter to training
wav files.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_3.png?raw=1 width=70%>
</div>

In [None]:
# Hyperparameters related to stage 3
!head -n 38 run.sh | tail -n 2

In [None]:
# run stage 3 with default settings
!./run.sh --stage 3

If `use_noise_shaping=false`, `stage 3` will be skipped.

Noise weighting filtered wav files are saved in
- `wav_nwf/tr_slt/*.wav`

The list of noise weighting filtered wav files is saved as
- `data/tr_slt/wav_nwf.scp`

In [None]:
!tree -L 3 -I "*.f0|*[0-9].h5|local|cmu_*"

Let us check the difference of waveform here.

In [None]:
# listen to the samples
import IPython.display
IPython.display.display(IPython.display.Audio("wav_hpf/tr_slt/arctic_a0001.wav"))
IPython.display.display(IPython.display.Audio("wav_nwf/tr_slt/arctic_a0001.wav"))

In [None]:
# show spectrogram
import soundfile as sf
import matplotlib.pyplot as plt
x, fs = sf.read("wav_hpf/tr_slt/arctic_a0001.wav")
x_ns, fs = sf.read("wav_nwf/tr_slt/arctic_a0001.wav")
plt.figure(figsize=(16, 7))
plt.subplot(1, 2, 1)
plt.specgram(x, Fs=fs)
plt.title("Original spectrogram")
plt.subplot(1, 2, 2)
plt.specgram(x_ns, Fs=fs)
plt.title("Noise weighting filtered spectrogram")

Filtering related parameters `mlas/coef` and `mlsa/alpha` are added in `data/tr_slt/stats.h5`.

In [None]:
with h5py.File("data/tr_slt/stats.h5") as f:
    print(f.keys())
    print(f["mlsa"].keys())
    print(f["mlsa"]["alpha"])
    print(f["mlsa"]["coef"])

`mlsa/coef` is the coefficient of MLSA filter, which is calculated from averaged mel-cepstrum and `mag`.  
`mlsa/alpha` is the hyperparameter `alpha`, all pass filter coefficient.

In [None]:
# (Optional) here you can check the file with your commands!


### Stage 4: WaveNet training

This stage trains WaveNet using extracted
features and noise weighting filtered wav files.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_4.png?raw=1 width=70%>
</div>

In [None]:
# Hyperparameters related to stage 4
!head -n 59 run.sh | tail -n 19

In [None]:
# run stage 4 with default settings
!./run.sh --stage 4 --iters 500

Default network structure in `egs/arctic/sd-mini`.
<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/wavenet.png?raw=1 width=70%>
</div>

Example when `dilation_depth=3` and `dilation_repeat=2`.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/structure_ex.png?raw=1 width=45%>
</div>

Make a batch by split a waveform into pieces.
- `batch_size`: Number of batches
- `batch_length`: Length of each batch

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/batch.png?raw=1 width=65%>
</div>

Model parameters are saved as  
- `exp/tr_arctic_16k_sd_world_slt_*/checkpoint-*.pkl`

Modle configuration is saved as  
- `exp/tr_arctic_16k_sd_world_slt_*/model.conf`

The directory name is automatically set to be unique depending on hyperparameters.

In [None]:
!tree -L 3 -I "*.f0|*.wav|*[0-9].h5|*.scp|*.log|local|cmu_*"

Model configuration file can be loaded as `argparse.Namespace`.

In [None]:
import torch
conf = torch.load("exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/model.conf")
print(conf)

Model parameters `checkpoint-*.pkl` can be loaded as `dict` which contains
following information:
- `iterations`: Number of iterations of this parameters
- `optimizer`: `Dict` of states of optimizer
- `model`: `OrderedDict` of Model
parameters

In [None]:
state_dict = torch.load("exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/checkpoint-500.pkl")
print(state_dict.keys())
print(state_dict["iterations"])
print(state_dict["optimizer"].keys())
print(state_dict["model"].keys())

You can resume training from `checkpoint-*.pkl` file with `--resume` options.

In [None]:
!./run.sh --stage 4 \
    --iters 1000 \
    --resume exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/checkpoint-500.pkl

You can train using multi-gpu with `--n_gpus` option.

In [None]:
# In colab, we can use only a single gpu :(
# batch_size must be >= n_gpus
# !./run.sh --stage 4 --n_gpus 2 --batch_size 2

In [None]:
# here you can check the file with your commands!


### Stage 5: WaveNet decoding

This stage performs decoding of evaluation data.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_5.png?raw=1 width=70%>
</div>

In [None]:
# Hyperparameters related to stage 5
!head -n 69 run.sh | tail -n 9

In [None]:
# run stage 5 with default setting
!./run.sh --stage 5

You can specify the `checkpoint-*.pkl` file used for decoding and directory to
be saved via `--checkpoint` and `--outdir` options.

In [None]:
# it takes times, comment out
# !./run.sh --stage 5 \
#     --checkpoint exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/checkpoint-100.pkl
#     --outdir exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/wav_ckpt_100

We can use multi-gpu decoding via `--n_gpus` option.

In [None]:
# In colab, we can use only a single gpu :(
# !./run.sh --stage 5 --n_gpus 2

Generated wav files are saved in
- `exp/tr_arctic_sd_tr_arctic_16k_sd_*/wav`

In [None]:
!tree exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up

In [None]:
# (Optional) here you can check the file with your commands!


### Stage 6: Noise shaping

This stage applies noise shaping filter to generated wav files.

<div align="center">
    <img src=https://github.com/kan-bayashi/INTERSPEECH19_TUTORIAL/blob/master/notebooks/wavenet_vocoder/figs/stage_6.png?raw=1 width=70%>
</div>

In [None]:
# run stage 6 with default setting
!./run.sh --stage 6

Restored wav files are saved in

- `exp/tr_arctic_sd_tr_arctic_16k_sd_*/wav_nsf`

In [None]:
!tree exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up

In [None]:
# (Optional) here you can check the file with your commands!


Finished! Unfortunately, generated samples are just-like a noise.  
So Let us check the samples which trained with `egs/arctic/sd` from  
https://kan-bayashi.github.io/WaveNetVocoderSamples/

## Use pretrained model as vocoder

Here we show how-to-use pretrained model as
vocoder.  
What we need to prepare is following three files:

- `model.conf`:
Model configuration file.
- `checkpoint-*.pkl`: Model parameter file.
- `stats.h5`: Statistics file.

Let us pack following files into
`pretrained_model/` directory.

In [None]:
# summarize trained model in the directory
!mkdir pretrained_model
!cp -v exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/stats.h5 \
    exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/model.conf \
    exp/tr_arctic_16k_sd_world_slt_nq256_na28_nrc32_nsc16_ks2_dp5_dr1_lr1e-4_wd0.0_bl10000_bs1_ns_up/checkpoint-1000.pkl \
    pretrained_model

First, please prepare the list file of feature files to be decoded.

In [None]:
import os
import numpy as np

# here make a dummy features and the stored as hdf5 with key "/world"
os.makedirs("dummy", exist_ok=True)
for idx, n_frames in enumerate([10, 20, 30, 40]):
    x = np.random.randn(n_frames, 28)  # (#num_frames, #feature_dims)
    with h5py.File("dummy/dummy_%d.h5" % idx, "w") as f:
        f["world"] = x

# make a list of features to be decoded.
!find dummy -name "*.h5" > dummy_feats.scp

# check
!cat dummy_feats.scp

Run the `--stage 56` by specifying `--feats` in the recipe directory.

In [None]:
# decode with pretrained model through the recipe
!./run.sh --stage 56 \
    --outdir dummy_feats_wav \
    --feats dummy_feats.scp \
    --checkpoint pretrained_model/checkpoint-1000.pkl
!ls dummy_feats_wav*

If you want to use outside of the recipe, directly call python scripts stored in
`wavenet_vocoder/bin`.

In [None]:
# decode with pretrained model
!python ../../../wavenet_vocoder/bin/decode.py \
     --feats dummy_feats.scp \
     --outdir dummy_feats_wav_2 \
     --checkpoint pretrained_model/checkpoint-1000.pkl \
     --fs 16000 \
     --n_gpus 1 \
     --batch_size 4
# make list of wav files to be filtered
!find dummy_feats_wav_2 -name "*.wav" > dummy_feats_wav_2/wav.scp
# apply noise shaping filter
!python ../../../wavenet_vocoder/bin/noise_shaping.py \
     --waveforms dummy_feats_wav_2/wav.scp \
     --outdir dummy_feats_wav_2_nsf \
     --stats pretrained_model/stats.h5 \
     --fs 16000 \
     --shiftms 5
!ls dummy_feats_wav_2*

## Combine with Sprocket

Let us show how-to-combine wavenet vocoder with voice conversion toolkit [sprocket](https://github.com/k2kobayashi/sprocket).    
Here, we generate converted voice with pretrained models.

In [None]:
# changed directory
!mkdir ../../../../conversion_example
os.chdir("../../../../conversion_example")
!pwd

First, download pretrained models.

In [None]:
# download sprocket model
!../PytorchWaveNetVocoder/wavenet_vocoder/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=1PiGDyYDQt0b4h6KAV1MOmDxHjHUv1cT6" \
    downloads/sprocket_pretrained

# download wavenet vocoder model
!../PytorchWaveNetVocoder/wavenet_vocoder/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=1AhtRB0vTkjDrum-dfgaiXnQgsAAiYMGW" \
    downloads/wavenet_vocoder_pretrained

# download wav samples
!../PytorchWaveNetVocoder/wavenet_vocoder/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=1kBwF7ejyCR5aI9FitmMSCnWdPCNVouqg"

- Sprocket pretrained model
    - `GMM_mcep.pkl`: GMM param file for mcep conversion.
    - `<src_spk>.yml`: Source speaker yaml file.
    - `<src_spk>-<tar_spk>.yml`: Source-target speaker pair yaml file.
    - `<src_spk>.h5`: Statistics file of source speaker.
    - `<tar_spk>.h5`: Statistics file of target speaker.
    - `cvgv.h5`: Statistics file of global variance for converted features.
    
- Target speaker WaveNet vocoder pretrained model
    - `model.conf`: Model configuration file.
    - `checkpoint-*.pkl`: Model parameter file.
    - `stats.h5`: Statistics file.

In [None]:
!ls downloads/*pretrained

Next, extract features and then convert them to target speaker's one.  

In [None]:
![ ! -e hdf5 ] && mkdir hdf5
![ ! -e wav ] && mkdir wav
!PYTHONPATH=../sprocket/example/src \
    python ../sprocket/sprocket/bin/convert_feats.py \
        --cvmcep0th True \
        --cvcodeap True \
        --cvgvstats downloads/sprocket_pretrained/cvgv.h5 \
        --org_yml downloads/sprocket_pretrained/rms.yml \
        --pair_yml downloads/sprocket_pretrained/rms-slt.yml \
        --org_stats downloads/sprocket_pretrained/rms.h5 \
        --tar_stats downloads/sprocket_pretrained/slt.h5 \
        --mcepgmmf downloads/sprocket_pretrained/GMM_mcep.pkl \
        --iwav downloads/samples/src/arctic_b0536.wav \
        --cvfeats hdf5/arctic_b0536.h5 \
        --owav wav/arctic_b0536.wav
!ls hdf5 wav

Then generate waveform with pretrained wavenet using converted features.

In [None]:
# NOTE: require too much time.
# decode with wavenet vocoder
!find hdf5 -name "*.h5" > hdf5/feats.scp
!python ../PytorchWaveNetVocoder/wavenet_vocoder/bin/decode.py \
     --feats hdf5/feats.scp \
     --outdir wav_wnv \
     --checkpoint downloads/wavenet_vocoder_pretrained/checkpoint-final.pkl \
     --fs 16000 \
     --n_gpus 1 \
     --batch_size 4
# apply noise shaping filter
!find wav_wnv -name "*.wav" > wav_wnv/wav.scp
!python ../PytorchWaveNetVocoder/wavenet_vocoder/bin/noise_shaping.py \
     --waveforms wav_wnv/wav.scp \
     --outdir wav_wnv_nsf \
     --stats downloads/wavenet_vocoder_pretrained/stats.h5 \
     --fs 16000 \
     --shiftms 5

In [None]:
# listen to pre-synthesized ones
import IPython.display
print("Source")
IPython.display.display(IPython.display.Audio("downloads/samples/src/arctic_b0536.wav"))
print("Target")
IPython.display.display(IPython.display.Audio("downloads/samples/tar/arctic_b0536.wav"))
print("Converted voice with vocoder")
IPython.display.display(IPython.display.Audio("downloads/samples/vocoder/arctic_b0536.wav"))
print("Converted voice with wavenet vocoder")
IPython.display.display(IPython.display.Audio("downloads/samples/wavenet_vocoder/arctic_b0536.wav"))

In [None]:
print("running time = %s minite" % ((time.time() - start_time) / 60))

## Conclusion

- Introduced voice conversion with direct waveform modeling
- Introduced Sprocket /  PytorchWaveNetVocoder
    - Can build GMM-based VC / DIFFVC  & WaveNet vocoder
    - Can combine both module to generate high quality converted voices

Thank you for your attendance!  
If you have time, please send us feedback via [Google form](https://forms.gle/28QrvGRBAAiKpWas8).