# Demonstration book of Blow

This demonstration book will:
    
1. Define components in the model
2. Load pre-trained model and do voice conversion


A few notes:
* Official implementation is in https://github.com/joansj/blow
* Model here is re-implemented for tutorial purpose
* It is NOT intended to surpass the official implementation

Blow is very similar to WaveGlow and Glow, but it is for voice conversion.

Modules for Blow are defined in `../sandbox/block_glow.py`. For convenience, I copy those modules to this notebook and demonstrate the usage.

The project to train and run a WaveGlow on VCTK database is available in `../project/05-nn-vocoder/blow`. However, I didn't prepare the link to download the training data. Let me know if you are interested in training it on VCTK. 


This notebook will not explain the modules in details.

## 1. Blow structure

Blow is similar to WaveGlow but:
* Does NOT use WaveNet layers to compute affine transformation parameters;
* Does NOT use early outputs;
* Does squeeze after every block, it squeezes the tensor by a factor of 2;
* Uses ActNorm as Glow in each flow step;
* Input waveform has fixed length $T=4096$;


The official implementation uses 8 blocks, each block contains 12 flow steps and squeeze operation.

```sh
.          |------- Blow block 1 -------|  |------- Blow block 2 -------|   ... |------- Blow block 8 -------|

                   (B, T/2, 2)                     (B, T/4, 4)                      (B, T/256, 256)         (B, T/256, 256)
            ---------    ---------------    ---------    ---------------        ---------    ---------------    
Waveform -->|squeeze| -> |12 flow steps| -> |squeeze| -> |12 flow steps| -> ... |squeeze| -> |12 flow steps| -> z
(B, T, 1)   ---------    ---------------    ---------    ---------------        ---------    ---------------    
                               ^                               ^                                    ^
            -----------        |                               |                                    |
speaker ID->|embedding|------------------------------------------------------------------------------   
            -----------     (B, 1, D)
```

Each flow step is
```sh
.
                                              (B, T/P, P/2)                      ----------------------    
             ------------    ---------     |------------------------------------>| Affine transform   |  
    input -> |invertible| -> |ActNorm|  ---|  (B, T/P, P/2)                      |       ra + b       |    -----------
 (B, T/P, P) | 1x1 conv |    ---------     |        ----------     -----------   |  / a (B, T/8, P/2) | -> | Concate | -> output
             ------------                  |------> | conv1d |  -> | conv1ds |---|->|                 |    -----------  (B, T/P, P)
                                              |     ----------     -----------   |  \ b (B, T/8, P/2) |         ^
                                              |          ^                        ----------------------        |
               ----------     conv1d weights  -----------|------------------------------------------------------|
 speaker_emb ->| FC     | -------------------------------|
 (B, 1, D)     ---------- 
```
The speaker_emb will drive the FC layer to predict the filter weight of the conv1d. 

Since most of the techiques has been covered in `s4_demonstration_waveglow`, I will not explain the modules one by one.

## 2. Voice conversion demonstration

Different from WaveGlow, the voice conversion by Blow needs both "analysis" and "synthesis"
* "Analysis": Given input waveform and its speaker ID, do transformation and get the latent Z
* "Synthesis": Given latent Z and target speaker ID, do inverse transformation and get converted waveform

```sh
.
           |------- Blow block 1 -------|  |------- Blow block 2 -------|   ... |------- Blow block 8 -------|

                   (B, T/2, 2)                     (B, T/4, 4)                      (B, T/256, 256)         (B, T/256, 256)
            ---------    ---------------    ---------    ---------------        ---------    ---------------    
Waveform -->|squeeze| -> |12 flow steps| -> |squeeze| -> |12 flow steps| -> ... |squeeze| -> |12 flow steps| -> z
(B, T, 1)   ---------    ---------------    ---------    ---------------        ---------    ---------------    |
                               ^                               ^                                    ^           |
            -----------        |                               |                                    |           |
speaker ID->|embedding|------------------------------------------------------------------------------           |
            -----------     (B, 1, D)                                                                           |
                                                                                                                |
                                                                                                                |
Converted   ---------    ---------------    ---------    ---------------        ---------    ---------------    |
Waveform <--|squeeze| <- |12 flow steps| <- |squeeze| <- |12 flow steps| <- ... |squeeze| <- |12 flow steps| <- |
(B, T, 1)   ---------    ---------------    ---------    ---------------        ---------    ---------------    
                               ^                               ^                                    ^
target      -----------        |                               |                                    |
speaker ID->|embedding|------------------------------------------------------------------------------   
            -----------     (B, 1, D)
```

In [1]:
# load packages 
from __future__ import absolute_import
from __future__ import print_function
import os
import sys
import numpy as np
import torch
import torch.nn as torch_nn
import torch.nn.functional as torch_nn_func

# basic nn blocks
import sandbox.block_blow as nii_blow
import sandbox.block_nn as nii_nn
# misc functions for this demonstration book

from plot_tools import plot_API
from plot_tools import plot_lib
import tool_lib
import plot_lib as plot_lib_legacy
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (10, 5)

### 2.1 Speaker information related to VCK

In [2]:
# just for convenience

# define a parse to load speaker IDs
class VCTKSpeakerMap:
    def __init__(self):
        speaker_list = 'data_models/pre_trained_blow/vctk-blow/scp/spk.lst'
        self.m_speaker_map = {}
        with open(speaker_list, 'r') as file_ptr:
            for idx, line in enumerate(file_ptr):
                line = line.rstrip('\n')
                self.m_speaker_map[line] = idx
        return

    def num(self):
        # leave one for unseen
        return len(self.m_speaker_map) + 1

    def parse(self, filename, return_idx=True):
        # 
        # filename will be in this format: '8758,p***_***,1,4096,16000'
        # we need to get the p*** part
        spk = filename.split('_')[0].split(',')[-1]
        if return_idx:
            return self.get_idx(spk)
        else:
            return spk
        
    def get_idx(self, spk):
        if spk in self.m_speaker_map:
            return self.m_speaker_map[spk]  
        else:
            return len(self.m_speaker_map)


class PrjConfig:
    def __init__(self):
        self.wav_samp_rate = 16000
        self.options = {'speaker_map': VCTKSpeakerMap(),
                         'conversion_map': {'p361': 'p245',
                                            'p278': 'p287',
                                            'p302': 'p298',
                                            'p361': 'p345',
                                            'p260': 'p267',
                                            'p273': 'p351',
                                            'p245': 'p273',
                                            'p304': 'p238',
                                            'p297': 'p283',
                                            'p246': 'p362'}}



The conversion_map defines the pair of source : target speakers.

This conversion map is derived from the samples on https://blowconversions.github.io/

### 2.2 Model Wrapper

This wrapper uses Blow defined in `../sandbox/block_blow.py`. 

This wrapper is used in `../project/05-nn-vocoder/blow`

In [3]:
# Wrapper model
class Model(torch_nn.Module):
    """ Model definition
    """
    def __init__(self, in_dim, out_dim, args, prj_conf, mean_std=None):
        super(Model, self).__init__()

        #################
        ## must-have
        #################
        # mean std of input and output
        in_m, in_s, out_m, out_s = self.prepare_mean_std(in_dim,out_dim,\
                                                         args, mean_std)
        self.input_mean = torch_nn.Parameter(in_m, requires_grad=False)
        self.input_std = torch_nn.Parameter(in_s, requires_grad=False)
        self.output_mean = torch_nn.Parameter(out_m, requires_grad=False)
        self.output_std = torch_nn.Parameter(out_s, requires_grad=False)
        self.input_dim = in_dim
        self.output_dim = out_dim
        
        # a flag for debugging (by default False)    
        self.model_debug = False
        
        #################
        ## model config
        #################        
        # waveform sampling rate
        self.sample_rate = prj_conf.wav_samp_rate

        # load speaker map
        self.speaker_map = prj_conf.options['speaker_map']
        self.speaker_num = self.speaker_map.num()
        if 'conversion_map' in prj_conf.options:
            self.conversion_map = prj_conf.options['conversion_map']
        else:
            self.conversion_map = None

        self.cond_dim = 128
        self.num_block = 8
        self.num_flow_steps_perblock = 12
        self.num_conv_channel_size = 512
        self.num_conv_conv_kernel = 3


        self.m_spk_emd = torch.nn.Embedding(self.speaker_num, self.cond_dim)

        self.m_blow = nii_blow.Blow(
            self.cond_dim, self.num_block, 
            self.num_flow_steps_perblock,
            self.num_conv_channel_size,
            self.num_conv_conv_kernel)

        # only used for synthesis
        self.m_overlap = nii_blow.OverlapAdder(4096, 4096//4, False)
        # done
        return
    
    def prepare_mean_std(self, in_dim, out_dim, args, data_mean_std=None):
        """
        """
        if data_mean_std is not None:
            in_m = torch.from_numpy(data_mean_std[0])
            in_s = torch.from_numpy(data_mean_std[1])
            out_m = torch.from_numpy(data_mean_std[2])
            out_s = torch.from_numpy(data_mean_std[3])
            if in_m.shape[0] != in_dim or in_s.shape[0] != in_dim:
                print("Input dim: {:d}".format(in_dim))
                print("Mean dim: {:d}".format(in_m.shape[0]))
                print("Std dim: {:d}".format(in_s.shape[0]))
                print("Input dimension incompatible")
                sys.exit(1)
            if out_m.shape[0] != out_dim or out_s.shape[0] != out_dim:
                print("Output dim: {:d}".format(out_dim))
                print("Mean dim: {:d}".format(out_m.shape[0]))
                print("Std dim: {:d}".format(out_s.shape[0]))
                print("Output dimension incompatible")
                sys.exit(1)
        else:
            in_m = torch.zeros([in_dim])
            in_s = torch.ones([in_dim])
            out_m = torch.zeros([out_dim])
            out_s = torch.ones([out_dim])
            
        return in_m, in_s, out_m, out_s
        
    def normalize_input(self, x):
        """ normalizing the input data
        """
        return (x - self.input_mean) / self.input_std

    def normalize_target(self, y):
        """ normalizing the target data
        """
        return (y - self.output_mean) / self.output_std

    def denormalize_output(self, y):
        """ denormalizing the generated output from network
        """
        return y * self.output_std + self.output_mean

    def forward(self, wav, fileinfo):
        """loss = forward(self, input_feat, wav)

        input
        -----
          wav: tensor, target waveform (batchsize, length2, 1)
               it should be raw waveform, flot valued, between (-1, 1)
               the code will do mu-law conversion

          fileinfo: list, file information for each data in the batch

        output
        ------
          loss: tensor / scalar,
        
        Note: returned loss can be directly used as the loss value
        no need to write Loss()
        """
        # prepare speaker IDs
        # (batch, )
        speaker_ids = torch.tensor(
            [self.speaker_map.parse(x) for x in fileinfo],
            dtype=torch.long, device=wav.device)
        # convert to embeddings
        # (batch, 1, cond_dim)
        speaker_emd = self.m_spk_emd(speaker_ids).unsqueeze(1)

        # normalize conditiona feature
        #input_feat = self.normalize_input(input_feat)
        # compute 
        z, neg_logp, logp_z, log_detjac = self.m_blow(wav, speaker_emd)
        return [[-logp_z, -log_detjac], [True, True]]

    def convert(self, wav, src_id, tar_id):
        """wav = inference(mels)

        input
        -----
          wav: tensor, target waveform (batchsize, length2, 1)
          src_id: int, ID of source speaker
          tar_id: int, ID of target speaker
          
        output
        ------
          wav_new: tensor, same shape
        """ 
        # framing the input waveform into frames
        #   m_overlap.forward does framing
        # framed_wav (batchsize, frame_num, frame_length)
        framed_wav = self.m_overlap(wav)
        batch, frame_num, frame_len = framed_wav.shape
        
        # change frames into batch
        # framed_Wav (batchsize * frame_num, frame_length, 1)
        framed_wav = framed_wav.view(-1, frame_len).unsqueeze(-1)

        
        # source speaker IDs
        # (batch, )
        speaker_ids = torch.tensor([src_id for x in wav],
                                   dtype=torch.long, device=wav.device)
        # (batch * frame_num)
        speaker_ids = speaker_ids.repeat_interleave(frame_num)
        # get embeddings (batch * frame_num, 1, cond_dim)
        speaker_emd = self.m_spk_emd(speaker_ids).unsqueeze(1)
        
        
        # target speaker IDs
        tar_speaker_ids = torch.tensor([tar_id for x in wav],
                                   dtype=torch.long, device=wav.device)
        # (batch * frame_num)
        tar_speaker_ids = tar_speaker_ids.repeat_interleave(frame_num)
        target_speaker_emb = self.m_spk_emd(tar_speaker_ids).unsqueeze(1)
        
        # analysis 
        z, _, _, _ = self.m_blow(framed_wav, speaker_emd)
        
        # synthesis
        # output_framed (batch * frame, frame_length, 1)
        output_framed = self.m_blow.reverse(z, target_speaker_emb)

        # overlap and add
        # view -> (batch, frame_num, frame_length)
        return self.m_overlap.reverse(
            output_framed.view(batch, -1, frame_len), True)

Load model

In [4]:
prj_config = PrjConfig()

# output dimension = 1 for waveform
wave_dim = 1

# declare the model
m_blow = Model(wave_dim, wave_dim, None, prj_config)

In [5]:
# load pre-trained model
device=torch.device("cpu")
m_blow.to(device, dtype=torch.float32)

pretrained_file = "data_models/pre_trained_blow/__pre_trained/trained_network.pt"
if os.path.isfile(pretrained_file):
    checkpoint = torch.load(pretrained_file, map_location="cpu")
    m_blow.load_state_dict(checkpoint)
else:
    print("Cannot find pre-trained model {:s}".format(pretrained_file))
    print("Please run 00_download_model.sh and download the pre-trained model")

### 2.3 Load waveform and do conversion

In [6]:
import tool_lib

data_dir = 'data_models/pre_trained_blow/vctk-blow/vctk_wav_test_tiny'

wavefiles = ['p361_01198.wav', 'p278_04851.wav', 'p302_01863.wav', 'p361_00375.wav', 'p260_01623.wav', 
             'p273_04605.wav', 'p245_05208.wav', 'p304_00078.wav', 'p297_06758.wav', 'p246_00375.wav']

# Choose one file example:
idx = 1
src_wav_name = wavefiles[idx]

# get source and target speaker ID
src_spk_name = prj_config.options['speaker_map'].parse(src_wav_name, return_idx=False)
src_spk_id = prj_config.options['speaker_map'].parse(src_wav_name)
tar_spk_id = prj_config.options['speaker_map'].parse(prj_config.options['conversion_map'][src_spk_name])


# do conversion
sr, src_wav = tool_lib.waveReadAsFloat(data_dir + '/' + src_wav_name)
with torch.no_grad():
    src_wav_tensor = torch.tensor(src_wav, dtype=torch.float32).unsqueeze(0).unsqueeze(-1)
    converted_wav = m_blow.convert(src_wav_tensor, src_spk_id, tar_spk_id)

In [7]:
import IPython.display

IPython.display.display("Example {:d}".format(idx+1))
IPython.display.display("Source waveform", src_wav_name)
IPython.display.display(IPython.display.Audio(src_wav, rate=sr, normalize=False))


# target speaker waveform (for rerence)
tar_data_dir = 'data_models/pre_trained_blow/vctk-blow/target_wav'
for wavename in os.listdir(tar_data_dir):
    tmpID = prj_config.options['speaker_map'].parse(wavename)
    if tmpID == tar_spk_id:
        sr, tar_wav = tool_lib.waveReadAsFloat(tar_data_dir + '/' + wavename)
        IPython.display.display("Target speaker data", wavename)
        IPython.display.display(IPython.display.Audio(tar_wav, rate=sr, normalize=False))

IPython.display.display("Converted waveform")
IPython.display.display(IPython.display.Audio(converted_wav[0, :, 0].numpy(), rate=sr, normalize=False))


'Example 2'

'Source waveform'

'p278_04851.wav'

'Target speaker data'

'p287_11336.wav'

'Converted waveform'

You may compare the samples with those on the official Blow webpage https://blowconversions.github.io/. Just find the correponding "Example N" on that page.


### 2.4 Convert the voice-converted waveform back?

Yes, flow is invertible. 

```sh
.          |------- Blow block 1 -------|  |------- Blow block 2 -------|   ... |------- Blow block 8 -------|

                   (B, T/2, 2)                     (B, T/4, 4)                      (B, T/256, 256)         (B, T/256, 256)
Converted   ---------    ---------------    ---------    ---------------        ---------    ---------------    
Waveform -->|squeeze| -> |12 flow steps| -> |squeeze| -> |12 flow steps| -> ... |squeeze| -> |12 flow steps| -> z
(B, T, 1)   ---------    ---------------    ---------    ---------------        ---------    ---------------    |
                               ^                               ^                                    ^           |
target      -----------        |                               |                                    |           |
speaker ID->|embedding|------------------------------------------------------------------------------           |
            -----------     (B, 1, D)                                                                           |
                                                                                                                |
                                                                                                                |
Recovered   ---------    ---------------    ---------    ---------------        ---------    ---------------    |
Waveform <--|squeeze| <- |12 flow steps| <- |squeeze| <- |12 flow steps| <- ... |squeeze| <- |12 flow steps| <- |
(B, T, 1)   ---------    ---------------    ---------    ---------------        ---------    ---------------    
                               ^                               ^                                    ^
original    -----------        |                               |                                    |
speaker ID->|embedding|------------------------------------------------------------------------------   
            -----------     (B, 1, D)
```

In [8]:
with torch.no_grad():
    src_wav_tensor_reversed = m_blow.convert(converted_wav, tar_spk_id, src_spk_id)

In [9]:
import IPython.display

IPython.display.display("Example {:d}".format(idx+1))
IPython.display.display("Source waveform", src_wav_name)
IPython.display.display(IPython.display.Audio(src_wav, rate=sr, normalize=False))

IPython.display.display("Converted waveform")
IPython.display.display(IPython.display.Audio(converted_wav[0, :, 0].numpy(), rate=sr, normalize=False))

IPython.display.display("De-Converted waveform")
IPython.display.display(IPython.display.Audio(src_wav_tensor_reversed[0, :, 0].numpy(), rate=sr, normalize=False))


'Example 2'

'Source waveform'

'p278_04851.wav'

'Converted waveform'

'De-Converted waveform'

# Final note

This wrapper uses Blow defined in `../sandbox/block_blow.py`. 

This wrapper is used in `../project/05-nn-vocoder/blow`.

The script in `../project/05-nn-vocoder/blow` does not download the VCTK database. It requires some manual work to change file names. Please check 00_demo.sh in that directory


That's all!