## 3. INTERMEDIATE BLOCKS



We present several types of intermediate blocks based on different design strategies. We first present time-distributed blocks and then present time-frequency blocks.


### 3.1 Time-Distributed Blocks

Some existing models use CNNs (e.g., [16]) for intermediate blocks to extract timbre features of the target source.
However, the authors of [8] reported that conventional CNN kernels are limited for this task.
They found that long-range correlations exist along the frequency axis in the spectrogram of voice signals, which Fully-connected Neural Networks (FCNs) can efficiently capture.
They proposed a model named Phasen for speech enhancement, which uses the Frequency Transformation Block (FTB) that has a single-layered FCN without bias. This FCN is applied to each frame of the internal representation in a **time-distributed** manner.


Inspired by TFB, we introduce **time-distributed** blocks, which are applied to a single frame of a spectrogram-like feature map. These blocks try to extract time-independent features that help singing voice separation without using inter-frame operations.
We first introduce an FCN-based block and then propose an alternative time-distributed block based on 1-D CNNs. 



#### 3.1.1 Time-Distributed Fully-connected networks

We present an FCN-based intermediate block, called Time-Distributed Fully-connected network (TDF). 
As illustrated in Figure 3, a TDF block is applied to each channel of each frame separately and identically. 

![](img/tdf.png)
Figure 3. Time-Distributed Fully-connected network

Suppose that the $l$-th intermediate block in our U-Net structure takes input $X^{(l-1)}$ into an output $X^{(l)}$.
As shown in Figure 3, a  fully-connected network is applied separately and identically to each frame (i.e., $X^{(l-1)}[i,j,:]$) in order to transform an input tensor in a time-distributed fashion. 
While an FTB of Phasen [8] is single-layered, a TDF block can be either single- or multi-layered. Each layer is defined as consecutive operations: a fully-connected layer, Batch Norm (BN) [17], and ReLU [15]. If it is multi-layered, then each internal layer maps an input to the hidden feature space, and its final layer maps the internal vector to $\mathbb{R}^{F^{(l)}}$.
The number of hidden units is $\lfloor F^{(l)}/bn \rfloor$, where we denote the bottleneck factor by $bf$. We can reduce parameters if we use two-layered TDFs of $bf > 2$. We investigate the effect of adding additional layers in §4.2.

In [1]:
import torch
import torch.nn as nn

class TDF(nn.Module):
    ''' [B, in_channels, T, F] => [B, in_channels, T, F] '''
    def __init__(self, channels, f, bf=16, bias=False, min_bn_units=16):
        
        '''
        channels: # channels
        f: num of frequency bins
        bf: bottleneck factor. if None: single layer. else: MLP that maps f => f//bf => f 
        bias: bias setting of linear layers
        '''
        
        super(TDF, self).__init__()

        if(bf is None):
            self.tdf = nn.Sequential(
                nn.Linear(f,f, bias),
                nn.BatchNorm2d(channels),
                nn.ReLU()
            )
        
        else:
            bn_unis = max(f//bf, min_bn_units)
            self.tdf = nn.Sequential(
                nn.Linear(f,bn_unis, bias),
                nn.BatchNorm2d(channels),
                nn.ReLU(),
                nn.Linear(bn_unis,f,bias),
                nn.BatchNorm2d(channels),
                nn.ReLU()
            )
            
    def forward(self, x):
        return self.tdf(x)

#### 3.1.2 Time-Distributed Convolutions


We propose an alternative time-distributed block named Time-Distributed Convolutions (TDC), which is applied separately and identically to each multi-channeled frame. 
It is a series of 1-D convolution layers.
Inspired by [5,6], it takes form of a **dense block** [18] structure. A dense block consists of densely connected composite layers, where each composite layer is defined as three consecutive operations: 1-D convolution, BN, and ReLU.
As discussed in [5,6,18] the densely connected structure enables each layer to propagate the gradient directly to all preceding layers, making a deep CNN training more efficient. 


![](img/tdc.png)
Figure 4. Time-Distributed Convolutions

In [2]:
class TDC(nn.Module):
    '''
    [B, in_channels, T, F] => [B, out_channels (= gr), T, F] 
    We set the number of output channels to be the same as the growth rate
    '''
    def __init__(self, in_channels, num_layers, gr, kf):
        
        '''
        in_channels: number of input channels
        num_layers: number of densly connected conv layers
        gr: growth rate
        kf: kernal size of the freq. axis
        '''
        
        super(TDC, self).__init__()

        c = in_channels
        self.H = nn.ModuleList()
        for i in range(num_layers):
            self.H.append(
                nn.Sequential(
                    nn.Conv1d(in_channels=c, out_channels=gr,kernel_size=kf,stride=1,padding=kf//2),
                    nn.BatchNorm1d(gr),
                    nn.ReLU(),
                )
            )
            c += gr

    def forward(self, x):
        '''[B, in_channels, T, F] => [B, out_channels (= gr), T, F] '''
        
        B, _, T, F = x.shape
        x = x.transpose(-2,-3)   # B, T, c, F
        x = x.reshape(B*T,-1,F)  # BT, c, F
        
        x_ = self.H[0](x)
        for h in self.H[1:]:
            x = torch.cat((x_, x), 1)
            x_ = h(x)  

        x_ = x_.view(B,T,-1,F)   # B, T, c, F
        x_ = x_.transpose(-2,-3) # B, c, T, F
        return x_

### 3.2 Time-Frequency Blocks

The performances of U-Nets with time-distributed blocks were above our expectation (see §4.2), but were still inferior considerably to those of current SOTA methods. The reason is that features observed in musical sources include sequential patterns (e.g., vibrato, tremolo, and crescendo) or musical patterns (e.g., rhythm, repetitive structure), which cannot be modeled by time-distributed blocks.

While time-distributed blocks cannot model the temporal context, time-frequency blocks try to extract features considering both the time and the frequency dimensions.
We introduce the Time-Frequency Convolutions (TFC) block, which is used in [5]. 
We also propose two novel blocks that combine two different transformations.




#### 3.2.1 Time-Frequency Convolutions

The Time-Frequency Convolutions (TFC) is a dense block of 2-D CNNs, as shown in Figure 5.
The dense block consists of densely connected composite layers, where each layer is defined as three consecutive operations: 2-D convolution, BN, and ReLU.   
It is applied to the spectrogram-like input representation in the time-frequency domain.
Every convolution layer in a dense block has kernels of size $(k_F, k_T)$. Its 2-D filters are trained to jointly capture features along both frequency and temporal axes.

![](img/tfc.png)

Figure 5. Time-Frequency Convolutions 

In [3]:
class TFC(nn.Module):
    '''
    [B, in_channels, T, F] => [B, out_channels (= gr), T, F] 
    We set the number of output channels to be the same as the growth rate
    '''
    def __init__(self, in_channels, num_layers, gr, kt, kf):
        '''
        in_channels: number of input channels
        num_layers: number of densly connected conv layers
        gr: growth rate
        kt: kernal size of the temporal axis.        
        kf: kernal size of the freq. axis
        '''
        
        super(TFC, self).__init__()
        c = in_channels
        self.H = nn.ModuleList()
        for i in range(num_layers):
            self.H.append(
                nn.Sequential(
                    nn.Conv2d(in_channels=c, out_channels=gr, kernel_size=(kf, kt), stride=1, padding=(kt//2, kf//2)),
                    nn.BatchNorm2d(gr),
                    nn.ReLU(),
                )
            )
            c += gr

    def forward(self, x):
        ''' [B, in_channels, T, F] => [B, gr, T, F] '''
        x_ = self.H[0](x)
        for h in self.H[1:]:
            x = torch.cat((x_, x), 1)
            x_ = h(x)  

        return x_

#### 3.2.2 Time-Frequency Convolutions with TDF


We propose the Time-Frequency Convolutions with Time-Distributed Fully-connected networks (TFC-TDF) block. 
It utilizes two different blocks inside: a TFC block and a TDF block.
Figure 6 describes a TFC-TDF block.
It first maps the input $X^{(l-1)}$ to a same-sized representation with $c_{out}^{(l)}$ channels by applying the TFC block. Then the TDF block is applied to the dense block output. A residual connection is also added for efficient gradient flow.

![](img/tfctdf.png)
Figure 6. Time-Frequency Convolutions with TDF

Phasen [8] has shown that inserting time-distributed operations into intermediate blocks can improve speech enhancement performance.
We validate whether it also works for SVS or not in §4.3.

In [4]:
class TFC_TDF(nn.Module):
    '''
    [B, in_channels, T, F] => [B, out_channels (= gr), T, F] 
    We set the number of output channels to be the same as the growth rate
    '''
    def __init__(self, in_channels, num_layers, gr, kt, kf, f, bf=16, bias=True):
        '''
        in_channels: number of input channels
        num_layers: number of densly connected conv layers
        gr: growth rate
        kt: kernal size of the temporal axis.        
        kf: kernal size of the freq. axis
        f: num of frequency bins
        
        below are params for TDF 
        bf: bottleneck factor. if None: single layer. else: MLP that maps f => f//bf => f 
        bias: bias setting of linear layers
        '''
        
        super(TFC_TDF, self).__init__()
        self.tfc = TFC(in_channels, num_layers, gr, kt, kf)
        self.tdf = TDF(gr, f, bf, bias)
            
    def forward(self, x):
        x = self.tfc(x)
        
        return x + self.tdf(x)

#### 3.2.3 Time-Distributed Convolutions with RNNs

We propose an alternative way to consider both the time and frequency dimensions.
A Time-Distributed Convolutions with Recurrent Neural Networks (TDC-RNN) block uses two different blocks: a TDC block for extracting timbre features and RNNs for capturing temporal patterns.
It extracts timbre features and temporal features **separately**, unlike a TFC block. We validate whether this approach can outperform the 2-D CNN approach by comparing TDC-RNNs with TFCs in §4.3.

The structure of a TDC-RNN block is similar to that of a TFC-TDF block.
It applies the TDC block to an input $X^{(l-1)}$, and obtains a same sized hidden representation with  $c_{out}^{(l)}$ channels. The RNNs compute the hidden representation and output an equally sized tensor. A residual connection is added, as is a TFC-TDF block.



In [5]:
class TDC_RNN(nn.Module):
    '''
    [B, in_channels, T, F] => [B, out_channels (= gr), T, F] 
    We set the number of output channels to be the same as the growth rate
    '''
    def __init__(self, 
                 in_channels, 
                 num_layers_tdc, gr, kf, f, 
                 bn_factor_rnn, num_layers_rnn, bidirectional=True, min_bn_units_rnn=16, bias_rnn=True,  ## RNN params
                 bn_factor_tif=16, bias_tif=True, ## RNN params
                 skip_connection=True):
        
        '''
        in_channels: number of input channels
        num_layers_tdc: number of densly connected conv layers
        gr: growth rate
        kf: kernal size of the freq. axis
        f: # freq bins
        bn_factor_rnn: bottleneck factor of rnn 
        num_layers_rnn: number of layers of rnn
        bidirectional: if true then bidirectional version rnn 
        bn_factor_tif: bottleneck factor of tif
        bias: bias
        skip_connection: if true then tdc+rnn else rnn
        '''
        
        super(TDC_RNN, self).__init__()

        self.skip_connection = skip_connection
        
        self.tdc = TDC(in_channels, num_layers_tdc, gr, kf)
        self.bn = nn.BatchNorm2d(gr)
        
        hidden_units_rnn = max(f//bn_factor_rnn, min_bn_units_rnn)
        self.rnn = nn.GRU(f, hidden_units_rnn, num_layers_rnn, bias=bias_rnn, batch_first=True, bidirectional=bidirectional)
        
        f_from = hidden_units_rnn * 2 if bidirectional else hidden_units_rnn
        f_to = f
        self.tif_f1_to_f2 = TIF_f1_to_f2(gr, f_from, f_to, bn_factor=bn_factor_tif, bias=bias_tif)


    def forward(self, x):
        ''' [B, in_channels, T, F] => [B, gr, T, F] '''
        
        x = self.tdc(x) # [B, in_channels, T, F] => [B, gr, T, F]
        x = self.bn(x)  # [B, gr, T, F] => [B, gr, T, F]
        tdc_output = x

        B, C, T, F = x.shape
        x = x.view(-1, T, F)
        x, _ = self.rnn(x)       # [B * gr, T, F] => [B * gr, T, 2*hidden_size]
        x = x.view(B,C,T, -1)    # [B * gr, T, 2*hidden_size] => [B, gr, T, 2*hidden_size]
        rnn_output = self.tif_f1_to_f2(x) # [B, gr, T, 2*hidden_size] => [B, gr, T, F]
        
        return tdc_output + rnn_output if self.skip_connection else rnn_output