In this notebook we modify Wave-U-Net for our purposes. We need it to take as input a 256x256 tensor and output a 256x256 tensor. 

We start by importing the necessary packages.

In [None]:
# Import same packages as the train script in Wave-U-Net-Pytorch

import argparse
import os
import time
from functools import partial

import torch
import pickle
import numpy as np

import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
from torch.optim import Adam
from tqdm import tqdm

# add a path to Wave-U-Net
import sys
sys.path.append('../Wave-U-Net-Pytorch')

import model.utils as model_utils
import utils
from model.waveunet import Waveunet

We define the parameters of the model.

In [11]:
model_config = {
    "num_inputs": 256,               # 128 mel bins per spectrogram, but we have to spectrograms
    "num_outputs": 128,              # Output also has 128 mel bins
    "num_channels": [64, 128, 256],    # Example channel progression
    "instruments": ["vocal"],        # Only output vocal, so no music branch
    "kernel_size": 3,                # Must be odd
    "target_output_size": 256,       # Desired output time frames (post-processing may crop)
    "conv_type": "normal",           # Set to "normal" to meet assertion requirements
    "res": "fixed",                  # Use fixed resampling
    "separate": False,                # Separate branch for vocal
    "depth": 1,                      # Number of conv layers per block
    "strides": 2                   # Down/up-sampling stride
}

model = Waveunet(**model_config)
print("input_size (length of input):", model.input_size)
print("num_inputs (number of channels in the input):", model.num_inputs)

Using valid convolutions with 289 inputs and 257 outputs
input_size (length of input): 289
num_inputs (number of channels in the input): 256


Check that the model is working by running it on a random tensor.

In [12]:
batch_size = 2
input_tensor = torch.randn(2, model.num_inputs,  model.input_size)
print(input_tensor.shape)
vocal_output = model(input_tensor)
print("Output shape:", vocal_output["vocal"].shape)

torch.Size([2, 256, 289])
Output shape: torch.Size([2, 128, 257])


Check the amount of GPU memory the model and a training batch takes up. Print a summary of the model.

In [13]:
import torch
from torch.optim import Adam
from torch.nn import L1Loss

from torchsummary import summary

# Ensure that you have a CUDA-enabled device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Instantiate and move the model to GPU
model = Waveunet(**model_config).to(device)

# Set up a dummy optimizer and loss function
optimizer = Adam(model.parameters(), lr=1e-3)
loss_fn = L1Loss()

# Define a dummy batch size
batch_size = 256

# Create a dummy input tensor with the required shape
# model.num_inputs corresponds to the number of channels (256 in your config)
# model.input_size is the computed length (353, for instance)
dummy_input = torch.randn(batch_size, model.num_inputs, model.input_size, device=device)

# Create a dummy target tensor with the shape that your model outputs.
# For a single output branch (vocal), the output shape should be:
# (batch_size, num_outputs, model.output_size)
# model.num_outputs is 128 and model.output_size is computed (257 in your case)
dummy_target = torch.randn(batch_size, model.num_outputs, model.output_size, device=device)

# Reset GPU peak memory stats
torch.cuda.reset_peak_memory_stats(device)

# Run a single forward and backward pass
optimizer.zero_grad()
# If separate is False, the model returns a dictionary; pass the correct key.
output = model(dummy_input)["vocal"]
loss = loss_fn(output, dummy_target)
loss.backward()
optimizer.step()

# Retrieve GPU memory stats
peak_memory = torch.cuda.max_memory_allocated(device)
current_memory = torch.cuda.memory_allocated(device)
print("Peak GPU memory allocated (bytes):", peak_memory)
print("Current GPU memory allocated (bytes):", current_memory)

# Optionally, print a detailed memory summary
print(torch.cuda.memory_summary(device=device))


summary(model, input_size=(model.num_inputs,  model.input_size))


Using valid convolutions with 289 inputs and 257 outputs
Peak GPU memory allocated (bytes): 956684288
Current GPU memory allocated (bytes): 154541568
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 150919 KiB |    912 MiB |   9429 MiB |   9282 MiB |
|       from large pool | 139776 KiB |    909 MiB |   9332 MiB |   9195 MiB |
|       from small pool |  11143 KiB |     13 MiB |     96 MiB |     86 MiB |
|---------------------------------------------------------------------------|
| Active memory         | 150919 KiB |    912 MiB |   9429 MiB |   9282 MiB |
|       from large pool | 139776 KiB |    909 MiB |   9332 MiB |   919

We will try padding the input with 17 zeros on the front annd 16 at the back. So that we can pass the model a 256 length tensor. It outputs a 257 tensor, so we will delete the 257th value of the output tensor.

We need to modify line 221 of the waveunet file:
        if not self.training:  # At test time clip predictions to valid amplitude range
            out = out.clamp(min=-1.0, max=1.0)
        return out
because the mel spectrogram has a different min value (I believe it has values ranging from 0-1.

We might try increasing the number of channels in the channel progression to account for inputing 256 channels instead of 1 or 2.
