# Time Variant Pitch Shifting by Time Scale Modification via Overlap and Add and Resampling

This pitch-shifting algorithm should take a time variant pitch-shifting factor $\rho$ for time variant pitch-shifting.

First import the needed modules:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# define default samplerate of 44100Hz and not 22050Hz for librosa
# and fft length and hop size
from presets import Preset
import librosa as _librosa
import librosa.display as _display
_librosa.display = _display
librosa = Preset(_librosa)
librosa['sr'] = 44100
librosa['n_fft'] = 4096
librosa_hop_len = 2048
librosa['hop_length'] = librosa_hop_len

import IPython

## Time Invariant Algorithm

Let's develop a time invariant pitch-shifting algorithm with constant pitch-shifting factor $\beta$ as opposed to the time variant pitch-shifting factor $\rho[n]$.

We will resample the signal to change its pitch or fundamental frequency.
Because this changes the length of the signal, we use a TSM (time scale modification) method to preserve the duration of the original signal.
The time duration of the resampled signal should be scaled by the time scaling factor $\alpha$.

Let's suppose we want to shift the pitch of a sinosoid up by an octave.
We define the pitch shifting factor $\beta$ as $\beta = \frac{f_{out}}{f_0}$, where $f_0$ is the frequency of the original sinusoid and $f_{out}$ is the target frequency or the frequency of the pitch shifted signal.

### Resampling

A signals pitch can be shifted by resampling it.
Resampling is done by the resampling factor $r = \frac{f_{s,orig}}{f_{s,target}}$ where $f_{s,orig}$ is the original sampling rate of the signal and $f_{s,target}$ is the target sampling rate.

If we want to increase the frequency of a signal, we want to play it faster.
This is equivalent to increasing the sampling rate of our playing device or decreasing the sampling rate $f_{s,target}$ of our signal.
Since we want to have a fixed sampling rate for our replaying system, we decrease the sampling rate of our signal to increase its frequency.

It follows, that $r = \frac{f_{s,orig}}{f_{s,target}} = \frac{1}{\beta} = \frac{f_0}{f_{out}}$

Let's do an example by increasing the frequency of a signal by an octave with resampling:

In [None]:
f_0 = 500
f_out = 2*f_0 # shift by an octave

beta = f_out / f_0
r = 1 / beta
print("The original signals frequency is", f_0, "Hz.")
print("The frequency of the pitch shifted signal is", f_out, "Hz.")
print("The pitch shifting factor beta is", beta)
print("and the resampling factor r is", r)

In [None]:
# make the input signal
n_samples = 512
t = np.linspace(0, 1/n_samples, n_samples)
omega = 2 * np.pi * f_0
x = np.sin(omega * t)

plt.plot(t, x)

In [None]:
f_s_orig = 44100

f_s_target = f_s_orig * r
print("Original sampling rate is", f_s_orig, "and target sampling rate is", f_s_target)

x_resampled = librosa.resample(x, f_s_orig, f_s_target)

fig, axs = plt.subplots(2)
axs[0].plot(x)
axs[0].set_ylabel("original")
axs[1].plot(x_resampled)
axs[1].set_ylabel("resampled")

To make this audible, we make an audio example, using the factors from above:

In [None]:
# original signal
dur = 2 # seconds
t = np.arange(0, dur, 1/f_s_orig)
x_audio = np.sin(2 * np.pi * f_0 * t)

IPython.display.Audio(x_audio, rate=f_s_orig)

In [None]:
# resampled signal
x_audio_resampled = librosa.resample(x_audio, f_s_orig, f_s_target)

IPython.display.Audio(x_audio_resampled, rate=f_s_orig)

### Time Scale Modification

Note, that the resampled signal with now a higher signal-frequency (or pitch) has a shorter length.
You can see this by the amount of samples in the diagrams above and by the indicated signal length in the audio player.
This has to be fixed by time scale modification according to $\alpha$.
An $\alpha > 1$ corresponds to time tretching and an $\alpha < 1$ leads to time compression.

For the case of pitch-shifting:

$$\alpha = \beta$$

Let's define a time scale modification function using overlap and add:

In [None]:
def tsm_ola(x, alpha, N=1024, overlap_factor=4):
    """
    Time Scale Modification by windowed Overlap and Add
    x...input signal to be time scaled
    alpha...time scale factor (alpha>1: time stretching; alpha<1: time compression)
    N...used block length
    overlap_factor...number of overlaping analysis blocks
    """
    if(alpha > overlap_factor):
        print("Error: alpha must be less than or equal to the overlap factor!")
        print("Reduce alpha or increase the overlap factor.")
        return None

    # window
    win = np.hanning(N)

    # analysis hop size
    Sa = N // overlap_factor
    # synthesis hop size
    Ss = round(alpha * Sa)

    n_blocks = (x.size - N) // Sa
    y = np.zeros(round(x.size * alpha))

    # first block
    last = np.copy(x[:N] * win)
    y[:Ss] = last[:Ss]

    for i in range(1, n_blocks):
        current = x[i*Sa : i*Sa + N] * win

        overlap = last[Ss:] + current[:N-Ss]
        tail = current[N-Ss:]

        last = np.concatenate((overlap, tail))
        y[i*Ss:(i+1)*Ss] = last[:Ss]
    
    return y

Now if we apply the time scaling with the $\alpha = 2$, we successfully did the pitch shift to an octave without altering the length of the signal.

In [None]:
alpha = beta
y = tsm_ola(x_audio_resampled, alpha)

print("alpha:", alpha, "beta:", beta, "r:", r)
print("original signal length:", x_audio.size)
print("resampled signal length:", x_audio_resampled.size)
print("pitch shifted signal length:", y.size)
plt.plot(y)
plt.title("pitch shifted signal")

IPython.display.Audio(y, rate=f_s_orig)

This algorithm has poor quality and produces a lot of artifacts, but it is simple and it works.
The output signal is a pitch shifted version of the input signal.
We can hear the amplitude modulation of the windowing.

### Pitch-Shifting to a lower pitch

Let's try a pitch-shift to a lower pitch.
We can omit the calculation of the pitch-shifting factor $\beta$ and calculate the resampling factor r directly from the signal frequencies.

In [None]:
f_0 = 500   # current signal frequency
f_out = 400 # target signal frequency
f_s_orig = 44100

# signal to be processed
dur = 2 # seconds
t = np.arange(0, dur, 1/f_s_orig)
x_audio = np.sin(2 * np.pi * f_0 * t)

# resampling factor r and time scaling factor alpha
r = f_0/f_out
alpha = 1/r

print("The original frequency of the signal is", f_0, "Hz.")
print("and the target frequency of the output signal is", f_out, "Hz.")
print("So the resampling factor r is", r, "and the time scaling factor alpha is", alpha, ".")
print("original length:", x_audio.size)

In [None]:
# let's hear the original signal again
IPython.display.Audio(x_audio, rate=f_s_orig)

In [None]:
# resampling
x_audio_resampled = librosa.resample(x_audio, f_s_orig, r * f_s_orig)

print("Original sampling rate is", f_s_orig, "and target sampling rate is", r * f_s_orig)
print("resampled length:", x_audio_resampled.size)

IPython.display.Audio(x_audio_resampled, rate=f_s_orig)

In [None]:
y = tsm_ola(x_audio_resampled, alpha, N=2048, overlap_factor=2)

print("pitch-shifted length:", y.size)

IPython.display.Audio(y, rate=f_s_orig)

There is a slight pitch mismatch between the resampled signal `x_audio_resampled` and the final output signal `y`, but otherwise this method works.
This might be a consequence of the amplitude modulation because of the windowing and/or of the overlapping segments.
If the overlap factor is smaller, the pitch mismatch is decreased.

### Final Test for the time invariant algorithm

Let's make shure, this algorithm works well by testing it with a recording of a singer.

In [None]:
# test of the algorithm
x, f_s_orig = librosa.load('../../samples/Toms_diner.wav')

IPython.display.Audio(x, rate=f_s_orig)

In [None]:
# define a pitch-shiftig factor
beta = 1.5

# resampling
r = 1 / beta

x_resampled = librosa.resample(x, f_s_orig, r*f_s_orig)

# TSM
alpha = beta
y = tsm_ola(x_resampled, alpha, N=4096, overlap_factor=4)

IPython.display.Audio(y, rate=f_s_orig)

So, this works.
Let's go on to the development of a time variant pitch-shifting algorithm.

## Time Variant Pitch-Shifting Algorithm

The time variant pitch-shifting algorithm takes a time variant pitch-shifting factor $\rho$ and shifts a singals pitch according to it over time.
Let's generate a unique input signal and some random values as $\rho[n]$ and test the algorithm with it by visually inspecting the waveform in a diagram.

In [None]:
sr = 44100

# generate a unique input signal to check, what is going on
n_samples = 2048

# noise
noise = (np.random.rand(n_samples) - 0.5) * 2

# sine
f = 265
omega = 2 * np.pi * f
t = np.linspace(0, n_samples / sr, n_samples)
sine = np.sin(omega * t)

# sawtooth
a = 500
saw = (((t * a) % 2) - 1)

x_part = np.concatenate((noise, sine, saw, noise, sine, saw[::-1], noise, np.zeros(n_samples)))

# long unique input signal
plt.rcParams['figure.figsize'] = [15, 3]
x = np.concatenate((x_part, x_part))
plt.plot(x)
plt.title("longer unique input signal")

### $\rho$ is analyzed by a pitch tracking algorithm with a certain window-size and hop-size.
So generate some data for rho:

In [None]:
pitch_win_size = 4096
pitch_hop_size = 2048

# how many pitch marks are needed?
n_pitches = (x.size - pitch_win_size) // pitch_hop_size
print(n_pitches)

### generate a rho around the value 1

In [None]:
rho = 1 + (np.random.rand(n_pitches) - 0.5) * 0.5

plt.plot(rho, 'o')

Now we have to resample per block for different resampling factors according to $r = \frac{1}{\rho[n]}$ and compensate the length via TSM according to $\alpha = \beta = \rho[n] = 1 / r$.

### Let's calculate the parameters for the algorithm:

In [None]:
# The length of the validity of the pitch marks is the pitch-tracking hop size
rho_N = pitch_hop_size

# choose an analysis window size and an overlap factor for the pitch-shifting algorithm
N = 1024
overlap_factor = 2

# window
win = np.hanning(N)

# analysis hop size
Sa = N // overlap_factor

n_blocks = (x.size - N) // Sa

# How many blocks use the same pitch mark?
#n_blocks_per_rho = N / rho_N
n_blocks_per_rho = rho_N / Sa

# format pitch-shifting factor to processing parameters
rho_formated = np.repeat(rho, n_blocks_per_rho)

# get rid of the last ones... because pitch window size rho_N is greater than block size N
n_blocks = rho_formated.size

print(n_blocks_per_rho)

### And now do the main audio processing:

In [None]:
y = []

# start
last = np.zeros(N)

# main processing loop
for i in range(n_blocks):
    # pitch-shifting factor is time-scaling factor
    alpha = beta = rho_formated[i]
    # time-scaling factor
    r = 1 / beta

    current = x[i*Sa : i*Sa + N] * win

    # resampling
    if not np.isnan(r): # voiced
        resampled = librosa.resample(current, sr, sr*r)
    else: # unvoiced (no pitch shifting)
        alpha = 1
        resampled = current
    
    # TSM
    Ss = round(alpha * Sa)
    overlap = last[Ss:] + current[:N-Ss]
    tail = current[N-Ss:]
    last = np.concatenate((overlap, tail))
    y.append(last[:Ss])
y = np.concatenate(y)

### Plot the sesult:

In [None]:
print("x.size:", x.size, "y.size", y.size)
fig, axs = plt.subplots(2)
axs[0].plot(x)
axs[0].set_ylabel("x")
axs[1].plot(y)
axs[1].set_ylabel("y")

The output looks kind of OK.
`y` is a bit shorter than `x` because we discard the last processing blocks.
The most important thing is that the algorithm works.

### So test it with a different signal and check it's sound quality:

In [None]:
# generate a sine input signal
sr = 44100
dur = 4 # seconds
t = np.linspace(0, dur, dur*sr)
f_0 = 400
x = np.sin(2 * np.pi * f_0 * t)

# generate rho as low frequency sine signal around 1
pitch_win_size = 4096
pitch_hop_size = 1024
n_pitches = (x.size - pitch_win_size) // pitch_hop_size
t_r = np.linspace(0, dur, n_pitches)
f_r = 1
r_sin = 0.1 * np.sin(2 * np.pi * f_r * t_r)
rho = r_sin + 0.9

# plot
fig, axs = plt.subplots(2)
axs[0].plot(t, x)
axs[0].set_ylabel("x")
axs[1].plot(t_r, rho, 'o')
axs[1].set_ylabel("rho")

### Algorithm Parameters:

In [None]:
# The length of the validity of the pitch marks is the pitch-tracking hop size
rho_N = pitch_hop_size

# choose an analysis window size and an overlap factor for the pitch-shifting algorithm
N = 2048
overlap_factor = 4

# windowing
win = np.hanning(N)

# analysis hop size
Sa = N // overlap_factor

# how many blocks to process
n_blocks = (x.size - N) // Sa

# How many blocks use the same pitch mark?
n_blocks_per_rho = rho_N / Sa

# format pitch-shifting factor to processing parameters
rho_formated = np.repeat(rho, n_blocks_per_rho)

# get rid of the last ones... because pitch window size rho_N is greater than block size N
n_blocks = rho_formated.size

### Audio Processing:

In [None]:
# start
y = np.zeros(N)

# main processing loop
for i in range(n_blocks):
    # time-scaling factor (alpha=beta=rho[n])
    alpha = rho_formated[i]
    # resampling factor
    r = 1 / alpha

    current = x[i*Sa : i*Sa + N] * win

    # resampling
    if not np.isnan(r): # voiced
        resampled = librosa.resample(current, sr, r*sr)
    else: # unvoiced (no pitch shifting)
        alpha = 1
        resampled = current
    
    # TSM
    Ss = round(alpha * Sa)
    overlap = y[-(resampled.size - Ss) : ] + resampled[ : resampled.size - Ss]
    tail = resampled[resampled.size-Ss:]
    y = np.concatenate((y[:-(resampled.size - Ss)], overlap, tail))

### Check the result:

In [None]:
# play x
IPython.display.Audio(x, rate=sr)

In [None]:
# play y
IPython.display.Audio(y, rate=sr)

In [None]:
plt.plot(y)

This seems to work, although there are a lot of cancellations with a sine as an input signal.
But this is to be expected from this kind of algorithm.
The overlap leads to a higher amplitude, but this can be normalized if necessary.

So test this algorithm with a musical signal.

## The final time variant pitch shifting algorithm

In [None]:
def tv_pitch_shift_ola(x, sr, rho, rho_N, N, overlap_factor):
    """
    x...input signal to be pitch shifted
    sr...sample rate of x
    rho...time varying pitch-shifting factor
    rho_N...window size of the validity of the pitch-shifting factor
    N...analysis window size of pitch shifting
    overlap_factor...factor of window overlap for OLA
    """

    # window
    win = np.hanning(N)

    # analysis hop size
    Sa = N // overlap_factor
    n_blocks = (x.size - N) // Sa

    # How many blocks use the same pitch mark?
    n_blocks_per_rho = rho_N / Sa

    # format pitch-shifting factor to processing parameters
    rho_formated = np.repeat(rho, n_blocks_per_rho)

    # get rid of the last ones...
    # because pitch window size rho_N is greater than block size N
    n_blocks = rho_formated.size

    # start
    y = np.zeros(N)

    # main processing loop
    for i in range(n_blocks):
        # time-scaling factor (alpha=beta=rho[n])
        alpha = rho_formated[i]
        # resampling factor
        r = 1 / alpha
        
        current = x[i*Sa : i*Sa + N] * win

        # resampling
        if not np.isnan(r): # voiced
            resampled = librosa.resample(current, sr, r * sr)
        else: # unvoiced (no pitch shifting)
            alpha = 1
            resampled = current

        # TSM
        Ss = round(alpha * Sa)
        overlap = y[-(resampled.size - Ss) : ] + resampled[ : resampled.size - Ss]
        tail = resampled[resampled.size-Ss:]
        y = np.concatenate((y[:-(resampled.size - Ss)], overlap, tail))
    return y

### test the algorithm with a different recording

In [None]:
# original
pos = 15
dur = dur # we use the rho from above
x_sing, sr = librosa.load("../../samples/ave-maria.wav", offset=pos, duration=dur)

# plot and play
IPython.display.Audio(x_sing, rate=sr)

### Now pitch shift the signal

In [None]:
# The length of the validity of the pitch marks is the pitch-tracking hop size
rho_N = pitch_hop_size

# choose an analysis window size and an overlap factor for the pitch-shifting algorithm
N = 2048
overlap_factor = 2

y_sing = tv_pitch_shift_ola(x_sing, sr, rho, rho_N, N, overlap_factor)
# plot and play
IPython.display.Audio(y_sing, rate=sr)

So, this is a simple time variant pitch-shifting algorithm.