# Algorithms and Parameters

A notebook for testing and finding the right algorithms and parameters for the _Reshift_ frequency discretization effect.
A frequency discretization effect like _Autotune_ consists of

* a pitch tracking algorithm

* a nonlinear frequency scale for the target pitch

* and of a pitch-shifting algorithm.

## Initialization

In [None]:
# define default samplerate of 44100Hz and not 22050Hz for librosa
# and fft length and hop size
from presets import Preset
import librosa as _librosa
import librosa.display as _display
_librosa.display = _display
librosa = Preset(_librosa)

librosa['sr'] = 44100
librosa['n_fft'] = 4096
librosa_hop_len = 2048
librosa['hop_length'] = librosa_hop_len

# other needed modules
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# import the reshift algorithms
import sys
sys.path.insert(1, '../py') # insert at 1, 0 is the script path (or '' in REPL)
import reshift

import IPython # for IPython.display.Audio(x, rate=fs)

## Pitch Tracking

We use the _pYIN_ algorithm implemented by librosa for pitch tracking.
Its [parameters](https://librosa.org/doc/latest/generated/librosa.pyin.html) are:

* fmin: minimum frequency to look for

* fmax: maximum frequency to look for

* sr: samplingrate of input signal

* frame_length: length of the frames in samples. By default, frame_length=2048

* win_length: length of the window for calculating autocorrelation in samples. If None, defaults to frame_length // 2

* hop_length: number of audio samples between adjacent pYIN predictions. If None, defaults to frame_length // 4

* n_thresholds: number of thresholds for peak estimation.

* beta_parameters: shape parameters for the beta distribution prior over thresholds.

* boltzmann_parameter: shape parameter for the Boltzmann distribution prior over troughs. Larger values will assign more mass to smaller periods.

* resolution: Resolution of the pitch bins. 0.01 corresponds to cents.

* max_transition_rate: maximum pitch transition rate in octaves per second.

* switch_prob: probability of switching from voiced to unvoiced or vice versa.

* no_trough_prob: maximum probability to add to global minimum if no trough is below threshold.

* fill_na: (None, float, or np.nan) default value for unvoiced frames of f0. If None, the unvoiced frames will contain a best guess value.

* centerboolean: If True, the signal y is padded so that frame `D[:, t]` is centered at `y[t * hop_length]`. If False, then `D[:, t]` begins at `y[t * hop_length]`. Defaults to True, which simplifies the alignment of D onto a time grid by means of librosa.core.frames_to_samples.

* pad_mode: (string or function) If center=True, this argument is passed to np.pad for padding the edges of the signal y. By default (pad_mode="reflect"), y is padded on both sides with its own reflection, mirrored around its first and last sample respectively. If center=False, this argument is ignored. .. see also:: np.pad


pYIN returns:

* f0: time series of fundamental frequencies in Hertz.

* voiced_flag: time series containing boolean flags indicating whether a frame is voiced or not.

* voiced_prob: time series containing the probability that a frame is voiced.

### pYIN

Let's start by anayzing the pitch of a sweep with the default parameters.

In [None]:
# generate a sine sweep
dur = 10
fmin = 500
fmax = 4*500
print("Sweep goes from", fmin, "to", fmax, "Hz.")
sr = 44100
x_sw = librosa.chirp(fmin, fmax, sr=sr, duration=dur)

# plot and play
plt.rcParams['figure.figsize'] = [15, 5]
reshift._my_plot(x_sw, sr, "sine sweep")
IPython.display.Audio(x_sw, rate=sr)

In [None]:
# pYIN analysis
fmin = 60
fmax = 2000
f_0_sw, voiced_flag, voiced_probs = librosa.pyin(x_sw, fmin=fmin, fmax=fmax, sr=sr)

fig, ax = plt.subplots(3)
ax[0].plot(f_0_sw)
ax[0].set_ylabel("f_0")
ax[1].plot(voiced_flag)
ax[1].set_ylabel("voiced flag")
ax[2].plot(voiced_probs)
ax[2].set_ylabel("voiced prob")

This works, so now analyze a singing voice:

In [None]:
pos = 5
dur = 10
x_sing, sr = librosa.load("../../samples/ave-maria.wav", offset=pos, duration=dur)

# plot and play
plt.rcParams['figure.figsize'] = [15, 5]
reshift._my_plot(x_sing, sr, "original signal")
IPython.display.Audio(x_sing, rate=sr)

In [None]:
# pYIN analysis
fmin = 60
fmax = 2000
pyin_frame_length = 2048
pyin_hop_length = pyin_frame_length // 4
f_0_sing, voiced_flag, voiced_probs = librosa.pyin(x_sing, fmin=fmin, fmax=fmax, sr=sr,
                                                   frame_length=pyin_frame_length, hop_length=pyin_hop_length)

fig, ax = plt.subplots(3)
ax[0].plot(f_0_sing)
ax[0].set_ylabel("f_0")
ax[1].plot(voiced_flag)
ax[1].set_ylabel("voiced flag")
ax[2].plot(voiced_probs)
ax[2].set_ylabel("voiced prob")

This seems to work alright too, so lets stick to the default parameters by now.

## nonlinear frequency scales

For mapping the analyzed frequency to a target frequency, we need a scale to get the needed pitch shifting ratio.
By now, we have a chromatic and a wholetone scale, which should be sufficient for checking parameters and algorithms.

In [None]:
f_in = np.linspace(500, 2000, 500)
plt.plot(f_in)

f_out = reshift.freq_scale(f_in, scale='chromatic', tune=440)
plt.plot(f_out)

f_out = reshift.freq_scale(f_in, scale='wholetone', tune=440)
plt.plot(f_out, '--')

The discretized frequency of the analyzed signals are as follows:

In [None]:
fig, axs = plt.subplots(2)

f_0_disc_sw = reshift.freq_scale(f_0_sw, scale='chrom') # default tuning is 440Hz
axs[0].plot(f_0_disc_sw)

f_0_disc_sing = reshift.freq_scale(f_0_sing, scale='chrom')
axs[1].plot(f_0_disc_sing)

The pitches are discretized to semitones.
Since the tremolo of the singer goes over a semitone, the target frequency is still jumping between different frequencies.
So let's try to get a flat target frequency at parts with tremolo.

In [None]:
f_0_disc_sing = reshift.freq_scale(f_0_sing, scale='whole')
plt.plot(f_0_disc_sing)
plt.plot(f_0_sing)

There is still jumping, so use a scale out of thirds:

In [None]:
f_0_disc_sing = reshift.freq_scale(f_0_sing, scale='thirds')
plt.plot(f_0_disc_sing)
plt.plot(f_0_sing)

Now this frequency mapping looks pretty flat.
The next step in the frequency discretization algorithm is to calculate the pitch-shifting factor $\rho[n]$ from the analyzed pitch $f_0[n]$ and the target pitch $f_{out}[n]$

$$\rho[n] = \frac{f_{out}[n]}{f_0[n]}$$

In [None]:
def get_rho(f_0, f_out):
    rho = f_out / f_0
    return rho

rho_sing = get_rho(f_0_sing, f_0_disc_sing)
plt.plot(rho_sing)

This is the pitch-shifting factor which is needed to cancel out the deviation of the input signal from the target pitch.
A $\rho$ of less than one shifts down and a $\rho$ bigger than one shifts up and one is the original pitch.
It can be interpreted as the error signal to the target pitch.

The next figure shows how the pitch is compensated to stay at discrete pitches according to the used scale.

In [None]:
original, = plt.plot(f_0_sing, label="original")
shift, = plt.plot(f_0_disc_sing * rho_sing, label="shift")
target, = plt.plot(f_0_disc_sing, label="target")
plt.legend(handles=[original, shift, target])

## Pitch-Shifting

Now we need a time variable pitch-shifting algorithm in terms of the pitch-shifting factor $\rho$.
One of the fastest and most simple pitch-shifting algorithms is based on __Time Scale Modification (TSM) by Overlap and Add (OLA)__.

Additionaly we need a strategy to handle unpitched parts of the signal.
Let's just keep the original signal at unpitched parts for now.

In [None]:
help(reshift.pitch_shift_ola)

The default pYIN _frame length_ is 2048, which is the window size for calculating one pitch value.
The default pYIN _hop size_ is `frame_length // 4`.
The pYIN _hop size_ is the validity of the pitch-shifting factor $\rho[n]$ at the current analyzed frequency.
In general, there should be less frequency estimates than audio frames for pitch shifting.

The _analysis window size_ $N$ and the _overlap factor_ of the OLA pitch-shifting algorithm determine the sound quality of the pitch-shifted signal and the produced artifacts.


### Synchronization

The output pYIN frequency estimations have to be synchronized to the pitch-shifting algorithm.
Next we analyze the output data of librosa's pYIN implementation for this purpose.

In [None]:
# check the output data sizes of pyin
rho_N = pyin_hop_length
actual_pyin_hop_size = (x_sing.size - pyin_frame_length) / rho_sing.size

print("pYIN given frame length:", pyin_frame_length, "given rho_N = pYIN_hop_length:", rho_N)
print("actual average pYIN hop size:", actual_pyin_hop_size)

$x[n]; n = 0, ..., N$

$N$...length of input signal

$f_0[m]; m = 0, ..., M$

$M$...length of frequency estimations as output from pYIN

$M_z$...length of $f_0$ with zero padding

$hop$...hop size of pYIN

In [None]:
x_len = 1024*8
x_test = x_sing[:x_len]
f0_test, flag, probs = librosa.pyin(x_test, sr=sr, frame_length=pyin_frame_length, hop_length=pyin_hop_length,
                       fmin=200, fmax=1000)
print("N:", x_test.size, "length of x[n] as sample rate", sr)
print("pYIN frame length:", pyin_frame_length, " and pYIN hop size:", pyin_hop_length)
print( "M:", f0_test.size, "number of f0 estimations")

expected_Mz = x_test.size / pyin_hop_length
print("Expected Mz:", expected_Mz, "calculated number of f0 estimations with zero padding")

__So the number of frequency estimations of the librosa pYIN implementation is__

$$M = floor(\frac{N}{hop}) + 1$$

for signals of arbitrary length.

In [None]:
x_len = 1234567
x_test = x_sing[:x_len]
f0_test, flag, probs = librosa.pyin(x_test, sr=sr, frame_length=pyin_frame_length, hop_length=pyin_hop_length,
                       fmin=200, fmax=1000)
print("N:", x_test.size, "length of x[n] as sample rate", sr)
print("pyin frame length:", pyin_frame_length, " and pyin hop size:", pyin_hop_length)
print( "M:", f0_test.size, "number of f0 estimations")

expected_M = x_test.size // pyin_hop_length + 1
print("Expected M:", expected_M, "calculated number of f0 estimations with zero padding")

__Solution:__
Let's do zero padding of the input signal $x[n]$ to prevent problems later on.
This is simple and it works.

In [None]:
x_sing = np.concatenate((x_sing, np.zeros(x_sing.size % pyin_frame_length)))

### pitch-shifting:

In [None]:
N = 512
overlap_factor = 2
y_sing = reshift.pitch_shift_ola(x_sing, sr, rho_sing, rho_N, N, overlap_factor)

In [None]:
# plot and play
plt.rcParams['figure.figsize'] = [15, 5]
reshift._my_plot(y_sing, sr, "processed signal")
IPython.display.Audio(y_sing, rate=sr)

OK, this works, but there are a lot of artifacts and the pitch is not exactly perceived as discrete.
We can still hear the tremolo and we can see it in the spectrogram.
So analyze the pitch-discretized signal and ajust the algorithms parameters for a better result.


## Analysis and Parameters

First let's analyze the pitch of the pitch-discretized signal.

In [None]:
f_0_y_sing, voiced_flag, voiced_probs = librosa.pyin(y_sing, fmin=fmin, fmax=fmax, sr=sr,
                                                     frame_length=pyin_frame_length, hop_length=pyin_hop_length)

In [None]:
plt.rcParams['figure.figsize'] = [15, 10]
original, = plt.plot(f_0_sing, label="original")
target, = plt.plot(f_0_disc_sing, label="target")
processed, = plt.plot(f_0_y_sing, label="processed")
plt.legend(handles=[original, target, processed])
IPython.display.Audio(y_sing, rate=sr)

This figure reflects, what we can hear.
As a first goal, we can focus at the part at the beginning, where the pitch is flat.
How can we flatten the pitch?
Which parameters can we adjust?

In [None]:
help(reshift.pitch_shift_ola)

__OLA pitch-shifting parameter improvement:__

- bigger overlap factor

- smaller block size N

In [None]:
help(librosa.pyin)

__pYIN pitch-tracking parameter improvement:__

- use a signal with less reverb

- smaller hop_length

- adjust the frame_length around 2048

- adjust maximum transition rate if tracking at fast pitch changes does not catch up.

- adjust the pitch resolution around 0.01cents?

- What does the window_length for calculating the autocorrelation exactly do?

- What is n_thresholds exactly for (peak estimation)?

### Parameter Adjustment of OLA Pitch-Shifting

In [None]:
# parameters
N = 256
overlap_factor = 2

y_sing = reshift.pitch_shift_ola(x_sing, sr, rho_sing, rho_N, N, overlap_factor)

# analysis
f_0_y_sing, voiced_flag, voiced_probs = librosa.pyin(y_sing, fmin=fmin, fmax=fmax, sr=sr,
                                                     frame_length=pyin_frame_length, hop_length=pyin_hop_length)

In [None]:
# plot and play
original, = plt.plot(f_0_sing, label="original")
target, = plt.plot(f_0_disc_sing, label="target")
processed, = plt.plot(f_0_y_sing, label="processed")
plt.legend(handles=[original, target, processed])
IPython.display.Audio(y_sing, rate=sr)

### DEBUGGING:

There is something wrong with the OLA-picth shifting algorithm.
Lets do a comparison with librosa's algorithm and correct the OLA algorithm.

In [None]:
N = 512
overlap_factor = 2
y_dbg_rosa = reshift.pitch_shift_rosa(x_sing, rho_sing, rho_N)
y_dbg_ola = reshift.pitch_shift_ola(x_sing, sr, rho_sing, rho_N, N, overlap_factor)

In [None]:
IPython.display.Audio(y_dbg_rosa, rate=sr)

In [None]:
IPython.display.Audio(y_dbg_ola, rate=sr)

Librosa's algorithm does the expected target pitch shift, but the OLA algorithm does just not work...

Let's leave it like that for now and try the Rollers algorithm, which might be impemented in a non blocking manner.