# Testing the Reshift pitch discretization effect

* explore the effects of different parameter adjustments and check the soundquality and artifacts of the effect

* increase performance of the overall algorithm

In [None]:
import numpy as np
from scipy.io import wavfile
import scipy.signal as signal
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [15, 3]
from IPython import display as ipd # for IPython.display.Audio(x, rate=fs)

# import the Reshifter
import sys
sys.path.insert(1, '../py') # insert at 1, 0 is the script path (or '' in REPL)
from reshift import Reshifter

fs, x = wavfile.read("../../samples/Toms_diner.wav")
x = x / np.abs(x.max())

## parameter candidates to explore

### pYIN pitch-tracking

Parameter and default value in librosas implementation.

* frame_length=2048

* hop_length=frame_length // 4

These two parameters adjust the minimum latency of pYIN.

In [None]:
reshifter = Reshifter(sr=fs)
help(reshifter)

Let's try the default settings of the reshift algorithm.

In [None]:
# original
plt.plot(x)
plt.title('original')
plt.show()
ipd.Audio(x, rate=fs)

In [None]:
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, default parameters')
plt.show()
ipd.Audio(y, rate=fs)

This works very well with a pitch tracking minimum latency of $t_{track} = \frac{N}{f_s}$ with

$N$...pYIN frame length

$f_s$...audio sampling rate

For CD quality audio with a sampling rate of $f_s = 44100Hz$, the standard librosa value for $N = 2048$.

$N = 2048 \rightarrow 46.4ms$

$N = 1024 \rightarrow 23.2ms$

$N = 512 \rightarrow 11.6ms$

$N = 256 \rightarrow 5.8ms$

$N = 128 \rightarrow 2.9ms$

In [None]:
# librosa default setting: N=2048
reshifter = Reshifter(sr=fs, a_N=2048, a_hop=512)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, librosa default parameters')
plt.show()
ipd.Audio(y, rate=fs)

There is no difference in pitch tracking quality for this example.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=256)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, N=512')
plt.show()
ipd.Audio(y, rate=fs)

In [None]:
reshifter = Reshifter(sr=fs, a_N=256, a_hop=256)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, N=256')
plt.show()
ipd.Audio(y, rate=fs)

The pitch-tracking with a smaller block size N than 512 samples does not work.
This makes sense, since at least one fundamental period of a sound has to be conained in one block.
If we want to analyze the pitch of a signal with a $f_0 = 100Hz$, the minimum frame length is $N_{min} = \frac{f_s}{f_0} = 441ms$.

__So for minimum latency, a frame length of $N = 512$ is a good value.__

Now we set no overlap of pitch-tracking frame length and hop size.
So these values are the same.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, N=512, hop=512')
plt.show()
ipd.Audio(y, rate=fs)

This sounds pretty OK.
So for this example, __no overlap is needed__.
The rate of fundamental frequencies $f_{f0} = \frac{f_s}{hop} = 44100 / 512 = 86.13Hz$ is sufficient here.


### Rollers pitch-shifting

* filter order of filter bank

* number of used filters in filter bank

(* maybe later: notches between bands, if too many detuning artifacts)


The produced artifacts of the Rollers algorithm are infuenced by a tradeoff between low order filters and a smaller amount of bands and higher order filters and a bigger amount of bands.
The _detuning_ artifact is present, if there are wider bands and bigger overlaps between the bands, since the frequencies beside the center frequency are shifted to wrong frequencies and the overlap of the bands produce beats (Schwebung).
The _downward chirp_ artifact is produced by the filter resonances.
Smaller bands and higher filter orders produce higher resonances.

The examples up to here used 100 bands with a filter order of 2.
There is hardly any audible detuning, but the downward chirp artifact is audible.

Let's try a higher filter order, to make the chirp more audible.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=8, filt_num=100)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=8, number=100, downward chirp')
plt.show()
ipd.Audio(y, rate=fs)

The downward chirp is clearly audible and there is a kind of reverb, since the filters are ringing at there resonance frequency.

Let's decrease the number of bands to hear the detuning artifact.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=2, filt_num=50)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=2, number=50, detuning')
plt.show()
ipd.Audio(y, rate=fs)

This still sounds OK for this example.
Let's reduce the number of bands even further

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=2, filt_num=25)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=2, number=25, detuning')
plt.show()
ipd.Audio(y, rate=fs)

Now this sounds different, but not really like detuning, but rather like a formant mismatch.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=2, filt_num=12)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=2, number=12, detuning')
plt.show()
ipd.Audio(y, rate=fs)

Now this effect is even more audible with only 12 bands.
This is in fact the detuning artifact and it is not that bad with 50 filters of order 2 for this example.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=2, filt_num=35)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=2, number=35, detuning')
plt.show()
ipd.Audio(y, rate=fs)

Even with just 35 bands, the result sounds OK.
This might be due to the small pitch-shifting intervals that are needed by the most scales.
Since the biggest interval of most scales is a whole tone, the frequency shift is moderately small and we can use few bands with OK results.

Let's increase the filter order with few bands and check the results.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=4, filt_num=35)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=4, number=35')
plt.show()
ipd.Audio(y, rate=fs)

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=6, filt_num=35)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=6, number=35')
plt.show()
ipd.Audio(y, rate=fs)

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=8, filt_num=35)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=8, number=35')
plt.show()
ipd.Audio(y, rate=fs)

Since the increase of the filter order does not really increase the sound quality of the result with few bands in this example, I would leave it at order 2 filters.
Let's hear order 1 filters.

In [None]:
reshifter = Reshifter(sr=fs, a_N=512, a_hop=512, filt_order=1, filt_num=35)
y = reshifter.discretize(x, 'wholetone')

In [None]:
plt.plot(y)
plt.title('discretized wholetone scale, filter order=1, number=35')
plt.show()
ipd.Audio(y, rate=fs)

There are strong artifacts compared to the other filter orders.

So for pitch-discretization I would recommend

* __filter order: 2__

* __number of bands: 35 for minimum quality__

__Increase the number of bands for better quality results.__


## Summary: Parameters and sound quality

* pYIN frame length = 512

A frame length of 512 samples at $f_s = 44100Hz$ is a good value for accurate pitch-tracking with the lowest possible latency to track speech signals.
For low speech, a bigger frame length might be necessary.

* pYIN hop-size = frame length

For pitch discretization, the pitch-tracking hop-size can be as low as the frame length.
In some cases, the hop size might be even bigger.

* Rollers filter order = 2

For the most cases for pitch discretization, a filter order of 2 yielded the best results even with few bands.

* Rollers number of bands >= 35

For pitch-discretization, the number of bands might be as low as 35 for minimum quality results and can be incresed for better sound quality.


## Performance

Now that we have the parameter values for minimum sound quality and lowest latency, how can we make this algorithm more efficient?

* Profiling: Was braucht am meisten ressourcen?


### Rollers: Ideas

* 2nd order IIR filters might be as efficient as we can get

Since we just need that low filter order, we could use FIR filters with less resonance, which might decrease the downward chirp artifact.

* __Polyphase oder Multirate filter bank:__

This might be the biggest performance boost, since we can reduce the sample rate after filtering, do the SSB and increse the rate at the summation of the bands.

* SSB: Hilbert transform with allpass instead of true Hilbert transform

* more efficient integration algorithm than trapezoid rule?


### pYIN: Ideas

Zur Zeit verwenden wir librosas pYIN Implementierung und da können wir nur die minimalen parameter angeben.
Das können wir später noch effizienter gestalten.
Aber das ist wahrscheinlich schon effizient implementiert und genaues pitch-tracking ist sehr wichtig für diesen Effekt.


### Profiling

In [None]:
import cProfile
import pstats

In [None]:
reshifter = Reshifter(sr=fs)

profiler = cProfile.Profile()
profiler.enable()

y = reshifter.discretize(x, 'wholetone')

profiler.disable()
stats = pstats.Stats(profiler).sort_stats('tottime')
stats.print_stats()

In the current reshift implementation, we get the following result for the top time consuming tasks:

* 3s in `librosa/sequence.py(_viterbi)`

* 2.2s in `scipy.fft._pocketfft.pypocketfft.c2c`

* 1s in `reshift.py(pitch_shift)`

* 0.16s in `scipy/signal/signaltools.py:4103(sosfilt)`

* 0.083 in `scipy/signal/signaltools.py:2169(hilbert)`

* 0.079 in `method 'cumsum' of 'numpy.ndarray' objects`

* 0.074 in `scipy/integrate/_quadrature.py:282(cumulative_trapezoid)`

In [None]:
%load_ext snakeviz
%snakeviz y = reshifter.discretize(x, 'wholetone')

So this means, that more than half the time is spent in the Rollers `pitch_shift` function and 3/4 of this time is due to the Hilbert transform, which uses an FFT and the last quarter ist spent in the pitch_shift function itself.

A bit less than half the time is spent in pYIN, an nearly all of that time is spent in the Viterbi algorithm.


### A more efficient solution

* implement the true Hilbert transform with an allpass

This should reduce the processing time by a lot.

* Test following pYIN parameters:

    - fmin, fmax: a smaller bandwidth of possible pitch might be more efficient
    
    - n_thresholds=100 -> less might be more efficient
  
    - resolution=0.01 -> a coarse resolution might be more efficient

* Test a further reduction of pitch tracking $f_0$ rate to minimize the calls to pYIN

* Look for inefficient code directly in the Rollers pitch_shift function

* Later at pYIN implementation: Can the Viterbi algorithm be more efficiently implemented?

The pYIN paper says, that there is little overhead to the original YIN algorithm.
Since the Viterbi algorithm is from the HMM in pYIN, this should be possible, if they are right.