In [None]:
%pylab inline
rcParams['figure.figsize'] = (15, 4) #wide graphs by default
from __future__ import print_function
from __future__ import division

## [Cepstrum](https://en.wikipedia.org/wiki/Cepstrum)
> ... is the result of taking the inverse Fourier transform (IFT) of the logarithm of the estimated spectrum of a signal.

- _complex_ cepstrum throws away *no* information allowing signal reconstruction
- _real_ cepstrum throws away the imaginary part
- _phase_ cepstrum...
- _power_ cepstrum is defined as the squared magnitude of the inverse Fourier transform of the logarithm of the squared magnitude of the Fourier transform of a signal, but we can use `abs` instead of the "squared" magnitude

We will use the _power_ cepstrum calculated by FT → abs() → log → IFT

In [None]:
from scipy.io import wavfile
from IPython.display import Audio

In [None]:
sr, e = wavfile.read('media/e.wav')
sr

In [None]:
plot(e)
Audio(e, rate=sr)

In [None]:
start = 4000
fourier_trans = rfft(e[start:start + 2048] * hanning(2048), n=4096)
mag_spectrum_e = abs(fourier_trans)

In [None]:
plot(mag_spectrum_e)
pass

In [None]:
log_mag_spec_e = log(mag_spectrum_e)
plot(log_mag_spec_e[:1000])
pass

In [None]:
freqs = linspace(0, sr / 2, 2049)
plot(freqs, log_mag_spec_e)
xlabel('Frequency (Hz)')
ylabel('Log amplitude')
xlim((0, 4000))
grid()

In [None]:
sr, a = wavfile.read('media/a.wav')
print(sr)
plot(a)
Audio(a, rate=sr)

In [None]:
start = 4000
fourier_trans = rfft(a[start:start + 2048] * hanning(2048), n=4096)
mag_spectrum_a = abs(fourier_trans)
log_mag_spec_a = log(mag_spectrum_a)
freqs = linspace(0, sr/2, 2049)
plot(freqs, log_mag_spec_a)
xlabel('Frequency (Hz)')
ylabel('Log amplitude')
xlim((0, 4000))
grid()

In [None]:
sr, o = wavfile.read('media/o.wav')
print(sr)
plot(o)
Audio(o, rate=sr)

In [None]:
start = 4000
fourier_trans = rfft(o[start:start + 2048] * hanning(2048), n=4096)
subplot(211)
mag_spectrum_o = abs(fourier_trans)
plot(mag_spectrum_o[:1000])
subplot(212)
log_mag_spec_o = log(mag_spectrum_o)
plot(log_mag_spec_o[:1000])
pass

In [None]:
plot(log_mag_spec_a[:1000], alpha=0.5)
plot(log_mag_spec_e[:1000], alpha=0.5)
plot(log_mag_spec_o[:1000], alpha=0.5)
legend(['a','e','o'])
pass

- similar fundamentals
- similar overall contour, but different in detail

In [None]:
#cepstrum_a = abs(ifft(log_mag_spec_a))
#cepstrum_a = irfft(log_mag_spec_a)
cepstrum_a = ifft(log_mag_spec_a)

In [None]:
plot(cepstrum_a[:])
pass

In [None]:
cepstrum_a[0:4]

There's always a huge peak near that we will ignore when when estimating pitch

In [None]:
plot(cepstrum_a[2:])
pass

In [None]:
plot(cepstrum_a[100:200])
pass

In [None]:
16000 / (38 + 100) # Hz

> The independent variable of the cepstrum is nominally time since it is the IDFT of a log-spectrum, but is interpreted as a frequency since we are treating the log spectrum as a waveform. The name of the independent variable of the cepstrum is known as a _quefrency_ and a linear filtering operation is known as _liftering_.

From a lecture on [Cepstral analysis](http://research.cs.tamu.edu/prism/lectures/sp/l9.pdf).

Each quefrency bin in this cepstrum space is called a "cepstral coefficient". We will use these coefficients used to built filters and signals.

If you find clear peaks in the cepstrum, then it's likely that there is pitched content at those peaks. Whereas if you find no peaks, the content may be unpitched or noisy.

### [Source-filter](https://en.wikipedia.org/wiki/Source%E2%80%93filter_model_of_speech_production) separation

Human voices and other signals may be modeled as a combination of a source signal and a filter.

$$s(t) = x(t) * y(t)$$

where $s$ is the voice signal, $x$ is the source or excitation signal and $y$ is the filter.

$$S(f) = X(f)Y(f)$$

$$|S(f)| = |X(f)||Y(f)|$$

$$\ln|S(f)| = \ln|X(f)| + \ln|Y(f)|$$ (note the move to + instead of *)


$$\mathcal{F^{-1}}\big[\ln|S(f)|\big] = \mathcal{F^{-1}}\big[\ln|X(f)|\big] + \mathcal{F^{-1}}\big[\ln|Y(f)|\big]$$

- [lyrebird.ai](https://lyrebird.ai/demo)
- [Source Filter Analysis](http://web.science.mq.edu.au/~cassidy/comp449/html/ch07s05.html)
- [Speak and Spell](http://www.datamath.org/Speech_IC.htm)
- [Linear Predictive Coding](https://en.wikipedia.org/wiki/Linear_predictive_coding)

In [None]:
pulse = list(r_[1, zeros(50)]) * 40
plot(pulse)
pass

In [None]:
plot(abs(rfft(pulse)))
pass

In [None]:
# we're shaping the spectrum using cos as a tool
filtered = abs(rfft(pulse)) * cos(linspace(0, 0.5 * pi, len(rfft(pulse)), endpoint=False))
plot(filtered)
pass

### Separate source and filter

Finding the first peak in the cepstrum ignoring the lowest, largest coefficients. This peak represents the fundamental pitch of the original signal. Everything below this peak we consider a filter coefficient. Everything above this peak we consider part of the source/excitation signal in the source+filter model.

In [None]:
n0 = argmax(cepstrum_a[100:200]) + 100 # +100 to make up for starting at 100
n0

In [None]:
16000 / n0 # to Hz

Everything before the peak represents the filter...

In [None]:
cepstrum_filter_a = abs(fft.fft(cepstrum_a[:n0 - 1], n=2048))
plot(cepstrum_filter_a[:1000])
grid()

In [None]:
plot(cepstrum_filter_a, 'green', lw=3)
twinx()
plot(log_mag_spec_a)
xlim((0,1000))

Everything after represents the source (excitation).

In [None]:
source_coeffs = r_[zeros(n0), cepstrum_a[n0:]]

In [None]:
cepstrum_source_a = fft.fft(source_coeffs, n=2048)
plot(abs(cepstrum_source_a)[:1025])
pass

In [None]:
source_spec = np.e ** (abs(cepstrum_source_a)[:1025])
plot(source_spec)
pass

In [None]:
source = fft.ifft(source_spec)
plot(source)
pass

In [None]:
source_cycled = list(source * 500) * 50 # the 500 here is because Audio expects 16bit range
plot(source_cycled)
Audio(source_cycled, rate=sr*2) 

In [None]:
cepstrum_e = ifft(log_mag_spec_e)
n0 = argmax(cepstrum_e[100:150]) + 100
cepstrum_filter_e = abs(fft.fft(cepstrum_e[:n0 - 1], n=2048))
plot(abs(cepstrum_filter_e), 'green', lw=3)
twinx()
plot(log_mag_spec_e)
xlim((0,1000))
16000 / n0

In [None]:
cepstrum_o = ifft(log_mag_spec_o)
n0 = argmax(cepstrum_o[100:150]) + 100 # had to do some minor manual adjustments here!
cepstrum_filter_o = abs(fft.fft(cepstrum_o[:n0 - 1], n=2048))
plot(cepstrum_filter_o, 'green', lw=3)
twinx()
plot(log_mag_spec_o)

xlim((0,1000))
16000 / n0

In [None]:
freqs = linspace(0, sr/2, 2048)
plot(freqs, cepstrum_filter_a)
plot(freqs, cepstrum_filter_e)
plot(freqs, cepstrum_filter_o)

legend(['a','e','o'])
xlabel('Frequency (Hz)')
grid()
xlim((0, 4000))
title('Cepstra extracted filter for vowels');
pass

Different amount of detail can be preserved by using more or less cepstral coefficients.

In [None]:
num_coeffs = [10, 15, 30, 50]

for n in num_coeffs:
    cepstrum_filter = abs(fft.fft(cepstrum_a[:n], n=512))
    plot(abs(cepstrum_filter)[:250])

legend(num_coeffs)
grid()
title('Different number of coefficients for Cepstral filter ("a")');

In [None]:
num_coeffs = [10, 15, 30,50]

for n in num_coeffs:
    cepstrum_filter = abs(fft.fft(cepstrum_e[:n], n=512))
    plot(abs(cepstrum_filter)[:250])

legend(num_coeffs)
grid()
title('Different number of coefficients for Cepstral filter ("e")');

In [None]:
num_coeffs = [10, 15, 30, 50]

for n in num_coeffs:
    cepstrum_filter = abs(fft.fft(cepstrum_o[:n], n=512))
    plot(abs(cepstrum_filter)[:250])

legend(num_coeffs)
grid()
title('Different number of coefficients for Cepstral filter ("o")');

## Using the [DCT](http://en.wikipedia.org/wiki/Discrete_cosine_transform) instead of the FFT

DCT type II:

$$X_k =
 \sum_{n=0}^{N-1} x_n \cos \left[\frac{\pi}{N} \left(n+\frac{1}{2}\right) k \right] \quad \quad k = 0, \dots, N-1.$$



The DCT is another type of harmonic analysis.

- fewer cosine functions are needed to approximate a typical signal
- uses only real numbers
- reduced computational complexity 
- for DCT type II, each harmonic is shifted by 0.5 "steps" within the analysis window

In [None]:
N = 1024
k = 0
phs = linspace(k * 0.5*pi/N, (k * pi *(N-0.5))/N, N)
plot(cos(phs))
pass

In [None]:
k = 1
phs = linspace(k * 0.5*pi/N, (k * pi *(N-0.5))/N, N)
plot(cos(phs))
pass

In [None]:
k = 2
phs = linspace(k * 0.5*pi/N, (k * pi *(N-0.5))/N, N)
plot(cos(phs))
pass

This produces some assymetrical aliasing on the second half of the spectrum (i.e. it's not symmetrical like the Fourier Transform for real input)

In [None]:
k = 1023
phs = linspace(k * 0.5 * pi / N, (k * pi * (N - 0.5)) / N, N)
plot(cos(phs))
pass

In [None]:
phs = linspace(k * 0.5*pi/N, (k * pi *(N-0.5))/N, N)
plot(cos(phs)[0:100])
pass

In [None]:
from scipy.fftpack import dct

In [None]:
cepstrum_dct_o = dct(log_mag_spec_o)
n0 = argmax(cepstrum_dct_o[100:150]) + 100 # had to do some minor manual adjustments here!
cepstrum_dct_o = abs(fft.fft(cepstrum_dct_o[:n0 - 1], n=4096))
plot(cepstrum_dct_o, 'green', lw=3)

twinx()
plot((cepstrum_filter_o), 'r')

twinx()
plot(log_mag_spec_o)

xlim((0,1000))

## Pitch estimation

In [None]:
plot(cepstrum_a[1:200])
pass

In [None]:
len(cepstrum_a)

In [None]:
plot(cepstrum_e[1:200])
pass

In [None]:
plot(cepstrum_o[1:200])
pass

The x-axis in a Ceptrum plot is called Quefrency. But it is in fact a time axis.

In [None]:
argmax(cepstrum_a[100:150]) + 100

In [None]:
argmax(cepstrum_e[100:150]) + 100

In [None]:
argmax(cepstrum_o[100:150]) + 100

In [None]:
f_a = sr /(argmax(cepstrum_a[100:150]) + 100)
f_a

In [None]:
f_e = sr /(argmax(cepstrum_e[100:150]) + 100)
f_e

In [None]:
f_o = sr /(argmax(cepstrum_o[100:150]) + 100)
f_o

In [None]:
freqs = linspace(0, sr/2, 2049)
plot(freqs, log_mag_spec_a)
xlim((0, 500))
vlines(f_a, 0, 14)
grid()

In [None]:
freqs = linspace(0, sr/2, 2049)
plot(freqs, log_mag_spec_e)
xlim((0, 500))
vlines(f_e, 0, 14)
grid()

In [None]:
freqs = linspace(0, sr/2, 2049)
plot(freqs, log_mag_spec_o)
xlim((0, 500))
vlines(f_o, 0, 16)
grid()

## Harmonic vs. noisy spectra

In [None]:
noise = 5000.0 * (random.random(2048) - 0.5)
fourier_trans = rfft(noise * hanning(2048), n=4096)
mag_spectrum_noise = abs(fourier_trans)
log_mag_spec_noise = log(mag_spectrum_noise)
plot(log_mag_spec_noise)

In [None]:
cepstrum_noise = ifft(log_mag_spec_noise)
plot(abs(cepstrum_noise[1:]))

In [None]:
sinsig = 2500.0 * ((sin(linspace(0, 20*2*pi,2048))) + (sin(linspace(0, 40*2*pi,2048))))
fourier_trans = rfft(sinsig * hanning(2048), n=4096)
mag_spectrum_sinsig = abs(fourier_trans)
log_mag_spec_sinsig = log(mag_spectrum_sinsig)
plot(log_mag_spec_sinsig)

In [None]:
cepstrum_sinsig = ifft(log_mag_spec_sinsig)
plot(abs(cepstrum_sinsig[1:]))

The simplest way can be setting a threshold for the maximum value of the cepstrum, but other techniques to detect flatness or peakedness of the cepstrum can be used.

# Sinusoidal modeling

A signal is modeled as a sum of time varying sinusoids:

$$P_k(n) = \alpha_k(n)\sin(\phi_k(n))$$

The signal is the sum of each individual sinusoid:

$$ s(n) = \sum\limits_{k}P_k(n)$$

*n* is the point in time (sample number) and *k* is the index to each sinusoidal component.


In [None]:
spec, freqs, bins, _ = specgram(e, NFFT=2048, Fs=16000, noverlap=1024, pad_to=8192)

In [None]:
spec.shape

In [None]:
plot(spec[:,0])
pass

First we find the local maxima to identify peaks:

In [None]:
maxima = argwhere((spec[:-2, 0] < spec[1:-1, 0]) & (spec[2:, 0] < spec[1:-1, 0])) + 1

In [None]:
plot(spec[:,0])
plot(maxima, spec[maxima, 0], 'o')
xlim(0, 2000)
pass

Now filter by threshold (let's choose 100000):

In [None]:
peaks = [index for index in maxima if spec[index, 0] > 100000]
peaks

In [None]:
plot(spec[:,0])
plot(peaks, spec[peaks, 0], 'o')
pass

In [None]:
peak_list = []
for s in spec.T:
    maxima = argwhere((s[:-2] < s[1:-1]) & (s[2:] < s[1:-1])) + 1
    peaks = [(freqs[index][0], s[index][0]) for index in maxima if s[index] > 100000]
    peak_list.append(peaks)

In [None]:
A = (4,5)
A[0] = 2

In [None]:
peak_list[0]

In [None]:
array(peak_list[0])

In [None]:
array(peak_list[0])[0]

In [None]:
array(peak_list[0])[0, 0]

In [None]:
array(peak_list[0])[:,0]

In [None]:
array(peak_list[0])[:,1] # amplitudes

In [None]:
for i, peaks in enumerate(peak_list):
    freqs = array(peaks)[:,0]
    plot(ones(len(freqs))*i, freqs, 'o')

In [None]:
specgram(e, NFFT=2048, Fs=16000, noverlap=1024, pad_to=8192)
for i, peaks in enumerate(peak_list):
    freqs = array(peaks)[:,0]
    plot(ones(len(freqs))*bins[i], freqs, 'o')

In [None]:
specgram(e, NFFT=2048, Fs=16000, noverlap=1024, pad_to=8192, interpolation='nearest')
for i, peaks in enumerate(peak_list):
    freqs = array(peaks)[:,0]
    plot(ones(len(freqs))*bins[i], freqs, 'o')

ylim((0, 1000))

Top part of the spectrum:

In [None]:
specgram(e, NFFT=2048, Fs=16000, noverlap=1024, pad_to=8192, interpolation='nearest')
for i, peaks in enumerate(peak_list):
    freqs = array(peaks)[:,0]
    plot(ones(len(freqs))*bins[i], freqs, 'o')

ylim((1500, 3000))

Now connect the dots. First start tracks at initial peak list:

In [None]:
tracks = [[r_[freq, amp, bins[0]]] for freq, amp in peak_list.pop(0)]
tracks

In [None]:
tracks_ = array(tracks)
tracks_, tracks_.shape

Then start connecting frame by frame:

In [None]:
new_peaks = peak_list.pop(0)
new_peaks

In [None]:
f = new_peaks[0][0]
f

In [None]:
tracks[:]

In [None]:
last_bps_freq = []
last_bps_time = []

for bps in tracks:
    last_bps_freq.append(bps[-1][0]) # get last breakpoint for all tracks
    last_bps_time.append(bps[-1][2]) # get last breakpoint for all tracks

print(last_bps_freq)

In [None]:
tracks[:]

But last breakpoint must be from the previous frame!

In [None]:
previous_frame_time = bins[0]
previous_frame_time

In [None]:
active_tracks = argwhere(last_bps_time == previous_frame_time)
active_tracks = array(active_tracks)
active_tracks

In [None]:
prev_freqs = array(tracks)[active_tracks,-1,0]
prev_freqs

In [None]:
dists = abs(prev_freqs - f)
dists

In [None]:
argmin(dists)

In [None]:
active_tracks[argmin(dists)]

In [None]:
best_matches = []
for peak in new_peaks:
    f = peak[0]
    dists = abs(prev_freqs - f)
    best_matches.append([active_tracks[argmin(dists)], dists.min()])

best_matches = array(best_matches)
best_matches

Now that we have the best match for each of the new points, we need to decide which ones get connected.

In [None]:
argwhere(best_matches[:, 0] == 0)

In [None]:
argwhere(best_matches[:, 0] == 3)

In [None]:
best_matches[argwhere(best_matches[:, 0] == 3)]

In [None]:
best_matches[argwhere(best_matches[:, 0] == 3)][:,:,1]

In [None]:
best_next = argmin(best_matches[argwhere(best_matches[:, 0] == 3)][:,:,1])
best_next

In [None]:
best_next += argwhere(best_matches[:, 0] == 3)[0]
best_next

Check if close enough

In [None]:
dist_th = 100 # Set maximum distance allowed for connection
if best_matches[best_next,1] < dist_th:
    print("Match!")

Now, all together:

In [None]:
spec, freqs, bins, im = specgram(e, 2048, 16000, noverlap=1024, pad_to=8192)

peak_list = []
for s in spec.T:
    maxima = argwhere((s[:-2] < s[1:-1]) & (s[2:] < s[1:-1])) + 1
    peaks = [(freqs[index], s[index]) for index in maxima if s[index] > 1000]
    peak_list.append(peaks)

tracks = [[r_[freq, amp, bins[0]]] for freq, amp in peak_list.pop(0)] #inital tracks from initial peaks
tracks = array(tracks)

In [None]:
new_peaks = peak_list.pop(0)
last_bps = tracks[:,-1,:]
last_bps[:,2]
previous_frame_time = 0.064

active_tracks = argwhere(last_bps[:,2] == previous_frame_time)
active_tracks = array(active_tracks)
prev_freqs = tracks[active_tracks,-1,0]

best_matches = []
for peak in new_peaks:
    f = peak[0]
    dists = abs(prev_freqs - f)
    best_matches.append([active_tracks[argmin(dists)][0], dists.min()])

best_matches = array(best_matches)
connections = dict()

for i in set(best_matches[:, 0]):
    best_next = argmin(best_matches[argwhere(best_matches[:, 0] == i)][:,:,1])
    best_next += argwhere(best_matches[:, 0] == i)[0]
    connections[int(i)] = best_next[0]

connections

Finally place breakpoints in tracks. (Exercise left to the reader)

Try sinusoidal modeling:

http://www.klingbeil.com/spear/

http://mtg.upf.edu/technologies/sms

http://www.cerlsoundgroup.org/Loris/

https://ccrma.stanford.edu/~juan/ATS_manual.html

Streaming real-time:

* http://www.csounds.com/manual/html/partials.html
* http://doc.sccode.org/Classes/TPV.html

By: Andrés Cabrera mantaraya36@gmail.com
For MAT course MAT 201A at UCSB

Adapted by Karl Yerkes

This ipython notebook is licensed under the CC-BY-NC-SA license: http://creativecommons.org/licenses/by-nc-sa/4.0/

![http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png](http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png)