## Introduction

Concepts
- Query - excerpt from a song tracked on a device
- Reference - song in a database

### How algorithm works (briefly)
#### Data preprocessing
1. We compute a spectrogram
2. We retain only peaks on this spectrogram ($f_1$,$f_2$ ... $f_n$)
3. We use some subset of peak pairs {($f_{t_1}$,$f_{t_2}$,$\Delta_{t_1,t_2}$)} as a hash

#### Prediction

1. Using reverse index we get candidate songs.
2. We choose a song that matches best

There is no much detail on the implementation.


Interesting lecture by the author

https://www.youtube.com/feed/history

They started working on it in 1999:
<img src="img/shazam_project.png" width=550>

Typical challenge in such applications - the noise.
The level of noise is usually measured by SignalToNoise ratio
- +20db
- +10db

At the moment state-of-the-art techniques included:
- zero crossings - metric representing number of times the signal amplitute changes its sign; can also be used as a fingerprint in signal matching
<img src="img/zerocrossing.svg" width=250>
- cross-correlation - we compare two signals by sliding query song along reference song. If there is a match there gonna be a peak
<img src="img/cross_correlation.png" width=300>

    Minus - cross-correlation breaks with there is even a slight signal stretch


- Spectral Flatness - a measure of tonality (jow uniform the signal is) with 1.0 for white noise and 0.0 for single frequency.


### Idea #1 - Peaks

Instead of using full spectrograms they tried to use just peak frequencies (the loudest sound at the moment). Turned out, it works very good.

Also it allowed to decrease the size of a signal 500+ times (typical spectrogram back then had granularity of 512 bins).


#### How robust this representation is?

To experiment let's add some noise and count the number of peaks that would not change.

No matter how much noise we add, there is always a few peaks that remain the same. 

As long as only few points are necessary to recover original song with high precision => peaks are quite robust!



### Idea #2 - Combinatorial Hasing

- Select anchor point
- Select Target Zone - all points in interval succeding the anchor point
- draw an edge
- represent each edge as ($f_1$,$f_2$,$\Delta t$)


Only few edges are needed; Even fewer nodes are needed;


### Idea #3 - Temporal Corerepondence

How to distinguish between real matches and false matches? If there is a real match, some subset of matched points will be strictly aligned in time.

Suppose there is a number of points that matched (including lots of false alarms). Let's plot them on a query-reference time offsets graph:
- X axis = time offset in original song (reference)
- Y axis = time offset in audio excerpt (query)

If there is a real match, we will see a staright line of matching points:
<img src ="img/time_graph.png" width=1000>

Such straight line can be observed as a peak in a cross-correlation plot:
<img src ="img/time_histogram.png" width=500>

#### Is there any noise reduction in the algorithm?
No

### Algoritm Robustness

Audio modifications that hinders recognition:
- tempo
- pitch

To get modification-invariant fingerprints you should use ratios instead of absolute value tuples.

For example, 
$$\frac{f_1}{f_2}$$


There is also a good paper with a review:

http://coding-geek.com/how-shazam-works/

Unforunately a bit outdated.

Here is how data preprocessing looks like:

<img src = "img/shazam_scheme.jpg">


1. Computes a spectrogram
2. Retains only peaks from this spectrogram - this gonna br an audio fingerprint
3. Constructs a hash from it and compares to song database

## Further Research

### Sonnleitner
[Dissertation](http://www.cp.jku.at/research/papers/Sonnleitner_Dissertation.pdf) (2017)

Goal: propose methods, more robust to scaling and other signal modifications.

- Spectrograms are constructed using STFT (short time fourier transform).

    It decomposes signal into frequencies, but only in a local neighborhood of the signal => also depends on time:

    $$\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t$$

- They introduce Quad (quadruple) fingerprints instead of pairs.

    Let's consider a peak spectrogram.
    Quads are represented by four-tuples of sorted peaks: A, B, C, D so that all of them are contained within a region formed by A and D. 

    Relative nature of construction of these tuples makes those fingerprints scale-inveriant.

- Only valid peaks are taken into account, weak peaks on uniform regions are rejected.

