# Speaker Recognition with SincNet

### Reference Paper
**Title:** *Speaker Recognition from Raw Waveform with SincNet* (2019)  
**Authors:** Mirco Ravanelli & Yoshua Bengio (MILA)

## 1. Context and Motivation
Traditional speaker recognition systems largely rely on **hand-crafted features** (e.g., FBANK, MFCC) or **i-vector** representations. While robust, these fixed feature extraction methods may inadvertently smooth out or lose narrow-band, high-frequency clues essential for distinguishing speaker identities.

Recent trends have shifted towards **Deep Learning on raw waveforms**, typically using Convolutional Neural Networks (CNNs). Standard CNNs process raw audio using learned Finite Impulse Response (FIR) filters:

$$y[n] = (x \ast h)[n] = \sum_{l=0}^{L-1} x[l] \cdot h[n-l]$$

where $L$ is the filter length and all elements of $h$ are learned parameters.

**The Problem:** The first layer of a standard CNN often struggles with raw audio. It tends to learn noisy, uninterpretable filters, particularly when training data is scarce. This results in high dimensionality and inefficient learning.

## 2. The SincNet Innovation
SincNet introduces a constrained architectural inductive bias. Instead of learning every point in the filter kernel $h$, SincNet defines the first convolution layer as a set of **parametrized band-pass filters**.

### Mathematical Formulation
The core idea is to convolve the waveform with a function $g$ that depends only on a few learnable parameters $\theta$:
$$y[n] = (x \ast g_{\theta})[n]$$

In the **Frequency Domain**, a generic band-pass filter can be defined as the difference between two rectangular functions:
$$G[f, f_1, f_2] = \text{rect}\left(\frac{f}{2f_2}\right) - \text{rect}\left(\frac{f}{2f_1}\right)$$
where $f_1$ and $f_2$ are the learned low and high cutoff frequencies.

By applying the Inverse Fourier Transform, we obtain the filter in the **Time Domain** (using the property that the transform of a rect is a sinc):
$$g[n, f_1, f_2] = 2f_2 \text{sinc}(2\pi f_2 n) - 2f_1 \text{sinc}(2\pi f_1 n)$$
where $\text{sinc}(x) = \sin(x)/x$.

**Key Advantage:** This reduces the number of parameters per filter from $L$ (filter length, e.g., 251) to just **2** ($f_1, f_2$).

## 3. Practical Implementation & Signal Processing Constraints
To make this approach viable in a deep learning framework, several practical engineering constraints are applied:

**A. Parameter Constraints**
To ensure stability and physical meaning ($f_2 \geq f_1 \geq 0$), the network does not learn $f_1, f_2$ directly, but rather auxiliary parameters that are mapped as follows:
* $f_1^{\text{abs}} = |f_1|$
* $f_2^{\text{abs}} = f_1^{\text{abs}} + |f_2 - f_1^{\text{abs}}|$

**B. Windowing**
The theoretical sinc function is infinite. In practice, the filter must be truncated to length $L$. Abrupt truncation causes ripples in the frequency response (Gibbs phenomenon). To mitigate this, a **Hamming window** $w[n]$ is applied:
$$g_w[n, f_1, f_2] = g[n, f_1, f_2] \cdot w[n]$$
$$w[n] = 0.54 - 0.46 \cos\left(\frac{2\pi n}{L}\right)$$

**C. Symmetry**
Since the sinc function and the window are symmetric, the convolution can be computed efficiently by processing one half of the filter and mirroring it.

## 4. Expected Benefits
1.  **Fast Convergence:** By forcing the network to focus only on high-impact parameters (cutoff frequencies), the model converges significantly faster than standard CNNs.
2.  **Parameter Efficiency:** drastic reduction in the first layer's parameter count ($2 \cdot F$ vs $F \cdot L$).
3.  **Interpretability:** The resulting filters are human-readable band-pass filters, allowing direct analysis of which frequency bands the network uses to identify speakers.
4.  **Few-Shot Robustness:** The constraints act as a regularizer, making SincNet highly suitable for regimes with limited training data.