## Utterance Verification for Text-Dependent Speaker Recognition: aComparative Assessment Using the RedDots Corpus
### http://cs.uef.fi/sipu/pub/1125.pdf

Combination of speaker verification and utterance verification give better results than doing them separately.
UV is combined with SV in HMM approach.
Utterance verification could greatly improve security in text-prompted case. In this case user is given a phrase to say and this information can be used to improve verification. On the other case if the phrase is different each time pre-recording and replaying the speaker would not be possible. 

In speaker verification evaluate log-likelihood ratio score
$$l_s(X, j) = log \frac{p(X | \text{same speaker})}{P(X | \text{different speaker})}$$

where X - features of the utterance, j - speaker id.

Numerator and denominator can be evaluated with:
* adapted target speaker GMM and an universal background model (UBM)
* probabilistic linear discriminant analysis (PLDA) with i-vectors as input features

In utterance verification
$$l_u(Y, k) = log \frac{p(X | \text{same text})}{P(X | \text{different text})}$$

where $Y$ - features of the utterance, $k$ - promted text id

Same methods could be used as in SV. 

Author, with assumption that they know all phrases (they use only 10 as in RedDots first case). With this assumption they can subtract from the $l_u$ of the real phase the mean or max score of other, wrong phrases.

Combined
$$l_{su}(X, Y, j, k) = l_s(X, j) + l_u(Y, k)$$


UV1
* Their system uses MFCC features and GMM-UBM model. (Gaussian mixture model – Universal background model)
* MFCC - 20 filters in Mel scale. Perform RASTA-filtering on 19 coefficients, add deltas and double deltas to get 57-dimensional features. Then cepstral-mean normalization.
* UBM - 512 components, trained with all male data from TIMIT
* Utterance models obtained with MAP adaptation with relevance factor of 3.
* Target-to-UBM log-likelihood is used as the UV score.

UV2
* 2-layer approach with HMM and UBM
* left-to-right 5-state HMM with continuous observation densities modeled with GMM
* GMM adapted from a 512-component UBM trained on TIMIT, 

UV3
* DTW to align feature vectors for a pair of utterances, euclidean distance for frames
* 57 MFCC without RASTA
* average score against all the utterances is used as the score

UV4
* Forced alignment. Created 10 reference transcripts for the 10 sentences using TIMIT. All test segments were force aligned with the references. 

Acoustic phone model
* MFCC with LDA and feature-space maximum likelihood linear regression fMLLR were used as DNN inputs with left and right contexts of 3 frames.
* Training 1: unsupervised training of a stack of RBMs with 1024 hidden units, 6 layers, 13 iterations.
* Training 2: DNN with objective to classify individual frames to their probability density functions via cross-entropy objective.
* Training 3: Optimize state-level minimum Bayes risk (sMBR) to emphasize state sequences with higher frame accuracy

SV1
* same as UV1

SV 2
* same as UV1 but with constant Q cepstral coefficients as input

SV 3
* Speaker Independent HMM. Then speaker models are derived using SI-HMM and MAP adaptation of Gaussian means using the enrollment data.
* During testing the test data is force-aligned against the target model ans SI-HMM and log-likelihood ratio is calculated. 
* 14 states and 8 mixtures provided best results

SV 4
* i-vectors S = m + Tw where w is the i-vector, S is utterance supervector, m is the UBM supervector, T is a low-rank matrix. 
* Gender-dependent GMM-UBM of 512 mixtures trained using 157 male speakers from RSR2015 corprus consisting od 30 pass phrases from 9 sessions (~42k utterances). i-vector dimension 400

Did male-only verification because RedDots lack female subjects.

Evaluation
* EER - error rate when false acceptance probability and false rejection probability are equal
* Separately report FAR(TW), FAR(IC), FAR(IW) 
* UV2 

## Fader Networks:Manipulating Images by Sliding Attributes
### https://arxiv.org/pdf/1706.00409.pdf

The paper is about a generative image model. The model learned features of human faces like whether the person wears glasses, whether they're old or young or what is their sex. It could generate given a face the same face but with some of those features changed.

They used encoder - decoder architecture. Encoder is a convolutional NN $D_{\gamma_enc}: X \to R^N$, decoder is a deconvolutional NN $D_{\gamma_dec}: (R^N, y) \to X$. Autoencoder loss is a MSE between image and its reconstruction. 

They had a problem learning the network, because it ignored y. They obviated the problem by forcing the model to have same $E(X)$ for all images $X$ of the same person, but with selected features $y$ changing. They used adversarial training on the latent space. The idea is to learn an additional NN called discriminator that tries to predict y given E(X) and ensuring that it is unable to identify y. This corresponds to a two-player where discriminator tries to learn y and E tries to prevent it. 

## Deep neural networks in speaker recognition
### pl-dydaktyka-kodrzywolek_praca_magisterska_skrot.pdf

Parametry dźwięku: Voice Activity Detection (oczyszczenie nagrania z fragmentów ciszy, zostawienie nagrań mowy), podział na ramki 20ms, potem MFCC, Filter Banks lub Perceptual Linear Prediction PLP bazująca na Linear Predictive Coding LPC.

Modelowanie: Modele wzorcowe vector quantization, dynamic time warping, nearest neighbors. Modele stochastyczne Gaussian Mixture Models, Hidden Markov Models. Stochastyczne raczej się stosuje, bo są bardziej pojemne.

Klasyfikacja: We wzorcowych modelach odległość od wzorca, w statystycznych prawdopodobieństwo pochodzenia próbki z modelu.

GMM-UBM: Trenujemy model tła UBM za pomocą EM na wszystkich dostepnych nagraniach. Potem generowany jest model użytkownika przez adaptację UBM. Adaptacja zachowuje macierz kowariancji, lecz średnie modelu są przesuwane, by zmaksymalizować prawdopodobieństwo uzyskania wzorców danego mówcy. Weryfikacja przez policzenie wiarygodności

$$log P(U | \lambda_m) - log P(U | \lambda_{UBM})$$ gdzie U to próbka głosu, $\lambda_m$ to parametry modelu mówcy m, a $\lambda_{UBM}$ to parametry modelu tła.

i-vectors: Rozwinięcie GMM-UBM. Wektory średnich dla wszystkich komponentów GMM dla danego mówcy są konkatenowane, daje to superwektor. Można no go rozbić na cechy wspólne dla mówców i zależne od mówcy $s = m + T i$ gdzie $m$ to superwektor uzyskany z modelu tła UBM, $T$ to macierz total variability, a $i$ to i-vector opisujący danego.

i-vectory dobrze opisano w Dehak N., Shum S.,Low-dimensional speech representation based on factor analysis and its appli-cations, Johns Hopkins CLSP Lecture, 2011

Ewaluacja: EER, Detection Cost Function DCF = 0.99 FPR + 0.1 FNR. Histogramy wiarygodności dla targetu/impostora. Detection Error Tradeoff DET zależnośc między FPR na osi x i FNR na osi y

Etapy nauki: Pretrening warstwami. Potem dotrenowanie jako jednej sieci MLP lub Stacked Denoising Autoencoders lub Deep Belief Networks. ReLU + Dropout pozwala zrezygnować z pretreningu. Gdy danych z labelami jest bardzo mało wciąż stosuje się pretrening z większym zbiorem bez labelek i potem dotrenowanie z labelkami.

Funkcja aktywacji dla ReLU ma warianty: Zwykły ReLU gdzie neuron umiera po dotarciu do 0, leaky max(ax, x) gdzie a ~ 0.01, PReLU gdzie a jest osobne dla każdego neuronu i uczone, Exponential Linear Unit.

Autoenkodery - denoising polegają na nakładaniu 0-1 maski na wejściowy wektor. $r ~ Bernoulli(p), x' = r_n \times r$. Można też powiązać wagi enkodera i dekodera $W_d = W_e^T$. Można też dać im wiele warst i uzyskać Stacked Autoencoder.

LSTM - nie opisał prawie :(

Co do DNN: Zamiast tworzyć sieć bezpośrednio do SV raczej używa się sieci wytrenowanej do zmodyfikowanego zadania i wykorzystuje się ją jako część klasyfikatora.

stary [30], trudny [31], "Kolejna [33] wykorzystała sieci Deep Belief Network, jako ekstraktor pseudo i-vectorów, czyli wektorów, które miały takie samo zadanie, jednak nie pochodziły z oryginalnej metody. Siecc składała się z 5 warstw ukrytych, z których każda zawierała 1000 neuronów. Korzystała z wektorów MFCC jedenastu kolejnych ramek sygnału dzwiekowego, a pseudo i-vectory były z sieci wydobywane jako statystyki z aktywacji jej ostatniej warstwy. Praca [34] natomiast wzorowała sie na GMM-UBM. Z wszystkich dostepnych i-vectorów wytrenowano model DBN i nazwano go Universal DBN, na wzór oryginalnego UBM. Nastepnie dla kazdego rejestrowanego mówcy tworzono jego własny model DBN przez dotrenowanie (adaptacje) modelu Universal DBN nagraniami tego mówcy. Najpierw uczeniem nienadzorowanym, a pózniej dyskryminatywnie dla klas target-impostor, w których klasa target reprezentowała tylko i wyłacznie danego mówce. Powyzsze prace osiagały zblizona skuteczność do metody i-vectorów, jednak ̇zadnej nie udało się jej przewyższyć"

Bottleneck Features w [35, 36, 37]

Rewelacyjne wyniki w [38]. 7-warstwowa RBM, dla nagrania bierze się wyjścia z jednej warstwy, uśrednia dla całej wypowiedzi i traktuje to jako feature całego nagrania. Na koniec korzysta się z klasyfikatorów CDS, LDA, PLDA. Wynik był zajebisty bo wytrenowali LDA i PLDA na zbiorze testowym. xd

Jest też model Google [39, 40]. Weryfikacja mówcy na podstawie frazy _Ok, google_. Model bezpośredni, bo mają ogromne dane. Filter Bank z dźwięku. Potem N ramek do LSTM, na wyjściu wektor cech mówcy. Potem porównanie przez CDS (Cosine distance similarity) z wektorami z rejestracji. Wektory mówcy powstają przez uśrednienie wektorów dla wypowiedzi.

d-vector = wektor reprezentujący wypowiedź uzyskany z DNN.

Autor połączyć [38] i [40]. Tworzy i-vector tradycyjnie, potem i-vector + kilka ramek nagrania daje się na wejście DNN. Zostanie wypróbowane kilka DNN. Bierze się wektor aktywacji jednej z warstw DNN, uśrednia dla całej wypowiedzi i redukuje wymiary. d-vector porównuje się z rejestracyjnymi.

Bazowa parametryzacja 13 cech PLP, wraz z cechami delta oraz delta-delta. Łącznie 39 na ramkę. Używa po 5 sąsiadujących ramek z każdej strony, kontekst 11 ramek, razem wektor ma 429 parametrów.

Redukcja wymiarowości najpierw unsupervised PCA do 400, potem supervised LDA do 200.

Klasyfikacja - podobieństwo cosinusowe m-dzy d-vectorem i d-vectorem zrobionym z kilku nagrań rejestracyjnych.

* 27 - Sak H., Senior A.W., Rao K., Beaufays F.,Fast and accurate recurrent neural network acousticmodels for speech recognition, CoRR, abs/1507.06947, 2015.
* 28 - Farrell K.R., Mammone R.J., Assaleh K.T.,Speaker recognition using neural networks and conven-tional classifiers, IEEE Transactions on Speech and Audio Processing, 2(1), 194–205, 1994
* 29 - Wouhaybi R.H., Al-Alaoui M.A.,Comparison of neural networks for speaker recognition, Electro-nics, Circuits and Systems, 1999. Proceedings of ICECS’99. The 6th IEEE International Conferen-ce on, tom 1, 125–128, IEEE, 1999.
* 30 - Konig Y., Heck L., Weintraub M., Sonmez K., et al.,Nonlinear discriminant feature extractionfor robust text-independent speaker recognition, ESCA workshop on Speaker Recognition and itsCommercial and Forensic Applications, 72–75, 1998.
* 31 - Lee H., Pham P., Largman Y., Ng A.Y.,Unsupervised feature learning for audio classification usingconvolutional deep belief networks, Advances in neural information processing systems, 1096–1104, 2009.
* 32 - Senoussaoui M., Dehak N., Kenny P., Dehak R., Dumouchel P.,First attempt of boltzmann machi-nes for speaker verification., Odyssey, 117–121, 2012.
* 33 - Vasilakakis V., Cumani S., Laface P.,Speaker recognition by means of deep belief networks, 2013.
* 34 - Ghahabi O., Hernando J.,Deep belief networks for i-vector based speaker recognition, Acoustics,Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 1700–1704,IEEE, 2014
* 35 - Richardson F., Reynolds D., Dehak N.,A unified deep neural network for speaker and languagerecognition, arXiv preprint arXiv:1504.00923, 2015.
* 36 - Richardson F., Reynolds D., Dehak N.,Deep neural network approaches to speaker and languagerecognition, Signal Processing Letters, IEEE, 22(10), 1671–1675, 2015.
* 37 - Tian Y., Cai M., He L., Lu J.,Investigation of bottleneck features and multilingual deep neuralnetworks for speaker verification, INTERSPEECH, 1151–1155, ISCA, 2015.
* 38 -  Liu Y., et al.,Deep feature for text-dependent speaker verification, ScienceDrect, 2015.
* 39 - Chen Y., Lopez-Moreno I., Sainath T.N., Visontai M., Alvarez R., Parada C.,Locally-connected andconvolutional neural networks for small footprint speaker recognition., INTERSPEECH, 1136–1140, ISCA, 2015.
* 40 - Heigold G., Moreno I., Bengio S., Shazeer N.,End-to-end text-dependent speaker verification,CoRR, abs/1509.08062, 2015
* 44 - Hinton G.,A practical guide to training restricted boltzmann machines, Momentum, 9(1), 926,2010.
* 45 - Bengio Y.,Practical recommendations for gradient-based training of deep architectures, NeuralNetworks: Tricks of the Trade, 437–478, Springer, 2012
* 46 - Larochelle H., Bengio Y., Louradour J., Lamblin P.,Exploring strategies for training deep neuralnetworks, The Journal of Machine Learning Research, 10, 1–40, 2009.
* 49 - Cumani S., Glembek O., Brümmer N., De Villiers E., Laface P.,Gender independent discriminativespeaker recognition in i-vector space, Acoustics, Speech and Signal Processing (ICASSP), 2012IEEE International Conference on, 4361–4364, IEEE, 2012

## EXPLOITING SEQUENCE INFORMATION FOR TEXT-DEPENDENT SPEAKER VERIFICATION
### https://infoscience.epfl.ch/record/225954/files/Dey_Idiap-RR-04-2017.pdf

> Model-based approaches to Speaker Verification (SV), such as Joint  Factor  Analysis  (JFA),  i-vector  and  relevance  Maximum-a-Posteriori  (MAP),  have  shown  to  provide  state-of-the-art  perfor-mance  for  text-dependent  systems  with  fixed  phrases. 
> The  per-formance  of  i-vector  and  JFA  models  has  been  further  enhancedby  estimating  posteriors  from  Deep  Neural  Network  (DNN)  instead of Gaussian Mixture Model (GMM).

(from abstract)

Most  of  the  techniques  to  tackle  text-dependent  SV  can  begrouped into two main categories: (a) model-based and (b) template-based  (Dynamic  Time  Warping  (DTW))  techniques.

PLDA is a method for classifying. It turns dimensionality reduction LDA into a classification method.

---
JFA!!! http://www1.icsi.berkeley.edu/Speech/presentations/AFRL_ICSI_visit2_JFA_tutorial_icsitalk.pdf
s = m + Vy + Ux + Dz
speaker supervector = speaker-independent component + speaker-dependent component + channel-dependent component + speaker-dependent residual component
m - speaker independent supervector (from UBM)
V - eigenvoice matrix
y - speaker factors, assumed to have N(0, 1) prior distribution
U - eigenchannel matrix
x - channel factors, N(0, 1) prior
D - residual matrix, diagonal
z - speaker specific residual factors, N(0, 1) prior

---

Paper says you can use MAP adapting of GMM-UBM model, i-vector (s = u + Tw) where u is mean supervector of GMM-UBM or jfa (s = u + Dz + Ux). Using DNN posteriors as i-vectors gives great results.

The i-vectors for a recording are often just averaged, which drops the information about in what sequence they occured. Authors suggest that DTW or HMM can be used to retain the information.

Trained, evaluated on male recording, part 1 of RedDots dataset.

20 MFCC, 25ms frames with 10ms sliding window, appended with delta and acceleration parameters. Short time gaussianization applied to those features, using 3s sliding window. Hungarian phoneme recognizer used to detect voice activity.

1024 params of GMM-UBM estimated on ~120h subset of Fisher corprus and Part 1 of RSR (?). i-vector extractor of 400 dims also trained with same training set. JFA rank of eigenchannel matrix is 50. 

DNN with 4 hidden layers, 1200 sigmoid units per layer, 1530 softmax units at output.

> Evaluations  are  done  onthree conditions labeled as Cond1, Cond2, Cond3, and an additionalcondition (Cond-all) with the trials from all three conditions put to-gether. 
> More particularly, in condition 1, each trial is associated withdetermining if the phrases are the same or different.  In condition 2,the system is required to differentiate speakers pronouncing the samecontent.  In condition 3, both the speaker and the phrase can be different.

* RMAP 5.2 4.1 1.0 1.8
* IVec(GMM, PLDA) 6.9 4.2 1.3 1.9
* JFA(GMM) 10.5 7.9 2.9 3.8


* IVec(DNN, PLDA) 6.9 3.4 1.2 1.6
* JFA(DNN) 4.1 7.0 1.2 2.5


* DTW-MFCC 2.1 5.6 1.2 1.9
* DTW-post(GMM) 1.8 6.7 1.7 2.9 
* DTW-post(DNN) 1.1 9.0 1.0 3.5 
* DTW-onIvec(GMM) 2.6 3.8 1.3 1.8 
* DTW-onIvec(DNN) 1.3 3.2 0.8 1.3 
* onIvec(GMM, PLDA) 5.4 6.7 2.5 2.9 
* onIvec(DNN, PLDA) 2.8 4.8 1.8 2.1

(EER in %)

## COMPARISON OF MULTIPLE FEATURES AND MODELING METHODS FORTEXT-DEPENDENT SPEAKER VERIFICATION
### https://arxiv.org/pdf/1707.04373.pdf

i-vectors work well for speaker recognition, but not for text-dependent sr


## WAVENET: A GENERATIVE MODEL FOR RAW AUDIOA
### https://arxiv.org/pdf/1609.03499.pdf

They used NN to generate audio. Their network was convolutional and given past samples generated the next one, which could then be prepended to input for the next sample.

The used dilated convolution, that is a convolution that skips input values with a certain step. It increased receptive field to $2^{\text{depth}}$

They used softmax, but quantized $2^{16} = 65536$ levels to $256$. The used nonlinear quantization 
$$f(x_t) = sign(x_t) \frac{ln(1 + \mu|x_t|}{ln(1 + \mu)}$$
where $-1 \lt x_t \lt 1$ and $\mu = 255$

They used gated activation units like in _PixelCNN_.
$$z = tanh(W_{f,k} * x) \cdot \sigma(W_{g,k} * x)$$
where $*$ is convolution operator, $\cdot$ is element-wise multiplication, $\sigma$ is a sigmoid function, $k$ layer index, $f$ and $g$ are filter and gate, $W$ is learnable convolution filter. It worked better than ReLUs.

You can make the model generate speech with some characteristics by conditioning it on it, eg. on speaker identity as one hot vector.

Some text-depended models also exists, they use external models for predicting $log F_0$ and phone duration from linguistic features.

## Parallel Speaker and Content Modelling for Text-dependent Speaker Verification
### https://pdfs.semanticscholar.org/fcb5/af7bbb24e3bea6564e57e3a2cceca6af5481.pdf

They used two systems. One system modelling speaker based on known lexical content, othr modelling lexical content based on known speaker. Speaker verification uses HMM GMM, lexical content verification uses GMM. They also created a mixture model based on KL divergence.

TD systems showed higher accuraccy that TI ones. Current research focuses on very short utterances, ~1.5s.

Advantage of TD models come from knowing the content in advance. Should we even verify the content?

Joint Factor Analysis, HMM, GMM-UBM, DNN, LSTM were all used.

Front end features - MFCC 19 dimensions with log energy their first and second derivatives. Vector quantization model voice activity detector was used, then feature warping was applied.

SV: HMM GMM, N-state, GMM are initialized with UBM. Then trained with all data to estimate background pass-phrase HMM. Then this HMM is copied for each speaker and MAP adapted using their recordings. Scoring: P(O|HMM) are estimated with Viterbi algorithm, final score $S_{HMM} = log(O|\text{Speaker HMM}) - log(O|\text{Background pp HMM})$

UV: left-to-right segment model. Split each pass phrase into S segments and use separate GMM to model each segment. Test segment is also divided into S segments and score is $S_{seg} = \frac{1}{S}\sum_{i=1}^S (log P(O_i|\lambda_{seg_i}) - P(O_i|\lambda_{UBM}))$

Baseline system is GMM UBM with 19 MFCC, 512 mixtures. Only means of GMM are MAP adapted.

Only part 1 males are considered. Seems like extremely common decision. Results for IC, TW, IW are expressed in terms of EER.

EER on different subsystems:

|    | GMM  | 4seg | 8seg | 4HMM | 8HMM | 8HMM + 4seg |
|:-- | ----:| ----:| ----:| ----:| ----:| -----------:|
| IC | 2.41 | 2.81 | 5.64 | 1.20 | 1.19 | 1.76        |
| TW | 5.11 | 2.78 | 6.29 | 6.42 | 5.92 | 2.72        |
| IW | 0.59 | 0.62 | 2.22 | 1.23 | 1.20 | 0.46        |

ERR on various mixtures

|    | GMM  | MS256 | MS128 | MS64 | MS32 | MS256_4seg | MS256_4seg + 8HMM |
|:-- | ----:| -----:| -----:| ----:| ----:| ----------:| -----------------:|
| IC | 2.41 | 2.34  | 2.50  | 2.96 | 4.34 | 2.80       | 1.45              |
| TW | 5.11 | 4.50  | 3.98  | 4.18 | 5.62 | 2.50       | 2.50              |
| IW | 0.59 | 0.48  | 0.52  | 0.77 | 1.24 | 0.56       | 0.37              |


## M. H ́ebert,Springer Handbook of Speech Processing.Berlin,Heidelberg:Springer   Berlin   Heidelberg,    2008,    ch.   Text-Dependent Speaker Recognition, pp. 743–762
DR. Hard to find, old.