# HMM GMM

We use 3-state mono-phone HMMs to construct this recognizer. The emission probability of every state is modeled by a GMM. Say we have F mono-phones (F = 39 in our English lexicon), and a G-mixture GMM for each mono-phone state. Thus the GMM-HMM has 3FG mixture components in total. Compared to GMM-UBM, these mixtures are better separated in the phoneme space.

MFCC:  MFCCs  are  extracted  from  16kHz  utterance  with  40 filter-banks distributed between 0 and 8kHz. Static 19-dimensional coefficients plus energy and their delta and delta-delta form a 60-dimensional vector. CMVN is applied per utterance.

Hmm, deepspeech ma 26 MFCC features i zdaje się, że usuwa co drugi, ale to w związku z jakąś sztuczką uczenia RNN. Mają tak: features = mfcc(audio, samplerate=fs, numcep=numcep) gdzie mfcc jest z psf jak u nas xD a numcep=26 domyślnie, a fs=16000 domyślnie

Hmm, inny papier: In our HMM-based method, a phoneme recognizer is firsttrained with 3-state, GMM-based, mono-phone HMMs. Thisrecognizer is the same as in speech recognition. LetFbe thetotal number of mono-phones (i.e. 39),S=3Fbe the numberof all states,Gthe number of Gaussian components per state,andC=SGthe number of all individual Gaussians, and let(s, g)denote Gaussian componentgin states.
3-state - what are you?

Given a transcription, a graph of HMM is composed.

The  Viterbi  and  forward-backward  (FB)  algorithms  are  twomeans to align frames to states and mixtures.

Speaker adaption is the same with Eq. 4 except mixtures here are phonetic dependent.

During the test phase,  the Viterbi-based log-likelihood ratio is expressed as:
sum_t log P(x_t | model_user,qt) - log P(x_t | model_ubm,qt)

HMM: To generate the alignment for the HMM-based modeling,we use MFCCs to train the HMM. 39 mono-phones plus a silencemodel are used,  each of which contains 3 states.  To model thecomplexity of silence, a GMM with 16 mixtures is used for every silence state, while other states are all modeled by 8 Gaussians,resulting 984 Gaussians in total. This HMM is further extended toa triphone system and remains 2142 senones.  The transcriptionsfor  DNN  training  is  generated  by  the  senone  alignment.   OnlyMFCCs are used for HMM training and alignment.

GMM-HMM and i-vector/HMM: The GMM of every state is re-estimated using the HMM alignments and different speaker fea-tures.   The total number of mixtures in our model is 984.   Thedimension of i-vector is again set to 600.  Viterbi and FB align-ments are both investigated

* 39 x 3 states

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

import concurrent.futures as cf
import functools as ft
import itertools as it
import json
import math
import operator as op
import os
import re

from IPython.display import display
from ipywidgets import interact, interact_manual, widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pomegranate as pg
from scipy import interpolate, linalg, misc, optimize, spatial, stats
from sklearn import metrics, mixture, cluster, utils

from paprotka.dataset import reddots
from paprotka.feature import cepstral

In [3]:
%autoreload 0

In [4]:
import warnings
warnings.filterwarnings('once')

# Load

In [5]:
root = reddots.get_root()
load_pcm = ft.partial(reddots.load_pcm, root)
load_mfcc = ft.partial(reddots.load_npy, root, 'wac2_mfcc13_ddd')

In [6]:
all_paths = [os.path.join(os.path.basename(root), file) for root, _, files in os.walk(root + '/pcm') 
                                      for file in files 
                                      if file.endswith('.pcm')]
print(len(all_paths), all_paths[0])

15305 f0002/20150224142650384_f0002_14438.pcm


In [7]:
all_mfcc = {path: load_mfcc(path) for path in all_paths}

In [9]:
def load_sets(eid, tid=None):
    tid = tid if tid else eid
    
    enrollments = reddots.load_enrollments(root + '/ndx/f_part_{}.trn'.format(eid), 
                                           root + '/ndx/m_part_{}.trn'.format(eid))
    trials = reddots.load_trials(root + '/ndx/f_part_{}.ndx'.format(tid), 
                                 root + '/ndx/m_part_{}.ndx'.format(tid))
    
    return enrollments, trials

In [None]:
enrollments_1, trials_1 = load_sets('01')
print('Enrollments', enrollments_1.dtypes, sep='\n')
print('Trials', trials_1.dtypes, sep='\n')

In [None]:
enrollments_2, trials_2 = load_sets('02')
enrollments_3, trials_3 = load_sets('03')
enrollments_4_td, trials_4_td = load_sets('04_td', '04')

In [None]:
trialed_paths = set(path for trials in (trials_1, trials_2, trials_3, trials_4_td) for path in trials.pcm_path)
untrialed_paths = [path for path in all_paths if path not in trialed_paths]
print(len(trialed_paths), len(untrialed_paths))

In [None]:
def with_opened_file(mode='r'):
    def decorator(fun):
        @ft.wraps(fun)
        def wrapped(path, *args, **kwargs):
            with open(path, mode=mode) as opened:
                return fun(opened, *args, **kwargs)
        return wrapped
    return decorator

def write_model(path, model):
    with open(path, 'w') as opened:
        opened.write(model.to_json())
        
def read_mvgaussian(path):
    with open(path, 'r') as opened:
        parsed = json.load(opened)
        means, covariance = parsed['parameters']
        return pg.MultivariateGaussianDistribution(means=means, covariance=covariance)
        
def read_gmm(path):
    with open(path, 'r') as opened:
        parsed = json.load(opened)
        distributions_parsed = parsed['distributions']
        weights_parsed = np.array(parsed['weights'])
        
        distributions = []
        for i, distribution in enumerate(distributions_parsed):
            try:
                distribution = pg.MultivariateGaussianDistribution(*distribution['parameters'])
                distributions.append(distribution)
            except Exception as exception:
                raise IOError('Cannot read mixture at {}'.format(i)) from exception
                        
        weights_exp = np.exp(weights_parsed) * weights_parsed.sum()
        
        return pg.GeneralMixtureModel(distributions=distributions, weights=weights_exp)

# Train

In [None]:
untrialed_features = [all_mfcc[path] for path in untrialed_paths]
untrialed_features_stack = np.vstack(untrialed_features)

print(len(untrialed_features), untrialed_features_stack.shape)

## Make 1024 mixture GMM

In [104]:
ubm_model = pg.GeneralMixtureModel.from_samples(
    pg.MultivariateGaussianDistribution, n_components=1024, 
    X=untrialed_features_stack,
    max_iterations=1e5, verbose=True, n_jobs=20
)

[1] Improvement: 1815337.7782751173	Time (s): 1.526
[2] Improvement: 1120632.2016400695	Time (s): 1.422
[3] Improvement: 479222.07682840526	Time (s): 1.453
[4] Improvement: 247275.77560900152	Time (s): 1.544
[5] Improvement: 169762.99712505937	Time (s): 1.414
[6] Improvement: 137201.11608241498	Time (s): 1.474
[7] Improvement: 114011.64813488722	Time (s): 1.426
[8] Improvement: 90099.76580010355	Time (s): 1.518
[9] Improvement: 70474.43591959774	Time (s): 1.507
[10] Improvement: 55233.85200339556	Time (s): 1.657
[11] Improvement: 43176.56939820945	Time (s): 1.495
[12] Improvement: 35224.70126327872	Time (s): 1.583
[13] Improvement: 31009.52012553811	Time (s): 1.492
[14] Improvement: 29216.22678720951	Time (s): 1.529
[15] Improvement: 29111.62843078375	Time (s): 1.546
[16] Improvement: 29632.50283382833	Time (s): 1.514
[17] Improvement: 29496.378614470363	Time (s): 1.392
[18] Improvement: 27828.899049550295	Time (s): 1.528
[19] Improvement: 24987.958367943764	Time (s): 1.607
[20] Improv

[157] Improvement: 263.00465716421604	Time (s): 1.479
[158] Improvement: 262.09291285276413	Time (s): 1.453
[159] Improvement: 260.9145005643368	Time (s): 1.48
[160] Improvement: 259.6116936802864	Time (s): 1.585
[161] Improvement: 258.27410274744034	Time (s): 1.522
[162] Improvement: 256.9385150372982	Time (s): 1.468
[163] Improvement: 255.5986407995224	Time (s): 1.655
[164] Improvement: 254.22057937085629	Time (s): 1.445
[165] Improvement: 252.76038193702698	Time (s): 1.51
[166] Improvement: 251.18103308975697	Time (s): 1.438
[167] Improvement: 249.46748450398445	Time (s): 1.532
[168] Improvement: 247.63904769718647	Time (s): 1.434
[169] Improvement: 245.75867991149426	Time (s): 1.34
[170] Improvement: 243.93846967816353	Time (s): 1.481
[171] Improvement: 242.34026941657066	Time (s): 1.411
[172] Improvement: 241.17046532034874	Time (s): 1.436
[173] Improvement: 240.66882038116455	Time (s): 1.464
[174] Improvement: 241.09274096786976	Time (s): 1.363
[175] Improvement: 242.699839636683

[311] Improvement: 16.63889430463314	Time (s): 1.411
[312] Improvement: 17.19874608516693	Time (s): 1.536
[313] Improvement: 17.80149246752262	Time (s): 1.525
[314] Improvement: 18.450757265090942	Time (s): 1.461
[315] Improvement: 19.150552108883858	Time (s): 1.538
[316] Improvement: 19.905330941081047	Time (s): 1.6
[317] Improvement: 20.720057532191277	Time (s): 1.444
[318] Improvement: 21.60028800368309	Time (s): 1.781
[319] Improvement: 22.552264288067818	Time (s): 1.579
[320] Improvement: 23.583033934235573	Time (s): 1.366
[321] Improvement: 24.70060685276985	Time (s): 1.338
[322] Improvement: 25.914144530892372	Time (s): 1.461
[323] Improvement: 27.23420675098896	Time (s): 1.348
[324] Improvement: 28.673054918646812	Time (s): 1.402
[325] Improvement: 30.244973927736282	Time (s): 1.361
[326] Improvement: 31.96663948893547	Time (s): 1.301
[327] Improvement: 33.8574163466692	Time (s): 1.286
[328] Improvement: 35.93960756063461	Time (s): 1.366
[329] Improvement: 38.23855631053448	Tim

[465] Improvement: 0.19875448942184448	Time (s): 1.439
[466] Improvement: 0.17369548976421356	Time (s): 1.541
[467] Improvement: 0.15169718861579895	Time (s): 1.491
[468] Improvement: 0.13240525126457214	Time (s): 1.45
[469] Improvement: 0.11550301313400269	Time (s): 1.545
[470] Improvement: 0.10070747137069702	Time (s): 1.477
[471] Improvement: 0.08776654303073883	Time (s): 1.448
Total Improvement: 4937452.903189346
Total Time (s): 694.4912


In [105]:
ubm_model.freeze()

In [109]:
# write_model(root + '/models/ubm_model', ubm_model)

In [113]:
# ubm_model = read_gmm(root + '/models/ubm_model')

## Enroll

In [54]:
def make_speaker_classifier(inspeaker_classifier):
    n_components = 3 * len(phones)
    initial, transitions = make_matrix_beads(n_components)

    classifier = hmm.GMMHMM(
        n_components=n_components, n_mix=8, covariance_type='diag', init_params='mcw', params='tmcw'
    )

    classifier.startprob_ = initial
    classifier.transmat_ = transitions

GMMHMM(algorithm='viterbi', covariance_type='diag', covars_prior=0.01,
    init_params='stmcw', n_components=39, n_iter=10, n_mix=8,
    params='stmcw', random_state=None, startprob_prior=1.0, tol=0.01,
    transmat_prior=1.0, verbose=False)

In [None]:
def perform_enrollments(classifier, enrollments):
    labels = enrollments[['is_male', 'speaker_id', 'sentence_id']].values
    features = [load_mfcc(path) for path in enrollments_1['pcm_path']]
    classifier.fit(features, labels)
    

## Trial

In [None]:
def save_results(label, results):
    path = os.path.join(root, 'result', label)
    with open(path) as opened:
        pickle.dump(results, opened)
        
def load_results(label):
    path = os.path.join(root, 'result', label)
    with open(path) as opened:
        return pickle.load(opened)
    
def perform_trial(classifier, path):
    features = load_mfcc(path)
    return classifier.predict_single_proba(features)

def perform_trials(classifier, trials):
    paths = trials['pcm_path'].unique()
    results = {}
    for path in paths:
        results[path] = perform_trial(classifier, path)
    return results

In [10]:
per_phone_weights = {}

per_phone_weights.get('A', np.eye(3))

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [None]:
class Bakis:
    def __init__(self, phones, per_phone_weights, ignore_stress=True, wrap_silence=True):
        self.phone_seq = phones.split(' ')
        if wrap_silence:
            self.phone_seq = ['_'] + self.phone_seq + ['_']
            
        state_transitions = 

In [51]:
enrollments_1.sentence_id.unique()

array([31, 32, 33, 34, 35, 36, 37, 38, 39, 40])

In [52]:
sample_enrollments = enrollments_1.loc[enrollments_1.sentence_id == 31]

In [None]:
class HmmGmmVerifier:
    def __init__(self):
        self.patterns = None
        self.labels = None

    def fit(self, features, labels):
        self.patterns = features
        self.labels = labels
        self.unique_labels = np.unique(labels)
        
    def enroll(self, ):
        pass
    
    def trial(self, ):
        pass

    def predict(self, features, metric=spatial.distance.cosine):
        sequence_label_proba = self.predict_proba(features, metric)
        max_proba_index = sequence_label_proba.argmax(axis=1)
        return self.unique_labels[max_proba_index]
    
    def predict_proba(self, features, metric=spatial.distance.cosine):
        sequence_n = len(features) 
        pattern_n = len(self.patterns)
        
        sequence_label_proba = np.zeros((sequence_n, pattern_n), dtype=self.labels.dtype)
        for i, sequence in enumerate(features):
            sequence_label_proba[i, :] = self.predict_single_proba(sequence, metric)
            
        return sequence_label_proba
    
    def predict_single_proba(self, sequence, metric=spatial.distance.cosine):
        pattern_dists = np.zeros(len(self.patterns), dtype=np.float64)
        for i, pattern in enumerate(self.patterns):
            distance, _ = fastdtw.fastdtw(pattern, sequence, dist=metric)
            pattern_dists[i] = distance
            
        pattern_proba = np.exp(-pattern_dists)
        
        label_proba = np.zeros(len(self.unique_labels), dtype=np.float64)
        all_dim = tuple(range(1, self.labels.ndim))
        for i, label in enumerate(self.unique_labels):
            relevant = (self.labels == label).all(axis=all_dim)
            total_proba = pattern_proba[relevant].sum()
            label_proba[i] = total_proba
        
        return label_proba / label_proba.sum()

classifier = markov.HMMGMMClassifier()
classifier.fit(e_features, e_labels, n_components=10, n_mix=2)
return classifier.predict(t_features)

In [5]:
display(enrollments_1.groupby(['is_male', 'speaker_id']).size())

is_male  speaker_id
False    2             30
         4             30
         5             30
         6             30
         8             24
         12            30
True     1             30
         2             30
         4             30
         5             30
         6             30
         7             30
         8              6
         9             30
         13            30
         14            24
         15            30
         16            30
         17            30
         18            30
         19            30
         20            30
         21            24
         22            30
         23            30
         26            30
         28            30
         29            30
         32            30
         38            24
         40            30
         41            30
         43            30
         47            30
         48             6
         51            30
         52             6
         53       