-
Notifications
You must be signed in to change notification settings - Fork 1
/
Master Thesis.tex
1766 lines (1531 loc) · 102 KB
/
Master Thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[english]{HSMW-Thesis}
\usepackage{graphicx}
\usepackage{float}
\usepackage{cite}
\usepackage{listings}
\renewcommand{\lstlistingname}{}
\lstdefinestyle{chstyle}{%
%basicstyle=\ttfamily\small,
commentstyle=\color{green!60!black},
keywordstyle=\color{magenta},
stringstyle=\color{blue!50!red},
showstringspaces=false,
numbers=left,
numberstyle=\footnotesize\color{gray},
numbersep=1pt,
%stepnumber=2,
tabsize=2,
breaklines=true,
inputpath=C:/Users/nana abeka otoo/Downloads/send from/pycode
}
\Art{Master Thesis}
\Anrede{Herr}
\Vorname{Nana Abeka}
\Nachname{Otoo}
\Thema{Determining of Classification Label Security/Certainty}
\Unterthema{}
\Studiengang{Applied Mathematics for Network and Data Sciences }
\Seminargruppe{MA18w1-M}
\Fakultaet{}
\Erstpruefer{Prof. Dr. Thomas Villmann}
\Zweitpruefer{MSc. Jensun Ravichandran}
\Datum{}
\Tag{}
\Monat{}
\Jahr{}
\Anlagen{}
\Copyright{}
\Textsatz{}
\Druck{}
\Verlag{}
\ISBN{}
\begin{document}
\begin{center}
A big thank you goes to\\\vspace{20pt}
my loved ones \\
for their support and love.\\ \vspace{20pt}
Special gratitude goes to \\\vspace{20pt}
Prof. Dr. Thomas Villmann\\
for his supervision and guidance\\
and MSc. Jensun Ravichandran \\ for his supervision, guidance and comments
\end{center}
\begin{Referat}
Classification label security determines the extent to which predicted labels from classification results can be trusted. The uncertainty surrounding classification labels is resolved by the security to which the classification is made. Therefore, classification label security is very significant for decision-making whenever we are encountered with a classification task. This thesis investigates the determination of the classification label security by utilizing fuzzy probabilistic assignments of Fuzzy c-means. The investigation is accompanied by implementation, experimentation, visualization and documentation of the results.
\end{Referat}
\begin{Vorwort}
% Vorwort
\end{Vorwort}
\Hauptteil
% Diese Anweisung nicht loeschen!
\chapter{Introduction}
Machine learning as a field of study has gained prominence and publicity in academia and industry in recent times. One is not wrong to say that, Machine learning has become a significant matter of discussion among students, industry players and every professional whose work, one way or the other, is influenced by it. We can at this stage realize why almost every practical process witnessed in our lives today is either applying machine learning or is migrating to its adoption. The benefits that arise with the utilization of machine learning processes can be exemplified and witnessed in areas such as medicine, security, engineering, commerce, agriculture, only to mention a few since the list keeps growing with new innovations and methods being added day by day.
In this regard, a learning machine is a model tasked to learn from a given data and make valuable predictions on new data that was not used in the learning process. A well-grounded area in machine learning is prototype-based models that learn prototypes from a given data set by training and make classifications on new data using the difference between the data points and learned prototypes. It is suitable to observe the many prototype-based algorithms that are commonly applied today in most processes. A family of prototype-based models that have pitched the interest of most users is the well-known Learning Vector Quantization. The interest in Learning Vector Quantization can be explained by the facts that surround its easily comprehensible theoretical considerations and practical implementations. At the moment, Learning Vector Quantisation holds an enviable position among the classification algorithms found in the area of prototype-based models. We begin to ask for reasons in this regard; a simple answer to this question lies in its understandable mathematical inclination, easy usability, high performance coupled with outcomes that can be explained. It is right to say that every classifier has the primary duty of making reasonable or, so to say, good classifications. Again, it is desirable from the usage point of view that a good classifier possesses the attribute that allows users to know the degree to which classification results can be trusted. The ability of a classifier to come equipped with such an attribute is very significant because it provides the security to which classification labels can be accepted. The classification label security remains vital for making decisions in this regard.
\section{Motivation}
T. Kohonen introduced Learning Vector Quantisation (LVQ) as a prototype-based analog of unsupervised competitive learning, which he designed to classify different patterns in data \cite{kohonen2001learning}. Even though LVQ results in optimal reference vectors, it is characterized by issues of divergent reference vectors\cite{sato1996generalized}. This challenge, among others, led to attempts geared towards improved variants in \cite{kohonen2001learning} but to no avail. The outcomes of these variants are, in practice, not the same\cite{biehl2006learning}.
The introduction of Generalized Learning Vector Quantization (GLVQ) by Sato and Yamada solved the problem concerning the diverging reference vectors, utilizes a cost function-based approach, and incorporates convergence conditions in the winner takes all learning rule\cite{sato1996generalized}. The reliability and robustness of LVQ and its variants is penchant on the homogeneity of data used and, most importantly, they heavily utilize and depend on Euclidean distance measure which may not be universal for all cases understudy\cite{article}.
GLVQ provides a good generalization with convergence conditions, based on any standard distance metric which can be optimized \cite{hammer2005generalization}. A substantial and balanced step for solving this problem led to applying relevant factors to specify a family of distance measures leading to the Relevance GLVQ\cite{hammer2002generalized}.
A variant of Relevance LVQ called Matrix Relevance LVQ utilizes a matrix of relevances that will be learned in the same manner as the weights using GLVQ update rules\cite{schneider2009adaptive}.
It remains to show which choice of a matrix of relevant factors initialization is required to parametrize the distance measure for optimal classification results\cite{hammer2002generalized,bunte2012limited}. Consider the optimal classification results linked with the certainty/security of the classification labels from a Fuzzy clustering utilizing a covariance matrix\cite{gath1989unsupervised}. A version of LVQ which utilizes cross-entropy for classification is introduced\cite{villmann2018probabilistic,kaden2014aspects}. The use of cross-entropy optimization in LVQ is discovered to result in class positions that ensure classification label security\cite{villmann2018probabilistic}. The computation of classification label security is reliant on the converged reference vectors\cite{bezdek1981pattern} whose optimization, in turn is also dependent on its initialization\cite{boubezoul2008application}.
The classification label security remains to be investigated in this regard.
Consider unsupervised Fuzzy c means (FCM) by Bezdek, which utilizes fuzzy membership to ascertain the certainty of cluster members\cite{bezdek1981pattern}. A good way forward is to investigate the utilization of fuzzy probabilistic assignments of FCM to determine the classification label security with applications to GLVQ, Generalized Matrix Learning Vector Quantization (GMLVQ) and Cross-Entropy Learning Vector Quantization (CELVQ).
\section{Brief on Clustering}
The clustering task involves partitioning data without labels into subgroups based on data features representing structure in the data set. The underlying similarity between data patterns is used for arranging data into clusters. We consider the following definitions:
\begin{definition}\cite{bezdek1981pattern}
\emph{Hard c-Partition}\label{def:Hard c-Partition}.\hspace{2pt} $ X=\left\lbrace \mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3,\ldots,\mathbf{x}_n\right\rbrace $\hspace{2pt} is any finite set;\hspace{2pt} $V_{cn}$\hspace{2pt} is the set of real\hspace{2pt} $c\times n$\hspace{2pt} matrices;\hspace{2pt} $c$\hspace{2pt} is an integer,\hspace{2pt} $2\leq c< n$.\hspace{2pt} \emph{Hard c-partition space for}\hspace{2pt} $X$\hspace{2pt} is the set
\begin{equation*}\label{hard} %remove
M_c= \bigg\{ U\in V_{cn}\Bigm| u_{ik}\in \{0,1\}\hspace{2pt} \forall\hspace{2pt} i,k\hspace{2pt} ; \hspace{2pt}\sum_{i=1}^{c} u_{ik}=1 \hspace{2pt}\forall\hspace{2pt} k \hspace{2pt};\hspace{2pt} 0<\sum_{k=1}^{n} u_{ik}< n \hspace{2pt}\forall \hspace{2pt}i \bigg\}
\end{equation*}
\begin{subequations}
\begin{equation}\label{condition 1}
u_{ik}\in \{0,1\},\hspace{10pt} 1\leq i\leq c,\hspace{10pt} 1\leq k\leq n
\end{equation}
$\left( \ref{condition 1}\right)$\hspace{2pt} means that the fuzzy probabilistic assignment of the $ith$ partition of\hspace{2pt} $X$\hspace{2pt} is $1$ or $0$ when\hspace{2pt} $\mathbf{x}_k$\hspace{2pt} is in the $ith$ partition and otherwise respectively\cite{bezdek1981pattern}.
\begin{equation}\label{condition 2}
\sum_{i=1}^{c} u_{ik}=1, \hspace{10pt} 1\leq k \leq n
\end{equation}
$\left( \ref{condition 2}\right)$\hspace{2pt} indicates each pattern\hspace{2pt} $\mathbf{x}_k$\hspace{2pt} can be uniquely assigned a cluster\hspace{2pt} $c$\hspace{2pt} subsets\cite{bezdek1981pattern}.
\begin{equation}\label{cond 3}
0<\sum_{k=1}^{n} u_{ik}< n ,\hspace{15pt} 1\leq i \leq c
\end{equation}
$\left( \ref{cond 3}\right) $ indicates there should be at least two partition subsets of\hspace{2pt} $X$\hspace{2pt} and these subsets should also be less than the cardinality of\hspace{2pt} $X$: \(2\leq c<n\)\cite{bezdek1981pattern}.
\end{subequations}
\end{definition}
\begin{definition}\cite{bezdek1981pattern}
\emph{Fuzzy c-Partition}\label{def:Fuzzy c-Partition}.\hspace{2pt} $ X $\hspace{2pt} is any finite set;\hspace{2pt} $V_{cn}$ is the set of real\hspace{2pt} $c\times n$\hspace{2pt} matrices;\hspace{2pt} $c$\hspace{2pt} is an integer,\hspace{2pt} $2\leq c< n$.\hspace{2pt} \emph{Fuzzy c-partion space for}\hspace{2pt} $X$\hspace{2pt} is the set
\begin{equation}\label{Fuzzy set space}
M_{fc}= \bigg\{ U\in V_{cn}\Bigm| u_{ik}\in \left[ 0,1\right]\hspace{2pt} \forall\hspace{2pt} i,k\hspace{2pt} \hspace{2pt};\hspace{2pt} \sum_{i=1}^{c} u_{ik}=1\hspace{2pt} \forall\hspace{2pt} k \hspace{2pt};\hspace{2pt} 0<\sum_{k=1}^{n} u_{ik}< n \hspace{2pt} \forall\hspace{2pt} i \bigg\}
\end{equation}
condition (\ref{condition 1}) is extendend to include all values between $1$ and $0$. Hence removing the crisp assignments of membership functions $u_{ik}$\cite{bezdek1981pattern}.
\end{definition}
\begin{definition}\cite{pal2005possibilistic}
\emph{Possibilistic c-Partition}\label{def:Possibilistic c-Partition}.\hspace{2pt} $ X $ \hspace{2pt}is any finite set;\hspace{2pt} $V_{cn}$ is the set of real\hspace{2pt} $c\times n$\hspace{2pt} matrices;\hspace{2pt} $c$\hspace{2pt} is an integer,\hspace{2pt} $2\leq c< n$.\hspace{2pt} \emph{Possibilistic c-partion space for}\hspace{2pt} $X$\hspace{2pt} is the set
\begin{equation}\label{Possibilistic set space}
M_{pc}= \bigg\{ U\in V_{cn}\Bigm| u_{ik}\in \left[ 0,1\right] \hspace{2pt} \forall\hspace{2pt} i,k\hspace{2pt} ;\hspace{2pt} \forall \hspace{2pt}k\hspace{2pt} \exists\hspace{2pt} i \hspace{2pt}\ni u_{ik}> 0\bigg\}
\end{equation}
the column condition in (\ref{condition 2}) is changed and replaced with\hspace{2pt} $0<\sum_{i=1}^{c} u_{ik}\leq c$\hspace{2pt} and\hspace{2pt} $u_{ik}$\hspace{2pt} is referred to as tipicality of data pattern\hspace{2pt} $\mathbf{x}_k$\cite{krishnapuram1993possibilistic}.
\end{definition}
\chapter{Objective Function Clustering}
The primary approach here is to utilize a sum of squares errors function optimized to achieve a minimized error point at which clustering results can be accepted. It is significant to know that optimal clustering, in this case, is achieved at the local extrema of the objective function\cite{bezdek1981pattern}.
\section{Fuzzy c-Means}
As described by Bezdek\cite{bezdek1981pattern}, the Fuzzy c-means provides a soft alternative to the Hard c-means clustering algorithm. The discrepancy comes from the way the fuzzy $U$- matrix is partitioned along with some conditions allowing the crisp assignments as seen in hard c- means to now include the full range of probabilistic assignments as defined above in $\left( \ref{Fuzzy set space}\right)$ and referred to as fuzzy memberships. The fuzzy memberships determine the degrees to which patterns belong in a partition(cluster).
\begin{theorem}\cite{bezdek1981pattern}
let the objective function of Fuzzy c-Means be
\begin{equation*}\label{FCM Objective} %remove
J_m\left( U,\mathbf{v}\right) =\sum_{k=1}^{n}\sum_{i=1}^{c}\left( u_{ik}\right) ^{m}\left( d_{ik}\right) ^2
\end{equation*}
and assume an inner product norm to be
\begin{align*}\label{inner product norm}
\left( d_{ik}\right)^{2} &= \parallel \mathbf{x}_k - \mathbf{v}_{i}\parallel_{A}^{2}\\
&= \langle \mathbf{x}_k-\mathbf{v}_i,\mathbf{x}_k-\mathbf{v}_i\rangle_A\\
&= \left( \mathbf{x}_k-\mathbf{v}_i\right) ^{T}A\left( \mathbf{x}_k-\mathbf{v}_i\right)
\end{align*}
where
\begin{equation*}
U\in M_{fc}
\end{equation*}
is the fuzzy c- partion of\hspace{3pt} $X$ and
\begin{equation*}
\mathbf{v}=\left(\mathbf{v}_1,\mathbf{v}_2,\ldots,\mathbf{v}_c\right)\in \mathbb{R}^{cp}\hspace{5pt} \text{with}\hspace{5pt} \mathbf{v}_i\in \mathbb{R}^{p}
\end{equation*}
is the cluster center or prototypes of \hspace{3pt} $u_i$,$\hspace{3pt} 1\leq i \leq c$
\begin{equation*}
\text{choose}\hspace{3pt} m \in \left( 1,\infty\right)
\end{equation*}
let\hspace{3pt} $X$\hspace{3pt} have at least \hspace{2pt}$c$ less than \hspace{2pt} $n$ \hspace{2pt} distinct points, and define for all\hspace{2pt} $k$\hspace{2pt} the sets
$$I_k=\left\lbrace i\mid 1\leq i \leq c;\hspace{2pt} d_{ik}=\parallel \mathbf{x}_k-\mathbf{v}_i\parallel_A=0\right\rbrace$$
$$\tilde{I}_k=\left\lbrace 1,2,\ldots ,c\right\rbrace - I_k$$
then\hspace{2pt} $J_m\left( U,\mathbf{v}\right)$\hspace{2pt} may be globally minimised only if
\begin{subequations}
\begin{equation}\label{membership function}
I_k=\varnothing \Rightarrow u_{ik}=\frac{1}{\left[ \sum_{j=1}^{c}\left( \frac{d_{ik}}{d_{jk}}\right)^{\frac{2}{\left( m-1\right) }}\right]}
\end{equation}
or \begin{equation}\label{membership function1}
I_k\neq\varnothing \Rightarrow u_{ik}= 0 \hspace{3pt}\forall \hspace{3pt} \tilde{I}_k \hspace{10pt}and\hspace{10pt} \sum_{i\in I_k}u_{ik} =1
\end{equation}
\begin{equation}\label{cluster center}
\mathbf{v}_i=\frac{\sum_{k=1}^{n}\left( u_{ik}\right)^{m} \mathbf{x}_k}{\sum_{k=1}^{n}\left( u_{ik}\right) ^{m}} \hspace{5pt}\forall \hspace{5pt} 1\leq i \leq c
\end{equation}
\end{subequations}
\end{theorem}
%\underline{FCM Algorithm, Bezdek\cite{bezdek1981pattern}}\\
%step 1. Fix $c, 2\leq c < n;$ choose any inner product norm of the form in $\left( \ref{FCM Objective}\right) $; \\fix $m\in \left[ 1,\infty\right)$ . Initialize $U^{0} \in M_{fc}\left( %\ref{Fuzzy set space}\right) $ at iteration $l,l= 0,1,2,\ldots:\\$
%step 2. Calculate the cluster centers $v_{i}^{l}$ using$\left( \ref{cluster center}\right) $ and $U^{l} $\\
%step 3. Update $U^{l}$ with $\left( \ref{membership function}\right) \left( \ref{membership function1}\right) $ and $ v_{i}^{l}$\\
%step 4. Compare $U^{l}$ to $U^{l+1} $ in a convenient matrix norm: if $\parallel U^{l+1}-U^{l}\parallel\leq \epsilon$ stop. otherwise, go to step 2.\\
\begin{figure}[h!]\label{FCM Algorithm}
\centering
\caption{FCM Algorithm, Bezdek\cite{bezdek1981pattern}}\label{FCM Centers}
\begin{tabular}{ l l }
\cline{1-2} \hline
\multicolumn{1}{c|}{\emph{Store}} &\multicolumn{1}{c}{Unlabelled Object Data $X= \left\lbrace \mathbf{x_{1},x_{2},\ldots,x_{n}}\right\rbrace \subset \Re^{p}$ } \\ \hline
\multicolumn{1}{c|}{} &$\ast 1<c<n$ \\
\multicolumn{1}{c|}{} &$\ast m>1$ \\
\multicolumn{1}{c|}{} &$\ast l_{max}= iteration\hspace{3pt} limit$ \\
\multicolumn{1}{c|}{} &$pick \ast Norm\hspace{3pt} for\hspace{3pt} J_{m} :||\mathbf{x}||_{A}=\sqrt{\mathbf{x^{T}Ax}} $ \\
\multicolumn{1}{c|}{} &$\ast0<\epsilon = termination \hspace{3pt} criterion$\\\hline
\multicolumn{1}{c|}{} &$Initialize \hspace{5pt}U^{0} \in M_{fc}\left( \ref{Fuzzy set space}\right) \hspace{3pt} at\hspace{3pt} iteration\hspace{5pt} l,l= 0,1,2,\ldots,:$\\ \hline
\multicolumn{1}{c|}{\emph{Do}} &$Calculate\hspace{4pt} the \hspace{3pt}cluster\hspace{3pt} centers\hspace{3pt} v_{i}^{l} \hspace{5pt}using\left( \ref{cluster center}\right) \hspace{5pt} and \hspace{4pt}U^{l} $ \\
\multicolumn{1}{c|}{} &$ update\hspace{4pt} U^{l} \hspace{3pt}with\hspace{3pt} \left( \ref{membership function}\right) \left( \ref{membership function1}\right)\hspace{3pt} and\hspace{3pt} v_{i}^{l}$ \\
\multicolumn{1}{c|}{} & $Compare\hspace{4pt} U^{l}\hspace{3pt} to \hspace{4pt} U^{l+1} \hspace{3pt} in \hspace{3pt}a \hspace{3pt}convenient\hspace{3pt} matrix\hspace{3pt} norm: if\hspace{3pt} \parallel U^{l+1}-U^{l}\parallel\leq \epsilon \hspace{3pt}stop:$\\
\multicolumn{1}{c|}{} & $otherwise\hspace{3pt} return\hspace{3pt} to\hspace{3pt} step\hspace{3pt} 1.$ \\\hline
\end{tabular}
\end{figure}
The parameter\hspace{2pt} $\left[ 0<\epsilon\ll 1 \right] $\hspace{2pt} in Figure \ref{FCM Algorithm} must be chosen to be very small. The fuzzifier$\left( m\right)$ must be chosen cautiously to fit the data. The equation$\left( \ref{membership function1} \right) $ is used to account for the scarce occurrance of singularity$\left( \mathbf{x}_k=\mathbf{v}_i\right)$ where $d_{ik}=0,\hspace{2pt} u_{ik}$\hspace{2pt} assignments are spread over\hspace{2pt} $\mathbf{v}_{i}$\hspace{2pt} and \hspace{2pt}$u_{ik}$\hspace{2pt} in\hspace{2pt} $d_{ik}$\hspace{2pt} greater than \hspace{2pt}$0$ automatically become $0 \text{'s}$\cite{pal2005possibilistic}. It must be noted,
\begin{subequations}
\begin{align}\label{FCM limit}
\lim_{m\rightarrow 1^+} \left\lbrace u_{ik}\right\rbrace =
\left \{
\begin{aligned}
&1, && \hspace{10pt} d_{ik}\hspace{2pt} <\hspace{2pt} d_{jk}\hspace{3pt} \forall\hspace{2pt} j \neq i \\
&0, && \hspace{10pt }\text{otherwise}
\end{aligned} \right.
\end{align}
and consequently,
\begin{equation}\label{FCM limit1}
\begin{split}
\lim_{m\rightarrow 1^+}\Bigg\{ \Bigg(\mathbf{v}_i&=\frac{\sum_{k=1}^{n}\left( u_{ik}\right) ^{m}\mathbf{x}_k}{\sum_{k=1}^{n}\left( u_{ik}\right) ^m}\Bigg)\Bigg\}\\ &= \frac{\sum_{k\in i}\mathbf{x}_k}{n_i}\\
&= \mathbf{\tilde v}_i ;\hspace{10pt} 1\leq i\leq c
\end{split}
\end{equation}
$\left( \ref{FCM limit}\right) $ and $\left( \ref{FCM limit1}\right) $ shows that cluster centroid moves closer to the general mean and hence FCM become crisply assigned with $u_{ik}=\left\lbrace 0,1\right\rbrace $ which results in Hard c-means (HCM) \cite{bezdek1981pattern}. This reason accounts for the choice of\hspace{2pt} $m$.
\end{subequations}
\chapter{Learning Vector Quantization}
\section{Introduction to Learning Vector Quantization}
T. Kohonen introduced Learning Vector Quantization(LVQ) as a prototype-based supervised learning model with the characteristics of being robust and intuitive\cite{kohonen2001learning}. LVQ presents an improvement to the nearest neighbor classifiers by introducing prototypes vectors that are learned and optimized to give improved results in classification\cite{kaden2014aspects}. Even though LVQ is characterized by producing optimal borders, it suffers a weakness of being heuristically inclined, and also, the instability reference vectors become a matter of concern in its application to most classification tasks\cite{kohonen2001learning,article}.
Given a training set\hspace{2pt} $X=\left\lbrace \mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3,\ldots,\mathbf{x}_N\right\rbrace \subseteq \mathbb{R}^n$\hspace{2pt} with its class labels\hspace{2pt} $c\left( \mathbf{x}\right)\in\mathcal{C}=\left\lbrace 1,2,\ldots, C\right\rbrace $,\hspace{2pt} we define a prototype set of vectors\hspace{2pt} $W=\left\lbrace \mathbf{w}_1,\mathbf{w}_2,\ldots,\mathbf{w}_M\right\rbrace\subseteq \mathbb{R}^n $\hspace{2pt} such that every\hspace{2pt} $\mathbf{w}\in W$\hspace{2pt} has a corresponding class\hspace{2pt} $c\left( \mathbf{w}\right)\in\mathcal{C} $.\hspace{2pt} The training of prototype vectors is based on a competitive learning known as winner-takes-all rule$\left( \ref{winner takes all rule}\right)$ until the prototypes vectors become typical of the classes they represent.
\begin{equation}\label{winner takes all rule}
S\left( \mathbf{x}\right) =\arg\min_k\hspace{3pt} d\left( \mathbf{x},\mathbf{w}_k\right) , 1\leq k \leq M
\end{equation}
Consider data point\hspace{2pt} $\mathbf{x}$, the protype vectors\hspace{4pt}$\mathbf{w}_{ s\left( \mathbf{x}\right) }$\hspace{2pt} is strengthened if\hspace{2pt} $c\left( \mathbf{x}\right) = c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right) $\hspace{2pt} and weakened if \hspace{2pt}$c\left( \mathbf{x}\right) \neq c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right) $ \hspace{2pt}based on an update rule defined in $\left( \ref{update rule}\right)$ utilising $\left( \ref{strengthen or weaken}\right)$ and a small but positive learning rate \hspace{2pt}$\eta$
\begin{align}\label{strengthen or weaken}
\psi \left( c\left( \mathbf{x}\right) , c\left( \mathbf{w}_{ s\left( \mathbf{x} \right) }\right)\right) =
\left \{
\begin{aligned}
&+1, && \hspace{10pt}c\left( \mathbf{x}\right) = c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right) \\
&-1, && \hspace{10pt} c\left( \mathbf{x}\right) \neq c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right)
\end{aligned} \right.
\end{align}
\begin{equation}\label{update rule}
\mathbf{w}_{t+1}=\mathbf{w}_{t} + \eta\psi\left( \mathbf{x}-\mathbf{w}_t\right) ;\hspace{10pt} \mathbf{w}_t=\mathbf{w}_{s\left( \mathbf{x}\right) } ; \hspace{10pt} 0<\eta\ll 1
\end{equation}
Though the standard Euclidean distance\hspace{2pt} $d\left( \mathbf{x},\mathbf{w}_k\right) $\hspace{2pt} is primarily utilized in LVQ, it is not limited to it, and that any standard dissimilarity measure is allowed if it fits the data set in question\cite{villmann2017can}.
The heuristic inclination and the problem of instability of reference vectors led to the development of many LVQ variants\cite{kohonen2001learning}. A more mathematically inclined and generalized version is introduced by Sato and Yamada, which solved the afore-mentioned problems of LVQ \cite{sato1996generalized}.
\section{Generalized Learning Vector Quantization}
Sato and Yamada successfully present a generalized version of the LVQ variants, which employs the use of a cost function and an update rule that incorporates convergence conditions for prototype vectors\cite{sato1996generalized}.
Let\hspace{2pt} $d$\hspace{2pt} be any differentaible dissimilarity measure, \hspace{2pt}$\mathbf{w}^{+}$ \hspace{2pt}is the best matching correct prototype vector if\hspace{2pt} $c\left( \mathbf{x}\right) = c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right)$
and\hspace{2pt} $\mathbf{w}^{-}$\hspace{2pt} be the best matching incorrect prototype vector if\hspace{2pt} $c\left( \mathbf{x}\right) \neq c\left( \mathbf{w}_{ s\left( \mathbf{x}\right) }\right)$\hspace{2pt} then a function\hspace{2pt} $\mu\left( \mathbf{x}\right)$\hspace{2pt} referred to as the classifier function is
\begin{equation*}%remove
\mu \left( \mathbf{x}\right) =\frac{d\left( \mathbf{x},\mathbf{w}^{+}\right)-d\left( \mathbf{x},\mathbf{w}^{-}\right) }{d\left( \mathbf{x},\mathbf{w}^{+}\right)+d\left( \mathbf{x},\mathbf{w}^{-} \right) }
\end{equation*}
$\mu\left( \mathbf{x}\right)\in\left[ -1,1\right], $ indicating\hspace{2pt} $d\left( \mathbf{x},\mathbf{w}^{+}\right)<d\left( \mathbf{x},\mathbf{w}^{-}\right)$\hspace{2pt} whenever classification is correct meaning\hspace{2pt} $\mu\left( \mathbf{x}\right) $\hspace{2pt} is negative and incorrect classification indicates\hspace{2pt} $\mu\left( \mathbf{x}\right) $ \hspace{2pt}is positive. The cost function is given by,
\begin{equation}\label{GLVQ cost fucntion}
J_{GLVQ}\left( X,W\right) =\sum_{i=1}^{n}f\left( \mu\left( \mathbf{x}_i\right) \right)
\end{equation}
The non-linear activation function $f$, which increases monotonically, is usually chosen as the sigmoid function
\begin{align*}
f_t\left( \mathbf{x}\right) =\frac{1}{1+e^{\frac{-\mathbf{x}}{t}}} ;\hspace{10pt} t>0
\end{align*}
minimization of the cost function in $\left( \ref{GLVQ cost fucntion}\right)$ is done using the stochastic gradient descent Learning (SGDL), and the update rules are given by $\left( \ref{GLVQ update}\right) $
\begin{equation}\label{GLVQ update w+}
\begin{split}
\frac{\partial J}{\partial \mathbf{w}^+}&=\frac{\partial f}{\partial \mu}\cdot\frac{\partial \mu}{\partial d^{+}\left( \mathbf{x}\right)}\cdot\frac{\partial d^{+}\left( \mathbf{x}\right) }{\partial \mathbf{w}^{+}}\\
&= \frac{\partial f}{\partial \mu}\cdot\frac{-2d^{-}\left( \mathbf{x}\right) }{\left( d^{+}\left( \mathbf{x}\right) + d^{-}\left( \mathbf{x}\right) \right) ^2} \left( -2\right) \left( \mathbf{x}-\mathbf{w}^{+}\right)
\end{split}
\end{equation}
Similarly,
\begin{equation}\label{GLVQ upddate W-}
\begin{split}
\frac{\partial J}{\partial \mathbf{w}^-}&=\frac{\partial f}{\partial \mu}\cdot\frac{\partial \mu}{\partial d^{-}\left( \mathbf{x}\right)}\cdot\frac{\partial d^{-}\left( \mathbf{x}\right) }{\partial \mathbf{w}^{-}}\\
&= \frac{\partial f}{\partial \mu}\cdot\frac{2d^{+}\left( \mathbf{x}\right) }{\left( d^{+}\left( \mathbf{x}\right) + d^{-}\left( \mathbf{x}\right) \right) ^2} \left( -2\right) \left( \mathbf{x}-\mathbf{w}^{-}\right)
\end{split}
\end{equation}
from $\left( \ref{GLVQ update w+}\right)$ and $\left( \ref{GLVQ upddate W-}\right)$ we have the update rule $\left( \ref{GLVQ update}\right) $
\begin{equation}\label{GLVQ update}
\Delta \mathbf{w}^{\pm}\propto\frac{-\partial f}{\partial \mu}\cdot\frac{\pm 2d^{\mp}\left( \mathbf{x}\right) }{\left( d^{+}\left( \mathbf{x}\right) +d^{-}\left( \mathbf{x}\right) \right)^2 }\cdot\frac{\partial d\left( \mathbf{x},\mathbf{w}^{\pm }\right) }{\partial \mathbf{w}^{\pm}}
\end{equation}
form Equations $\left( \ref{GLVQ update w+}\right)$ and $\left( \ref{GLVQ upddate W-}\right)$ we identify that the attraction and repulsion scheme used in LVQ is also preserved in GLVQ\cite{villmann2017can} .
However, it must be noted that the dissimilarity measure employed in $\left( \ref{GLVQ update}\right)$ is the squared Euclidean distance\cite{sato1996generalized}.
\section{Generalized Matrix Learning Vector Quantization}
GLVQ provides a conceptual framework for which all generalized LVQ could be developed. The GLVQ requirement of using a differentiable dissimilarity measure chosen as the standard Euclidean distance in \cite{sato1996generalized} is not ideal for all problems\cite{villmann2017can}. The search for a dissimilarity that can work well for different data sets while keeping the generalization requirement of differentiability led to the introduction of Generalized Relevance Learning Vector Quantization (GRLVQ)\cite{article}. The dissimilarity measure used in GRLVQ is specified with relevant factors, which are learned in the same manner as prototypes in GLVQ\cite{article}. An advanced variant of the GRLVQ, which utilizes a full matrix of relevances in specifying the dissimilarity measure used in GLVQ, is introduced and referred to as Generalized Matrix Learning Vector Quantization (GMLVQ)\cite{article}.
The dissimilarity measure in matrix-GLVQ is given by
\begin{equation*}%remove
d_\Omega\left( \mathbf{x},\mathbf{w}\right)=\left( \mathbf{x}-\mathbf{w}\right) ^{T}\Omega^{T}\Omega\left( \mathbf{x}-\mathbf{w}\right) ;\hspace{10pt} \Omega \in \mathbb{R}^{m\times n},
\end{equation*}
when\hspace{2pt} $m$\hspace{2pt} is same as\hspace{2pt} $n$ ; matrix $\Lambda=\Omega^T \Omega \in \mathbb{R}^{n\times n}$ ; $\Omega$ \hspace{2pt}serves the purpose of a projection matrix\cite{villmann2017can}
\begin{equation}\label{GMLVQ distance}
d_\Omega \left( \mathbf{x},\mathbf{w}\right) =\left( \Omega\left( \mathbf{x}- \mathbf{w}\right)\right) ^2
\end{equation}
with a positive definite matrix\hspace{2pt} $\Lambda$, $ \left( \ref{GMLVQ distance} \right)$ can be taken as the Euclidean distance.
Given a classifier of the form
\begin{equation*}%remove
\mu\left( \mathbf{x}\right) =\frac{d_\Omega\left( \mathbf{x},\mathbf{w}^{+}\right)-d_\Omega\left( \mathbf{x},\mathbf{w}^{-}\right) }{d_\Omega\left( \mathbf{x},\mathbf{w}^{+}\right)+d_\Omega\left( \mathbf{x},\mathbf{w}^{-} \right) }
\end{equation*}
The extent of classification security is based on the level to which\hspace{2pt} $d_\Omega\left( \mathbf{x},\mathbf{w}^{+}\right)<d_\Omega\left( \mathbf{x},\mathbf{w}^{-}\right)$\cite{article}.
The cost function is given by
\begin{equation}\label{GMLVQ costfunction}
J_{GMLVQ}\left( X,W\right) =\sum_{i=1}^{n}f\left( \mu\left( \mathbf{x}_i\right) \right)
\end{equation}
Just like in GLVQ, the weights updation in $\left( \ref{GMLVQ weight updation}\right)$ and the matrix adaptation in $\left( \ref{GMLVQ matrix adaptation}\right)$ is done simultaneously\cite{schneider2009adaptive} with the SGDL used in minimization of $\left( \ref{GMLVQ costfunction}\right)$
\begin{equation}\label{GMLVQ matrix adaptation}
\Delta \Omega\propto \frac{-\partial f}{\partial \mu}\Bigg( \frac{\partial \mu}{\partial d_{\Omega}^{+}\left( \mathbf{x}\right)}\cdot\frac{\partial d_{\Omega}^{+}\left( \mathbf{x}\right) }{\partial \Omega}+\frac{\partial \mu}{\partial d_{\Omega}^{-}\left( \mathbf{x}\right)}\cdot\frac{\partial d_{\Omega}^{-}\left( \mathbf{x}\right) }{\partial \Omega} \Bigg)
\end{equation}
\begin{equation}\label{GMLVQ weight updation}
\Delta \mathbf{w}^{\pm}\propto \frac{-\partial f}{\partial \mu}\cdot\frac{\pm 2d_{\Omega}^{\mp}\left( \mathbf{x}\right) }{\left( d_{\Omega}^{+}\left( \mathbf{x}\right) +d_{\Omega}^{-}\left( \mathbf{x}\right) \right)^2 }\cdot\frac{\partial d_{\Omega}\left( \mathbf{x},\mathbf{w}^{\pm }\right) }{\partial \mathbf{w}^{\pm}}
\end{equation}
\section{Cross-Entropy in Learning Vector Quantization}
We refer to the same introduction and parameters as used in GLVQ. Considering an information theoretic approach, the training set employed in the learning process comes along with probabilistic target class information given by \hspace{2pt}$(X,T) = \left\lbrace \mathbf{x}_i , \mathbf{t}_i\right\rbrace _{i=1}^{N}$ with\hspace{2pt} $\mathbf{t}_i$\hspace{2pt} being the probabilistic class targets satisfying the conditions\hspace{3pt} $t_{ij}\in \left[ 0,1\right] $ \hspace{2pt}and \hspace{2pt}$\sum_{j}t_{ij} = 1$\cite{villmann2018probabilistic}.
Given a data point \hspace{2pt}$\mathbf{x}\in X$,\hspace{2pt} consider the class probability vector\hspace{2pt}
$p\left( \mathbf{x}\right) =\left( p_{1}\left( \mathbf{x}\right) ,\ldots,p_{C}\left( \mathbf{x}\right)\right)^{T} $.\hspace{2pt} Assume a model class predictor\hspace{2pt} $p_{W}\left( \mathbf{x}\right) = \left( p_{W}\left( 1|\mathbf{x}\right) ,p_{W}\left( 2|\mathbf{x}\right) ,\ldots,p_{W}\left( C|\mathbf{x}\right)\right) ^{T} $\hspace{2pt} by Soft Learning Vector Quantization(SLVQ) analogy using model parameters from set\hspace{2pt} $W$.\hspace{2pt} The objective here is to clearly maximize the mutual information between \hspace{2pt}$p\left( \mathbf{x}\right) $\hspace{2pt} and \hspace{2pt}$p_{W}\left( \mathbf{x}\right) $\hspace{2pt}by minimizing the divergence between them\cite{villmann2018probabilistic}. Hence, a function that represents this divergence is,
\begin{equation}\label{local errors}
L\left( X,W\right) = D_{KL}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right)
\end{equation}
where the Kulbach-Liebler divergence is,
\begin{equation*}%remove
D_{KL}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right)= H\left( p\left( \mathbf{x}\right)\right) - Cr\left( p\left( \mathbf{x}\right) ,p_{W}\left( \mathbf{x}\right) \right)
\end{equation*}
$H\left( p\left( \mathbf{x}\right)\right) $ \hspace{2pt}indicates Shanon entropy and\hspace{2pt} $Cr\left( p\left( \mathbf{x}\right) ,p_{W}\left( \mathbf{x}\right) \right)$\hspace{2pt} indicates cross-entropy.
It must be noted that alternative divergence such as the Renyi-$\alpha$-divergence may also be considered in this regard,
\begin{equation}\label{Renyi-divergence}
D_{\alpha}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right) =\frac{1}{1-\alpha}\log\left( \sum_{k}\left( p_{k}\left( \mathbf{x}\right) \right) ^{\alpha}\cdot\left( p_{W}\left( k|\mathbf{x}\right) \right) ^{1-\alpha}\right)
\end{equation}
as $\alpha\rightarrow 1$ we have that
\begin{align*}
D_{\alpha}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right)\rightarrow D_{KL}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right)
\end{align*}
because the Shanon entropy is independent of the learning parameters, the local errors in $\left(\ref{local errors}\right) $ is minimized, taking into consideration only the cross-entropy
\begin{equation}\label{cross entropy}
\frac{\partial}{\partial w}D_{KL}\left( p\left( \mathbf{x}\right) ||p_{W}\left( \mathbf{x}\right) \right)=\frac{-\partial}{\partial w}Cr\left( p\left( \mathbf{x}\right) ,p_{W}\left( \mathbf{x}\right) \right)
\end{equation}
\subsection{Soft Learning Vector Quantization}
The primary aim is to model a soft class predictor that follows conventional learning vector quantization (prototype-based with Euclidean dissimilarity measure)\cite{seo2003soft,villmann2018probabilistic,kaden2014aspects}. Hence given\hspace{2pt} $\mathbf{x}\in X$,\hspace{2pt} the probability density is determined by
\begin{align}
P_{W}\left( \mathbf{x}\right) = \sum_{j=1}^{N}p\left( \mathbf{x}|\mathbf{w}_{j}\right)p\left( \mathbf{w}_{j}\right)
\end{align}
where for prototype \hspace{2pt}$\mathbf{w}_{j}\in W$,\hspace{2pt} $p\left( \mathbf{w}_{j}\right)$\hspace{2pt} indicates the prior probability and\hspace{2pt} $p\left( \mathbf{x}|\mathbf{w}_{j}\right)$\hspace{2pt} indicates the probability of prototype\hspace{2pt} $\mathbf{w}_{j}$\hspace{2pt} to induce\hspace{2pt} $ \mathbf{x}$.
We incorporate fixed classes\hspace{2pt} $c\in\mathcal{C}$ \hspace{2pt} of the data point together with LVQ principles of best correct matching prototypes and best incorrect matching prototypes arriving at a joint probability density of the form
\begin{equation}
P_{W}\left( \mathbf{x},c\right) = \sum_{j:c\left( \mathbf{w}_{j}\right) = c}p\left( \mathbf{x}|\mathbf{w}_{j}\right)p\left( \mathbf{w}_{j}\right)
\end{equation}
and
\begin{equation}
P_{W}\left( \mathbf{x},\neg c\right) = \sum_{j:c\left( \mathbf{w}_{j}\right) \neq c}p\left( \mathbf{x}|\mathbf{w}_{j}\right)p\left( \mathbf{w}_{j}\right)
\end{equation}
referred to as the probability that\hspace{2pt} $\mathbf{x}$\hspace{2pt} is induced by a mixture of Gaussians with the correct class and the probability that\hspace{2pt} $\mathbf{x}$\hspace{2pt} is induced by a mixture of Gaussians with the incorrect class, respectively\cite{seo2003soft,villmann2018probabilistic}. Concerning Soft Learning Vector Quantization (SLVQ),
the cost function minimized by stochastic gradient descent learning is given by
\begin{equation}\label{slvq cost}
L_{SLVQ}(X,W) = -\sum_{k}\ln\bigg(\frac{P_{W}(\mathbf{x}_{k},c_{k})}{P_{W}(\mathbf{x}_{k},\neg c_{k})}\bigg)
\end{equation}
and for Robust Soft Learning Vector Quantisation (RSLVQ),
\begin{equation}\label{RSLVQ}
L_{RSLVQ}(X,W) = -\sum_{k}\ln\bigg(\frac{P_{W}(\mathbf{x}_{k},c_{k})}{P_{W}(\mathbf{x}_{k})}\bigg)
\end{equation}
where,
\begin{equation}
P_{W}(\mathbf{x}_{k}) = P_{W}(\mathbf{x}_{k},c_{k}) + P_{W}(\mathbf{x}_{k},\neg c_{k}\big)
\end{equation}
In line with Seo and Obermayer\cite{seo2003soft},\hspace{2pt} $P_{W}(\mathbf{x}_{k})$\hspace{2pt} solves the problem of instability encountered whenever infinity is attained by the cost function in (\ref{slvq cost}).
The updates of prototypes for SLVQ is done using
\begin{equation*}
\Delta \mathbf{w}\propto \frac{-\partial}{\partial \mathbf{w}_{l}}L_{SLVQ}(X,W)
\end{equation*}
and in the case of RSLVQ,
\begin{equation*}
\Delta \mathbf{w}\propto \frac{-\partial}{\partial \mathbf{w}_{l}}L_{RSLVQ}(X,W)
\end{equation*}
%\begin{align}
% \frac{\partial}{\partial \mathbf{w}_{l}}L_{RSLVQ}(X,W) = \frac{\partial}{\partial \mathbf{w}_{l}}\ln P_{W}(\mathbf{x}_{k},c_{k}) - \frac{\partial}{\partial \mathbf{w}_{l}}\ln \big( P_{W}(\mathbf{x}_{k},c_{k}) + P_{W}(\mathbf{x}_{k},\neg c_{k}\big)
%\end{align}
\subsection{Robust Soft Learning Vector Quantization with Cross-Entropy Optimization}
GLVQ, including its variants together with many other prototype-based classifiers, are generally accepted to be highly robust and involve the optimization of the classification error to attain classification results that are highly intrepretable\cite{kaden2014aspects}. A version of LVQ which utilizes the cross-entropy maximization, motivated by information-theoretic principles, is introduced as a generalization of RSLVQ\cite{villmann2018probabilistic}.
Hence the cost function of the form,
\begin{equation*}%remove
E_{}\left( X,W\right) =\sum_{\mathtt{x}}D_{KL}\left( t\left( \mathbf{x}\right) ||p_{W}\left(\mathbf {x}\right) \right)
\end{equation*}
from the cross-entropy in $\left(\ref{cross entropy}\right) $. Considering a relation of this model based on prototypes\hspace{2pt} $W=\left\lbrace \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{N}\right\rbrace $ \hspace{2pt}and class responsibilities\hspace{2pt} $c\left( \mathbf{w}_{k}\right)$\hspace{2pt} we have,
\begin{equation*}%remove
Cr_{W}\left( \mathbf{x}\right) =\sum_{c=1}^{C}t_{c}\left( \mathbf{x}\right) \cdot\log\left( p_{W}\left(c|\mathbf{x}\right) \right)
\end{equation*}
and using the model class prediction probability from SLVQ,
\begin{align*}
p_{W}\left( c|\mathbf{x}\right)=\frac{P_{W}\left( \mathbf{x},c\right) }{P_{W}\left( \mathbf{x}\right) }
\end{align*}
with
\begin{align*}
P_{W}\left( \mathbf{x},c\right) &= \sum_{j:c\left( \mathbf{w}_{j}\right) = c}\exp\left( -d_{\Omega}\left( \mathbf{x},\mathbf{w}_{j}\right) \right)\\
&= \sum_{j:c\left( \mathbf{w}_{j}\right) = c}\exp\left(-\left( \Omega\left( \mathbf{x}- \mathbf{w}_{j}\right)\right) ^2 \right)
\end{align*}
and
\begin{align*}
P_{W}\left( \mathbf{x}\right) &= \sum_{l}\exp\left( -d_{\Omega}\left( \mathbf{x},\mathbf{w}_{l}\right) \right)\\
&=\sum_{l}\exp\left(-\left( \Omega\left( \mathbf{x}- \mathbf{w}_{l}\right)\right) ^2 \right)
\end{align*}
we indicate that,\hspace{2pt} $d_{\Omega}\left( \mathbf{x},\mathbf{w}_{j}\right)$\hspace{2pt} for all\hspace{2pt} $\mathbf{w}_{j}\in W$\hspace{2pt} follows the analogy of the dissimilarity measure utilized in GMLVQ.
We have the cross-entropy presented as
\begin{align}\label{cross entropy 2}
Cr_{W}\left( \mathbf{x}\right) =\sum_{c=1}^{C}t_{c}\left( \mathbf{x}\right) \cdot\log\left( \frac{P_{W}\left( \mathbf{x},c\right) }{P_{W}\left( \mathbf{x}\right) } \right)
\end{align}
From the results in $\left(\ref{cross entropy 2}\right) $,\ is a generalization RSLVQ\cite{villmann2018probabilistic}.
Concerning mutually exclusive training data, the cost function approaches RSLVQ cost function\cite{villmann2018probabilistic}. We account for this by considering\hspace{2pt} $t_{ij}\in\left\lbrace 0,1\right\rbrace $\hspace{2pt} together with \hspace{2pt}$\sum_{j}t_{ij}=1$, \hspace{2pt}when we assume the target probability accross the classes is mutually exclusive. So considering one prototype per class,\hspace{2pt} $t_{c}\left( \mathbf{x}\right) = 1$\hspace{2pt} in (\ref{cross entropy 2}) arriving at the same cost functon (\ref{RSLVQ}) for RSLVQ.\\ Mathematically, we have
\begin{equation*}
p\left( t_{i}| \mathbf{x}_{i} \right) = \prod_{j=1}^{C}p_{j}\left( \mathbf{x}_{i}\right) ^{t_{ij}}
\end{equation*}
and
\begin{equation*}
p\left( c_{i}| \mathbf{x}_{i} \right) = \prod_{j=1}^{C}\left( p_{W}\left( j,\mathbf{x}_{i}\right)\right) ^{t_{ij}}
\end{equation*}
referred to as the true conditional target probabilty for\hspace{2pt} $\mathbf{x}_{i}$\hspace{2pt} and model target conditional probability for \hspace{2pt}$\mathbf{x}_{i}$ \hspace{2pt}respectively expressed as multinomial distributions\cite{villmann2018probabilistic}. We further consider the log-likelihood ratio
\begin{equation*}
\log \frac{p\left( T|X\right) }{p_{W}\left( C|X\right) } = \log \bigg(\prod_{i=1}^{N}\frac{p\left( t_{i}|\mathbf{x}_{i}\right) }{p_{W}\left( c_{i}|\mathbf{x}_{i}\right)}\bigg)
\end{equation*}
expanded as
\begin{align*}
&= \sum_{i=1}^{N}\log\left( p\left( t_{i}|\mathbf{x}_{i}\right)\right) - \sum_{i=1}^{N}\log\left( p_{W}\left( c_{i}|\mathbf{x}_{i}\right)\right) \\
&=\sum_{i=1}^{N}\sum_{j=1}^{C}t_{ij}\log\left( p_{j}\left( \mathbf{x}_{i}\right) \right)- \sum_{i=1}^{N}\sum_{j=1}^{C}t_{ij}\log\left( p_{W}\left(j| \mathbf{x}_{i}\right) \right)
\end{align*}
and we have the form observed in (\ref{local errors}) below
\begin{align*}
=\sum_{i=1}^{N}H\left( t_{i}\right) - Cr\left( t_{i},p_{W}\left( \mathbf{x}_{i}\right) \right)
\end{align*}
The cross-entropy is minimized for gradient descent learning with respect to parameter\hspace{2pt} $W$\hspace{2pt} as \hspace{2pt}$\frac{\partial}{\partial \mathbf{w}_{l}}Cr_{W}\left( \mathbf{x}\right)$\hspace{2pt} and the prototype updates are done using,
\begin{equation}
\Delta \mathbf{w}\propto \frac{-\partial}{\partial \mathbf{w}_{l}}Cr_{W}\left( \mathbf{x}\right)
\end{equation}
%\section{Cross-Entropy Method Generalized Learning Vector Quantization}
%The incorporation of a cost function approach that is continuous and differentiable together with convergence conditions into the reference vectors update rule of GLVQ remains a groundbreaking feat in the LVQ family of advanced prototype-based classification algorithms\cite{sato1996generalized}. Even though developments in such regard have achieved excellent generalization ability and optimal classification results, a vital point worthy of discussion is the prototype initialization problem associated with the use of GLVQ\cite{boubezoul2008application}. The optimization in this area only seeks to achieve convergence at the global minima, which in theory and practice is linked to optimal classification results for prototype-based models but this scenario remains a challenge whenever the optimization gets stacked in the local minima, which is precisely the case for GLVQ\cite{boubezoul2008application}. Cross-Entropy Method Generalized Learning Vector Quantization has been discovered to overcome the problem associated with prototype initialization sensitiveness of GLVQ\cite{boubezoul2008application}.
%Given an optimization task, the challenge here would be to search for the optimal set of parameters $W$ to which the cost function in $\left( \ref{GLVQ cost fucntion}\right)$ can be minimized.
%\begin{equation*}
% \gamma^{\ast} = \min\limits_{\mathbf{w}\in W}J\left( \mathbf{w}\right)
%\end{equation*}
%Two key iterative steps are considered\cite{boubezoul2008application}\cite{kroese2006cross},
%\begin{enumerate}
% \item Generate sample prototypes using a\hspace{2pt} $p(.; \mathbf{v})$\hspace{2pt} and choose the best of these samples
% \item Using the parameter\hspace{2pt} $\mathbf{v}$ update the distribution family by utilizing best samples selected in $(1)$. Repeate until convergence.
%\end{enumerate}
%The goal is to extend the search spectrum to which an optimal set of parameters can be obtained\cite{kroese2006cross}.
%The Cross-entropy method is applied in the optimization of GLVQ by ensuring the set of parameters to which the GLVQ cost function is minimized are generated by a Gaussian distribution given by way of a respective\hspace{2pt} $\left(d\times P \right)$\hspace{2pt} matrix with components obtained by
%\begin{equation*}
% W^{l}\triangleq W^{l}_{pq} \sim \mathcal{N}t(m_{pq}^{t},(\sigma_{pq}^{t})^2,a_{p},b_{p})
%\end{equation*}
%for
%\begin{equation*}
% l=1,...,V ;\hspace{10pt} p=1,...d\hspace{10pt} \text{and}\hspace{10pt}q=1,...,P
%\end{equation*}
%where the mean and variance of the $pqth$ components at iteration\hspace{2pt} $t$\hspace{2pt} is indicated by\hspace{2pt} $m^{t}_{pq}$ \hspace{2pt} and\hspace{2pt} $ \left( \sigma^{t}_{pq}\right) ^{2}$ respectively with the lower and upper bounding box to which all data points are covered per dimension is indicated by\hspace{2pt} $a_{p}$\hspace{2pt} and\hspace{2pt} $b_{p}$\cite{boubezoul2008application}.
%The lower and upper bounding box as used in pragmatic terms is given by
%\begin{equation*}
% a_{p} = K\min\limits_{j = 1,...,N}\left\{\mathbf{x}_{jp}\right\}
%\end{equation*}
%\begin{equation*}
% b_{p} = K\max\limits_{j = 1,...,N}\left\{\mathbf{x}_{jp}\right\}
%\end{equation*}
%where $K$ is greater or equal to 1 \cite{boubezoul2008application}.
%The smoothed updates of generating parameters in the Generic cross-entropy algorithm and the multi-extremal version as used in GLVQ are given respectively by
%\begin{equation*}
% \widehat{\mathbf{v}}_{t} = \alpha\widetilde{\mathbf{v}}_{t} + (1-\alpha)\widehat{\mathbf{v}}_{t-1} ;
%\end{equation*}
%\begin{equation}\label{dynamic smoothing}
% \beta_{t} = \beta_{0} - \beta_{0}\left(1-\frac{1}{t}\right)^{c}
%\end{equation}
%where
%\begin{equation}
% 0\leqslant \alpha \leqslant 1 ;\hspace{10pt} 0.8\leqslant \beta_{0}\leqslant 0.99 ; \hspace{10pt}5\leqslant c \leqslant 15
%\end{equation}
%with\hspace{2pt} $\beta_{0} $\hspace{2pt} as used here refers to a large smoothing constant,\hspace{2pt} $c$\hspace{2pt} chosen as a small integer and \hspace{2pt}$\alpha$\hspace{2pt} is a fixed smoothing parameter\cite{boubezoul2008application}.
%The avoidance of optimization getting stuck in a local minima coupled with the effect of poor convergence remains the critical account for which the smoothing is done\cite{boubezoul2008application,kroese2006cross}. Consequently, for viability, the selection of variance must be made regarding a broad search spectrum to overcome the unwanted effect of the initial choice of parameters noting that updates for the variance as used in the case of GLVQ in $\left( \ref{dynamic smoothing}\right)$ is referred to as dynamic smoothing\cite{kroese2006cross}.
%A summary of this process is shown below for the Prototypical Cross-Entropy Algorithm in Figure \ref{CE Algorithm} and Cross-Entropy Algorithm for GLVQ in Figure \ref{CE Algorithm for GLVQ} respectively.
%\begin{figure}[h]
% \centering
% \caption{Cross-Entropy Algorithm\cite{kroese2006cross}}\label{CE Algorithm}
%
% \begin{tabular}{ l l }
% \cline{1-2} \hline
%
% \multicolumn{1}{c}{\emph{}} &\multicolumn{1}{c}{Prototypical Cross-Entropy Algorithm for optimization $$ } \\ \hline
% \multicolumn{1}{c}{(1)} & Choose some\hspace{2pt} $\hat{\mathbf{v}}_{0}\in \mathbf{\vartheta}$.\hspace{2pt} Set\hspace{2pt} $t=1$ (level counter) \\
% \multicolumn{1}{c}{(2)} & Generate samples\hspace{2pt} $\mathbf{w}^{1},...,\mathbf{w}^{V}$ \hspace{2pt}from the density$\hspace{2pt}p(.;\hat{\mathbf{v}}_{t-1})$ and compute the \hspace{2pt}$\rho$-quantile \\
% \multicolumn{1}{c}{} &$\widehat{\gamma}_{t-1}$\hspace{2pt} of the samples scores.\\
% \multicolumn{1}{c}{(3)} & Use the same samples to solve the stochastic program by: \\
% \multicolumn{1}{c}{} &$\underset{\mathbf{v}}{\max}\hspace{2pt} \widehat{D}(\mathbf{v}) =\underset{\mathbf{v}} {\max}\left\{{\frac{1}{V}\sum_{l=1}^{V}I_{J(\mathbf{v}^{l})\leq\widehat{\gamma}_{t-1}}\ln p(\mathbf{w}^{l};\mathbf{v})}\right\}$ \\
% \multicolumn{1}{c}{} & Denote the solution by \hspace{2pt}$\tilde{\mathbf{v}}_{t}$ \\
% \multicolumn{1}{c}{(4)} & If predefined stopping criteria is met, then stop; otherwise set $$\\
% \multicolumn{1}{c}{} & $t=t+1$ \hspace{2pt}reiterate from step 2 \\\hline
%
% \end{tabular}
%\end{figure}
%
%\begin{figure}[h]
% \centering
% \caption{Cross-Entropy Algorithm for GLVQ\cite{boubezoul2008application}}\label{CE Algorithm for GLVQ}
%
% \begin{tabular}{ l l }
% \cline{1-2} \hline
%
% \multicolumn{1}{c}{\emph{}} &\multicolumn{1}{c}{Cross-Entropy Algorithm for GLVQ optimization $$ } \\ \hline
% \multicolumn{1}{c}{(1)} & Choose some initial\hspace{2pt} $\left\{M^0,\sum^0\right\} $ \hspace{2pt}for\hspace{2pt} $ p=1,...,d,\hspace{2pt} q=1,...,P.$\hspace{2pt} Set\hspace{2pt} $t=1$ \\
% \multicolumn{1}{c}{} & (level counter) \\
% \multicolumn{1}{c}{(2)} & Draw samples\hspace{2pt} $W^{l} \sim \mathcal{N}t(M^{(t-1)},\sum^{(t-1)},a_{p},b_{p}),$\hspace{2pt} $ l=1,...,V. $ \\
% \multicolumn{1}{c}{(3)} & Compute\hspace{2pt} $S^{l}=J_{GLVQ}(X;W^{l})$\hspace{2pt} scores by applying $(\ref{GLVQ cost fucntion})$ $\forall\hspace{2pt} l$.\\
% \multicolumn{1}{c}{(4)} & Sort\hspace{2pt} $S^{l}$\hspace{2pt} in ascending order and denote by\hspace{2pt} $I$\hspace{2pt} the set of corresponding \\
% \multicolumn{1}{c}{} & indices. Let us denote $\left(\widetilde{M}^{(t-1)},\left(\widetilde{\sum}^{(t-1)}\right)^2\right)$ the mean and the variance\\
% \multicolumn{1}{c}{} & of the best\hspace{2pt} $\lceil \rho V\rceil\hspace{2pt} $ prototypes elite samples of\hspace{2pt} $\left\{W^{I(l)}\right\},\hspace{2pt} l = 1,...,\lceil \rho V\rceil$ \\
% \multicolumn{1}{c}{} & respectively.\\
% \multicolumn{1}{c}{(5)} & $\widehat{M}^{t} = \alpha\widetilde{M}^{t} + (1-\alpha)\widehat{M}^{t-1},\hspace{3pt} \widehat{\sum}^{t} =\beta_{t}\widetilde{\sum}^{t} + (1-\beta_{t})\widehat{\sum}^{(t-1)}$ \\
% \multicolumn{1}{c}{(6)} & If convergence is reached or\hspace{2pt} $t=T$ ($T$ denote the final iteration), then \\
% \multicolumn{1}{c}{} & stop; otherwise set\hspace{2pt} $t = t + 1$ and reiterate from step 2. \\\hline
%
% \end{tabular}
%
%\end{figure}
\newpage
\section{Classification Label Security/Certainty}
We consider a training set\hspace{2pt} $X=\left\lbrace \mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3,\ldots,\mathbf{x}_N\right\rbrace \subseteq \mathbb{R}^n$\hspace{2pt} with its class labels\hspace{2pt} $c\left( \mathbf{x}\right)\in\mathcal{C}=\left\lbrace 1,2,\ldots, C\right\rbrace $,\hspace{2pt} we define a prototype set of vectors\hspace{2pt} $W=\left\lbrace \mathbf{w}_1,\mathbf{w}_2,\ldots,\mathbf{w}_M\right\rbrace\subseteq \mathbb{R}^n $\hspace{2pt} such that every\hspace{2pt} $\mathbf{w}\in W$ \hspace{2pt}has a corresponding class \hspace{2pt}$c\left( \mathbf{w}\right)\in\mathcal{C}$.\hspace{2pt} We divide the training set into the train and test sets, respectively. Using the train set along with standard LVQ training procedure, the learned prototypes \hspace{2pt}$\mathbf{w}_{k}\in W$ \hspace{2pt}together with its classes \hspace{2pt}$c\left( \mathbf{w}_{k} \right) $ \hspace{2pt} are accessed and applied in accordance with the fuzzy probabilistic assignments of FCM described in $\left( \ref{membership function}\right) $\ to determine the classification label security of the test set. So for test data, the classification label security is calculated and returned accordingly.
We further consider the utilization of the computed classification label securities to determine reject classification and non-reject classification strategy\cite{hanczar2019performance}. Advancing in this regard, we consider a test sample \hspace{2pt}$\mathbf{x}_{k}\subseteq \mathbb{R}^n$\hspace{2pt} for all\hspace{2pt} $1\leq k \leq N$, a given model classifier function indicated by\hspace{2pt} $M_{c}$\hspace{2pt} and the computed classification label security of\hspace{2pt} $\mathbf{x}_{k}$\hspace{2pt} indicated by\hspace{2pt} $u_{ik}$,\hspace{2pt} $1\leq i \leq |\mathcal{C}|$\hspace{2pt} and define a non-reject classification strategy based on
\begin{equation}\label{non reject classification}
M_{c}(\mathbf{x}_{k}) = c_{i}\in\mathcal{C} \hspace{3pt} \arg\max_i\hspace{2pt}\left\lbrace u_{ik}\right\rbrace
\end{equation}
and a reject classification strategy based on
\begin{align}\label{reject classification strategy}
M_{c}(\mathbf{x}_{k})=
\left \{
\begin{aligned}
&r, && \hspace{10pt}if\hspace{5pt}u_{ik}< h\hspace{10pt} \forall\hspace{2pt} i \\
&c_{i}\in\mathcal{C}, && \hspace{10pt}\arg\max_i\hspace{2pt}\left\lbrace u_{ik}\right\rbrace \hspace{10pt}\text{otherwise}
\end{aligned} \right.
\end{align}
with an extended class set\hspace{2pt} $\mathcal{C}^{\ast} = \mathcal{C}\cup \left\lbrace r\right\rbrace $\hspace{2pt} where the decision to reject is indicated by\hspace{2pt} $r$\hspace{2pt} based on a fixed but arbitrarily choosen threshold classification label security \hspace{2pt}$h$,\hspace{2pt} $0\leq h \leq 1$\cite{hanczar2019performance}.
The average model classification certainty which indicates regions in the data space where the model is confident with respect to the prototypes is indicated by
\begin{equation*}\label{model certainty}
\zeta(X,W) = \frac{1}{|W|}\sum_{\mathbf{w}\in W}(\zeta_{\mathbf{w}}(X))
\end{equation*}
where $\zeta_{\mathbf{w}}(X)$\hspace{2pt} in $\left( \ref{prototype certainty}\right) $ measures the classification certainty of respective prototype\hspace{2pt} $\mathbf{w}$ and class responsibilities\hspace{2pt} $c \left(\mathbf{w}\right) $ with regards to equation $\left( \ref{reject classification strategy}\right)$ \cite{villmann2018probabilistic}
\begin{equation}\label{prototype certainty}
\zeta_{\mathbf{w}}(X) = \frac{|\left\{\mathbf{x}\in X|\mathbf{w} = \mathbf{w}_{s(\mathbf{x})}\wedge c(\mathbf{x}) = c(\mathbf{w}_{s(\mathbf{x})})\right\}|}{|\left\{\mathbf{x}\in X|\mathbf{w} = \mathbf{w}_{s(\mathbf{x})}\right\}|}
\end{equation}
The behavior of the models concerning the test accuracy \hspace{2pt}$Acc$\hspace{2pt} and the adjusted test accuracy\hspace{2pt} $Acc_{h} $\hspace{2pt} not including rejected classification based on a given threshold security \hspace{2pt}$h$\hspace{2pt} will also be investigated with
\begin{equation}\label{model accuracy}
Acc_{h} = \frac{|\left\{\mathbf{x}\in X|\ c(\mathbf{x}) = c(\mathbf{w}_{s(\mathbf{x})})\right\}|}{| X|}
\end{equation}
for accuracy consideration disregarding any rejected classification, we drop the threshold security\hspace{2pt}\ $h$.
The model classification certainty\hspace{2pt} $\zeta(X,W)$\hspace{2pt} will be utilized as the primary metric to evaluate the confidence of the GLVQ, GMLVQ and CELVQ models used in this thesis for the determination of the classification label security.
%\begin{equation}\label{reject certainty}
%Accuracy(X) = \frac{|\left\{\mathbf{x}\in X|\hspace{2pt}M_{c}(\mathbf{x}) = c(\mathbf{x})\hspace{2pt}\wedge\hspace{2pt} M_{c}(\mathbf{x})\neq r \right\}|}{|\left\{\mathbf{x}\in X|\hspace{2pt}M_{c}(\mathbf{x}) \neq r\right\}|}
%\end{equation}
\chapter{Experimental Results}
\section{General Overview of Train/Test Procedure }
A standard and generally accepted procedure in machine learning for model training and testing involve demarcating the data set under consideration into a train set and test set. The ratio of the demarcation puts more weight on the train set than the test set. A good model should endure vigorous training with much of the data set in order to capture a reasonably representable variance per the patterns present in the data set with the remainder of a relatively sizeable unused data points tested on the model to evaluate how well the model can predict with new data points.
A significant way forward will be to have a fair explorable insight of the data points in the data set under study. This will lead to the decision on which data scaling procedure to apply to the data set.
In this thesis, all data sets used in the experimentation were split into the train-test ratio of $4:1$. The data sets were normalized with $\left( \ref{standardization}\right)$ in the data preparation stage.
\begin{equation}\label{standardization}
\mathbf{x}_{s}=\frac{\mathbf{x}-\text{mean}(\mathbf{x})}{\text{standard deviation}(\mathbf{x})}
\end{equation}
$\mathbf{x}_{s}$\hspace{2pt} is the normalized vector and \hspace{2pt}$\mathbf{x}$\hspace{2pt} is the unnormalized vector.
The train set is first fitted and transformed with the standard feature scaler in $\left( \ref{standardization} \right) $ whilst the mean$\left(\mathbf{x} \right)$ \hspace{2pt}of the train set and the standard deviation$\left(\mathbf{x}\right)$ of the train set is used to transform the test set. Generally, this is done to disallow information passage into the model during the testing stage.
The split must be done to ensure that the training and test sets are mutually disjoint sets.
\section{Iris Data Set}
The Iris data set\cite{fisher1936use} is used in this thesis to determine the classification label security by taking into account the fuzzy probabilistic assignment of FCM estimates. This data set is chosen primarily to reflect its prolific usage for most machine learning implementation schemes. The Iris data set holds an unchallenged position of fame in the machine learning community and remains well understood in such regard. The data set has 150 data points present with three uniform classes, each containing 50 data points with four features, namely sepal length in cm, sepal width in cm, petal length in cm and petal width in cm. The three classes are referred to as Iris Setosa, Iris Versicolour and Iris Virginica.
\section{Classification Label Security of Iris Data set}
The Iris data set is normalized as described in $\left( \ref{standardization}\right) $ and in-line with standard train-test procedure, the train samples consists of $80\%$ of the total data points, with the remaining $20\%$ used as test samples. The prototypes initialization is done uniformly across all three classes with one prototype per class. Training is realized using batches with 32 samples for 100 maximum epochs with $\eta =0.01$. The training of the Iris train set was realized using the python implementation\cite{Ravichandran2020}.
The learned prototypes were accessed and used to determine the classification label security of the Iris test set. The GLVQ, GMLVQ and CELVQ models were employed in the learning and classification of the Iris data set. A summary of computed results that indicate the adjusted test accuracy with and without rejected classifications for the GLVQ, GMLVQ and CELVQ models is summarised in Table \ref{tab:Iris summary2}. The model classification certainty is summarised in Table \ref{tab:Iris summary} and \ref{tab:Iris summary1}.
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c|c|c| }
\hline
\multicolumn{5}{|c|}{Model classification certainty of the Iris test set} \\
\hline
Model &$\zeta_{\mathbf{w}}^{0}(X) $ & $\zeta_{\mathbf{w}}^{1}(X)$ &$\zeta_{\mathbf{w}}^{2}(X)$ &$\zeta(X,W)$ \\
\hline
GLVQ & 1.00 &0.69 & 0.89 &0.860 \\
GMLVQ &1.00 &0.69 &0.88 &0.857 \\
CELVQ &1.00 &0.69 &0.89 & 0.860 \\
\hline
\end{tabular}
\caption[Summary of model classification certainty of the Iris test set]{\label{tab:Iris summary}This table contains a summary of the model classification certainty of the Iris test set with non-reject classification.\hspace{2pt} $\zeta_{\mathbf{w}}^{0}(X) $,\hspace{2pt} $\zeta_{\mathbf{w}}^{1}(X)$\hspace{2pt} and\hspace{2pt} $\zeta_{\mathbf{w}}^{2}(X)$\hspace{2pt} indicates the classification certainty of the model prototypes with respect to the Iris Setosa, Iris Versicolour and Iris Virginica classes. The average model classification certainty for the Iris test set is indicated by\hspace{2pt} $\zeta(X,W)$.}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c|c|c| }
\hline
\multicolumn{5}{|c|}{Model classification certainty of the Iris test set $(h=0.7)$} \\
\hline
Model &$\zeta_{\mathbf{w}}^{0}(X) $ & $\zeta_{\mathbf{w}}^{1}(X)$ &$\zeta_{\mathbf{w}}^{2}(X)$ &$\zeta(X,W)$ \\
\hline
GLVQ &1.00 &1.00 &1.00 &1.00 \\
GMLVQ &1.00 &0.69 &1.00 &0.90 \\
CELVQ &1.00 &1.00 &1.00 &1.00 \\
\hline
\end{tabular}
\caption[Summary of model classification certainty of the Iris test set with threshold security]{\label{tab:Iris summary1}This table contains a summary of the model classification certainty of the Iris test set with reject classification based on a threshold classification label security of 0.7.\hspace{2pt} $\zeta_{\mathbf{w}}^{0}(X) $,\hspace{2pt} $\zeta_{\mathbf{w}}^{1}(X)$\hspace{2pt} and\hspace{2pt} $\zeta_{\mathbf{w}}^{2}(X)$\hspace{2pt} indicates the classification certainty of the model prototypes with respect to the Iris Setosa, Iris Versicolour and Iris Virginica classes. The average model classification certainty for the Iris test set is indicated by\hspace{2pt} $\zeta(X,W)$.}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c| }
\hline
%\multicolumn{3}{|c|}{Model classification certainty of the Iris test set $(h=0.7)$} \\
%\hline
Model & Test Accuracy $(Acc)$ & Adjusted Test Accuracy $(Acc_{h})$ \\
\hline
GLVQ &0.83 &1.00 \\
GMLVQ &0.83 &0.84 \\
CELVQ &0.83 &1.00 \\
\hline
\end{tabular}
\caption[Summary of model classification test accuracy of the Iris test set]{\label{tab:Iris summary2}This table contains a summary of the model classification test accuracy of the Iris test set based on a non-reject classification\hspace{2pt}$(\ref{non reject classification})$\hspace{2pt} and a reject classification\hspace{2pt} $(\ref{reject classification strategy})$\hspace{2pt} based on a threshold classification label security\hspace{2pt} $(h=0.7)$.\hspace{2pt}}
\end{table}
By the estimates in Table \ref{tab:Iris summary}, \ref{tab:Iris summary1} and \ref{tab:Iris summary2} we can infer how accurate the models are when they are confident regarding the predictions made. In other words, the certainty with which the models (GLVQ, GMLVQ and CELVQ) made the observed classifications from the Iris test set. We relate this to the computed classification label securities and observe by way of visualization (Figures \ref{fig:igd1}, \ref{fig:igmd1} and \ref{fig:icd1}), regions in the Iris data space where the models (GLVQ, GMLVQ and CELVQ) are confident or unconfident about the classification labels. We further explore Figure \ref{fig:igd1} to Figure \ref{fig:icd1} where we can determine for any arbitrarily chosen threshold, regions in the data space where the models are confident or unconfident about the classification labels. The utilization of reject classification strategy $\left( \ref{reject classification strategy}\right) $ was able to improve the model classification certainty both at the class and overall level.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{"../../../send from/sendddd/iris3/gtr"}
\caption[Iris train set with GLVQ prototypes]{Iris train set with GLVQ prototypes and decision boundary}
\label{fig:ig1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{"../../../send from/sendddd/iris3/gt"}
\caption[Iris test set with GLVQ prototypes]{Iris test set with GLVQ prototypes and decision boundary}
\label{fig:ig2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris1/igld1"}
%\caption[Iris test set classification label security (GLVQ)]{The data space of the Iris test set showing the GLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:igd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris3/gf"}
\caption[Iris test set classification label security (GLVQ)]{The data space of the Iris test set showing the GLVQ model computed classification label securities with a threshold security $(h=0.7)$.}
\label{fig:igd1}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n/iris glvqf"}
%\caption[Iris test set classification label security (GLVQ)]{The data space of the Iris test set showing the GLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:glvqf}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris3/gmtr"}
\caption[Iris train set with GMLVQ prototypes]{Iris train set with GMLVQ prototypes and decision boundary}
\label{fig:igm1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris3/gmt"}
\caption[Iris test set with GMLVQ prototypes]{Iris test set with GMLVQ prototypes and decision boundary}
\label{fig:igm2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris1/igmld"}
%\caption[Iris test set classification label security (GMLVQ)]{The data space of the Iris test set showing the GMLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:igmd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris3/gmf"}
\caption[Iris test set classification label security (GMLVQ)]{The data space of the Iris test set showing the GMLVQ model computed label securities with a threshold security $(h=0.7)$.}
\label{fig:igmd1}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n/iris gmlvqf"}
%\caption[Iris test set classification label security (GMLVQ)]{The data space of the Iris test set showing the GMLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:gmlvqf}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/iris3/ctr"}
\caption[Iris train set with CELVQ prototypes]{Iris train set with CELVQ prototypes and decision boundary}
\label{fig:ic1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/iris3/ct"}
\caption[Iris test set with CELVQ prototypes]{Iris test set with CELVQ prototypes and decision boundary}
\label{fig:ic2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris1/iced"}
%\caption[Iris test set classification label security (CELVQ)]{The data space of the Iris test set showing the CELVQ model predicted labels along with the computed classification label securities.}
%\label{fig:icd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/iris3/cf"}
\caption[Iris test set classification label security (CELVQ)]{The data space of the Iris test set showing the CELVQ model computed classification label securities with a threshold security $(h=0.7)$.}
\label{fig:icd1}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n/iris celvqf"}
%\caption[Iris test set classification label security (CELVQ)]{The data space of the Iris test set showing the CELVQ model predicted labels along with the computed classification label securities.}
%\label{fig:celvqf}
%\end{figure}
From the results in Tables \ref{tab:Iris summary} and \ref{tab:Iris summary1}, we observe for all the three models (GLVQ, GMLVQ and CELVQ) that, the model classification certainty with and without rejected classifications for the Iris Setosa class was\hspace{2pt} $1.0$\hspace{2pt}. By referring to Figures \ref{fig:igd1}, \ref{fig:igmd1} and \ref{fig:icd1}, we can determine for sure whether this recorded certainty and the level of confidence in the regions of the Iris Setosa class labels were in agreement or not. So for an arbitrarily chosen label security threshold of\hspace{2pt} $0.7$, we observe that most of the classification labels in the region for the Iris Setosa class had very high label securities. This analogy applies to the observed certainties of the Iris Versicolour and Iris Virginica class as well. The utilization of reject classification strategy $\left( \ref{reject classification strategy}\right) $ was able to improve the test accuracy for all the models. By this implementation, we have determined the extent to which the predicted labels of the Iris test can be trusted.
\section{Breast Cancer Wisconsin (Diagnostic) Data set (WDBC)}
This thesis proceeds to test the implementation of the determination of the classification label security on the well-acclaimed WDBC data set\cite{street1993nuclear}, encompassing 569 data points along with 30 numeric, predictive attributes (specified by the mean, standard error and worst). So for each data point, measurements are made under the attributes designations: radius, texture, perimeter, area, smoothness, compactness, concactivity, concave points, symmetry and fractal dimension. Two classes, namely WDBC-Malignant and WDBC-Benign, are primarily considered with somewhat relatively homogeneous class distributions in the ratio of 212:357 for Malignant and Benign classes, respectively.
\section{Classification Label Security of Breast Cancer Wisconsin(Diagnostic) Data set}
All standard procedure described in section $4.3$ for the training with Iris data set is maintained and employed to train the WDBC data set. Considering a train-test split of\hspace{2pt} $80\% : 20\%$\hspace{2pt} for the WDBC data set, the prototypes initialization was done uniformly across the classes with one prototype per class. Training is realized using batches with 32 samples for 100 maximum epochs with $\eta =0.01$. The learned prototypes from the WDBC trained data using the GLVQ, GMLVQ and CELVQ models were accessed and used to determine the classification label security of the WDBC test set. A summary of computed results that indicate the adjusted test accuracy without rejected classifications for the GLVQ, GMLVQ and CELVQ models is summarized in Table \ref{tab:WDC summary3}.
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c|c| }
\hline
\multicolumn{4}{|c|}{Model classification certainty of the WDBC test set} \\
\hline
Model &$\zeta_{\mathbf{w}}^{0}(X) $ & $\zeta_{\mathbf{w}}^{1}(X)$ & $\zeta(X,W)$ \\
\hline
GLVQ &0.89 & 0.86 & 0.875 \\
GMLVQ &0.90 & 0.91 & 0.905 \\
CELVQ &0.88 & 0.91 & 0.895 \\
\hline
\end{tabular}
\caption[Summary of model classification certainty of the WDBC test set]{\label{tab:WDBC certainty}This table contains a summary of the model classification certainty of the WDBC test set with non-reject classification.\hspace{2pt} $\zeta_{\mathbf{w}}^{0}(X) $\hspace{2pt} and\hspace{2pt} $\zeta_{\mathbf{w}}^{1}(X)$ \hspace{2pt} indicates the classification certainty of the model prototypes with respect to the WDBC-Malignant and WDBC-Benign classes. The average model classification certainty for the WDBC test set is indicated by\hspace{2pt} $\zeta(X,W)$.}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c|c| }
\hline
\multicolumn{4}{|c|}{Model classification certainty of the WDBC test set $ (h=0.7)$} \\
\hline
Model &$\zeta_{\mathbf{w}}^{0}(X) $ & $\zeta_{\mathbf{w}}^{1}(X)$ & $\zeta(X,W)$ \\
\hline
GLVQ &1.00 &0.92 &0.960 \\
GMLVQ &0.93 & 0.94 &0.935 \\
CELVQ &1.00 &1.00 &1.000 \\
\hline
\end{tabular}
\caption[Summary of model classification certainty of the WDBC test set with threshold security]{\label{tab:WDBC certainty_}This table contains a summary of the model classification certainty of the WDBC test set with reject classification based on a threshold classification label security of 0.7.\hspace{2pt} $\zeta_{\mathbf{w}}^{0}(X) $\hspace{2pt} and\hspace{2pt} $\zeta_{\mathbf{w}}^{1}(X)$ \hspace{2pt} indicates the classification certainty of the model prototypes with respect to the WDBC-Malignant and WDBC-Benign classes. The average model classification certainty for the WDBC test set is indicated by\hspace{2pt} $\zeta(X,W)$.}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{ |c|c|c| }
\hline
%\multicolumn{3}{|c|}{Model classification certainty of the WDBC test set $(h=0.7)$} \\
%\hline
Model & Test Accuracy $(Acc)$ & Adjusted Test Accuracy $(Acc_{h})$ \\
\hline
GLVQ &0.87 &0.94 \\
GMLVQ &0.90 &0.94 \\
CELVQ &0.89 &1.00 \\
\hline
\end{tabular}
\caption[Summary of model classification test accuracy of the WDBC test set]{\label{tab:WDC summary3}This table contains a summary of the model classification test accuracy of the WDBC test set based on a non-reject classification\hspace{2pt}$(\ref{non reject classification})$\hspace{2pt} and a reject classification\hspace{2pt} $(\ref{reject classification strategy})$\hspace{2pt} based on a threshold classification label security\hspace{2pt} $(h=0.7)$.\hspace{2pt}}
\end{table}
Observing estimates in Table \ref{tab:WDBC certainty}, \ref{tab:WDBC certainty_} and \ref{tab:WDC summary3}, we can draw an inference on how accurate the models are when they are confident regarding the classifications labels of the test set. In other words, the certainty with which the models (GLVQ, GMLVQ and CELVQ) made the observed classifications from the WDBC test set. We relate this to the computed classification label securities and show by way of visualization (Figures \ref{fig:wgd1}, \ref{fig:wgmld}, \ref{fig:cdl}), regions in the WDBC data space where the models (GLVQ, GMLVQ and CELVQ) are confident or unconfident about the classification labels. We further explore Figure \ref{fig:wgd1} to Figure \ref{fig:cdl} where we can determine for any arbitrarily chosen threshold, regions in the WDBC data space where the models are confident or unconfident about the classification labels.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{"../../../send from/sendddd/WDBC3/gtr"}
\caption[WDBC train set with GLVQ prototypes]{WDBC train set with GLVQ prototypes and decision boundary}
\label{fig:wg1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{"../../../send from/sendddd/WDBC3/gt"}
\caption[WDBC test set with GLVQ prototypes]{WDBC test set with GLVQ prototypes and decision boundary}
\label{fig:wg2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/WDBC/wglf"}
%\caption[WDBC test set classification label security (GLVQ)]{The data space of the WDBC test set showing the GLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:wgd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{"../../../send from/sendddd/WDBC3/gf"}
\caption[WDBC test set classification label security (GLVQ)]{The data space of the WDBC test set showing the GLVQ model computed classification label securities with a threshold security $(h=0.7)$.}
\label{fig:wgd1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/WDBC3/gmtr"}
\caption[WDBC train set with GMLVQ prototypes]{WDBC train set with GMLVQ prototypes and decision boundary}
\label{fig:wgm1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/WDBC3/gmt"}
\caption[WDBC test set with GMLVQ prototypes]{WDBC test set with GMLVQ prototypes and decision boundary}
\label{fig:wgm2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/WDBC/wgmlf"}
%\caption[WDBC test set classification label security (GMLVQ)]{The data space of the WDBC test set showing the GMLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:wgmd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{"../../../send from/sendddd/WDBC3/gmf"}
\caption[WDBC test set classification label security (GMLVQ)]{The data space of the WDBC test set showing the GMLVQ model computed classification label securities with a threshold security $(h=0.7)$.}
\label{fig:wgmld}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n2/gf"}
%\caption[WDBC test set classification label security (GLVQ)]{The data space of the WDBC test set showing the GLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:gf}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/WDBC3/ctr"}
\caption[WDBC train set with CELVQ prototypes]{WDBC train set with CELVQ prototypes and decision boundary}
\label{fig:c1}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{"../../../send from/sendddd/WDBC3/ct"}
\caption[WDBC test set with CELVQ prototypes]{WDBC test set with CELVQ prototypes and decision boundary}
\label{fig:c2}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=1.0\linewidth]{"../../../send from/sendddd/WDBC/celf"}
%\caption[WDBC test set classification label security (CELVQ)]{The data space of the WDBC test set showing the CELVQ model predicted labels along with the computed classification label securities.}
%\label{fig:cd}
%\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{"../../../send from/sendddd/WDBC3/cf"}
\caption[WDBC test set classification label security (CELVQ)]{The data space of the WDBC test set showing the CELVQ model computed label securities with a threshold security $(h=0.7)$.}
\label{fig:cdl}
\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n2/gmf"}
%\caption[WDBC test set classification label security (GMLVQ)]{The data space of the WDBC test set showing the GMLVQ model predicted labels along with the computed classification label securities.}
%\label{fig:gmf}
%\end{figure}
%\begin{figure}[H]
%\centering
%\includegraphics[width=0.9\linewidth]{"../../../send from/send n2/cf"}
%\caption[WDBC test set classification label security (CELVQ)]{The data space of the WDBC test set showing the CELVQ model predicted labels along with the computed classification label securities.}
%\label{fig:cf}
%\end{figure}
Similarly, from the results in tables \ref{tab:WDBC certainty} and \ref{tab:WDBC certainty_}, we observe for all the three models (GLVQ, GMLVQ and CELVQ), the model classification certainty with and without rejected classifications for the WDBC test set. By referring to Figures \ref{fig:wgd1}, \ref{fig:wgmld} and \ref{fig:cdl}, we can determine for sure whether this recorded certainty and the levels of confidence in the regions of the WDBC- Malignant and Benign class labels were in agreement or not. So for an arbitrary chosen label security threshold of\hspace{2pt} $0.7$, we observe improvements in the model classification certainty for both classes of the WDBC-test set. This same deduction applies to the average model classification certainty. The adjusted test accuracy, not including rejected classification, indicated improvements in the test accuracy, which gives insight into how the models were accurate about their determined confidence.
By this implementation, we have determined the extent to which the predicted labels of the WDBC test set can be trusted.
\chapter{Conclusion and Prospective Work}
This thesis investigated to determine the classification label security using fuzzy probabilistic assignments of FCM estimates. Chapter 4 exhibited by implementation, computation of the classification label security for the GLVQ, GMLVQ and CELVQ models. So for a given test set, the classification label security of all predicted labels is computed. The visualization of this implementation was accompanied by displaying the regions in the data space for which the considered models were confident or unconfident regarding the classification labels for the data sets used in the experimentations. We also determined the accuracy to which the models made the classifications when they were confident. The classification label security in this regard has been determined.
Concerning future work, a possibilistic approach to the determination of classification label security will be considered.
\Anhang
\chapter{Reference Implementation in Python}
\begin{lstlisting}[caption=label\textunderscore security1.py ,style=chstyle, language=Python]
"""Module to Determine classification Label Security/Certainty"""
import numpy as np
from scipy.spatial import distance
class LabelSecurity:
"""
Label Security
:params
x_test: array, shape=[num_data,num_features]
Where num_data is the number of samples and num_features
refers to the number of features.
class_labels: array-like, shape=[num_classes]
Class labels of prototypes
predict_results: array-like, shape=[num_data]
Predicted labels of the test-set
model_prototypes: array-like, shape=[num_prototypes, num_features]
Prototypes from the trained model using train-set, where
num_prototypes refers to the number of prototypes
x_dat : array, shape=[num_data, num_features]
Input data
fuzziness_parameter: int, optional(default=2)
"""
def __init__(self, x_test, class_labels, predict_results,
model_prototypes, x_dat, fuzziness_parameter=2):
self.x_test = x_test
self.class_labels = class_labels
self.predict_results = predict_results
self.model_prototypes = model_prototypes
self.x_dat = x_dat
self.fuzziness_parameter = fuzziness_parameter
def label_sec_f(self, x):
"""
Computes the labels security of each prediction from
the model using the test_set
:param x
predicted labels from the model using test-set
:return:
labels with their security
"""