In [1]:
#un-comment if you want to install KernelMethods with KMS Module and RDatasets respectively.*
#using Pkg
#pkg"add https://github.com/kyriox/KernelMethods.jl"
#pkg"add RDatasets"

In [1]:
using KernelMethods: KMS, predict, recall, LabelEncoder
using RDatasets
using Random
using Statistics

┌ Info: Recompiling stale cache file /Users/job/.julia/compiled/v1.1/KernelMethods/lt5mb.ji for KernelMethods [d79e8f30-5872-11e9-0dab-2d1842b87615]
└ @ Base loading.jl:1184
│ - If you have KernelMethods checked out for development and have
│   added PyCall as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with KernelMethods


# Kernel Model Selection for Classification tasks

Our method Kernel Model Selection(KMS) integrates algorithms inspired by prototypes selection and generation, kernel functions and k-Nearest Neighbors and Naive Bayes classifiers. This integration results in the KMS classification pipeline.  We perform a model selection is using random search on the space formed with the different algorithms; and, the performance is obtained using a k-fold cross-validation approach. Furthermore, the computational cost of performing the random search is exploited with the creation of an ensemble; this ensemble outperforms the base classifiers. Next table shows default parameters for Random Search process:

**Table 1**

------

|        **Name**        |                **Value**                 |
|------------------------|------------------------------------------|
|Number of prototyes $k$ | $$\{4,8,16,32,64\}$$ |
|Distance function       | $$\{Angle, Euclidean\}$$ |
|Sampling method         | $$\{Density,\mathit{FFT}^*, \mathit{K-Means}, \mathit{Random}\}$$      | 
|Kernel function         | $$\{\mathit{Linear}, \mathit{Gaussian}, \mathit{Sigmoid}, \mathit{Cauchy}\}$$|
|Reference's type        | $$\{\mathit{Centers}, \mathit{Centroids}\}$$|
|Internal classifiers    | $$\{\mathit{Naive Bayes}, k\mathit{NN}\}$$|
|$k$NN weighting scheme  | $$\{\mathit{Distance}, \mathit{Uniform}\}$$|
|$k$NN distance function | $$\{\mathit{Cosine}, \mathit{Euclidean}\}$$ 
|Number of neighbors     | $$\{1,5,11,21\}$$|
|Sample size             | $$128$$|
|Number of folds         | $$3$$ |

------
\*Farthest First Traversal

# Usage Example

As an example, an experiment of 30 runs of KMS is performed, for effects of contrasting the mean an variance for the top KMS and an ensemble of size $t=15$ is reported. 
Please note that train and test splits are generated randomly and results may vary. 

In [2]:
## Load de Iris Dataset 
iris = dataset("datasets", "iris"); 
lencoder=LabelEncoder(iris.Species);
#labels must be an array of integer values.
labels=[lencoder.imap[x] for x in iris.Species]; 
#data must be an array of 1D arrays.
data=[collect(x) for x  in zip(iris.SepalLength,iris.SepalWidth,iris.PetalLength,iris.PetalWidth)];
n=length(data);

In [None]:
# This may take a couple of minutes
kms_results,kmse15_results=[],[] # List to store score values for each run
ni=30
for i in 1:ni
    ind,idx=randperm(n),trunc(Int, 0.7*n) # 70-30 train/validation split
    it,iv=ind[1:idx], ind[idx+1:end]
    Xt,yt=data[it],labels[it]
    Xv,yv=data[iv],labels[iv]
    kn=KMS(Xt,yt) # instacing and training KMS
    yp=predict(kn,Xv) # predict  validation lables using the top KMS at training phase
    yp15=predict(kn,Xv, ensemble_k=15) #predict validation lables using the top 15 KMS 
    # recording recall score (recall is used as defualt score)
    push!(kms_results, recall(yv,yp))
    push!(kmse15_results, recall(yv,yp15))
    if i%5+1==1
        @show "iteration $i of $ni"
    end
end

"iteration $(i) of $(ni)" = "iteration 5 of 30"
"iteration $(i) of $(ni)" = "iteration 10 of 30"
"iteration $(i) of $(ni)" = "iteration 15 of 30"
"iteration $(i) of $(ni)" = "iteration 20 of 30"


In [None]:
# Averege and variance for top classifier 
@show mean(kms_results), std(kms_results);

In [None]:
# Averege and variance for an ensemble of size 15
@show mean(kmse15_results), std(kmse15_results);

**Note**: Even though results may vary, the ensemble consistently outperforms top classifier exihibiting higher mean recall. 

## KMS function' parameters 

The parameters for $\mathit{KMS}$ are basically the ones describe a Table 1, but, there are a number of aditional parameters, functions are passed as Symbols and some names are different.

Function $\mathit{KMS}(X,Y;\mathit{op\_function},\mathit{top\_k},\mathit{folds},\mathit{udata},\mathit{nets},\mathit{K},\mathit{distances},\mathit{distancesk},\mathit{sample\_size},\mathit{kernels},\mathit{debug})$

- Positional arguments:
    - X  must be an array of 1D arrays with the training samples 
    - Y  must an 1D array of size |X|, where each element $y_i \in Y$ corresponds to the label for each $x_i \in X$  
- Keyword arguments:
    - $kernels$ is a list of kernel functions, each element in the list  must be a symbol which can be evaluated as a function. Default value is  \[*:gaussian*,*:linear*,*:cauchy*,*:sigmoid*\]. Any custom fucntion may be used.
    - $K$, is a list of intergers containing the values for the number of references. By default $K=[4,8,16,32,64]$
    - $\mathit{op\_function}$ is the fitness function, it must be a symbol which can be evaluate as a function. By default is set to *:recall*, but, it can be any of the ones define in scores; as well as user defined.
    - $\mathit{top\_k}$ the number of top classifiers to be keep during training phase. By default is set to 15. 
    - udata, allows to include unlabeled data to be used for the sampling process, it must be an array of 1D arrays; the 1D arrays have to be of the same cardinality that the ones in $X$. Default value is  set to [].
    - nets, is the list of sampling algorithms to be used to select the references. The list can include any combination of the symbols *{:fft\_sampling, :kmeans\_sampling, :density\_sampling, :random\_sampling}*. By default a list with the four symbols is used. 
    - $\mathit{distancesk}$ is the list of distances functions to be used for the sampling methods.  Default value is set to \[*:angle*,*squared_l2_distance*\], note that any pair metric may be used.
    
    - $\mathit{distances}$, is the list of distances which may be used when $k$NN classifier is selected. We set $\mathit{distances}=$ \[*:cosine*, *:euclidean*\] as default value.
    - $\mathit{sample\_size}$, the number of the configurations to be evaluated by the random search process. Default value is set to 128. If the grid size is lower than $\mathit{sample\_size}$, or  $\mathit{sample\_size}=-1$ then,  all the grid is evaluated.
    - folds, is the number of folds to be performed during the train phase. Default value is 3.
   



In [6]:
# Using only two sampling strategies FFT and KMeans
kna=KMS(data,labels; op_function=:recall,top_k=15,folds=3,udata=[], 
    nets=[:fft_sampling,:kmeans_sampling], #here
    K=[4,8,16,32,64],distances=[:angle,:squared_l2_distance],
    distancesk=[:angle,:squared_l2_distance],sample_size=128,
    kernels=[:gaussian,:linear,:cauchy,:sigmoid]);

In [7]:
recall(labels,predict(kna,data))

1.0