In [None]:
#un-comment if you want to install KernelMethods with KMS Module and RDatasets respectively.*
#using Pkg
#pkg"add https://github.com/kyriox/KernelMethods.jl"
#pkg"add RDatasets"
pkg"add ProgressMeter"

In [1]:
using KernelMethods: KMS, predict, recall, LabelEncoder
using RDatasets
using Random
using Statistics

┌ Info: Recompiling stale cache file /Users/job/.julia/compiled/v1.1/KernelMethods/lt5mb.ji for KernelMethods [d79e8f30-5872-11e9-0dab-2d1842b87615]
└ @ Base loading.jl:1184
│ - If you have KernelMethods checked out for development and have
│   added PyCall as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with KernelMethods
┌ Info: Recompiling stale cache file /Users/job/.julia/compiled/v1.1/RDatasets/JyIbx.ji for RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b]
└ @ Base loading.jl:1184


# Kernel Model Selection for Classification tasks

Our method Kernel Model Selection(KMS) integrates algorithms inspired by prototypes selection and generation, kernel functions and k-Nearest Neighbors and Naive Bayes classifiers. This integration results in the KMS classification pipeline.  We perform a model selection is using random search on the space formed with the different algorithms; and, the performance is obtained using a k-fold cross-validation approach. Furthermore, the computational cost of performing the random search is exploited with the creation of an ensemble; this ensemble outperforms the base classifiers. Next table shows default parameters for Random Search process:

------

|        **Name**        |                **Value**                 |
|------------------------|------------------------------------------|
|Number of prototyes $k$ | $$\{4,8,16,32,64\}$$ |
|Distance function       | $$\{Angle, Euclidean\}$$ |
|Sampling method         | $$\{Density,\mathit{FFT}^*, \mathit{K-Means}, \mathit{Random}\}$$      | 
|Kernel function         | $$\{\mathit{Linear}, \mathit{Gaussian}, \mathit{Sigmoid}, \mathit{Cauchy}\}$$|
|Reference's type        | $$\{\mathit{Centers}, \mathit{Centroids}\}$$|
|Internal classifiers    | $$\{\mathit{Naïive Bayes}, k\mathit{NN}\}$$|
|$k$NN weighting scheme  | $$\{\mathit{Distance}, \mathit{Uniform}\}$$|
|$k$NN distance function | $$\{\mathit{Cosine}, \mathit{Euclidean}\}$$ 
|Number of neighbors     | $$\{1,5,11,21\}$$|
|Sample size             | $$128$$|
|Number of folds         | $$3$$ |

------
\*Farthest First Traversal

# Usage Example

As an example, an experiment of 30 runs of KMS is performed, for effects of contrasting the mean an variance for the top KMS and an ensemble of size $t=15$ is reported. 
Please note that train and test splits are generated randomly and results may vary. 

In [29]:
## Load de Iris Dataset 
iris = dataset("datasets", "iris"); 
lencoder=LabelEncoder(iris.Species);
#labels must be an array of integer values.
labels=[lencoder.imap[x] for x in iris.Species]; 
#data must be an array of 1D arrays.
data=[collect(x) for x  in zip(iris.SepalLength,iris.SepalWidth,iris.PetalLength,iris.PetalWidth)];
n=length(data);

In [31]:
# This may take a couple of minutes
kms_results,kmse15_results=[],[] # List to store score values for each run
for i in 1:30
    ind,idx=randperm(n),trunc(Int, 0.7*n) # 70-30 train/validation split
    it,iv=ind[1:idx], ind[idx+1:end]
    Xt,yt=data[it],labels[it]
    Xv,yv=data[iv],labels[iv]
    kn=KMS(Xt,yt) # instacing and training KMS
    yp=predict(kn,Xv) # predict  validation lables using the top KMS at training phase
    yp15=predict(kn,Xv, ensemble_k=15) #predict validation lables using the top 15 KMS 
    # recording recall score (recall is used as defualt score)
    push!(kms_results, recall(yv,yp))
    push!(kmse15_results, recall(yv,yp15))  
end

In [25]:
# Averege and variance for top classifier 
@show mean(kms_results), std(kms_results);

(mean(kms_results), std(kms_results)) = (0.9676119113657814, 0.023034053452944573)


In [26]:
# Averege and variance for an ensemble of size 15
@show mean(kmse15_results), std(kmse15_results);

(mean(kmse15_results), std(kmse15_results)) = (0.9747210215347469, 0.01979729965163568)


**Note**: Even though results may vary, the ensemble consistently outperforms top classifier exihibiting higher mean recall and lower deviation. 