Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
 
 
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/

Naïve Bayes

Build Status CRAN_Status_Badge

1. Overview

The naivebayes package provides an efficient implementation of the popular Naïve Bayes classifier in R. It was developed and is now maintained based on three principles: it should be efficient, user friendly and written in Base R. The last implies no dependencies, however, it neither denies nor interferes with being efficient as many functions from the Base R distribution use highly efficient routines programmed in lower level languages, such as C or FORTRAN. In fact, the naivebayes package utilizes only such functions for resource-intensive calculations.

The general function naive_bayes() detects the class of each feature in the dataset and, depending on the user choices, assumes possibly different distribution for each feature. It currently supports following class conditional distributions:

  • categorical distribution for discrete features
  • Poisson distribution for non-negative integers
  • Gaussian distribution for continuous features
  • non-parametrically estimated densities via Kernel Density Estimation for continuous features

In addition to that specialized functions are available which implement:

  • Bernoulli Naive Bayes via bernoulli_naive_bayes()
  • Multinomial Naive Bayes via multinomial_naive_bayes()
  • Poisson Naive Bayes via poisson_naive_bayes()
  • Gaussian Naive Bayes via gaussian_naive_bayes()
  • Non-Parametric Naive Bayes via nonparametric_naive_bayes()

They are implemented based on the linear algebra operations which makes them efficient on the dense matrices. They can also take advantage of sparse matrices to furthermore boost the performance. Also few helper functions are provided that are supposed to improve the user experience. The general naive_bayes() function is also available through the excellent Caret package.

2. Installation

Just like many other R packages, naivebayes can be installed from the CRAN repository by simply executing in the console the following line:

install.packages("naivebayes")

# Or the the development version from GitHub:
devtools::install_github("majkamichal/naivebayes")

3. Usage

The naivebayes package provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main function naive_bayes() is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/

3.1 Example data

library(naivebayes)

# Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
                   bern = sample(LETTERS[1:2], n, TRUE),
                   cat  = sample(letters[1:3], n, TRUE),
                   logical = sample(c(TRUE,FALSE), n, TRUE),
                   norm = rnorm(n),
                   count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]

3.2 Formula interface

nb <- naive_bayes(class ~ ., train)
summary(nb)
#> 
#> ================================ Naive Bayes ================================= 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - Gaussian: 2
#> - Prior probabilities: 
#>     - classA: 0.5263
#>     - classB: 0.4737
#> 
#> ------------------------------------------------------------------------------

# Classification
predict(nb, test, type = "class")
#> [1] classB classA classA classA classA
#> Levels: classA classB
nb %class% test
#> [1] classB classA classA classA classA
#> Levels: classA classB

# Posterior probabilities
predict(nb, test, type = "prob")
#>         classA    classB
#> [1,] 0.4998488 0.5001512
#> [2,] 0.5934597 0.4065403
#> [3,] 0.6492845 0.3507155
#> [4,] 0.5813621 0.4186379
#> [5,] 0.5087005 0.4912995
nb %prob% test
#>         classA    classB
#> [1,] 0.4998488 0.5001512
#> [2,] 0.5934597 0.4065403
#> [3,] 0.6492845 0.3507155
#> [4,] 0.5813621 0.4186379
#> [5,] 0.5087005 0.4912995

# Helper functions
tables(nb, 1)
#> 
#> ------------------------------------------------------------------------------ 
#>  ::: bern (Bernoulli) 
#> ------------------------------------------------------------------------------ 
#>     
#> bern    classA    classB
#>    A 0.4400000 0.4888889
#>    B 0.5600000 0.5111111
#> 
#> ------------------------------------------------------------------------------
get_cond_dist(nb)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"    "Gaussian"    "Gaussian"

# Note: all "numeric" (integer, double) variables are modelled
#       with Gaussian distribution by default.

3.3 Matrix/data.frame and class vector

X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
#>         classA    classB
#> [1,] 0.4998488 0.5001512
#> [2,] 0.5934597 0.4065403
#> [3,] 0.6492845 0.3507155
#> [4,] 0.5813621 0.4186379
#> [5,] 0.5087005 0.4912995

3.4 Non-parametric estimation for continuous features

Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter usekernel=TRUE otherwise Gaussian distribution will be assumed. The estimation is performed with the built in R function density(). By default, Gaussian smoothing kernel and Silverman’s rule of thumb as bandwidth selector are used:

nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE)
summary(nb_kde)
#> 
#> ================================ Naive Bayes ================================= 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train, usekernel = TRUE) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - KDE: 2
#> - Prior probabilities: 
#>     - classA: 0.5263
#>     - classB: 0.4737
#> 
#> ------------------------------------------------------------------------------
get_cond_dist(nb_kde)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"         "KDE"         "KDE"
nb_kde %prob% test
#>         classA    classB
#> [1,] 0.6252811 0.3747189
#> [2,] 0.5441986 0.4558014
#> [3,] 0.6515139 0.3484861
#> [4,] 0.6661044 0.3338956
#> [5,] 0.6736159 0.3263841

# Class conditional densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

# Marginal densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "marginal")

3.4.1 Changing kernel

In general, there are 7 different smoothing kernels available:

  • gaussian
  • epanechnikov
  • rectangular
  • triangular
  • biweight
  • cosine
  • optcosine

and they can be specified in naive_bayes() via parameter additional parameter kernel. Gaussian kernel is the default smoothing kernel. Please see density() and bw.nrd() for further details.

# Change Gaussian kernel to biweight kernel
nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               kernel = "biweight")
nb_kde_biweight %prob% test
#>         classA    classB
#> [1,] 0.6237152 0.3762848
#> [2,] 0.5588270 0.4411730
#> [3,] 0.6594737 0.3405263
#> [4,] 0.6650295 0.3349705
#> [5,] 0.6631951 0.3368049
plot(nb_kde_biweight, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.2 Changing bandwidth selector

The density() function offers 5 different bandwidth selectors, which can be specified via bw parameter:

  • nrd0 (Silverman’s rule-of-thumb)
  • nrd (variation of the rule-of-thumb)
  • ucv (unbiased cross-validation)
  • bcv (biased cross-validation)
  • SJ (Sheather & Jones method)
nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               bw = "SJ")
nb_kde_SJ %prob% test
#>         classA    classB
#> [1,] 0.7279209 0.2720791
#> [2,] 0.4858273 0.5141727
#> [3,] 0.7004134 0.2995866
#> [4,] 0.7005704 0.2994296
#> [5,] 0.7089626 0.2910374
plot(nb_kde_SJ, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.3 Adjusting bandwidth

The parameter adjust allows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes “wigglier” and for values above 1 the density tends to be “smoother”:

nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE,
                         adjust = 0.5)
nb_kde_adjust %prob% test
#>         classA    classB
#> [1,] 0.6636171 0.3363829
#> [2,] 0.4784302 0.5215698
#> [3,] 0.6442293 0.3557707
#> [4,] 0.6745416 0.3254584
#> [5,] 0.7533994 0.2466006
plot(nb_kde_adjust, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.5 Model non-negative integers with Poisson distribution

Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting usepoisson=TRUE in the naive_bayes() function and by making sure that the variables representing counts in the dataset are of class integer.

is.integer(train$count)
#> [1] TRUE
nb_pois <- naive_bayes(class ~ ., train, usepoisson = TRUE)
summary(nb_pois)
#> 
#> ================================ Naive Bayes ================================= 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train, usepoisson = TRUE) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - Poisson: 1
#>     - Gaussian: 1
#> - Prior probabilities: 
#>     - classA: 0.5263
#>     - classB: 0.4737
#> 
#> ------------------------------------------------------------------------------
get_cond_dist(nb_pois)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"    "Gaussian"     "Poisson"

nb_pois %prob% test
#>         classA    classB
#> [1,] 0.4815380 0.5184620
#> [2,] 0.4192209 0.5807791
#> [3,] 0.6882270 0.3117730
#> [4,] 0.4794415 0.5205585
#> [5,] 0.5209152 0.4790848

# Class conditional distributions
plot(nb_pois, "count", prob = "conditional")

# Marginal distributions
plot(nb_pois, "count", prob = "marginal")

About

High performance implementation of the Naive Bayes algorithm in R

Topics

Resources

License

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.