GitHub - msoczi/categorical_naive_bayes: Implementation of Naive Bayes algorithm for categorical data

Categorical Naive Bayes

Table of Contents:

About
Mathematical model
Example

About

The script contains implementation of the Naive Bayes algorithm for categorical variables that do not require coding. The algorithm can be used both for the binary classification problem and for multiclass classification.

Mathematical model

Let's assume that we have independent variables and target . Bayes' theorem states the following relationship

In practice, only the numerator of the fraction is interesting, because the denominator does not depend on , and the values of are given. The denominator is therefore constant. The numerator of the fraction is equivalent to the cumulative probability model

Using the chain rule for conditional probability we have

Now, using the assumption of conditional independence of the predictors, we can present the model as follows

Since we want to predict the class of the variable, it is only interesting for which value of the dependent variable is the highest probability, so we can omit the constant appearing in the denominator. Then the predicted class is

Example

# Clean Environment
remove(list=ls())

# Load R script with categoricalNB
source('categoricalNB.R')

# Read the data
data <- read.csv("data\\agaricus-lepiota.csv", header = TRUE)

# Train-test split
test_indx <- sample(1:dim(data)[1], round(dim(data)[1]/3))
y_train <- data$target[-test_indx]
y_test <- data$target[test_indx]
X_train <- data[-test_indx, -1]
X_test <- data[test_indx, -1]

# Create Naive Bayes model
NBmodel <- naive_bayes(X_data = X_train, y_data = y_train, lambda = 0)

# Predictions on test data
preds <- predict.nb(X_data = X_test, model = NBmodel)

# Accuracy evaluation
cat('ACCURACY\n',
'TRAIN:',sum(NBmodel$preds == as.character(y_train))/length(y_train),
'\n  TEST:',sum(preds == as.character(y_test))/length(y_test)
)

Comparison

I compared the results of my own implementation with popular R and Python implementations in terms of accuracy and timing.
Comparison on test agaricus-lepiota dataset divided by train_test_split(X, y, test_size = 0.33, random_state = 42).

	Accuracy	AUC	Time (sec)
categoricalNB (own)	0.9447967	0.9433333	0.06680608
CategoricalNB (sklearn)	0.9447967	0.9433333	0.05624079
naiveBayes (e1071)	0.9447967	0.9433333	0.01396394

CategoricalNB form sklearn requires encoded data so the time needed to encode the data with the OrdinalEncoder method was taken into account.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
categoricalNB.R		categoricalNB.R
example.R		example.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Categorical Naive Bayes

About

Mathematical model

Example

Comparison

About

Releases

Packages

Languages

License

msoczi/categorical_naive_bayes

Folders and files

Latest commit

History

Repository files navigation

Categorical Naive Bayes

About

Mathematical model

Example

Comparison

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages