Skip to content

Naïve Bayes classifier with gaussian, multinomial and bernoulli decision rules [machine learning][statistics]

Notifications You must be signed in to change notification settings

nuivall/naive-bayes

 
 

Repository files navigation

Naive Bayes Classifier
----------------------
(c) Tim Nugent

Compile by running 'make'. Uses std=c++11 - on older compilers you may need to change this to 'std=c++0x' in the Makefile.

Run all tests with 'make test'

Run the train/classify tool as follows:

./nb_classify -d 1 training.data test.data

Training and test data should be in svm-light/libsvm format, e.g.

+1 1:4.12069 2:11.3896 3:18.5742 4:2.85764 5:53.4406 
-1 1:3.14565 2:17.4338 3:19.3353 4:2.63431 5:56.4233

Sparse data is not (yet) supported. The -d flag is the decision rule, option 1 = Gaussian (default), 2 = Multinomial and 3 = Bernoulli. -v flag turns verbose mode on - use this to see the classification results. The -a flag passes the smoothing parameter used in the multinomial decision rule (default 1.0, Laplace smoothing).

Gaussian example
----------------

Generate Gaussian training and test data as follows:

./generate_gaussian_data 5000 +1 5.3 2.0 12.8 3.1 18.2 1.0 1.2 3.4 56.2 2.3 > training.data
./generate_gaussian_data 5000 -1 7.3 1.0 10.8 1.1 24.1 2.3 0.8 1.2 53.1 0.8 >> training.data
./generate_gaussian_data 1000 +1 5.3 5.0 12.8 5.1 18.2 4.0 1.2 5.4 56.2 4.3 > test.data
./generate_gaussian_data 1000 -1 7.3 4.0 10.8 4.1 24.1 5.3 0.8 4.2 53.1 2.8 >> test.data

Arguments are the number of examples, the class label, then pairs of means and standard deviations. Note that I bumped up the standard deviations in the test data to make classification harder.

Then train and classify as follows:

	./nb_classify -d 1 training.data test.data

Multinomial example
-------------------

Example based on table 13.1 from here:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

	echo "+1 1:2 2:1 3:0 4:0 5:0 6:0" > training.data
	echo "+1 1:2 2:0 3:1 4:0 5:0 6:0" >> training.data
	echo "+1 1:1 2:0 3:0 4:1 5:0 6:0" >> training.data
	echo "-1 1:1 2:0 3:0 4:0 5:1 6:1" >> training.data
	echo "+1 1:3 2:0 3:0 4:0 5:1 6:1" > test.data

Then train and classify as follows:

	./nb_classify -d 2 training.data test.data

Bernoulli example
-----------------

Example taken from here:
http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf

	echo "+1 1:1 2:0 3:0 4:0 5:1 6:1 7:1 8:1" > training.data
	echo "+1 1:0 2:0 3:1 4:0 5:1 6:1 7:0 8:0" >> training.data
	echo "+1 1:0 2:1 3:0 4:1 5:0 6:1 7:1 8:0" >> training.data	
	echo "+1 1:1 2:0 3:0 4:1 5:0 6:1 7:0 8:1" >> training.data	
	echo "+1 1:1 2:0 3:0 4:0 5:1 6:0 7:1 8:1" >> training.data	
	echo "+1 1:0 2:0 3:1 4:1 5:0 6:0 7:1 8:1" >> training.data
	echo "-1 1:0 2:1 3:1 4:0 5:0 6:0 7:1 8:0" >> training.data
	echo "-1 1:1 2:1 3:0 4:1 5:0 6:0 7:1 8:1" >> training.data	
	echo "-1 1:0 2:1 3:1 4:0 5:0 6:1 7:0 8:0" >> training.data	
	echo "-1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0" >> training.data	
	echo "-1 1:0 2:0 3:1 4:0 5:1 6:0 7:1 8:0" >> training.data
	echo "+1 1:1 2:0 3:0 4:1 5:1 6:1 7:0 8:1" > test.data	
	echo "-1 1:0 2:1 3:1 4:0 5:1 6:0 7:1 8:0" >> test.data
	./nb_classify -d 3 training.data test.data

Generating dummy data
---------------------

See above for Gaussian data. For multinomial data:

	./generate_multinomial_data 5000 +1 2 3 4 3 5 > training.data
	./generate_multinomial_data 5000 -1 1 4 3 2 4 >> training.data

The five weights are used to generate integers with probability weight/sum of weights.

For Bernoulli data:	

	./generate_bernoulli_data 5000 +1 0.7 0.1 0.5 > training.data
	./generate_bernoulli_data 5000 -1 0.4 0.2 0.6 >> training.data

The three probabilities are used to generate binary variables.

Note that the prior class probabilites are calculated based on their ratios in the training data.


About

Naïve Bayes classifier with gaussian, multinomial and bernoulli decision rules [machine learning][statistics]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 81.5%
  • Makefile 18.5%