Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Naive Bayes classifier computations in high dimensions. #2022

Merged
merged 3 commits into from Sep 29, 2019

Conversation

@KimSangYeon-DGU
Copy link
Member

commented Sep 17, 2019

This is a fix for #2017. I found out NaiveBayesClassifier works correctly for a dataset with low dimensions, like iris_full.dat, but for the datasets with high dimensions, it didn't. From tracing the source of the problem, I figured out there were three reasons to cause undesirable results. First, in LogLikelihood() method, we replaced zero of the variance with 1e-50 to make it invertible, but I think we should update all the values. Therefore, I added a parameter, epsilon. Second, in LogLikelihood(), std::log(arma::det(arma::diagmat(variances.col(i)))) caused the infinity of log likelihood for high dimensions, so I changed the equation to more computationally tractable one. Lastly, in Classify methods with the parameter, probabilities, the Log-Sum-Exp(x) operation sometimes became negative infinity when x has small negative values. Thus, I fixed the code to prevent the potential errors.

Below are the results using test data that were 20% from the total, and I used the code and dataset provided by @marcovirgolin
[breastwc_full]

mlpack - mean: 0.93, std: 0.02
sklearn - mean: 0.93, std: 0.02

[cylinderbands_full]

mlpack - mean: 0.31, std: 0.05
sklearn - mean: 0.34, std: 0.05

[ecoli_full]

mlpack - mean: 0.49, std: 0.1
sklearn - mean: 0.5, std: 0.1

[imagesegmentation_full]

mlpack - mean: 0.78, std: 0.01
sklearn - mean: 0.78, std: 0.01

[ionosphere_full]

mlpack - mean: 0.87, std: 0.04
sklearn - mean: 0.88, std: 0.04

[iris_full]

mlpack - mean: 0.95, std: 0.04
sklearn - mean: 0.95, std: 0.04

[madelon_full]

mlpack - mean: 0.6, std: 0.02
sklearn - mean: 0.6, std: 0.02

[sonar_full]

mlpack - mean: 0.65, std: 0.06
sklearn - mean: 0.65, std: 0.06

[vowel_full]

mlpack - mean: 0.66, std: 0.03
sklearn - mean: 0.66, std: 0.03

[yeast_full]

mlpack - mean: 0.25, std: 0.05
sklearn - mean: 0.25, std: 0.04
maxValue = arma::max(logLikelihoods.col(j));
logProbX(j) = log(arma::accu(exp(logLikelihoods.col(j) - maxValue))) +
maxValue;
predictionLogProb.col(j) = logLikelihoods.col(j) - logProbX(j);

This comment has been minimized.

Copy link
@zoq

zoq Sep 18, 2019

Member

Looks like if we store the result into predictionProbs right away we don't need predictionLogProb.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Sep 19, 2019

Member

Right, we could just do predictionProbs.col(j) = arma::exp(logLikelihoods.col(j) - logProbX(j)). Also, there's no need to store logProbX, I think---we can just use a single value, or even inline the current computation of logProbX(j).

Copy link
Member

left a comment

@KimSangYeon-DGU awesome fix! It will slow down the implementation slightly, but I didn't see a way to avoid the extra work. If you can handle Marcus's comment then from my end it looks good, but we should also add something to HISTORY.md. :)

maxValue = arma::max(logLikelihoods.col(j));
logProbX(j) = log(arma::accu(exp(logLikelihoods.col(j) - maxValue))) +
maxValue;
predictionLogProb.col(j) = logLikelihoods.col(j) - logProbX(j);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Sep 19, 2019

Member

Right, we could just do predictionProbs.col(j) = arma::exp(logLikelihoods.col(j) - logProbX(j)). Also, there's no need to store logProbX, I think---we can just use a single value, or even inline the current computation of logProbX(j).

@zoq
zoq approved these changes Sep 19, 2019
Copy link
Member

left a comment

Looks good to me, no further comments from my side.

@zoq

This comment has been minimized.

Copy link
Member

commented Sep 19, 2019

Wait one last comment, can you write a test case that uses the dataset from above or another one that shows the same issue, could be helpful in the future to avoid such issues.

@zoq zoq removed the s: unanswered label Sep 19, 2019
Copy link
Member

left a comment

Looks good to me, nice work again with the great fix. 👍

@rcurtin

This comment has been minimized.

Copy link
Member

commented Sep 19, 2019

Wait one last comment, can you write a test case that uses the dataset from above or another one that shows the same issue, could be helpful in the future to avoid such issues.

Oh, good point. Should be easy, I think all that's needed is to create an 1000-dimensional linearly separable dataset and then expect that NBC is able to properly classify it.

@KimSangYeon-DGU

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

Wait one last comment, can you write a test case that uses the dataset from above or another one that shows the same issue, could be helpful in the future to avoid such issues.

Great. Also, I found the last update failed to pass NBCTest, I'll look into it.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Sep 19, 2019

Ah, oops, I approved too early :-D

@zoq

This comment has been minimized.

Copy link
Member

commented Sep 19, 2019

Same, haven't checked the travis result.

@KimSangYeon-DGU

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

Ahh, sorry. It's my fault. When I'm done, I'll let you know again, please wait to merge this in until then.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

@KimSangYeon-DGU have you had a chance to look at this? We can make it part of the mlpack-3.2.1 patch release in a couple days if we can resolve the issue by then. 👍

@rcurtin rcurtin added this to the mlpack 3.2.1 milestone Sep 27, 2019
@KimSangYeon-DGU KimSangYeon-DGU force-pushed the KimSangYeon-DGU:nbc branch from a16d55f to 57edcef Sep 28, 2019
@KimSangYeon-DGU

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2019

@rcurtin Sorry for the slow progress. Yeah, I want this work to be included the mlpack-3.2.1 patch release. :)

Also, I made training and test datasets for the high dimensions check. There are 200 data that represent 5 classes in the training dataset. In addition, each datum has 1,000 dimensions. The test dataset has 50 data and the rest is the same with the training dataset.

Below is the code that I used for dataset generation.
Code for dataset generation.zip

@KimSangYeon-DGU KimSangYeon-DGU force-pushed the KimSangYeon-DGU:nbc branch from 57edcef to 259a2af Sep 28, 2019
@KimSangYeon-DGU

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2019

Hmm, GmmTrainDiffKmeansMaxIterationsTest failed. However, it is not related to this PR. I opened the issue #2035.

Copy link
Member

left a comment

Looks great.

@ShikharJ

This comment has been minimized.

Copy link
Member

commented Sep 29, 2019

@KimSangYeon-DGU Please feel free to merge this in.

@KimSangYeon-DGU KimSangYeon-DGU merged commit 42d154f into mlpack:master Sep 29, 2019
6 checks passed
6 checks passed
LaTeX Documentation Checks Build finished.
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@KimSangYeon-DGU

This comment has been minimized.

Copy link
Member Author

commented Sep 29, 2019

@ShikharJ Yeah :)

Thanks, everyone :)

rcurtin added a commit that referenced this pull request Sep 30, 2019
@rcurtin

This comment has been minimized.

Copy link
Member

commented Sep 30, 2019

👍 I added a note to HISTORY.md in 916f9e8.

@KimSangYeon-DGU KimSangYeon-DGU changed the title Fix some potential infinity errors in Naive Bayes Classifier. Fix Naive Bayes classifier computations in high dimensions. Oct 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.