Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs produced by integer overflow #8

Closed
Melkiades opened this issue Apr 23, 2018 · 6 comments
Closed

NAs produced by integer overflow #8

Melkiades opened this issue Apr 23, 2018 · 6 comments

Comments

@Melkiades
Copy link

Hello everyone! Firstly, thank you for your wonderful work.

Secondly, I wanted to ask how is it possible to get NAs from NMI. I insert a vector of 1s and 2s (40k and 40k) as a true vector and I use 1s and 2s (20K and 60k) as the predicted vector.

The result + warning I get are:


---------------------------------------- 
purity                         : 0.667 
entropy                        : 0.6964 
normalized mutual information  : NA 
variation of information       : NA 
normalized var. of information : NA 
---------------------------------------- 
specificity                    : 0.4308 
sensitivity                    : 0.6807 
precision                      : 0.5446 
recall                         : 0.6807 
F-measure                      : 0.6051 
---------------------------------------- 
accuracy OR rand-index         : 0.5558 
adjusted-rand-index            : 0.1115 
jaccard-index                  : 0.4338 
fowlkes-mallows-index          : 0.6089 
mirkin-metric                  : 2843074238 
---------------------------------------- 
[1] NA
Warning messages:
1: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow
2: In sum(tbl) * conv_df[i, j] : NAs produced by integer overflow
3: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow
@mlampros
Copy link
Owner

@Melkiades I'm sorry for the late reply,

would you mind sharing a minimal reproducible example, so that I'm in place to find out if it's a bug in the external_validation function?

@Melkiades
Copy link
Author

Yes, this is the simplest reproducible example I could make:

> a <- c(rep(1,40000), rep(2,40000))
> b <- c(rep(1,20000), rep(2,60000))
> ClusterR::external_validation(a,b, method = "nmi")
[1] NA
Warning messages:
1: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow
2: In sum(tbl) * conv_df[i, j] : NAs produced by integer overflow
3: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow

@mlampros
Copy link
Owner

@Melkiades, thanks for making me aware of this issue.

Actually "NAs produced by integer overflow" is a known issue in R (as mentioned in this stackoverflow thread).

In the ClusterR package this was occurred when calculating the mutual_information variable. I had to use the gmp::as.bigz() function to allow the multiplication of big integers.

I uploaded a new version of the ClusterR package on Github. You can install this new version using

devtools::install_github('mlampros/ClusterR')

as it will take some time till I submit the new version on CRAN.

Please test it and let me know.

@Melkiades
Copy link
Author

Melkiades commented Apr 25, 2018

That is perfect!! Thanks so much :)

I have also another test (a silly one) that ends with a resulting Nan. Is it the right answer? Because I tested also the same inputs in sklearn and I get 1. I dunno, it is probably a matter of standards, right? (It remains a stupid test)
In R:

> ClusterR::external_validation(true_labels = c(1,1,1,1), clusters = c(1,1,1,1), method = 'nmi')
NaN

In Python:

>>> from sklearn.metrics.cluster import normalized_mutual_info_score
>>> normalized_mutual_info_score([1, 1, 1, 1], [1, 1, 1, 1])
1.0

@mlampros
Copy link
Owner

@Melkiades thanks for testing this case.

The authors of sklearn added an exception in case that both true-labels and clusters perfectly match. In this case the normalized_mutual_info_score should return 1.0

I also added a similar exception to account for this case. I updated the ClusterR package.

Please test it and let me know.

@Melkiades
Copy link
Author

I imagined! It works like a charm. Thanks a lot for the prompt responses :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants