NAs produced by integer overflow #8

Melkiades · 2018-04-23T17:26:58Z

Hello everyone! Firstly, thank you for your wonderful work.

Secondly, I wanted to ask how is it possible to get NAs from NMI. I insert a vector of 1s and 2s (40k and 40k) as a true vector and I use 1s and 2s (20K and 60k) as the predicted vector.

The result + warning I get are:


---------------------------------------- 
purity                         : 0.667 
entropy                        : 0.6964 
normalized mutual information  : NA 
variation of information       : NA 
normalized var. of information : NA 
---------------------------------------- 
specificity                    : 0.4308 
sensitivity                    : 0.6807 
precision                      : 0.5446 
recall                         : 0.6807 
F-measure                      : 0.6051 
---------------------------------------- 
accuracy OR rand-index         : 0.5558 
adjusted-rand-index            : 0.1115 
jaccard-index                  : 0.4338 
fowlkes-mallows-index          : 0.6089 
mirkin-metric                  : 2843074238 
---------------------------------------- 
[1] NA
Warning messages:
1: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow
2: In sum(tbl) * conv_df[i, j] : NAs produced by integer overflow
3: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow

The text was updated successfully, but these errors were encountered:

mlampros · 2018-04-24T14:01:43Z

@Melkiades I'm sorry for the late reply,

would you mind sharing a minimal reproducible example, so that I'm in place to find out if it's a bug in the external_validation function?

Melkiades · 2018-04-25T09:49:52Z

Yes, this is the simplest reproducible example I could make:

> a <- c(rep(1,40000), rep(2,40000))
> b <- c(rep(1,20000), rep(2,60000))
> ClusterR::external_validation(a,b, method = "nmi")
[1] NA
Warning messages:
1: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow
2: In sum(tbl) * conv_df[i, j] : NAs produced by integer overflow
3: In sum(conv_df[i, ]) * sum(conv_df[, j]) :
  NAs produced by integer overflow

mlampros · 2018-04-25T11:58:27Z

@Melkiades, thanks for making me aware of this issue.

Actually "NAs produced by integer overflow" is a known issue in R (as mentioned in this stackoverflow thread).

In the ClusterR package this was occurred when calculating the mutual_information variable. I had to use the gmp::as.bigz() function to allow the multiplication of big integers.

I uploaded a new version of the ClusterR package on Github. You can install this new version using

devtools::install_github('mlampros/ClusterR')

as it will take some time till I submit the new version on CRAN.

Please test it and let me know.

Melkiades · 2018-04-25T12:38:56Z

That is perfect!! Thanks so much :)

I have also another test (a silly one) that ends with a resulting Nan. Is it the right answer? Because I tested also the same inputs in sklearn and I get 1. I dunno, it is probably a matter of standards, right? (It remains a stupid test)
In R:

> ClusterR::external_validation(true_labels = c(1,1,1,1), clusters = c(1,1,1,1), method = 'nmi')
NaN

In Python:

>>> from sklearn.metrics.cluster import normalized_mutual_info_score
>>> normalized_mutual_info_score([1, 1, 1, 1], [1, 1, 1, 1])
1.0

mlampros · 2018-04-25T15:13:28Z

@Melkiades thanks for testing this case.

The authors of sklearn added an exception in case that both true-labels and clusters perfectly match. In this case the normalized_mutual_info_score should return 1.0

I also added a similar exception to account for this case. I updated the ClusterR package.

Please test it and let me know.

Melkiades · 2018-04-25T15:19:27Z

I imagined! It works like a charm. Thanks a lot for the prompt responses :)

Melkiades closed this as completed Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAs produced by integer overflow #8

NAs produced by integer overflow #8

Melkiades commented Apr 23, 2018

mlampros commented Apr 24, 2018

Melkiades commented Apr 25, 2018

mlampros commented Apr 25, 2018

Melkiades commented Apr 25, 2018 •

edited

mlampros commented Apr 25, 2018

Melkiades commented Apr 25, 2018

NAs produced by integer overflow #8

NAs produced by integer overflow #8

Comments

Melkiades commented Apr 23, 2018

mlampros commented Apr 24, 2018

Melkiades commented Apr 25, 2018

mlampros commented Apr 25, 2018

Melkiades commented Apr 25, 2018 • edited

mlampros commented Apr 25, 2018

Melkiades commented Apr 25, 2018

Melkiades commented Apr 25, 2018 •

edited