Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdasynClassif - Error in matrix(nrow = sum(unlist(g)), ncol = nC) : invalid 'nrow' value (too large or NA) #4

Closed
ghost opened this issue Apr 19, 2019 · 5 comments

Comments

@ghost
Copy link

ghost commented Apr 19, 2019

Hi,

I have used AdasynClassif function for several times without any problem. But now I experienced weird behavior. I got this error:

"Error in matrix(nrow = sum(unlist(g)), ncol = nC) : invalid 'nrow' value (too large or NA)"

It’s only happening for some combination of samples and features (please find attached - "a","b" and "c" are features and "target" is class, baseClass = 4). Additionally, setting different base class has an effect on it. I wanted to dive deeper into the function and find out what’s going on, but I couldn’t figure out from where class.freq function came from.
(That’s another mystery for me, how is it possible that AdasynClassif is running without having installed a package with class.freq function?)
ADAS_examples.zip

@ghost
Copy link
Author

ghost commented Apr 19, 2019

I'm adding one more example. It has the same samples as ADAS_fail.csv but it has an extra feature column. In this case, AdasynClassif function works..
ADAS_ok2.csv.zip

@paobranco
Copy link
Owner

Hi,

I found the problem. It is related with the nearest neighbours in your data.
What is happening is that none of your class 3 examples has nearest neighbours that belong to the majority class.
These neighbours are used to derive the number of new examples to generate for each base example of class 3.
Because these class 3 examples do not have nearest neighbours in the majority class (or baseClass) the algorithm assigns an importance of zero to each case. This importance is then normalized...

I'm attaching a part of ADASYN algorithm below just to make this discussion more clear:

image

image

So, when the algorithm tries to generate new cases for examples of class 3, it fails because they all have $\hat{r}_i =0$.

This is a special case where the ADASYN algorithm is not capable of generating a non uniform distribution for the $\hat{r}_i$ values.
I didn't thought about that, so the ADASYNClassif is failing when this happens because it is generating NA's for these cases.
I think that this can be solved by simply generating new cases uniformly because we are not able to bias the generation of examples due to the lack of nearest neighbours in the majority class for these cases.

In effect, when you add more features the nearest neighbours change and the problem does not occur because all classes have nearest neighbours that belong to the majority class!

I'll try to push as soon as possible an updated version of ADASYN to github. I'll use the development branch.
Would that be OK for you?

Regarding the "mystery" of the class.freq function, it is implemented in UBL but it is an auxiliary function, so it is not exported. UBL knows what it is but the end-users shouldn't need to know ;-)

I'm sorry for the inconvenience.
Thank you for your interest in UBL!

@ghost
Copy link
Author

ghost commented Apr 26, 2019

Thank you so much for your great explanation!

@ghost
Copy link
Author

ghost commented May 6, 2019

Hi, I was just wondering when do you think that the updated version will be available?
Thank you!

@paobranco
Copy link
Owner

Hi,
Thank you for your patience!
A new version is now available on the development branch.
Please let me known if you have any more issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant