-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the .copy operator to ensure passing by value. #4
Conversation
Hi Fokko, Sorry for the late reply; I was on vacation. Thank you very much for catching this. I originally developed SOS in MATLAB, and I guess didn't pay enough attention when porting it to Python. I applied both the original Python implementation and your version on the following toy data set: $ cat toy.csv
1.00,1.00
3.00,1.25
3.00,3.00
1.00,3.00
2.25,2.25
8.00,2.00
$ git checkout master
$ < toy.csv ./sos -p 4.5
0.34863044
0.25228690
0.25312064
0.33462241
0.23113441
0.64961863
$ git checkout Fokko-master
$ < toy.csv ./sos -p 4.5
0.35885434
0.24096533
0.23969990
0.33949115
0.19634946
0.76447110 This toy data set was also used in the SOS paper, and because I still had the original outlier probabilities as computed by the MATLAB version (using a perplexity of 4.5), I thought we could use those as a reference. These are the results as computed by the MATLAB version:
As you can see, the outlier probabilities produced by your version are closer to the MATLAB implementation than mine, but they're still not quite the same. (It might very well be that the MATLAB implementation is incorrect.) What does your Scala implementation produce? Thanks again, Jeroen |
Hi Jeroen, Thanks for the reply and of course many thanks for sharing the Python implementation, it allowed me to write unit-tests which uncovered difference in results. My results based on the toy data-set are:
Which is approximate the same as the Python implementation. The Scala implementation is part of my master thesis which focuses on the scaling of outlier detection at which the Stochastic Outlier Detection algorithm can be used very nicely as the Spark framework can work very fast on iterative jobs. What tolerance are you using with the Matlab script? At the Scala implementation I set it to zero which might take some additional iterations and might affect the final output. |
It's very reassuring that your Scala implementation produces the same results as the improved Python one. I'd be comfortable merging your changes. I would have to dig for the MATLAB implementation as it's been a few years since I've touched it. I think it's not zero, but a very small number, but it would be interesting to look this up. I'll get back to you on this. I'm honored that you're implementing SOS in Scala. (It's a shame I never submitted this to a journal). Do you know of any other outlier detection algorithms in Spark? |
Sure, I am curious where the difference in results comes from. Please let me know. As far as I know, outlier detection is not yet implemented in Apache Spark. Sparks' own MLLib has some general machine learning tools, among other unsupervised algorithms like k-Means. Recently a Spark third-party packages repository has been set up. I am planning to submit the SOS-algorithm to this repo when is it finished, still need to do some testing and cleanup. Algorithms like LOF are quite hard to implement in Apache Spark as the results are dependent of the n-th nearest neighbors, this would require multiple shuffles across the data-set. As SOS works on rows of data which makes it easier to distribute the computations across multiple nodes. The main problem however, as with LOF and SOS, is the computational cost of the distance matrix which is quadratic to the input size. |
OK, so in the Matlab version, tolerance was set to 1e-20 with at most 400 tries, whereas in the Python version this is 1e-5 and 5000 respectively. Changing these values in the Python version doesn't influence the results, so I'm not sure what's causing this different. However, I'm satisfied that the Python and Scala versions now produce the same results, so I'll merge your pull request. Thanks again for your help! |
Added the .copy operator to ensure passing by value.
Hi Jeroen Janssens,
I think I have found a bug in the Python implementation. Currently I am working on an implementation in Scala for the Apache Spark platform the scale the algorithm across a large number of clusters. I used the Python script to create unit-tests to verify the workings of the algorithm. I could not find any obvious mistakes, so I did a step by step comparison.
Within Scala the values are assigned as an immutable value which is different as in Python. I found out that this caused a difference in result. To mimic this behavior I added the
.copy()
method on the value which gives a copy. Otherwise, for examplebeta[i] == 1
,betamax == inf
andHdiff <= 0
.betamax
gets assigned as1
, which is the value ofbeta at that moment
, after that the precision ofbeta
gets adjusted to.5
, and then betamax gets also adjusted to.5
, which does not seem like a correct behavior to me.Before:
After the application of the
.copy()
: