Added the .copy operator to ensure passing by value. #4

Fokko · 2015-06-25T15:29:03Z

Hi Jeroen Janssens,

I think I have found a bug in the Python implementation. Currently I am working on an implementation in Scala for the Apache Spark platform the scale the algorithm across a large number of clusters. I used the Python script to create unit-tests to verify the workings of the algorithm. I could not find any obvious mistakes, so I did a step by step comparison.

Within Scala the values are assigned as an immutable value which is different as in Python. I found out that this caused a difference in result. To mimic this behavior I added the .copy() method on the value which gives a copy. Otherwise, for example beta[i] == 1, betamax == inf and Hdiff <= 0. betamax gets assigned as 1, which is the value of beta at that moment, after that the precision of beta gets adjusted to .5, and then betamax gets also adjusted to .5, which does not seem like a correct behavior to me.

# If not, increase or decrease precision
if Hdiff > 0:
    betamin = beta[i].copy()
    if betamax == np.inf or betamax == -np.inf:
        beta[i] = beta[i] * 2.0
    else:
        beta[i] = (beta[i] + betamax) / 2.0
else:
    betamax = beta[i].copy()
    if betamin == np.inf or betamin == -np.inf:
        beta[i] = beta[i] / 2.0
    else:
        beta[i] = (beta[i] + betamin) / 2.0

Before:

~/Desktop/sos-correct/bin$ < iris.csv ./sos -p 30 | sort -nr | head
0.92552418
0.91794955
0.81657372
0.79410068
0.77251273
0.76652991
0.71135211
0.69634175
0.69305280
0.68967627

After the application of the .copy():

~/Desktop/sos-correct/bin$ < iris.csv ./sos -p 30 | sort -nr | head
0.88569072
0.82241037
0.74932441
0.71168871
0.69498432
0.68975787
0.67562369
0.66990241
0.65935384
0.65740135

jeroenjanssens · 2015-07-15T16:27:51Z

Hi Fokko,

Sorry for the late reply; I was on vacation.

Thank you very much for catching this. I originally developed SOS in MATLAB, and I guess didn't pay enough attention when porting it to Python. I applied both the original Python implementation and your version on the following toy data set:

$ cat toy.csv
1.00,1.00
3.00,1.25
3.00,3.00
1.00,3.00
2.25,2.25
8.00,2.00
$ git checkout master
$ < toy.csv ./sos -p 4.5
0.34863044
0.25228690
0.25312064
0.33462241
0.23113441
0.64961863
$ git checkout Fokko-master
$ < toy.csv ./sos -p 4.5
0.35885434
0.24096533
0.23969990
0.33949115
0.19634946
0.76447110

This toy data set was also used in the SOS paper, and because I still had the original outlier probabilities as computed by the MATLAB version (using a perplexity of 4.5), I thought we could use those as a reference. These are the results as computed by the MATLAB version:

As you can see, the outlier probabilities produced by your version are closer to the MATLAB implementation than mine, but they're still not quite the same. (It might very well be that the MATLAB implementation is incorrect.) What does your Scala implementation produce?

Thanks again,

Jeroen

Fokko · 2015-07-16T11:13:20Z

Hi Jeroen,

Thanks for the reply and of course many thanks for sharing the Python implementation, it allowed me to write unit-tests which uncovered difference in results. My results based on the toy data-set are:

0.19634928222978057
0.23970253544493803
0.2409667102197638
0.3394894713889564
0.35885258224479155
0.7644688448450071

Which is approximate the same as the Python implementation.

The Scala implementation is part of my master thesis which focuses on the scaling of outlier detection at which the Stochastic Outlier Detection algorithm can be used very nicely as the Spark framework can work very fast on iterative jobs.

What tolerance are you using with the Matlab script? At the Scala implementation I set it to zero which might take some additional iterations and might affect the final output.

jeroenjanssens · 2015-07-16T11:31:43Z

It's very reassuring that your Scala implementation produces the same results as the improved Python one. I'd be comfortable merging your changes. I would have to dig for the MATLAB implementation as it's been a few years since I've touched it. I think it's not zero, but a very small number, but it would be interesting to look this up. I'll get back to you on this.

I'm honored that you're implementing SOS in Scala. (It's a shame I never submitted this to a journal). Do you know of any other outlier detection algorithms in Spark?

Fokko · 2015-07-16T13:46:33Z

Sure, I am curious where the difference in results comes from. Please let me know.

As far as I know, outlier detection is not yet implemented in Apache Spark. Sparks' own MLLib has some general machine learning tools, among other unsupervised algorithms like k-Means. Recently a Spark third-party packages repository has been set up. I am planning to submit the SOS-algorithm to this repo when is it finished, still need to do some testing and cleanup.

Algorithms like LOF are quite hard to implement in Apache Spark as the results are dependent of the n-th nearest neighbors, this would require multiple shuffles across the data-set. As SOS works on rows of data which makes it easier to distribute the computations across multiple nodes.

The main problem however, as with LOF and SOS, is the computational cost of the distance matrix which is quadratic to the input size.

jeroenjanssens · 2015-07-21T08:02:45Z

OK, so in the Matlab version, tolerance was set to 1e-20 with at most 400 tries, whereas in the Python version this is 1e-5 and 5000 respectively. Changing these values in the Python version doesn't influence the results, so I'm not sure what's causing this different. However, I'm satisfied that the Python and Scala versions now produce the same results, so I'll merge your pull request. Thanks again for your help!

Added the .copy operator to ensure passing by value.

Added the .copy operator to ensure passing by value.

ff8deb3

jeroenjanssens closed this Jul 21, 2015

jeroenjanssens reopened this Jul 21, 2015

jeroenjanssens added a commit that referenced this pull request Jul 21, 2015

Merge pull request #4 from Fokko/master

871b1f7

Added the .copy operator to ensure passing by value.

jeroenjanssens merged commit 871b1f7 into jeroenjanssens:master Jul 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the .copy operator to ensure passing by value. #4

Added the .copy operator to ensure passing by value. #4

Fokko commented Jun 25, 2015

jeroenjanssens commented Jul 15, 2015

Fokko commented Jul 16, 2015

jeroenjanssens commented Jul 16, 2015

Fokko commented Jul 16, 2015

jeroenjanssens commented Jul 21, 2015

Added the .copy operator to ensure passing by value. #4

Added the .copy operator to ensure passing by value. #4

Conversation

Fokko commented Jun 25, 2015

jeroenjanssens commented Jul 15, 2015

Fokko commented Jul 16, 2015

jeroenjanssens commented Jul 16, 2015

Fokko commented Jul 16, 2015

jeroenjanssens commented Jul 21, 2015