Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the .copy operator to ensure passing by value. #4

Merged
merged 1 commit into from
Jul 21, 2015

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 25, 2015

Hi Jeroen Janssens,

I think I have found a bug in the Python implementation. Currently I am working on an implementation in Scala for the Apache Spark platform the scale the algorithm across a large number of clusters. I used the Python script to create unit-tests to verify the workings of the algorithm. I could not find any obvious mistakes, so I did a step by step comparison.

Within Scala the values are assigned as an immutable value which is different as in Python. I found out that this caused a difference in result. To mimic this behavior I added the .copy() method on the value which gives a copy. Otherwise, for example beta[i] == 1, betamax == inf and Hdiff <= 0. betamax gets assigned as 1, which is the value of beta at that moment, after that the precision of beta gets adjusted to .5, and then betamax gets also adjusted to .5, which does not seem like a correct behavior to me.

# If not, increase or decrease precision
if Hdiff > 0:
    betamin = beta[i].copy()
    if betamax == np.inf or betamax == -np.inf:
        beta[i] = beta[i] * 2.0
    else:
        beta[i] = (beta[i] + betamax) / 2.0
else:
    betamax = beta[i].copy()
    if betamin == np.inf or betamin == -np.inf:
        beta[i] = beta[i] / 2.0
    else:
        beta[i] = (beta[i] + betamin) / 2.0

Before:

~/Desktop/sos-correct/bin$ < iris.csv ./sos -p 30 | sort -nr | head
0.92552418
0.91794955
0.81657372
0.79410068
0.77251273
0.76652991
0.71135211
0.69634175
0.69305280
0.68967627

After the application of the .copy():

~/Desktop/sos-correct/bin$ < iris.csv ./sos -p 30 | sort -nr | head
0.88569072
0.82241037
0.74932441
0.71168871
0.69498432
0.68975787
0.67562369
0.66990241
0.65935384
0.65740135

@jeroenjanssens
Copy link
Owner

Hi Fokko,

Sorry for the late reply; I was on vacation.

Thank you very much for catching this. I originally developed SOS in MATLAB, and I guess didn't pay enough attention when porting it to Python. I applied both the original Python implementation and your version on the following toy data set:

$ cat toy.csv
1.00,1.00
3.00,1.25
3.00,3.00
1.00,3.00
2.25,2.25
8.00,2.00
$ git checkout master
$ < toy.csv ./sos -p 4.5
0.34863044
0.25228690
0.25312064
0.33462241
0.23113441
0.64961863
$ git checkout Fokko-master
$ < toy.csv ./sos -p 4.5
0.35885434
0.24096533
0.23969990
0.33949115
0.19634946
0.76447110

This toy data set was also used in the SOS paper, and because I still had the original outlier probabilities as computed by the MATLAB version (using a perplexity of 4.5), I thought we could use those as a reference. These are the results as computed by the MATLAB version:

0.334793
0.235116
0.237428
0.322998
0.224368
0.788499

As you can see, the outlier probabilities produced by your version are closer to the MATLAB implementation than mine, but they're still not quite the same. (It might very well be that the MATLAB implementation is incorrect.) What does your Scala implementation produce?

Thanks again,

Jeroen

@Fokko
Copy link
Contributor Author

Fokko commented Jul 16, 2015

Hi Jeroen,

Thanks for the reply and of course many thanks for sharing the Python implementation, it allowed me to write unit-tests which uncovered difference in results. My results based on the toy data-set are:

0.19634928222978057
0.23970253544493803
0.2409667102197638
0.3394894713889564
0.35885258224479155
0.7644688448450071

Which is approximate the same as the Python implementation.

The Scala implementation is part of my master thesis which focuses on the scaling of outlier detection at which the Stochastic Outlier Detection algorithm can be used very nicely as the Spark framework can work very fast on iterative jobs.

What tolerance are you using with the Matlab script? At the Scala implementation I set it to zero which might take some additional iterations and might affect the final output.

@jeroenjanssens
Copy link
Owner

It's very reassuring that your Scala implementation produces the same results as the improved Python one. I'd be comfortable merging your changes. I would have to dig for the MATLAB implementation as it's been a few years since I've touched it. I think it's not zero, but a very small number, but it would be interesting to look this up. I'll get back to you on this.

I'm honored that you're implementing SOS in Scala. (It's a shame I never submitted this to a journal). Do you know of any other outlier detection algorithms in Spark?

@Fokko
Copy link
Contributor Author

Fokko commented Jul 16, 2015

Sure, I am curious where the difference in results comes from. Please let me know.

As far as I know, outlier detection is not yet implemented in Apache Spark. Sparks' own MLLib has some general machine learning tools, among other unsupervised algorithms like k-Means. Recently a Spark third-party packages repository has been set up. I am planning to submit the SOS-algorithm to this repo when is it finished, still need to do some testing and cleanup.

Algorithms like LOF are quite hard to implement in Apache Spark as the results are dependent of the n-th nearest neighbors, this would require multiple shuffles across the data-set. As SOS works on rows of data which makes it easier to distribute the computations across multiple nodes.

The main problem however, as with LOF and SOS, is the computational cost of the distance matrix which is quadratic to the input size.

@jeroenjanssens
Copy link
Owner

OK, so in the Matlab version, tolerance was set to 1e-20 with at most 400 tries, whereas in the Python version this is 1e-5 and 5000 respectively. Changing these values in the Python version doesn't influence the results, so I'm not sure what's causing this different. However, I'm satisfied that the Python and Scala versions now produce the same results, so I'll merge your pull request. Thanks again for your help!

jeroenjanssens added a commit that referenced this pull request Jul 21, 2015
Added the .copy operator to ensure passing by value.
@jeroenjanssens jeroenjanssens merged commit 871b1f7 into jeroenjanssens:master Jul 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants