Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior of apoc.agg.statistics #534

Closed
ivan-kleshnin opened this issue Nov 13, 2023 · 4 comments
Closed

Strange behavior of apoc.agg.statistics #534

ivan-kleshnin opened this issue Nov 13, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@ivan-kleshnin
Copy link

ivan-kleshnin commented Nov 13, 2023

apoc.agg.statistics looks broken. At least with custom percentiles:

Screenshot 2023-11-13 at 13 41 24

Pls. correct me if I'm wrong. To my understanding, passing 0.1 percentile is a request to evaluate a threshold between lowest 10% of the population and the rest of the population. 0.5 percentile is a median.

There's no way in which values for 0.1 and 0.25 could be below (or even equal to) min.

Is it a rounding issue went too far or what? There's a large dispersion, yes, but median & other percentiles should be unaffected by outliers 🤔

@ivan-kleshnin ivan-kleshnin added the bug Something isn't working label Nov 13, 2023
@mnd999 mnd999 transferred this issue from neo4j/neo4j Nov 14, 2023
@loveleif
Copy link
Contributor

Thanks for taking the time to report! I agree, this looks wrong. I'll make sure the relevant team is notified. Sorry for the inconvenience.

@ivan-kleshnin
Copy link
Author

Any news on this?

@gem-neo4j
Copy link
Contributor

Hi! sorry for the slow response. I took a look and APOC is simply using another library - HdrHistogram, I would assume this is a rounding error and that it would be the min value. It is using the getValueAtPercentile() of the DoubleHistogram class: http://hdrhistogram.org/.

@ivan-kleshnin
Copy link
Author

ivan-kleshnin commented May 14, 2024

Thank you for the response. Here's an approx. Cypher version of statistics

MATCH (u:User)--(s:Specialization {name: "Web Engineer"}) // customize this part
WITH u
ORDER BY u.articlerank
WITH collect(u) AS us
WITH us[0] AS `u-0.0`,
     us[size(us) * 1 / 10] AS `u-0.1`,
     us[size(us) * 2 / 10] AS `u-0.2`,
     us[size(us) * 3 / 10] AS `u-0.3`,
     us[size(us) * 4 / 10] AS `u-0.4`,
     us[size(us) * 5 / 10] AS `u-0.5`,
     us[size(us) * 6 / 10] AS `u-0.6`,
     us[size(us) * 7 / 10] AS `u-0.7`,     
     us[size(us) * 75 / 100] AS `u-0.75`, 
     us[size(us) * 8 / 10] AS `u-0.8`,   
     us[size(us) * 85 / 100] AS `u-0.85`,                 
     us[size(us) * 9 / 10] AS `u-0.9`,
     us[size(us) * 95 / 100] AS `u-0.95`,  
     us[size(us) * 97 / 100] AS `u-0.97`,  
     us[size(us) * 98 / 100] AS `u-0.98`,  
     us[size(us) * 99 / 100] AS `u-0.99`,  
     us[size(us) - 1] AS `u-1.0`  
WITH {
  `0.0`: `u-0.0`.articlerank,  
  `0.1`: `u-0.1`.articlerank,
  `0.2`: `u-0.2`.articlerank,
  `0.3`: `u-0.3`.articlerank,
  `0.4`: `u-0.4`.articlerank,  
  `0.5`: `u-0.5`.articlerank,
  `0.6`: `u-0.6`.articlerank,
  `0.7`: `u-0.7`.articlerank,
  `0.75`: `u-0.75`.articlerank,
  `0.8`: `u-0.8`.articlerank,
  `0.85`: `u-0.85`.articlerank,  
  `0.9`: `u-0.9`.articlerank,
  `0.95`: `u-0.95`.articlerank,
  `0.97`: `u-0.97`.articlerank,  
  `0.98`: `u-0.98`.articlerank,    
  `0.99`: `u-0.99`.articlerank,    
  `1.0`: `u-1.0`.articlerank    
} AS r
RETURN r

It returns roughly the same data (minus rounding). The distribution, in general, is correct. The remaining mistake is:

min value (sometimes) > 0.0 value
min value (sometimes) > 0.1 value
...

and

...
max value (sometimes) < 0.99 value
max value (sometimes) < 1.0 value

which algebraically are impossible. The error comes from rounding which is somehow different for min/max and 0.0/1.0 sets.

To workaround, users should manually:

  • Replace all percentiles < min with min
  • Replace all percentiles > max with max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants