Skip to content

kstest uses wrong n for p-value calculation #164

@ecky-l

Description

@ecky-l

The kstest function creates an empirical CDF (using F_n = ecdf(...)) and then calculates the value D_n = sup_x |F_n - F| to a computed or given CDF F. Fn and F have the same size.

The test statistics D_n is the supremum of all differences between F_n and F, and the set of these differences has cardinality n. Therefore, n should be the size of the second output from the call to ecdf (line 142, where x is set to the ordinate of ecdf), because there are just that many CDF-differences calculated (line 191 interpolates the CDF to x). But instead, n is taken as the size of the input x to kstest (line 141, just before the call to ecdf). Then this n is used in the calculation of the pValue (lines 216, 242 or 247).

This does not generate an error - But it produces wrong results. Attached is a script to reproduce it. It creates an array R of poisson distributed random numbers and plots the ecdf(R) together with a calculated CDF of the poisson distribution with same \lambda. The plot shows clearly, that the CDF's are very close. But kstest rejects the H0 hypothesis, saying that R is not poisson distributed, and gives a very low p value:

pkg load statistics

R = poissrnd (1,100,1);
[F,X,Flo,Fup] = ecdf (R);
Xp = linspace (min (X), max (X), 1000)';
P = makedist ("poisson", "lambda", 1);
Fp = cdf(P, Xp);

figure
hold on
stairs (X, F, "-b")
stairs (X, Flo, "--g")
stairs (X, Fup, "--c")
stairs (Xp, Fp, "-r")
legend (
    "edcf(R)",
    "lower bound of ecdf",
    "upper bound of ecdf",
    "calculated poisson CDF",
    "location", "east")

[h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")

Output:

h = 1
p = 1.8543e-12
ks = 0.3679
cv = 0.1340

poisson_cdf

When I swap the lines 141 and 142 in kstest, so that n is taken as the length of the ecdf ordinates instead of the input x, the result looks better:

>> [h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
h = 0
p = 0.3108
ks = 0.3679
cv = 0.5193

And this is, btw, also more in line with the result that you get with R, using the KSgeneral package:

> R = rpois(512,1)
> F = stepfun(c(0:100), c(0, ppois(0:100,1)))
> KSgeneral::disc_ks_test(R,F)

	One-sample Kolmogorov-Smirnov test

data:  R
D = 0.047879, p-value = 0.4922
alternative hypothesis: two-sided

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions