kstest uses wrong n for p-value calculation

The `kstest` function creates an empirical CDF (using F_n = ecdf(...)) and then calculates the value D_n = sup_x |F_n - F| to a computed or given CDF F. Fn and F have the same size.

The test statistics D_n is the supremum of all differences between F_n and F, and the set of these differences has cardinality n. Therefore, n should be the size of the second output from the call to `ecdf` (line [142](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L142), where x is set to the ordinate of `ecdf`), because there are just that many CDF-differences calculated (line [191](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L191) interpolates the CDF to x). But instead, n is taken as the size of the _input_ x to `kstest` (line [141](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L141), just before the call to `ecdf`). Then this n is used in the calculation of the pValue (lines [216](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L216), [242](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L242) or [247](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L247)).

This does not generate an error - But it produces wrong results. Attached is a script to reproduce it. It creates an array R of poisson distributed random numbers and plots the `ecdf(R)` together with a calculated CDF of the poisson distribution with same \lambda. The plot shows clearly, that the CDF's are very close. But `kstest` rejects the H0 hypothesis, saying that R is not poisson distributed, and gives a very low p value:

```
pkg load statistics

R = poissrnd (1,100,1);
[F,X,Flo,Fup] = ecdf (R);
Xp = linspace (min (X), max (X), 1000)';
P = makedist ("poisson", "lambda", 1);
Fp = cdf(P, Xp);

figure
hold on
stairs (X, F, "-b")
stairs (X, Flo, "--g")
stairs (X, Fup, "--c")
stairs (Xp, Fp, "-r")
legend (
    "edcf(R)",
    "lower bound of ecdf",
    "upper bound of ecdf",
    "calculated poisson CDF",
    "location", "east")

[h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
```

Output:
```
h = 1
p = 1.8543e-12
ks = 0.3679
cv = 0.1340
```
![poisson_cdf](https://github.com/user-attachments/assets/aff31841-13db-48c6-a682-02837d06e612)


When I swap the lines [141](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L141) and [142](https://github.com/gnu-octave/statistics/blob/ebbc6ec3317c008008cdf37116dbfe9186e7e7be/inst/kstest.m#L142) in `kstest`, so that n is taken as the length of the ecdf ordinates instead of the input x, the result looks better:

```
>> [h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
h = 0
p = 0.3108
ks = 0.3679
cv = 0.5193
```

And this is, btw, also more in line with the result that you get with R, using the KSgeneral package:

```
> R = rpois(512,1)
> F = stepfun(c(0:100), c(0, ppois(0:100,1)))
> KSgeneral::disc_ks_test(R,F)

	One-sample Kolmogorov-Smirnov test

data:  R
D = 0.047879, p-value = 0.4922
alternative hypothesis: two-sided
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kstest uses wrong n for p-value calculation #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kstest uses wrong n for p-value calculation #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions