# One Sample Test for Median: The Sign Test

Non-parametric tests are normally based on **ranks** of the data samples, and test hypotheses relating to
quantiles of the probability distribution representing the population from which the data are drawn.
Specifically, tests concern the **population median**, $\eta$, where 

$$ \text{Pr}[\text{Observation} \leq \eta] = 1/2$$

The *sample median*, $x_{\text{MED}}$ is the mid-point of the sorted sample; if the data $x_1, \dots,x_n$ are sorted into *ascending order*, then 

$$ x_{\text{MED}} = \begin{cases}
x_m  & \text{$n$ odd, $n=2m+1$} \\
\frac{x_m + x_{m+1}}{2} & \text{$n$ even, $n=2m$}.
\end{cases}$$

For a single sample size $n$, to test the hypothesis $\eta = \eta_0$ for some specified value $\eta_0$ we use the *Sign Test*. The test statistic $S$ depends on the alternative hypothesis $H_a$.

(a) For *one-sided* tests, to test 

- $H_0: \eta = \eta_0$

- $H_a: \eta > \eta_0$

we define test statistic $S$ by $$S = \text{Number of observations greater than $\eta_0$}$$

whereas to  test

- $H_0: \eta = \eta_0$

- $H_a: \eta < \eta_0$

we define $S$ by $$S = \text{Number of observations less than $\eta_0$}$$

If $H_0$ is *true*, it follows that $$S \sim \text{Binom}(n,1/2)$$.

The $p$-value is defined by $$p = \text{Pr}[X \geq S]$$

where $X \sim \text{Binom}(n,1/2)$. The rejection region for significance level $\alpha$ is defined implicitly by the rule $$\text{Reject } H_0 \text{ if } \alpha \geq p$$.

(b) For a *two-sided* test, 

- $H_0: \eta = \eta_0$

- $H_a: \eta \neq \eta_0$

we define the test statistic by $$S= \max \{ S_1, S_2\}$$

where $S_1$ and $S_2$ are the counts of the number of obersvations less than, and greater than $\eta_0$ respectively. The $p$-value is defined by
$$ p = 2\text{Pr}[X \geq S] $$
where $X \sim \text{Binom}(n,1/2) $

**Notes:**

1. The only assumption behind the test is that the data are drawn independently from a continuous distribution

2. If any data are equal to $\eta_0$, we discard them before carrying out the test.

3. *Large sample approximation*. If $n$ is large (say $n \geq 30$) and $X \sim \text{Binom}(n,1/2)$, then it can be shown that 

$$X \sim \text{Normal}(np, np(1-p))$$

Thus for the sign test, where $p = 1/2$, we can use the test statistic

$$ Z = \frac{S - n/2 }{\sqrt{n \times 1/2 \times 1/2 }} = \frac{S- n/2}{\sqrt{n} \times 1/2} $$

and note that if $H_0$ is true, $$ Z \sim \text{Normal}(0,1).$$

so that the test at $\alpha = 0.05$ uses the following critical values

- $H_a: \eta > \eta_0$ then $C_R = 1.645$

- $H_a: \eta < \eta_0$ then $C_R = -1.645$

- $H_a: \eta \neq \eta_0$ then $C_R = \pm 1.960$



4. For the large sample approximation, it is common to make a *continuity correction*, where we replace $S$ by $S - 1/2$ in the definition of $Z$:

$$ Z = \frac{S - 1/2 - n/2}{\sqrt{n} \times 1/2} $$



In [2]:
people = {'salary': [29500,54000,54000,65600,70400,73600,78800,80400,91200,
                      94700,99200,500000]}

In [3]:
import pandas as pd
df = pd.DataFrame(people)

In [6]:
df.index = range(1, len(df) + 1)

In [7]:
df

Unnamed: 0,salary
1,29500
2,54000
3,54000
4,65600
5,70400
6,73600
7,78800
8,80400
9,91200
10,94700


Suppose we hypothesize that the median is $60200$? That is, 

- $H_0$: Median $= 60200$,    $H_0: p = 0.50$

- $H_a$: Median $\neq 60200$, $H_a: p \neq 0.50$


In [13]:
import numpy as np
conditions = [(df['salary'] > 60200), (df['salary'] < 60200), (df['salary'] == 60200)]
values = ['+', '-', '0']
df['sign'] = np.select(conditions,values)

In [14]:
df

Unnamed: 0,salary,sign
1,29500,-
2,54000,-
3,54000,-
4,65600,+
5,70400,+
6,73600,+
7,78800,+
8,80400,+
9,91200,+
10,94700,+


In [15]:
df['sign'].value_counts()

+    9
-    3
Name: sign, dtype: int64

If this were the actually median, we expect half the values lie above it and half the values lie below it.

A car manufacturer claims that no more than 10% of their cars are unsafe. 15 cars are inspected for safety, 3 were found to be unsafe. Test the manufacturer’s claim.

In [16]:
from scipy import stats
stats.binom_test(3, n=15, p=0.1, alternative='greater')

0.18406106910639114

The null hypothesis cannot be rejected at the 5% level of significance because the returned p-value is greater than the critical value of 5%.