## Non-parametric Statistics

Non-parametric statistics are used when data do not meet the assumptions required for parametric tests (e.g., normal distribution). These methods do not assume a specific distribution for the data. This tutorial covers key concepts, mathematical background, and numerical examples.

### 1. Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is used to test whether the median of a single sample differs from a specified value or whether the median of the differences between paired samples is zero.

*Example:*

Suppose we have data on the scores of students before and after a tutoring program:

| Student | Before | After |
|---------|--------|-------|
| A       | 60     | 70    |
| B       | 65     | 75    |
| C       | 70     | 68    |
| D       | 75     | 72    |
| E       | 80     | 78    |

We want to test if there is a significant difference in scores before and after tutoring.

1. **Calculate Differences**: Compute the differences between paired observations.

| Student | Before | After | Difference |
|---------|--------|-------|------------|
| A       | 60     | 70    | 10         |
| B       | 65     | 75    | 10         |
| C       | 70     | 68    | -2         |
| D       | 75     | 72    | -3         |
| E       | 80     | 78    | -2         |

2. **Rank the Absolute Differences**: Rank the absolute values of the differences, ignoring zeros.

| Difference | Absolute Difference | Rank |
|------------|---------------------|------|
| 10         | 10                  | 4.5  |
| 10         | 10                  | 4.5  |
| -2         | 2                   | 1.5  |
| -3         | 3                   | 3    |
| -2         | 2                   | 1.5  |

3. **Assign Signs to Ranks**: Assign the original sign of the differences to the ranks.

| Difference | Rank |
|------------|------|
| 10         | 4.5  |
| 10         | 4.5  |
| -2         | -1.5 |
| -3         | -3   |
| -2         | -1.5 |

4. **Calculate Test Statistic**: Sum the positive and negative ranks separately.

$$ W^+ = 4.5 + 4.5 = 9 $$
$$ W^- = -1.5 - 3 - 1.5 = -6 $$

5. **Determine Significance**: Compare the smaller of $W^+$ and $W^-$ to the critical value from the Wilcoxon Signed-Rank Table for the given sample size.

**Key Properties:**

1. **Distribution-Free**: Does not assume a specific distribution for the data.
2. **Paired Data**: Suitable for paired samples.
3. **Rank-Based**: Uses ranks rather than actual data values.

### 2. Mann-Whitney U Test

The Mann-Whitney U Test is used to test whether two independent samples come from the same distribution.

*Example:*

Suppose we have test scores from two groups of students:

| Group A | Group B |
|---------|---------|
| 60      | 70      |
| 65      | 75      |
| 70      | 80      |
| 75      | 85      |

We want to test if there is a significant difference between the two groups.

1. **Combine and Rank Data**: Combine the scores from both groups and rank them.

| Group | Score | Rank |
|-------|-------|------|
| A     | 60    | 1    |
| A     | 65    | 2    |
| A     | 70    | 3    |
| A     | 75    | 4    |
| B     | 70    | 5    |
| B     | 75    | 6    |
| B     | 80    | 7    |
| B     | 85    | 8    |

2. **Calculate U Statistic**:

$$ U_A = n_A n_B + \frac{n_A (n_A + 1)}{2} - R_A $$
$$ U_B = n_A n_B + \frac{n_B (n_B + 1)}{2} - R_B $$

Where:
- $n_A$, $n_B$ = sample sizes of groups A and B
- $R_A$, $R_B$ = sum of ranks for groups A and B

For Group A:

$$ U_A = 4 \times 4 + \frac{4 \times (4 + 1)}{2} - (1 + 2 + 3 + 4) $$
$$ U_A = 16 + 10 - 10 = 16 $$

For Group B:

$$ U_B = 4 \times 4 + \frac{4 \times (4 + 1)}{2} - (5 + 6 + 7 + 8) $$
$$ U_B = 16 + 10 - 26 = 0 $$

The smaller value of $U_A$ and $U_B$ is the test statistic.

3. **Determine Significance**: Compare the test statistic to the critical value from the Mann-Whitney U Table for the given sample sizes.

**Key Properties:**

1. **Distribution-Free**: Does not assume a specific distribution for the data.
2. **Independent Samples**: Suitable for two independent samples.
3. **Rank-Based**: Uses ranks rather than actual data values.

### 3. Kruskal-Wallis H Test

The Kruskal-Wallis H Test is used to test whether three or more independent samples come from the same distribution.

*Example:*

Suppose we have test scores from three groups of students:

| Group A | Group B | Group C |
|---------|---------|---------|
| 60      | 70      | 80      |
| 65      | 75      | 85      |
| 70      | 80      | 90      |
| 75      | 85      | 95      |

We want to test if there is a significant difference between the three groups.

1. **Combine and Rank Data**: Combine the scores from all groups and rank them.

| Group | Score | Rank |
|-------|-------|------|
| A     | 60    | 1    |
| A     | 65    | 2    |
| A     | 70    | 3    |
| A     | 75    | 4    |
| B     | 70    | 5    |
| B     | 75    | 6    |
| B     | 80    | 7    |
| B     | 85    | 8    |
| C     | 80    | 9    |
| C     | 85    | 10   |
| C     | 90    | 11   |
| C     | 95    | 12   |

2. **Calculate H Statistic**:

$$ H = \frac{12}{N (N+1)} \sum \frac{R_i^2}{n_i} - 3 (N+1) $$

Where:
- $N$ = total number of observations
- $R_i$ = sum of ranks for group $i$
- $n_i$ = number of observations in group $i$

$$ H = \frac{12}{12 \times 13} \left( \frac{(1+2+3+4)^2}{4} + \frac{(5+6+7+8)^2}{4} + \frac{(9+10+11+12)^2}{4} \right) - 3 \times 13 $$

$$ H = \frac{12}{156} \left( \frac{10^2}{4} + \frac{26^2}{4} + \frac{42^2}{4} \right) - 39 $$

$$ H = \frac{12}{156} \left( 25 + 169 + 441 \right) - 39 $$

$$ H = \frac{12}{156} \times 635 - 39 $$

$$ H = 49.0385 - 39 = 10.0385 $$

3. **Determine Significance**: Compare the test statistic to the critical value from the Kruskal-Wallis H Table for the given degrees of freedom (number of groups - 1).

**Key Properties:**

1. **Distribution-Free**: Does not assume a specific distribution for the data.
2. **Multiple Independent Samples**: Suitable for three or more independent samples.
3. **Rank-Based**: Uses ranks rather than actual data values.

### 4. Spearman's Rank Correlation

Spearman's Rank Correlation measures the strength and direction of the association between two ranked variables.

*Example:*

Suppose we have two sets of rankings for five students in math and science:

| Student | Math Rank | Science Rank |
|---------|-----------|--------------|
| A       | 1         | 2            |
| B       | 2         | 3            |
| C       | 3         | 1            |
| D       | 4         | 5            |
| E       | 5         | 4            |

1. **Calculate Rank Differences**: Compute the differences between the ranks.

| Student | Math Rank | Science Rank | Difference ($d$) | $d^2$ |
|---------|-----------|--------------|------------------|-------|
| A       | 1         | 2            | -1               | 1     |
| B       | 2         | 3            | -1               | 1     |
| C       | 3         | 1            | 2                | 4     |
| D       | 4         | 5            | -1               | 1     |
| E       | 5         | 4            | 1                | 1     |

2. **Calculate Spearman's Rank Correlation Coefficient ($\rho$)**:

$$ \rho = 1 - \frac{6 \sum d^2}{n(n^2 - 1)} $$

Where:
- $d$ = difference between ranks
- $n$ = number of observations

$$ \rho = 1 - \frac{6 \times 8}{5(25 - 1)} $$
$$ \rho = 1 - \frac{48}{120} $$
$$ \rho = 1 - 0.4 $$
$$ \rho = 0.6 $$

**Key Properties:**

1. **Non-Parametric**: Does not assume normal distribution.
2. **Rank-Based**: Uses ranks rather than actual data values.
3. **Correlation Measure**: Measures monotonic relationships.

### 5. Chi-Square Test for Independence

The Chi-Square Test for Independence tests whether two categorical variables are independent.

*Example:*

Suppose we have data on the preference for two brands of cereal among children and adults:

|          | Brand A | Brand B | Total |
|----------|---------|---------|-------|
| Children | 30      | 20      | 50    |
| Adults   | 20      | 30      | 50    |
| Total    | 50      | 50      | 100   |

We want to test if there is an association between age group and brand preference.

1. **Calculate Expected Frequencies**:

$$ E_{ij} = \frac{(Row Total) \times (Column Total)}{Grand Total} $$

For Children and Brand A:

$$ E_{11} = \frac{50 \times 50}{100} = 25 $$

For Children and Brand B:

$$ E_{12} = \frac{50 \times 50}{100} = 25 $$

For Adults and Brand A:

$$ E_{21} = \frac{50 \times 50}{100} = 25 $$

For Adults and Brand B:

$$ E_{22} = \frac{50 \times 50}{100} = 25 $$

2. **Calculate Chi-Square Statistic**:

$$ \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$

Where $O_{ij}$ and $E_{ij}$ are the observed and expected frequencies.

$$ \chi^2 = \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} + \frac{(20-25)^2}{25} + \frac{(30-25)^2}{25} $$
$$ \chi^2 = \frac{5^2}{25} + \frac{(-5)^2}{25} + \frac{(-5)^2}{25} + \frac{5^2}{25} $$
$$ \chi^2 = \frac{25}{25} + \frac{25}{25} + \frac{25}{25} + \frac{25}{25} $$
$$ \chi^2 = 1 + 1 + 1 + 1 $$
$$ \chi^2 = 4 $$

3. **Determine Significance**: Compare the test statistic to the critical value from the Chi-Square Table for the given degrees of freedom (number of rows - 1) * (number of columns - 1).

**Key Properties:**

1. **Categorical Data**: Suitable for categorical data.
2. **Independence Test**: Tests for the independence of two variables.
3. **Observed vs. Expected**: Compares observed and expected frequencies.

### 6. Summary

Non-parametric statistics provide powerful tools for analyzing data without requiring specific distributional assumptions. The Wilcoxon Signed-Rank Test, Mann-Whitney U Test, Kruskal-Wallis H Test, Spearman's Rank Correlation, and Chi-Square Test for Independence offer robust methods for various data analysis scenarios. Mastery of these concepts enables effective analysis when parametric assumptions are not met.


In [None]:
# Wilcoxon Signed-Rank Test Statistic
import numpy as np
from scipy.stats import wilcoxon

# Data: Scores before and after a tutoring program
before = np.array([60, 65, 70, 75, 80])
after = np.array([70, 75, 68, 72, 78])

# Perform Wilcoxon Signed-Rank Test
stat, p_value = wilcoxon(before, after)

print(f"Wilcoxon Signed-Rank Test Statistic: {stat}")
print(f"P-value: {p_value}")

Wilcoxon Signed-Rank Test Statistic: 6.0
P-value: 0.8125


In [None]:
# Mann-Whitney U Test
from scipy.stats import mannwhitneyu

# Data: Test scores from two groups of students
group_A = np.array([60, 65, 70, 75])
group_B = np.array([70, 75, 80, 85])

# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(group_A, group_B, alternative='two-sided')

print(f"Mann-Whitney U Test Statistic: {stat}")
print(f"P-value: {p_value}")


Mann-Whitney U Test Statistic: 2.0
P-value: 0.10806337293756858


In [None]:
# Kruskal-Wallis H Test
from scipy.stats import kruskal

# Data: Test scores from three groups of students
group_A = np.array([60, 65, 70, 75])
group_B = np.array([70, 75, 80, 85])
group_C = np.array([80, 85, 90, 95])

# Perform Kruskal-Wallis H Test
stat, p_value = kruskal(group_A, group_B, group_C)

print(f"Kruskal-Wallis H Test Statistic: {stat}")
print(f"P-value: {p_value}")


Kruskal-Wallis H Test Statistic: 7.645390070921987
P-value: 0.02186878425495871


In [None]:
# Spearman's Rank Correlation
from scipy.stats import spearmanr

# Data: Rankings of five students in math and science
math_rank = np.array([1, 2, 3, 4, 5])
science_rank = np.array([2, 3, 1, 5, 4])

# Compute Spearman's Rank Correlation
corr, p_value = spearmanr(math_rank, science_rank)

print(f"Spearman's Rank Correlation: {corr}")
print(f"P-value: {p_value}")


Spearman's Rank Correlation: 0.6
P-value: 0.28475697986529375


In [None]:
# Chi-Square Test for Independence
from scipy.stats import chi2_contingency

# Data: Preference for two brands of cereal among children and adults
data = np.array([[30, 20],
                 [20, 30]])

# Perform Chi-Square Test for Independence
chi2, p_value, dof, expected = chi2_contingency(data)

print(f"Chi-Square Test Statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)


Chi-Square Test Statistic: 3.24
P-value: 0.07186063822585143
Degrees of Freedom: 1
Expected Frequencies:
[[25. 25.]
 [25. 25.]]
