# Mann-Whitney U
---

### Introduction

<a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">wikipedia</a>

Mann-Whitney is a non-parametric test that evaluates the hypothesis $H_0$ that two samples A and B originate from the same distriubtion. Works with ordinal data (absolute values do not matter)

### Overview

Non-parametric test = we do not make any asumptions about the distribution of compared variables, we can compare arbitrary distributions

<img src="img/mw1.png" width=350>

Rank test = we compare only relative location of the points from A and B merged together and put on a single scale, absolute values are not important

While other (parameteric) tests evaluate the equality of individual moments (like $\mu$ or $\sigma$) this test compares distributions at the whole scale (both center and tails)

It's a good choice when the distributions are not normal or the sample size is small

Since it does not require any specific conditions it is well suited for a universal metric comparison tool

### Формула

Test value is calculated as a pair $(U_1, U_2)$, where

\begin{cases}
    U_1 = n_1  n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1 \\
    U_2 = n_1  n_2 + \frac{n_2 (n_2 + 1)}{2} - R_2
\end{cases}

where $R_1$ = the sum of ranks of the first group in a common pool with both groups (ordered by increasing values)

Notice that $U_1 + U_2 = n_1 \cdot n_2$. The domain of the statistic is $U_1 + U_2 \in [0, n_1 n_2]$<br> Values $0$ and $n_1 n_2$ correspond to extreme cases, when X_A > X_B for all observations and vice versa

Легко иентерпретируеая симметрия - насколько X опережает Y, настолько Y отстает от X. Она делает возможной форму записи этой статистики через min:<br>
$U = \min{(U_1, U_2)}$

Распределение именно этой U анализируется, считаются p-value

Another important way to write the statistic:
$$U = \sum_{i \in X} \sum_{j \in Y} I[X_i > Y_j]$$


#### Derivation

- $n_1 n_2$ = amount of all pairs of points in $A \times B$

- по формуле арифметической прогрессии сумма рангов первых $n_1$ элементов равна $\frac{(1 + n_1)}{2}n_1$

- если первая группа опережает все точки второй группы, то $R_1=\frac{n1 (n_1 + 1)}{2}$ тоже<br> и $U_1 = n_1 \cdot n_2, U_2 = 0$

Нормированный вариант теста $\frac{U_1}{n_1 n_2}$ - это метрика ROC AUC (меняется в диапазоне [0,1])в машинном обучении ROCAUC вводится как мера смещения классов друг относительно друга - по сути это то же, что считает U-тест, но чуть в другом контекстепри равенствен рапределений 0

### U distribution

U is a random variable with its own distribution $P(U|H_0)$

It is discrete, symmetric & converges to normal

<img src="img/mannwhitney_u.png" width=500>

How p-values be calculated:
- tabular (exact values)<br>select among all combinations of $A \times B$
- approximated by normal
- permutation-based

Recall this form of U
$$U = \sum_{i \in X} \sum_{j \in Y} I[X_i > Y_j]$$

Why we can appoximate with normal?
1. CLT tells us that the sum converges to normal $N(\mu, \sigma)$
2. $\mu$ and $\sigma$ are known (see below) => we can standardize to $N(0,1)$ and get p-values


<img src="img/mann_whitney_u_2.png" width=300>


U is a sum of individual Bernoulli variables $I[X_i > Y_j]$<br>

#### Expectation and Variance (single comparison)
Under Null hypothesis $P(X>Y)=\frac{1}{2}$ for all comparisons<br>
Thus Expectation $E[I_{ij}]= p =\frac{1}{2}$

And Variance is $E\big[E[U] - U\big]^2= p^2 = \frac{1}{4}$


#### Expextation and Variance

Mean of the sum U is $\frac{n_1 n_2}{2}$

Now for Variation. Notice that most comparisons are not independent. If X > Y all observations following X must also be greater than Y. So covariances are not zero

$\mathrm{Var}{(U)}=\mathrm{Var}{(U)} + \mathrm{Cov}{(U)}$

Using combinatorics we get its value <br>
$\frac{n_1 n_2 (n_1 + n_2 - 1)}{12}$

Variance is $\sigma^2=\frac{n_1 n_2(n_1 + n_2 + 1)}{12}$



### Бакетный вариант теста
В реальных системах считать полный U тест накладно => нужна его апроксимация. Часто считают "бакетный" вариант теста (условно, квантизованый до фиксированной размерности в 100 элементов):
- равномерно делим пользователей на 100 корзин по хэшу
- считаем интересующую нас метрику отдельно для каждой корзины - получаем 100-элементный вектор значений
- применяем стандартный U-тест к 100-элементному вектору значений метрики

### Результат

p-value - это pvalue рассчитанной U при условии H0чем меньше, тем лучше - можем быть уверены в неслучайности результата

MW-test = 100. * (1 - pvalue)предположительно введен для удобства, чтобы значение вариьировалось на более привычной шкале [0, 100]

In [83]:
import numpy as np
from scipy.stats import mannwhitneyu

# Sample data for two independent groups
group_1 = np.array([12, 15, 14, 10, 19, 17, 21])
group_2 = np.array([22, 25, 24, 23, 30, 28, 29])

# Perform Mann-Whitney U test
stat, p_value = mannwhitneyu(group_1, group_2)

# Output the results
print(f'Mann-Whitney U statistic: {stat}')
print(f'p-value: {p_value}')

Mann-Whitney U statistic: 0.0
p-value: 0.0005827505827505828


# Relation to other tests

Another name for Mann-Whitney is <u>Wilocoxon Rank-sum</u> since it was independently designed by Wilcoxon. Not to confuse with <u>Wilocoxon Signed</u> test which is used to compare two paired datasets (observation = difference)

#### Student t-test
Unlike MW T-test compares means and it requires normality to make its formulas work

Mann-whitney is prefereable when data distribution is not normal OR dataset is small (less than 30 examples)




# Properties

#### Effect size
In statistics apart from evaluating the hypothesis, we would want to understand the magnitude of deviation. This measure is called Effect Size and often based on the statistic itself

U metric is not very interpretable => let's adjust it to be more intuitive<br>
- Common Language Effect Size - ROC AUC<br>$\text{CLES}=\frac{U}{n_1 n_2}$<br><br>
- Rank biserial correlation<br>$r=\frac{Z}{\sqrt{N}}$

#### Robustness
In small samples the <u>outliers</u> affect the distribution more than they affect ranking => Rank tests are more robust to outliers and preferred when sample size is small

#### Exact vs approximate

For small samples p-values are exact
For large values they are apperoximated by normal distribution => less precise & t-test is prefered

#### Data Types
Since only comparisons are important, ordinal data is also possible <br>Example: compare academic degrees of two companies employees