# Lecture 14: Nonparametic Statistics

## Non-parametric Statistics: The Why

Normal distribution is nice and friendly; we have good math for this.

**Parameters** (like mean and variance) cannot fully and accurately capture this distribution! Hence, we require **non-parametric statistics**.

### When To Turn To Non-Parametric Statistics

* When underlying distributions are non-normal, skewed, or cannot be parameterized simply
* When you have ranked (ordinal) data, e.g., preferences
* When you need to build an empirical "null" distribution

### Non-Parametric Statistics: Distribtion Free

* Myth: Non-parametric statistics do not use parameters
* Fact: Non-parametric statistics do not make *assumptions about*/parameterize the underlying distribution generating the data
* "Distribution-Free" Statistics
    * Meaning, it does not assume data-generating process (like heights) result in, e.g., normally-distributed data

**CLICKER QUESTION**

Which of the following variables contain ordinal data?

A) Favorite pet

B) Distance traveled by car each day

**C) Survey responses (scale from "dislike" to "like")**

D) Human height

E) Human hair color

## Resampling Statistics: The What

* Bootstrap (Monte Carlo)
* Rank Statistics (Mann Whitney U)
* Kolmogorov-Smirnoff Test
* Non-parametric prediction models

### Bootstrapping (Resampling)

How can we build a more realistic "null distribution" for the sample estimate without knowing the population it's drawn from?

### Rank Statistics

We rank things in the real world *all the time*!

* International rankings (economics, happiness, government performance)
* Sports (teams, players, leagues)
* Search engines
* Academic journal prestige
* Online reviews

#### Rank Statistics

Data are transformed from their quantitative value to their rank:

**Ordinal data**: Categorical, where the variables have a natural order

Quantitative Data [1, 4.5, 6.6, 9.2] -> Ordinal Data [1, 2, 3, 4]

Particularly helpful when data have a ranking but no clear numerical representation (i.e., movie reviews)

**CLICKER QUESTION**

What would the rank of the following list be? [77, 49, 23, 10, 89]

A) [1, 2, 3, 4, 5]

B) [5, 4, 3, 2, 1]

**C) [4, 3, 2, 1, 5]**

D) [2, 1, 1, 1, 2]

### Wilcoxon Rank Sum Test (Mann Whitney U Test)

* Determine whether two independent samples were selected from the same populations, having the same distribution
* Similar to t-test (but does not require normal distributions) & tests median

Assumptions

* Observations in each group are independent of one another
* Responses are ordinal

* $H_{0}$: Distributions of both populations are equal
* $H_{a}$: Distributions are *not* equal

#### Mann-Whitney U: Example

In a clinical trial, is there a difference in the number of episodes of shortness of breath between placebo and treatment?

* Step 1: Participants record number of episodes they have
* Step 2: Episodes from both groups are combined, sorted, and ranked
* Step 3: Resort the ranks into separate samples (placebo vs. treatment)
* Step 4: Carry out statistical test

#### Mann-Whitney U: Calculating the U Statistic

$$U_{A} = n_{a}n_{b} + \frac{n_{a}(n_{a}+1)}{2} - T_{A}$$

* $\frac{n_{a}(n_{a}+1)}{2}$ = The max possible value of $T_{A}$
* $T_{A}$ = The observed sum of ranks for sample A
* $H_{0}$: Low and high scores are approximately evenly distributed in the two groups
* $H_{a}$: Low and high scores are NOT evenly distributed in the two groups (U <= 2)
* $0 < U < n_{1}*n_{2}$ : Complete separation -> No separation
    * $n_{a}$ = Number of elements in Group A
    * $n_{b}$ = Number of elements in Group B
* **We reject the null if U is small**

### Kolmogorov-Smirnov (KS) Test

Given (limited) samples from two populations, how do we quantify rather whether they come from the same distribution?

Compare cumulative distributions empirically.

Find the maximum difference between the CDFs.

Tests:

* Whether a sample is drawn from a given distribution
* Whether two samples are drawn from the same distribution

### Non-Parametric Prediction Models

* When you have lots of data and no prior knowledge
* When you're not focused/worried about choosing the right features
* Goal: Fit training data while being able to generalize unseen data

Examples:

* k-NN (k-Nearest Neighbors)
* Decision Trees (CART)
* Support Vector Machines (SVM)

#### Why Do We Even Teach/Use Parametric Statistics Anyway?

Parametric Approaches:

* Lots of data follow unexpected patterns
* Requires less data
* More sensitive
* Quicker to run/train/predict
* More resistant to overfitting