In [None]:
import numpy as np

In [3]:
x = [1,2,4,6,5,4,0]
n = len(x)

In [4]:
mean1 = np.mean(x)
mean2 = np.sum(x)/n

In [5]:
print(mean1)
print(mean2)

3.142857142857143
3.142857142857143


## variance

In [6]:
var1 = np.var(x)
var2 = np.sum((x-mean1)**2) * (1/(n-1))

In [7]:
print(var1)
print(var2)

4.122448979591836
4.809523809523809


## Understanding `ddof` for Variance

**`ddof` (Delta Degrees of Freedom)** is a parameter that controls how many values you subtract from your sample size when calculating variance.

### Simple Explanation:

When you calculate variance, you divide by a number. The question is: **what number?**

- **`ddof=0` (default):** Divide by `n` → Used for **population variance** (you have all the data)
- **`ddof=1`:** Divide by `n-1` → Used for **sample variance** (you have a sample, not the whole population)

### Why Does This Matter?

Imagine you're estimating the variance of a large population, but you only have a small sample:

- If you divide by `n` (the sample size), you'll **underestimate** the true variance
- If you divide by `n-1`, you get a **better estimate** because you're accounting for the fact that your sample might not perfectly represent the population

The `n-1` makes the variance slightly larger, which compensates for the uncertainty in your sample.

### In Your Notebook:

- `np.var(x)` uses `ddof=0` → divides by 7
- Your manual formula divides by 6 (which is `n-1`) → equivalent to `ddof=1`

**Rule of thumb:** Use `ddof=1` when working with sample data (most real-world cases), and `ddof=0` only when you have the complete population.

The variances are different because they use different formulas:

- **var1** uses `np.var(x)` which calculates **population variance** (divides by `n`)
- **var2** uses `np.sum((x-mean1)**2) * (1/(n-1))` which calculates **sample variance** (divides by `n-1`)

For your data with 7 elements, `np.var()` divides by 7, while your manual calculation divides by 6. This is why the values differ.

**To match them, use one of these approaches:**

1. **Make var1 use sample variance:**


In [9]:
var1 = np.var(x, ddof=1)  # ddof=1 divides by (n-1)
print(var1)

4.809523809523809




2. **Make var2 use population variance:**


In [10]:
var2 = np.sum((x-mean1)**2) * (1/n)  # divide by n instead of (n-1)
print(var2)

4.122448979591836




The `ddof` parameter in NumPy stands for "Delta Degrees of Freedom" and adjusts the divisor accordingly.

# When the n becomes large enough(for large datasets), it does not matter much

In [22]:
n= 1000

x1 = np.random.randint(low=0, high=20, size=n)
mean3 = np.mean(x1)

varnc1 = np.var(x1)
varnc2 = np.sum((x1-mean3)**2) * (1/(n-1))

In [23]:
print(varnc1)
print(varnc2)

33.158479
33.19167067067067
