# Variance, covariance and bias & defaults in np and pandas

Inquiry into implementation of bias into python functions calculating variance, covariance and correlation.
Biased and unbiased variance will converge as n large. Since I do not usually have data for all people in the world $(N)$, I have to be satisifed with the few $(n)$ individuals in my sample. But if we divide the sum of squares $ \sum{}{(x-{\bar {x}})}^2$ by $n$ there is a problem of downward bias (we underestimate variance) [see discussion here](https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation) and 2 min video [here](https://twitter.com/MaxvHaastrecht/status/1509440761477185542?s=20&t=IEoTxxcgsN1MPt--Nw0-vg)

Variance formulas (notice the n or n-1 in denominator) 
$${\displaystyle s^{2}_{biased}={{{\frac {1}{n}}\sum _{i=1}^{n}\left(x_{i}-{\bar {x}}\right)^{2}}},}$$


$${\displaystyle s^{2}_{unbiased}={{{\frac {1}{n-1}}\sum _{i=1}^{n}\left(x_{i}-{\bar {x}}\right)^{2}}}}$$

If n is large, the difference between the biased an unbiased form are small. But if n small, these will matter. Similarly, if we know n = N, we definitely want to use division by n. 


Hence, it is good to know what method python libraries use to calcualte variances so we can taylor their use to our need. 

In [165]:
import numpy as np
import statsmodels.api as sm
import pandas as pd

# Set a seed for reproducibility
np.random.seed(0)

# Generate 100 random values for x and y
x = np.random.rand(100)
y = 3*x 

#y = 3*x + np.random.normal(0, 0.1, 100)
# Optional: uncomment `np.random.normal(0, 0.1, 100)` to add random noise to the relationship between x and y. 
# This noise represents the unexplained variation or error 
# that you might find in real-world datasets as we generate 100 random values (errors) 
# from a normal distribution with a mean of 0 and a standard deviation of 0.1. 
# this will be good when moving to correleation coeficients and OLS 
# I plan to copy this dataset so I am already preapring for later


df = pd.DataFrame({'x': x, 'y': y})
df

Unnamed: 0,x,y
0,0.548814,1.646441
1,0.715189,2.145568
2,0.602763,1.808290
3,0.544883,1.634650
4,0.423655,1.270964
...,...,...
95,0.183191,0.549574
96,0.586513,1.759539
97,0.020108,0.060323
98,0.828940,2.486820


In [148]:
df.var() #deafult: unbiased: divide by n-1

x    0.083957
y    0.762468
dtype: float64

Sidenote: the variance above is unbiased (divide by n-1, n = sample size)

In [149]:
# # Calculate variance
var_x = np.var(x)

var_x#deafult: biased: divide by n

0.08311781545439237

The variance above using numpy is biased (divided by n, n = sample size)

x unbiased variance: `df.var()` or `np.var(, doof = 1)`: 0.0839  (have to override the default in `np.var`

x biased variance: `df.var(, doof = 0)`, `np.var()`.   : 0.0831 (have to override defualt in `df.var`)

**Notice the different default settings with `np.var` as opposed to `df.var()`**


# covariance

In [150]:
df.cov() #default: unbiased : divide by n-1

Unnamed: 0,x,y
x,0.083957,0.251343
y,0.251343,0.762468


In [151]:
np.cov(x, y) #default: unbiased:  divide by n-1

array([[0.08395739, 0.25134269],
       [0.25134269, 0.76246761]])

When using `np` or `df` for covariance, both use unbiased. 

Takeaway:


* **There is inconsistency in the default setting for np** 
* `np.var` variance is by default unbiased, while `np.cov`is by default unbiased



# driving the points home, offering solutions

## var 
$Var X = Cov (X,X) $, notice it's just the X 

In [152]:
cov_xx = np.cov(x, x)[0][1] 
print(f"default df.var : {df.x.var():.4f}",'UNbiased')
print(f"default np.cov xx = var x: {cov_xx:.4f}",'UNbiased')
print(f"default np.var: {np.var(x):.4f}", 'biased')

print('\n','options')
print(f"forcing np.var: {np.var(x, ddof = 1):.4f}",'to be UNbiased')
print(f"forcing np.cov: {np.cov(x, x, bias = True)[0][1]:.4f}", 'to be biased')


default df.var : 0.0840 UNbiased
default np.cov xx = var x: 0.0840 UNbiased
default np.var: 0.0831 biased

 options
forcing np.var: 0.0840 to be UNbiased
forcing np.cov: 0.0831 to be biased


## cov

How about covariance for X **and** Y ? 
$Cov (X,Y)$ 

In [153]:
cov_xy = np.cov(x, y)[0][1] 
print(f"default df.cov : {df.cov().iloc[0][1]:.4f}" , 'UNbiased')
print(f"default np.cov: {cov_xy:.4f}", 'UNbiased')

print('\n','options')
print(f"forcing df.cov : {df.cov(ddof = 0).iloc[0][1]:.4f}" , 'to be biased')
print(f"forced np.cov: {np.cov(x, y, bias = True)[0][1]:.4f}", 'to be biased')


default df.cov : 0.2513 UNbiased
default np.cov: 0.2513 UNbiased

 options
forcing df.cov : 0.2488 to be biased
forced np.cov: 0.2488 to be biased


## cov manual calculation

In [154]:
df['x-xbar'] = df["x"]-df["x"].mean() 
df['y-ybar'] = df["y"]-df["y"].mean() 

biased_cov = (df['x-xbar']*df['y-ybar']).sum()/df.shape[0] #biased
unbiased_cov = (df['x-xbar']*df['y-ybar']).sum()/(df.shape[0]-1) #biased
biased_cov, unbiased_cov

(0.248829264037903, 0.25134269094737677)

In [155]:
#so let's calculate covariance
#when x moves right of the mean, what does y do? oes it also move in the same direction 
df[['x-xbar', 'y-ybar']]

Unnamed: 0,x-xbar,y-ybar
0,0.076020,0.092311
1,0.242396,0.798036
2,0.129970,0.417241
3,0.072089,0.043410
4,-0.049139,-0.017825
...,...,...
95,-0.289602,-0.819881
96,0.113719,0.241583
97,-0.452686,-1.446247
98,0.356146,1.003652


# correlation
$$
{\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}
$$

In [156]:
#idk is there a mistake in this wikipeadia formula - wheere is dividing by N ? 
#numerator 
num = (df['x-xbar']*df['y-ybar']).sum() #biased
denom = (np.sqrt((df['x-xbar']**2).sum()))*(np.sqrt((df['y-ybar']**2).sum()))
num/denom

0.9934044403650562

In [157]:
np.corrcoef(df.x, df.y) #is this unbiased or biased



array([[1.        , 0.99340444],
       [0.99340444, 1.        ]])

In [158]:
biased_sd_x = np.sqrt((df['x-xbar']**2).sum()/df.shape[0]) #biased - what we got with expectation
unbiased_sd_x = np.sqrt((df['x-xbar']**2).sum()/(df.shape[0]-1)) #unbiased 


In [159]:
biased_sd_y = np.sqrt((df['y-ybar']**2).sum()/df.shape[0]) #biased - what we got with eypectation
unbiased_sd_y = np.sqrt((df['y-ybar']**2).sum()/(df.shape[0]-1)) #unbiased 


In [160]:
biased_cov = (df['x-xbar']*df['y-ybar']).sum()/df.shape[0] #biased
unbiased_cov = (df['x-xbar']*df['y-ybar']).sum()/(df.shape[0]-1) #biased
biased_cov, unbiased_cov

(0.248829264037903, 0.25134269094737677)

In [161]:
biased_cov/(biased_sd_x*biased_sd_y)

0.9934044403650563

In [162]:
unbiased_cov/(unbiased_sd_x*unbiased_sd_y)

0.9934044403650562

In [163]:
unbiased_cov/(biased_sd_x*biased_sd_y)

1.003438828651572

In [164]:
biased_cov/(unbiased_sd_x*unbiased_sd_y)

0.9834703959614056

# takeaway

## numpy vs pandas (df, describe)
* if N=n, i.e. we deal with populatin, use `np.var()` as it is dividing by N by default
* if N>n, i.e. we deal with sample, use `df.var()` as it is dividing by n-1 by default
* either command can be adjusted to meet our needs with the `ddof` option

## cov
* is the same in both `np.cov)` and `df.cov() as it is dividing by n-1 by default
    * the default setting for `np.var` and `np.cov`wtherefore differ (the former is biased by default, the lattr is unbiased by default) 
    
## corr 
* Bias does not matter for correlation: 
    * we arrive at the same number as long as both covariance and variance are of the same kind (i.e. both biased or both unbiased) 