In [227]:
import pandas as pd
import numpy as np
import numpy as np
from IPython.display import display, HTML, Math
from sympy import init_printing, Matrix, symbols, sqrt
init_printing(use_latex = 'mathjax')

# Variance ($\sigma^{2}$)

Variance is the average squared deviation from the mean. This measure of the variability of spread in a set of data.


$$\sigma^{2}(x) = \frac{\Sigma{(x_{i} - \bar{x})^2}}{N}$$


- $N$ - Number of observations 
- $\bar{x}$ - mean of the given variable
- $x_{i}$ - $i^{th}$ row variable




### Example data

Let us assume there is an experiment (observing stars in a galaxy, running an experiment in the lab or watching the stock market). Each observation measures some parameters. In the case of observing stars, we can record its Mass, Radius, Flux, and distance. Experiments in the lab could be a chemical reaction that has its temperature, reaction rate, color changes ... etc. Watching the stock market records high value, low value, buying price, selling price ...etc. We can pick any number of parameters, and that depends on the experiment and also what quantities we are interested in. For the practical purposes, let us assume all the parameters are recorded as floating point numbers.

We can store these observations in matrix (table format)

In [228]:
# random normal values
d1 = np.random.randn(3)
d2 = np.random.randn(3)
d3 = np.random.randn(3)
d4 = np.random.randn(3)
d5 = np.random.randn(3)

Observations = pd.DataFrame(
    [d1, d2, d3, d4, d5],
    index = [
        'Observation_1', 
        'Observation_2', 
        'Observation_3',
        'Observation_4',
        'Observation_5'
    ],
    columns={
        'parameter_1', 
        'parameter_2', 
        'parameter_3'
    }
)

Above code snippet generates some random values and put them into a data frame. In this dataset, we have three different **observations (row)**, and each observation records three separate **parameters (columns)**. 

{{Observations}}

### Deviation Scores ($x_{i} - \bar{x}$)
How to center. Centered independent variables are obtained just by subtracting the mean of the variable. Centering data is important because that makes interpretation os parameter estimators easier.


We have 3 parameters in the above dataset, and we have 3 observations. We need to center all three parameters in this data set. Will take a look at how to center **parameter_1** in details 

**Centering paramter_1**

- **Step 1**: Calculate the mean value of the **parameter_1 **

$$\mu_{parameter\_1} = \frac{1}{3}\Big(\sum^{3}_{observation=1}{\big(parameter\_1_{observation}\big)}\Big)$$

In [229]:
mu_parameter_1 = Observations[['parameter_1']].mean()

- **Step 2**: Remove the mean value from the **parameter_1** values
    - Raw Data
    {{(Observations[['parameter_1']])}}
    - **parameter_1**  mean {{mu_parameter_1['parameter_1']}}
    - centred Data
    {{Observations[['parameter_1']] - mu_parameter_1}}

### Use Matrix Algebra 

Calculating deviation scores and centering all the parameters can be done using linear algebra matrix manipulations. This is a very handy way to manipulate large amount of data.


- Let us assume the raw data is in the Matrix $X$. Each row is an observation, and each column is a parameter.


$$X = 
\begin{bmatrix}
\vec{x}_{1} \\
\vec{x}_{2} \\
. \\
\vec{x}_{m}
\end{bmatrix} = 
\begin{bmatrix} 
x_{11} & x_{12} & ... & x_{1n} \\ 
x_{21} & x_{22} & ... & x_{2n} \\
. & . & ... & . \\
x_{m1} & x_{m2} & ... & x_{mn} \\
\end{bmatrix}_{m\times n}$$
 
 
- Define a column vector of ones 

$$L = 
\begin{bmatrix}
1 \\
1 \\
. \\
1
\end{bmatrix}_{m\times 1}$$

- Build a square matrix of ones 

$$LL^{T} = \begin{bmatrix}
1 \\
1 \\
. \\
1
\end{bmatrix}
\begin{bmatrix}
1 & 1 & . & 1 \\
\end{bmatrix} = \begin{bmatrix} 
1 & 1 & ... & 1 \\ 
1 & 1 & ... & 1 \\
. & . & ... & . \\
1 & 1 & ... & 1 \\
 \end{bmatrix}_{m\times m}$$


 - Transform the raw scores from matrix $X$ into deviation scores for matrix $D$.
 
 $$D = X-\frac{1}{m}(LL^{T})X$$
 
 $$D = 
\begin{bmatrix} 
x_{11} & x_{12} & ... & x_{1n} \\ 
x_{21} & x_{22} & ... & x_{2n} \\
. & . & ... & . \\
x_{m1} & x_{m2} & ... & x_{mn} \\
\end{bmatrix}_{m\times n} - \frac{1}{m}\begin{bmatrix} 
1 & 1 & ... & 1 \\ 
1 & 1 & ... & 1 \\
. & . & ... & . \\
1 & 1 & ... & 1 \\
 \end{bmatrix}_{m\times m}\begin{bmatrix} 
x_{11} & x_{12} & ... & x_{1n} \\ 
x_{21} & x_{22} & ... & x_{2n} \\
. & . & ... & . \\
x_{m1} & x_{m2} & ... & x_{mn} \\
\end{bmatrix}_{m\times n}$$
 

$$D = 
\begin{bmatrix} 
x_{11} & x_{12} & ... & x_{1n} \\ 
x_{21} & x_{22} & ... & x_{2n} \\
. & . & ... & . \\
x_{m1} & x_{m2} & ... & x_{mn} \\
\end{bmatrix}_{m\times n} -
\begin{bmatrix} 
\frac{1}{m}\sum_{i=1}^{m}x_{i1} & \frac{1}{m}\sum_{i=1}^{m}x_{i2} & ... & \frac{1}{m}\sum_{i=1}^{m}x_{in} \\ 
\frac{1}{m}\sum_{i=1}^{m}x_{i1} & \frac{1}{m}\sum_{i=1}^{m}x_{i2} & ... & \frac{1}{m}\sum_{i=1}^{m}x_{in} \\ 
. & . & ... & . \\
\frac{1}{m}\sum_{i=1}^{m}x_{i1} & \frac{1}{m}\sum_{i=1}^{m}x_{i2} & ... & \frac{1}{m}\sum_{i=1}^{m}x_{in} \\ 
\end{bmatrix}_{m\times n}$$

$$D = 
\begin{bmatrix} 
x_{11} & x_{12} & ... & x_{1n} \\ 
x_{21} & x_{22} & ... & x_{2n} \\
. & . & ... & . \\
x_{m1} & x_{m2} & ... & x_{mn} \\
\end{bmatrix}_{m\times n} -
\begin{bmatrix} 
\mu_{param_1} & \mu_{param_2} & ... & \mu_{param_n} \\ 
\mu_{param_1} & \mu_{param_2} & ... & \mu_{param_n} \\ 
. & . & ... & . \\
\mu_{param_1} & \mu_{param_2} & ... & \mu_{param_n} \\ 
\end{bmatrix}_{m\times n}$$


$$D = \begin{bmatrix} 
(x_{11} - \mu_{param_1}) & (x_{12} - \mu_{param_2}) & ... & (x_{1n} - \mu_{param_n})\\ 
(x_{21} - \mu_{param_1}) & (x_{22} - \mu_{param_2}) & ... & (x_{2n} - \mu_{param_n})\\
. & . & ... & . \\
(x_{m1} - \mu_{param_1}) & (x_{m2} - \mu_{param_2}) & ... & (x_{mn} - \mu_{param_n})\\
\end{bmatrix}_{m\times n}$$

- Centerd data
$$D = 
\begin{bmatrix} 
d_{11} & d_{12} & ... & d_{1n} \\ 
d_{21} & d_{22} & ... & d_{2n} \\
. & . & ... & . \\
d_{m1} & d_{m2} & ... & d_{mn} \\
\end{bmatrix}_{m\times n}$$



### Center Example data 

- Raw Data
{{Observations}}
- Centering data

In [230]:
ObsCenterd = Observations - Observations.mean(axis=0)

{{ObsCenterd}}

# Covariance Matrix

**Covariance** is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. $X$ and $Y$ are two vectors.
 $$V = \sigma^{2}_{XY} = \frac{1}{N-1}\sum(X_i - \bar{X})(Y_i-\bar{Y})$$
 
 Dividing by N-1 give us the unbiased estimator [read more](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation)

$$V = \frac{1}{m-1}D^{T}D$$

$$V = \frac{1}{m-1}
\begin{bmatrix} 
d_{11} & d_{21} & ... & d_{m1} \\
d_{12} & d_{22} & ... & d_{m2} \\
. 	   & 	.	& ... &	. \\
d_{1n} & d_{2n} & ... & d_{mn} \\
\end{bmatrix}_{n\times m} \times
\begin{bmatrix} 
d_{11} & d_{12} & ... & d_{1n} \\ 
d_{21} & d_{22} & ... & d_{2n} \\
. & . & ... & . \\
d_{m1} & d_{m2} & ... & d_{mn} \\
\end{bmatrix}_{m\times n}
$$




$$V = 
\begin{bmatrix} 
\frac{1}{m-1}\sum_{i=1}^{m-1}d^{2}_{i1} & \frac{1}{m-1}\sum_{i=1}^{m-1}d_{i1}d_{i2} & ... & \frac{1}{m-1}\sum_{i=1}^{m-1}d_{i1}d_{in} \\ 
\frac{1}{m-1}\sum_{i=1}^{m-1}d_{i2}d_{i1} & \frac{1}{m-1}\sum_{i=1}^{m-1}d^{2}_{i2} & ... & \frac{1}{m-1}\sum_{i=1}^{m-1}d_{i2}d_{in} \\ 
. & . & ... & . \\
\frac{1}{m-1}\sum_{i=1}^{m-1}d_{in}d_{i1} & \frac{1}{m-1}\sum_{i=1}^{m-1}d_{in}d_{i2} & ... & \frac{1}{m-1}\sum_{i=1}^{m-1}d^{2}_{nn} \\ 
\end{bmatrix}_{n\times n}$$


$$V = 
\begin{bmatrix} 
\sigma^{2}_{d_{i1}d_{i1}} & \sigma^{2}_{d_{i1}d_{i2}} & ... & \sigma^{2}_{d_{i1}d_{in}} \\ 
\sigma^{2}_{d_{i2}d_{i1}} & \sigma^{2}_{d_{i2}d_{i2}} & ... & \sigma^{2}_{d_{i2}d_{in}} \\ 
. & . & ... & . \\
\sigma^{2}_{d_{in}d_{i1}} & \sigma^{2}_{d_{in}d_{i2}} & ... & \sigma^{2}_{d_{nn}d_{nn}} \\ 
\end{bmatrix}_{n\times n}$$



The **covariance** measures the degree of the linear relationship between two variables.

- $\sigma^2_{XY} >> 0$, A and B are positively correlated
- $\sigma^2_{XY} = 0$, A and B are NOT correlated 
- $\sigma^2_{XY} << 0$, A and B are negatively correlated
- $|\sigma^2_{XY}|$, Absolute magnitude of the covariance measure the degree of redundancy
- $\sigma^2_{XY} = \sigma^2_{X}$ if $X=Y$


The $ij^{th}$ element in the $V$ ($V_{X_{ij}}$) is the dot product between the vector $i^{th}$ parameter with the vector of the $j^{th}$ parameter. 

- $V$ is a square symmetric $n\times n$ matrix 
- The diagonal terms of $V$ are the **variance** of particular observations types
- The off-diagonal terms of $V$ are the **covariance** between observation types 



$V$ capture all the covariance among all the possible parameters in observations. The covariance values reflect the noise and redundancy in the parameters.

- Diagonal terms, by assumption, large values correspond to an interesting structure.
- Off-diagonal terms large magnitudes correspond to high redundancy 


Let us assume we can manipulate this covariance matrix. If you can do that, what features we want to optimize ? (Will cover this in another post)


### Covariance Matrix with Example data 

$D^{T}$ = {{ObsCenterd.T}}

$D$ = {{ObsCenterd}}

In [231]:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dot.html
DtD = ObsCenterd.T.dot(ObsCenterd)/(len(ObsCenterd) - 1)

$V = \frac{1}{n-1}D^{T}D$
{{DtD}}

# Calculate Covariance Matrix directly from the data frame 


In [232]:
covDf = ObsCenterd.cov()

{{covDf}}

# Calculate Covariance Matrix using numpy arrays

It is important to arrange the data (observations) in the proper format before calcaulte the covariance matrix. Arrange all the vectors (observations) as column vectors: each **column represent an observation and row represent parameters**. 

In [233]:
P = np.column_stack([d1, d2, d3, d4, d5])
P

array([[-0.84393945,  0.34945711, -1.32568652,  2.30512988, -0.0785676 ],
       [-0.03653621, -0.62894742, -0.48918021,  1.36789637,  0.15032504],
       [ 1.67682082,  0.26426531, -1.44821653,  0.43092719,  1.25069837]])

In [235]:
np.cov(P)

array([[ 1.96964108,  0.89284061,  0.32885849],
       [ 0.89284061,  0.62587758,  0.27508235],
       [ 0.32885849,  0.27508235,  1.44578854]])