# Variance, Covariance, and Correlation

## Introduction

In the previous section, we introduced two important descriptive statistics that are used to summarize a *single* random variable:

* **Mean** ($\mu$) which is the average of all of the data points in a dataset

* **Standard Deviation** ($\sigma$) which describes how far the data points are spread around the mean 

A common goal in data science is to determine if *two* random variables are related. One way to think about the relationship between two variables is to ask yourself, "if one variable changes does the other also change?" 

It turns out we can explicitly quantify the extent to which two variables change together by comparing how each variable varies from its mean. That might sound a little confusing so this lesson will break down the process. 

First, we will introduce the concept of the **Variance** which describes how much a variable varies from its own mean. We will show that the variance is directly related to the standard deviation which we talked about before. 

Second, we will build on the concept of variance to introduce **Covariance** and **Correlation** which describe different quantitative aspects of the relationship between two variables. 

## Learning Objectives

* Develop a conceptual and practical understanding of variance and how it relates to standard deviation
* Learn how to calculate variance
* Develop a conceptual and practical understanding of the covariance and correlation between two random variables
* Learn how to calculate covariance and correlation between two random variables
* Visualize and interpret the covariance and correlation between two random variables 

## What is Variance?

Let's look at some data to get a feel for what the variance is. Consider the following graphs from Conglomo, Inc. and Bilco, Inc. These graphs show the distributions of each company's stock returns expressed as a probability density which we talked about before. The returns are normally distributed with a mean value of $20 for both companies.

<img src="images/var_example2.png" width="400">

We can see that Conglomo's distribution of returns is taller and narrower than Bilco's distribution. In other words, the values of Conglomo's returns tend to be close to their mean value. On the other hand, the values of Bilco's returns are more varied which results in a wider distribution around the mean value. This *variation* about the mean is what the **Variance** represents.

> __Variance quantifies how much a random variable deviates from its mean value__. 

The variance of each company's stock returns is proportional to the width of the distribution. Since Bilco's stock returns vary more about the mean, it has a wider distribution and thus a larger variance. If you were a cautious investor seeking to minimize risk, you would choose to invest in Conglomo over Bilco since Conglomo's returns are more likely to be near the mean value whereas Bilco's returns are less likley to be near the mean value.

## How Do You Calculate Variance?

Ok, the example above described variance but didn't give any numbers. So how do we actually calculate the variance? Well, since we know the variance must reflect how much each individual data point in the dataset differs from the mean value of the data set, we can calculate the variance using the following order of operations:

1. Calculate the mean of the dataset
2. Calculate the differences between each element in the dataset and the mean 
3. Square the differences
4. Sum the squared differences
5. Divide the sum by the number of values in the dataset

So if we consider a random variable, $X$, where

$X = [x_1, x_2, ... , x_n]$

Then the above steps can be conveniently summarized by the formula: 

$$\sigma_X^2 = \frac{\sum_{i=1}^{n}(x_i-\mu_X)^2}{n}$$

where the variance of $X$ is denoted by $\sigma_X^2$. If this notation looks a little strange, recall that the $i$ subscript refers to the index of the ith element, $x_i$, of variable $X$ and $n$ refers to the total number of $x$ values in the dataset. Therefore, the quantity $(x_i-\mu_X)$ is the difference between one data point and the mean of the dataset. This is also called the **error term**. 

> **A note about n**  
> Recall the use of $n$ in the denominator indicates we are calculating the population variance.  
> If we were calculating the variance of a sample from a population, we would use $n-1$ in the denominator.  
> This is known as "Bessel's Correction" and it is used to remove bias that results from estimating population statistics using a sub-sample from the population.

This math looks a little cumbersome but Python makes it easy to calculate the variance using the ```var``` method from ```NumPy``` :

In [19]:
import numpy as np
import numpy.random as npr

# Generate some normally distributed data with a mean of 5 and a standard deviation of 1
npr.seed(12345)
x_vals = npr.normal(loc=5, scale=1, size=1000)

# Calculate the variance
variance = np.var(x_vals)
print(variance)

0.9593541796598006


### Wait...isn't this the same as the standard deviation?

At this point (especially after looking at the code block above) you are probably thinking that the variance sounds very similar to the standard deviation. Indeed, the variance is directly related to the standard deviation. Recall that the standard deviation, $\sigma$, is given by the following equation:

$$\sigma_x = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\mu_x)^2}{n}}$$

We can immediately see that the standard deviation is the square root of the variance. So both the variance and standard deviation report on the spread about the mean.

So why do we need two metrics to tell us essentially the same thing? Well think about units. We are squaring the error term $(x_i-\mu)$ to prevent positive and negative differences from canceling each other out. By doing this we also square the units of the variable. Taking the square root restores the original units. 

So when you report things like uncertainties in measurements, you would use the standard deviation. You will see later on that the variance becomes more useful in other contexts because the square root can complicate some math that is required in different types of data analysis.

## Interpreting Variance

Take a close look at the equation describing the variance. What do you think the variance would be for a dataset where all of the data points were identical? In that case each data point would be equal to the mean so the variance would be 0. Now consider the other extreme where you have a lot of data points that are far from the mean. In that case the variance would be large since the difference between those values and the mean would be large. Let's simulate the effect of increasing the variance of a normally distributed dataset with mean of 10.

<img src="images/var_example3.png" width="500">

Like the stock return example above, we see that increasing the variance results in a much wider distribution that is more spread out about the mean.

## What is Covariance?

Now that we know what the variance of *one* random variable is, imagine calculating the variance of *two* random variables. Why would we want to do this? Well if we had a way to measure how much two variables changed together, then we could potentially uncover relationships between variables. **Covariance** allows us to do that.

> __Covariance is a measure of how two random variables change together__. 

Covariance is useful because it provides insight into how two variables are **related** to one another. Think about it conceptually: If two variables are related, then changes in one variable would likely accompany changes in the other. This is an example of high covariance. If two variables are completely unrelated then changes in one would not accompany changes in the other. This is an example of no covariance. 

## How Do You Calculate Covariance?

Let's build on the previous example illustrating variance. Now this time, consider two random variables, $X$ and $Y$, where

$X = [x_1, x_2, ... , x_n]$  

$Y = [y_1, y_2, ... , y_n]$

Let $X$ be the independent variable and let $Y$ be the dependent variable.

If $Y$ depends on $X$, we would expect both variables to exhibit similar deviations from their means. The covariance between $X$ and $Y$, $\sigma_{XY}$ is given by the following formula:

$$ \large \sigma_{XY} = \frac{\sum_{i=1}^{n}(x_i -\mu_X)(y_i - \mu_Y)}{n}$$

This looks a lot like the variance formula above and that makes sense, right?. After all, we want to calculate how much $Y$ depends on $X$ (or vice-versa), by measuring how values in $Y$ change with values of $X$. 

> **Another note about n**  
> Pay close attention to the way the formula for covariance is written.  
> In order for the math to work out, $n$ must be the same for both $X$ and $Y$.

This math is starting to look a little intimidating but Python makes it easy for us to calculate covariance using the ```cov``` method from ```NumPy```.



In [23]:
import numpy as np
import numpy.random as npr

# Generate some normally distributed data with a mean of 5 and a standard deviation of 1
# This is the independent (X) variable
npr.seed(3214)
x_vals = npr.normal(loc=5, scale=1, size=500)

# Generate the dependent (Y) variable that has a positive linear dependence on x_vals
# This means that y = mx + b. Let's pick m = 10, b = 1
y_vals = x_vals * 10.0 + 1.0

# Calculate the covariance, use indexing to simplify output
covariance = np.cov(x_vals, y_vals)[0][1]

print(covariance)

10.724901942498041


We see in the code block above that there will be a positive covariance when the two variables have a positive linear relationship. Does this make sense given the equation? Let's break down the interpretation.

## Interpreting Covariance

In the above example, we saw that a positive linear relationship between random variables yields a positive covariance. Let's rationalize this with respect to the equation: If a given value of $X$ is greater than $\mu_X$, then we would expect the corresponding value of $Y$ to be greater than $\mu_Y$ since their relationship is positive linear. Therefore, the numerator is a positive number which yields a positive covariance. 

* A positive covariance indicates that **higher than average** values of one variable tend to pair with **higher than average** values of the other variable and **lower than average** values of one variable tend to pair with **lower than average** values of the other variable.

Similarly, we can construct analogous scenarios where there is a negative linear relationship: 

* A negative covariance indicates that **lower than average** values of one variable tend to pair with **higher than average** values of the other variable. 

Or no relationship at all:

* Zero covariance (or a value close to zero) indicates no relationship between the variables. The values of one variable cannot be paired with values of the other variable in any way. 

This behavior is illustrated using the scatter plots below:

<img src="images/covariance.gif" width="500">

### Pay attention to the scale of the data!

The image above doesn't have any numbers. Let's make things a little more quantitative. Consider the two positive linear relationships below:

<img src="images/cov_example2.png" width="500">

They look pretty similar right? Yet they have very different covariances. Why? Check out the scales of the y-axes. The data shown on the right is on a much larger scale. So even though both relationships look identical, their covariances are very different. 

>__Covariance is not standardized so values can range from negative infinity to positive infinity.__

Do you see any potential problems with using the covariance in data science?

## What is Correlation?

So we now know that we can use covariance to determine the relationship between two variables. However, we also know that very similar relationships can have wildly different covariances due to differences in the scale of the data. This prevents us from comparing data measured in different units or on different scales. Ideally, we would normalize the covariance in such a way that the scale was standardized. It turns out that the **Correlation** does just that!

> __Correlation is a measure of how two random variables change together that is independent of scale__.

## How Do You Calculate Correlation?

So what are we normalizing the covariance by? Again, think of the units of $X$ and $Y$. If we wanted to make correlation completely without units, we would just divide by a term that has the same units of $X$ and a term that has the same units of $Y$. Recall that the standard deviation of a random variable has the same units as the mean. We can therefore use the standard deviations of $X$ and $Y$ as our normalization factors. If we denote the correlation as $\rho_{XY}$, then the correlation is defined as: 

$$ \large \rho_{(X,Y)} = \frac{\sigma_{XY}}{\sigma_X\sigma_Y}$$

Again, this is for population measurements. In practice, data scientists almost always work with samples taken from a population which are then used to infer statistical properties about the population as a whole. Unlike the examples we showed above, we can't just stick an $n-1$ term in the denominator. We have to go in and replace the covariance and standard deviations with their sample estimates which we discussed when we covered statistical inference. The resulting metric is called the **Pearson Correlation Coefficient** and it is denoted by $r$:

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{(\sum_{i=1}^{n}x_i - \mu_x)^2} \sqrt{(\sum_{i=1}^{n}y_i-\mu_y)^2}}$$

After normalization, the correlation can only assume values between -1.0 and 1.0.

Again, Python makes it easy to calculate $r$. This time we will use the ```pearsonr``` method from ```SciPy```:

In [20]:
import numpy as np
import numpy.random as npr
import scipy.stats as sps

# Generate some normally distributed data with a mean of 5 and a standard deviation of 1
# This is the independent (X) variable
npr.seed(3214)
x_vals = npr.normal(loc=5, scale=1, size=500)

# Generate the dependent (Y) variable that has a positive linear dependence on x_vals
# This means that y = mx + b. Let's pick m = 10, b = 1
y_vals = x_vals * 10.0 + 1.0

# Calculate Pearson r, use indexing to simplify output
pearson = sps.pearsonr(x_vals, y_vals)[0]

print(pearson)

1.0


## Interpreting Correlation

In the example above, $r=1.0$. Recall after normalization, the correlation can only assume a value between -1.0 which indicates a perfect negatively linear relationship and 1.0 which indicates a perfect positively linear relationship. We know the data in the code block above must be perfectly positively correlated because we defined $Y$ to be linearly dependent on $X$ with a positive slope. Therefore $r=1.0$.

Let's look at some noisier data with different values of $r$ in order to get a feel for how we should interpret correlation. The "best fit" line that describes the relationship between the variables is shown. We will talk more about this later when we cover regression. For now just use it as a guide to visualize how "spread out" the data are.

<img src="images/pearson_2.png" width="500">

Similar to covariance we see that negatively related variables have negative $r$ values and positively related variables have positive $r$ values. Also, variables that have no relation at all have $r=0$. As the correlation becomes weaker, the value of $r$ approaches 0.

Since $r$ is not sensitive to scale, it can tell us how strongly two variables change together. We could not access this information from covariance since the scale of the data influenced the value of the covariance. In other words, a bigger covariance does not necessarily mean a stronger relationship between variables. With correlation we do not need to worry about this caveat. Variables with values of $r$ that are closer to 0 will always be more weakly correlated than variables with $r$ values that are closer to 1 or -1.

### Warning! There are many different types of correlations in statistics

It may seem silly that we are calling the sample correlation the "Pearson Correlation Coefficient" instead of just the "correlation" but it is important we specify this. There are several other correlations in statistics including:
 
* Kendall rank correlation 
* Spearman rank correlation
* Point-Biserial correlation 

So why do we have so many types of correlations? Different correlations are used in different contexts. These contexts will be determined by the properties of your data. 

For now, we shall focus on the Pearson correlation as it is the go-to correlation measure for most scenarios you might encounter. That being said, the validity of the Pearson correlation coefficient is still dependent on the properties of the data. Ideally, for the Pearson correlation coefficient to be valid your variables should:

1. Be continuous
2. Be linearly related
3. Have finite and homogeneous variances

Point 3 is getting at a more technical concept in statistics known as "homoscedasticity." We will revisit this point when we cover regression and residuals. For now, think of homoscedasticity to mean that the data are equally distributed around the line that best describes their relationship. 

## Summary

Wow that was a lot of information! But it's a good thing we covered **variance**, **covariance**, and **correlation** together because there is a lot of conceptual overlap among these topics. Let's recap the key points and then take a much needed break!

* The variance quantifies how much a random variable deviates from its mean value.

* The variance is the square of the standard deviation.

* The covariance measures how two variables change together. A covariance of 0 indicates the two variables do not change together. Positive covariances indicate the two variables change in the same direction and negative covariances indicate the two variables change in the opposite direction.

* The covariance is not normalized and can span from negative infinity to infinity. It is very sensitive to the scale of the data and its magnitude should not be interpreted to be a measure of how strongly two variables are coupled.

* The correlation is the covariance of two variables normalized by their respective standard deviations.

* The correlation can span from -1 to 1. It is insensitive to the scale of the data. A correlation of 0 indicates the two variables do not change together. Positive correlations indicate the two variables change in the same direction and negative correlations indicate the two variables change in the opposite direction.

This lesson was a little math heavy at times but it does help to take a closer look at the formulas. Even though Python will do the calculations for us, understanding what the equations are telling us will help us develop an deep intuitive understanding that will come in handy when working on the job.