# Variance, Covariance and Correlation

## Introduction 

In this lesson, we will look at how **Variance** of a random variable is used to calculate **Covariance** and **Correlation**, two key measures used in statistics for finding the relationships between random variables. These measures help us identify the degree to which two sets of data tend to deviate from their expected value (i.e. mean), in a similar way.  Based on these measures, we can see if two variables move together, and to what extent.  This lesson will help you develop a conceptual understanding, necessary calculations, and some precautions to keep in mind when using these measures. 

## Learning Objectives

You will be able to

* Understand and explain data variance and how it relates to standard deviation
* Understand and calculate Covariance and Correlation between two random variables
* Visualize and interpret Covariance and Correlation

---

## What is Variance ($\sigma^2$)

Before we talk about covariance, we should get some idea around **Variance** of a random variable. Variance refers to the __spread of a variable in a dataset__.

> __Variance is a measure used to quantify how much a random variable deviates from its mean value__. 

When we calculate variance, we are essentially asking, "__Given the relationship of all given data points, how distant from the mean do we expect the next data point to be?__"  This "distance" is called the **error term**, and it's what variance measures. 

Variance is shown using notation $\sigma^2$. Previously, we have seen $\sigma$ as a measure of standard deviation. Remember, standard deviation is also a measure of the spread of data. __Variance is simply the square of standard deviation (Or we could say standard deviation is the square root of variance)__. 

### Example Use Case

# 😩 MAKE A NEW EXAMPLE

For example, a simple application of this measure could be associating probabilities with predicted future events in a market research activity, identifying them as "very likely" or "unlikely" etc. Most people are risk averse, in that they wish to minimize the amount of risk they must endure to earn a certain level of expected return. If investors were indifferent to risk, they would not be influenced by the differences between stock A and stock B above, whereas the risk-averse investor would clearly prefer stock A. Therefore, most people want to know the range or dispersion(spread/deviation as we termed it earlier) of possible outcomes, as well as the likelihood of certain outcomes occurring. Variance measure is a great way to find all of the possible values and likelihoods that a random variable can take within a given range defined by the underlying data. 


Consider the following graphs for Conglomo, Inc. and Bilco, Inc. These graphs show the theoretical frequency distributions of the monthly returns for each firm's common stock as though the returns were normally distributed.

# 😩 USE A DIFFERENT IMAGE
<img src="images/var.png" width=400>

Conglomo's distribution of returns is more concentrated than Bilco's, as illustrated by Conglomo's relatively wider bell curve. A more concentrated distribution is defined as having a smaller standard deviation. The distribution curve appears higher, steeper, and narrower because more observations are occurring close to the expected return. Bilco's distribution is rather flat, reflecting that its returns are less concentrated, or more dispersed, than those of Conglomo Inc.

### Interpreting Variance 

A variance value of zero represents that all of the values within a data set are identical, while all variances that are not equal to zero will be any positive number. The larger the variance, the more spread in the data set. A large variance means that the numbers in a set are far from the mean and each other. A small variance means that the numbers are closer together.

### How to Calculate Variance? 

Variance is calculated by:
1. Taking the squared differences between each element in a data set and the mean. 
2. Summing those squared differences.
3. Dividing the sum of the resulting squares by the number of values in the set, *n*.

$$\sigma^2 = \frac{\sum(x-\mu)^2}{n}$$

Here, $x$ represents an individual data point and $\mu$ represents the mean of the data points. $n$ is the total number of data points. Remember that while calculating a sample's variance in order to estimate a population variance, the denominator of the variance equation becomes n - 1. Doing so removes bias from the estimation.

The following illustration summarizes how the spread of data around the mean (10) relates to the variance. 

<img src="images/var2.png" width=500>

# 💻 CODE

In [2]:
#code

---

## Covariance ($\sigma_{xy}$)

Now that we know what variance is what quantity it measures, imagine calculating the variance of two random variables to get some idea on how they change together (or stay the same) considering all included values.

In statistics, if we are trying to figure out how two random variables **tend to vary** together, we are effectively talking about **Covariance** between these variables. Covariance provides an insight into how two variables **move according to each other**.

> Covariance is a measure of how two random variables in a data set __change together__. 

### How to calculate Covariance ?
In essence, covariance is used to measure **how much variables change together**, and it's calculated using the formula:


$$ \large \sigma_{XY} = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(z_i - \mu_y)}{n}$$

Here $X$ and $Y$ are two random variables having n elements each. We want to calculate ___how much $Y$ depends on $X$___ (or vice-versa), by measuring how values in $Y$ change with observed changes in $X$ values. 

> This makes $X$ our __independent variable__ and $Y$, the __dependent variable__.  

$xi$ = ith element of variable $X$

$yi$ = ith element of variable $Y$

$n$ = number of data points (__$n$ is the same for $X$ and $Y$, becuase they are paired.__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$\sigma_{XY}$ = Covariance between $X$ and $Y$

*We can see that the above formula calculates the variance of $X$ and $Y$ (check the variance formula above) by multiplying the variance of each of their corresponding elements. Hence the term __Co-Variance__.*

### Interpreting Covariance values 

* A positive covariance indicates that **higher than average** values of one variable tend to pair with higher than average values of the other variable.

* Negative covariance indicates that lower than average values of one variable tend to pair with **lower than average** values of the other variable. 

* A zero value, or values close to zero indicate no covariance, i.e. no values from one variable can be paired with values of second variable. 

These patterns are shown in the scatter plots below.
<img src="images/covariance.gif" width=500>



A large negative covariance shows an inverse relationship between values at x and y axes. i.e. y decreases as x increases. This is shown by the scatter plot on the left. The middle scatter plot shows values spread all over the plot, reflecting the fact that variables on x and y axes do not vary together. The covariance for these variables would be very close to zero. 

In the scatter plot on right, we see a strong relationship between values at x and y axes i.e. y increases as x increases.

>__Covariance is not standardized. Therefore, covariance values can range from negative infinity to positive infinity.__

# 💻 CODE

In [None]:
# code

#show how units change

#do it with numpy
#why all these other values?

---

## Correlation 

Above, we saw how covariance can identify the degree to which two random variables tend to vary together, while using a formula that depends on the units of $X$ and $Y$ variables. During data analysis, covariance measure cannot be directly used in data comparison, as different experiments may contain underlying data measured in different units. Therefore, we need to scale covariance into a standard unit, with interpretable results independent of the units of data. We achieve this with a derived normalized measure called correlation. 

Correlation is defined as covariance, scaled by the inverse product of standard deviations of $X$ and $Y$. This scaling sets the range of possible values to betwewen -1 and 1. So the correlation between $𝑋$ and $𝑌$ would be calculated as:

$$Correlation(X,Y) = \frac{\sigma_{X,Y}}{\sigma_X\sigma_Y}$$

# 😩 <>EXPLAIN MORE<>

>When two random variables **Correlate**, this reflects that the change in one variable **affects** the values of the second variable. 

In data science practice, we typically to look at correlation rather than covariance because it is more interpretable, as it does not depend on the scale of either random variable involved.

# 💻 CODE

In [1]:
😩 #code here

SyntaxError: invalid character in identifier (<ipython-input-1-cc1349013336>, line 1)

### Use Cases


#### Social Media and Websites
Digital publishers want to maximize their understanding of the potential relationship between social media activity and visits to their website. For example, the digital publisher runs the correlation report between hourly Twitter mentions and visits for two weeks. The correlation is found to be r = 0.28, which indicates a medium, positive relationship between Twitter mentions and website visits.

#### Optimization for E-retailers
E-retailers are interested in driving increased revenue. For example, an e-retailer wants to compare a number of secondary success events (e.g., file downloads, product detail page views, internal search click-throughs, etc.) with weekly web revenue. They can quickly identify internal search click-throughs as having the highest correlation, which may indicate an area for optimization.

### Types of Correlation Measures

The __linear correlation coefficient__, $r$, measures the strength and the direction of a linear relationship between two variables. It also called __Pearson's correlation coefficient__. 

In statistics, we measure four types of correlations for detailed relationship analysis: 
* Pearson correlation 
* Kendall Rank correlation 
* Spearman correlation
* Point-Biserial correlation. 


For now, we will focus on Pearson correlation as it is the go-to correlation measure for most situations. 

For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoskedasticity. Linearity assumes a straight line relationship between each of the two variables and homoskedasticity assumes that data is equally distributed about the regression line.

### Calculating Coefficient of Correlation (r)

# 😩 <> SHOULD THIS BE REORDERED <>

Pearson Correlation (r) is calculated using the following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

So just like in the case of covariance,  $X$ and $Y$ are paired random variables having n elements each. 


$xi$ = ith element of variable $X$

$yi$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$r$ = Calculated Pearson Correlation


Here x and y are the random variables, x_bar and y_bar are the mean values for both x and y. A detailed mathematical insight into this equation is available [in this paper](http://www.hep.ph.ic.ac.uk/~hallg/UG_2015/Pearsons.pdf)


### Interpreting Correlation values

> __The correlation formula shown above always gives values in a range between -1 and 1__

# 😩 ^SAY WHY^

If two variables have a correlation of +0.9,  this means the change in one item results in an almost similar change to another item. A correlation value of -0.9 means that the change is one variable results as an opposite change in the other variable. A Pearson correlation near 0 would be no effect. Here are some examples of Pearson correlation values as scatter plots. 

![](images/pearson_2.png)

Think about stock markets in terms of correlation. All the stock market indexes tend to move together in similar directions. When the DOW Jones loses 5%, the S&P 500 usually loses around 5%. When the DOW Jones gains 5%, the S&P 500 usually gains around 5% because they are **highly correlated**.

On the other hand, there could also be negative correlation. You might observe that as the DOW Jones loses 5% of its value, gold might gain 5%. Alternatively, if the Dow Jones gains 5% of its value, gold may lose 5% of its value. That's **negative correlation**.

Stock traders can use information about positive and negative correlations between prices of assets when building their portfolios.

### So how do these measures relate to each other ?

Are covariance and correlation the same thing? Simply put, no.

While both covariance and correlation indicate whether variables are positively or inversely related to each other, they are not considered to be the same. This is because correlation also informs about the degree to which the variables tend to move together.

Covariance is used to measure variables that have different units of measurement. By leveraging covariance, analysts can determine whether units are increasing or decreasing, but they cannot say to what degree the variables are moving together since covariance does not use one standardized unit of measurement.

Correlation, on the other hand, standardizes the measure of interdependence between two variables and informs researchers as to how closely the two variables move together.

---

## Summary
In this lesson, we looked at calculating the variance of random variables as a measure of deviation from the mean. We saw how this measure can be used to first calculate covariance, and then correlation, to analyze how the change in one variable is associated with change in another variable. Next, we will see how we can use correlation analysis to run a __regression analysis__ and later, how covariance calculation helps us with dimensionality reduction.