# Variance, Covariance and Correlation

## Introduction 

In this lesson, we will use the **variance** of a variable to calculate **covariance** and **correlation**, two key statistical measures to find the relationship between variables. These measures help identify the degree to which two sets of data tend to deviate from their expected value, i.e. the mean. We can use these measures to identify whether two variables change in relation to each other, and to what extent. This lesson will help you understand these terms conceptually, enable you to calculate them, and equip you with caveats for handling them.

## Objectives

You will be able to:

* Understand and explain data variance and how it relates to standard deviation
* Understand and calculate covariance and correlation between two random variables
* Visualize and interpret the results of covariance and correlation

## What is Variance ($\sigma^2$)?

Before discussing covariance , it is important to understand the **variance** of a random variable. Variance refers to the __spread of a data set__. 

> __Variance quantifies how much a random variable deviates from the mean value__. 

When we calculate variance, we are essentially asking, "__Given the relationship of all data points, how far from mean do we expect the next data point to be?__"

The notation for variance is $\sigma^2$. We have seen that $\sigma$ is the notation for standard deviation within a given dataset. 

  * Remember that standard deviation is also a measure of spread of data. 
  
  * Variance is simply the square of standard deviation. 
  
  * Conversely, standard deviation is the square root of variance.  

### Example Use Case

For example, a simple application of variance as part of a market research could be to associate probabilities with predicted future events, to categorize them as "very likely," or "unlikely," etc. This is to minimize the risk that must be taken in order to obtain a certain amount of expected return. An investor indifferent to risk would not be influenced by the differences between the stocks of the two companies shown below, whereas the risk-averse investor would clearly prefer one stock over the other. Therefore, it's in the investor's interest to understand the spread of possible outcomes (expressed by variance or standard deviation), as well as the likelihoods of certain outcomes. Variance offers a way, given the range and the values of a dataset, to identify the likelihood that a random variable will have a certain value.

Consider the following graphs for Conglomo, Inc. and Bilco, Inc. These graphs show the theoretical frequency distributions of the monthly returns for each firm's common stock as though the returns were normally distributed.

<img src="images/var.png" width=400>

Conglomo's distribution of returns is more concentrated than Bilco's, as illustrated by Conglomo's relatively narrower bell curve. A more concentrated distribution means a smaller standard deviation. The distribution curve appears higher, steeper, and narrower because more data points are found close to the expected return (the mean). Bilco's distribution is flatter, since its returns are or more dispersed, than those of Conglomo, Inc.

### Interpreting Variance 

A variance of zero means that all of the values within a data set are identical. Any variance not equal to zero will be positive. Why is this?
  
  * Remember that variance is the square of standard deviation, and that the square of any number, whether positive or negative, will be positive.
  
The larger the variance, the more widely spread the data set. A large variance means that, on average, the data points in a set are far from the mean and each other. A small variance means that the numbers are closer together in value, and closer to the mean.

### How to Calculate Variance? 

Variance is calculated by:

1. Taking the differences between each element in a data set and the mean, 
2. Squaring those differences to give it a positive value
3. Dividing the sum of the resulting squares by the number of values in the set.

$$\sigma^2 = \frac{\sum(x-\mu)^2}{n}$$

Here, $x$ represents an individual data point and $\mu$ represents the mean of the data points. $n$ is the total number of data points. 

  * Remember that when calculating a sample variance in order to estimate a population variance, the denominator of the variance equation becomes $n - 1$. 
  
  * This reduces bias. In other words, it reduces the underestimation of the population variance.

The following illustration summarizes how the spread of data around the mean (10) relates to the variance.

  * The red curve, with a variance of 1, shows a narrow distribution around the mean.
  
  * The green curve, with a variance of 2, is shorter and flatter, showing a wider distribution than the red curve.
  
  * The purple curve, with a variance of 5, is even shorter and flatter, showing a still wider distribution of its data points.

<img src="images/var2.png" width=500>



### Calculating Variance in Python

Below are the results of a test in a class of 10 students. Identify the mean and the variance.

|  Student ID #| Grade |
|---|---|
|  1 |  60|
|  2 |   85|
|  3 |  70 |
|  4 |  75 |
|  5|   75|
|  6 |  90 |
|  7 |  55 |
|   8|  80 |
| 9  |   85|
|  10 | 75  |

In [1]:
import numpy as np
grades = [60, 85, 70, 75, 75, 90, 55, 80, 85, 75]
grades_mean = np.mean(grades)
print(f"Mean: {grades_mean}")
grades_variance = np.var(grades)
print(f"Variance: {grades_variance}")

Mean: 75.0
Variance: 110.0


Below are the results of a second test. Identify the mean and the variance.

|  Student ID #| Grade |
|---|---|
|  1 |  75|
|  2 |   75|
|  3 |  70 |
|  4 |  75 |
|  5|   75|
|  6 |  70 |
|  7 |  75 |
|   8|  70 |
| 9  |   75|
|  10 | 75  |

In [2]:
grades2 = [75, 75, 70, 75, 75, 70, 75, 70, 75, 75]
grades2_mean = np.mean(grades2)
print(f"Mean: {grades2_mean}")
grades2_variance = np.var(grades2)
print(f"Variance: {grades2_variance}")

Mean: 73.5
Variance: 5.25


## Covariance ($\sigma_{xy}$)

We have just seen that variance offers insight into the distribution of data points in a set, and the likelihood of the occurrence of a particular value. In other words, variance reveals how a single variable varies in value within its data set.

Covariance provides an insight into how two variables are __related__ to one another. 

More precisely, covariance refers to:

> The measure of how two random variables in a data set will __change together__.
  
### How to Calculate Covariance?

In essence, covariance is used to measure **how variables change *together* **, and is calculated using the following formula:

$$\sigma_{xy} = \frac{\sum_{i=1}^{n}(X_i -\mu_x)(Y_i - \mu_y)}{n}$$

Here, $X$ and $Y$ are two random variables with $n$ data points each. We want to calculate ___how much $Y$ depends on $X$___ (or vice-versa) by measuring how values in $Y$ change with observed changes in $X$ values. 

> This makes $X$ our __independent variable__ and $Y$ the __dependent variable__.  

$xi$ = ith element of variable $X$

$yi$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$\sigma_{xy}$ = covariance between $X$ and $Y$

Note the similarities between the formulae for variance and covariance. 

  * The covariance of $X$ and $Y$ is obtained by multiplying the variance of each corresponding variable. Hence the term __co-variance__.*

### Interpreting Covariance Values 

* A positive covariance indicates that the values of one variable tend to go in the same direction as the values of the other variable.

* A negative covariance indicates that the values of one variable tend to go in the opposite direction as the values of the other variable.

* A zero value, or values close to zero, indicate no covariance, i.e. no values from one variable can be paired with values of the second variable. 

This behavior can be further explained using the scatter plots below:
<img src="images/covariance.gif" width=500>

A large negative covariance value shows an inverse relationship between values at x and y axes. That is, y decreases as x increases. This is shown by the scatter plot on the left. 

The middle scatter plot shows values spread all over the plot, showing that variables on x and y axes cannot be related in terms of how they vary together. The covariance value for such variables would be very close to zero. 

In the scatter plot on the right, we see a strong relationship between values at x and y axes. That is, y increases as x increases. 

>__Covariance is not standardized. Therefore, covariance values can range from negative infinity to positive infinity.__



### Calculating Covariance in Python

The following table lists the temperature of an oceanside town and the number of surfers in the waters over a sampling of 5 days in a given month.

|Day   |   Temperature, Celsius|  Number of Surfers |
|---|---|---|
| 1  |  20 | 350  |
|  6 |  25 |  450 |
| 13  |  28 |  470 |
|  20 |   27|  455 |
|  27 |   25|  440 |


In [3]:
temp = [20, 25, 28, 27, 25]
surfers = [350, 450, 470, 455, 440]

covariance = np.cov(temp, surfers)[0][1]
print(f"Covariance: {covariance}")

Covariance: 142.5


Because the covariance value of 142.5 is positive, we can conclude that the two variables--the temperature and the number of surfers--move in the same direction.

Note that the denominator in the covariance formula, if the dataset is a sample, becomes $n-1$ instead of $n$. 

The default of Numpy's covariance method, `np.cov()`, is to assume a sample dataset. The denominator would be $n-1$.

To calculate the covariance of a population, use the `ddof=0` argument: `np.cov(dataset1, dataset2, ddof=0)`.

See below for further information.

[https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html)

## Correlation 

We saw above that covariance can identify the degree to which two random variables tend to vary together. Note that the formula for variance depends on the units of $X$ and $Y$ variables. In data analysis, covariance cannot be directly used because the two variables might come in different units of measurement. Correlation normalizes covariance into a standard unit, giving interpretable results independent of the units of data. 

Correlation normalizes covariance on a scale from -1 to 1. A correlation of 1 between 𝑋 and 𝑌 means that the two variables are perfectly correlated. When 𝑋 increases, 𝑌 increases in lockstep proportion. A correlation of -1 means that the two variables are inversely correlated. When 𝑋 increases, 𝑌 decreases in lockstep proportion. A correlation of 0 means that there is no relationship between the two variables.

The correlation between 𝑋 and 𝑌 is calculated as:

$$Correlation(X,Y) = \frac{\sigma_{xy}}{\sigma_X\sigma_Y}$$

>When two random variables **correlate**, this means that a change in the values of one variable **effects** change in the values of the second variable. 

In data science practice, we typically use correlation rather than covariance because it is more interpretable, since it does not depend on the unit of either random variable involved.


### Use Cases


#### Social Media and Websites

Digital publishers want to maximize their understanding of the potential relationship between social media activity and visits to their website. For example, a digital publisher runs a correlation report between hourly Twitter mentions and visits for a two-week period. The correlation is found to be r = 0.28, which indicates a medium, positive relationship between Twitter mentions and website visits.

#### Optimization for E-retailers

E-retailers are interested in increased revenue. For example, an e-retailer wants to compare the number of secondary success events (e.g., file downloads, product detail page views, internal search click-throughs, etc.) with weekly web revenue. Of these secondary success events, an e-retailer might identify internal search click-throughs to have the highest correlation, possibly indicatieng an area for optimization.

### Types of Correlation Measures

__Coefficient of correlation__, r, measures the strength and the direction of a linear relationship between two variables. It is also called the __Pearson correlation coefficient__. 

In statistics, four types of correlations are measured for detailed relationship analysis: 

* Pearson correlation 
* Kendall Rank correlation 
* Spearman correlation
* Point-Biserial correlation. 


We will focus on Pearson correlation here as it is the most commonly used correlation measure. 

Pearson __r__ correlation is the most widely used correlation statistic to measure the degree of the relationship between two linearly related variables. For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Other assumptions include linearity and homoscedasticity.   

  * Linearity assumes a straight-line relationship between each of the two variables. 
  
  * Homoscedasticity assumes that data is equally distributed about the regression line.


### Calculating Coefficient of Correlation (r)

Pearson Correlation (r) is calculated using following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{(\sum_{i=1}^{n}x_i - \mu_x)^2 (\sum_{i=1}^{n}y_i-\mu_y)^2}}$$

Just as in the case of covariance,  $X$ and $Y$ are two random variables having $n$ elements each. 


$xi$ = ith element of variable $X$

$yi$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$r$ = calculated Pearson Correlation


Here x and y are the random variables, $\mu_x$ and $\mu_y$ are the mean values for x and y. A detailed mathematical insight into this equation is available [in this paper](http://www.hep.ph.ic.ac.uk/~hallg/UG_2015/Pearsons.pdf)

### Interpreting Correlation values

> __The correlation formula shown above always gives values in a range between -1 and 1.__

A correlation of +0.9 between two variables means that the change in one variable results in an almost identical change to the other variable. A correlation value of -0.9 means that the change is one variable results in an opposite change in the other variable. A pearson correlation near 0 would mean no effect. Here are some examples of Pearson correlation calculations as scatter plots. 

<img src="images/pearson_2.png" width=500>

Think about stock markets in terms of correlation. All the stock market indexes tend to move together in similar directions. When the DOW Jones loses 5%, the S&P 500 usually loses around 5%. When the DOW Jones gains 5%, the S&P 500 usually gains around 5% because they are **highly correlated**.

On the other hand, there could also be negative correlation where you might observe that as the DOW Jones loses 5% of it value, gold might gain 5% in value. Alternatively, if the Dow Jones gains 5% of its value, gold may lose 5% of its value. That's **negative correlation**. 

### So how do these measures relate to each other?

Are covariance and correlation the same thing?

No. While both covariance and correlation indicate whether variables are positively or inversely related to each other, they are not the same. Covariance states only whether two variables move in the same direction or in the opposite direction, while correlation also identifies the degree to which the variables move together. 

Covariance is used to measure variables that have different units of measurement. Analysts can use covariance to determine whether units are increasing or decreasing, but they are unable to identify the degree to which the variables are moving together because covariance does not use one standardized unit of measurement.

Correlation, on the other hand, standardizes the measure of interdependence between two variables and informs researchers as to how closely the two variables move together.



### Calculating Correlation in Python

The following data on total years of education and total annual book expenditure per customer was gathered by Narnes and Boble, an independent bookseller.

| Customer ID#  |  Years of Education | Annual Spending, Dollars  |
|---|---|---|
|  1 |  12 | 400  |
|  2 | 16  |  700 |
|  3 |  14 | 800  |
|  4 |  20 | 1200  |
|  5 |  16 |  900 |

In [4]:
years_education = [12, 16, 14, 20, 16]
annual_spending = [400, 700, 800, 1200, 900]

correlation = np.corrcoef(years_education, annual_spending)[0][1]
print(f"Correlation: {correlation}")

Correlation: 0.9249945801257605


In this example, there is a high degree of correlation between the two variables: years of education and annual spending on books.

Now let's return to the previous data on temperature and number of surfers. The covariance indicated a positive relationship between the two variables. Try calculating the correlation.

In [5]:
temp = [20, 25, 28, 27, 25]
surfers = [350, 450, 470, 455, 440]

temp_surfer_correlation = np.corrcoef(temp, surfers)[0][1]
print(f"Correlation: {temp_surfer_correlation}")

Correlation: 0.9703761930030727


Unlike covariance, the correlation shows a standardized degree to which the two variables move together.

## Summary

In this lesson, we looked at identifying the variance of random variables as a measure of mean deviation. We saw how this measure can be used to first calculate covariance, followed by the correlation to analyze how change variable effects the change of another variable. Next, we will see how to use correlation analysis to run a __regression analysis__, and later, how covariance calculation helps us with dimensionality reduction.