# Beginner Python and Math for Data Science
## Lecture 5 - Correlation and Covariance
### Statistics

__Purpose:__ The purpose of this lecture is to understand correlation and covariance.

__At the end of this lecture you will be able to:__
> 1. Understand some summary statistics such as Correlation and Covariance

### 1.1.1 Correlation and Covariance:

__Overview:__ 
- We can use a series of statistics to evaluate the relationship between the two variables:
> 1. __[Covariance](https://en.wikipedia.org/wiki/Covariance):__ Covariance measures how the two variables vary with each other. Covariance investigates if the two variables tend to increase and decrease together or if one variable increases when the other decreases or vice versa or if the variables do not vary at all with each other (covariance of 0)
>> - The formula for Covariance can be represented in terms of Expected Value and Variance: 
<center> $Cov(X, Y) = \sigma_{XY} = E[XY] - E[X][Y] = E[(X - \mu_{X})(Y - \mu_{Y})]$ </center> 

>> - The formula for Covariance can also be represented in the following way:
<center> $Cov(X, Y) = \sigma_{XY} = \frac{\sum_{i=1}^{n} (x_{i} - \mu_{x})(y_{i} - \mu_{y})}{n}$ </center>

>> - If two variables $X$ and $Y$ are independent, then $Cov(X,Y) = 0$, but the opposite is not always true 
>> - The problem with Covariance is that it is not normalized and therefore difficult to determine if the magnitude of Covariance is considered strong or weak. This warrants a normalized measure of variable association of which we define Correlation below 
> 2. __[Correlation](https://en.wikipedia.org/wiki/Correlation_and_dependence):__ Correlation is very similar to Covariance but is scaled by the Standard Deviations of the two variables such that Correlation can only range between -1 and +1. 
>> - The formula for Correlation can be represented in the following way:
<center> $Cor(X, Y) = r_{XY} = \frac{\frac{\sum_{i=1}^{n} (x_{i} - \mu_{x})(y_{i} - \mu_{y})}{n}}{\sigma_x \sigma_y}$ </center> 

>> - We can now interpret the mangitude of the Correlation and conclude that if the Correlation is positive, the variables move in the same direction, if the Correlation is negative is negative, the two variables move in opposite direction, and if the Correlation is 0, the variables do move together in either direction 

__Helpful Points:__
1. It is possible to create Covariance and Correlation Matrices which calculates the Covariance and Correlation, respectively, of a series of variables and summarizes this information into a symmetric matrix
2. Typically, we calculate the Covariance and Correlation Matrices of a data set to observe all the paired covariance and correlation values for every pair of variables 

__Practice:__ Examples of Correlation and Covariance in Python 

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import math 
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# read in data to analyze 
nba_df = pd.read_csv("NBA_GameLog_2010_2017.csv")

### Example 1 (Covariance):

In [3]:
# subset of nba data for cov and cor calculations 
nba_df_subset = nba_df.loc[:, ['Tm.Pts', 'Tm.FG_Perc', 'Tm.3P_Perc', 'Tm.FT_Perc', 
                               'Tm.TRB','Tm.AST', 'Tm.STL', 'Tm.BLK', 'Tm.TOV', 'Tm.PF']]

In [4]:
nba_df_subset_cov = pd.DataFrame(np.cov(nba_df_subset.T)) # cov function calculates row based covariances by default, but we want column covariances
nba_df_subset_cov.columns = ['Tm.Pts', 'Tm.FG_Perc', 'Tm.3P_Perc', 'Tm.FT_Perc', 
                               'Tm.TRB','Tm.AST', 'Tm.STL', 'Tm.BLK', 'Tm.TOV', 'Tm.PF']
nba_df_subset_cov.index = ['Tm.Pts', 'Tm.FG_Perc', 'Tm.3P_Perc', 'Tm.FT_Perc', 
                               'Tm.TRB','Tm.AST', 'Tm.STL', 'Tm.BLK', 'Tm.TOV', 'Tm.PF']
nba_df_subset_cov

Unnamed: 0,Tm.Pts,Tm.FG_Perc,Tm.3P_Perc,Tm.FT_Perc,Tm.TRB,Tm.AST,Tm.STL,Tm.BLK,Tm.TOV,Tm.PF
Tm.Pts,147.523854,0.468805,0.593105,0.216965,10.056589,34.856841,4.510063,1.680997,-4.704925,7.426513
Tm.FG_Perc,0.468805,0.003133,0.002757,7.3e-05,-0.072631,0.153645,0.004388,0.005725,0.007339,-0.000621
Tm.3P_Perc,0.593105,0.002757,0.012074,0.000221,-0.072651,0.198534,-0.003194,0.003556,0.005498,0.002567
Tm.FT_Perc,0.216965,7.3e-05,0.000221,0.010602,-0.037013,-0.001035,-0.003037,-0.000288,-0.009051,0.003774
Tm.TRB,10.056589,-0.072631,-0.072651,-0.037013,42.333663,1.098065,-1.750161,2.815002,3.186443,-0.327592
Tm.AST,34.856841,0.153645,0.198534,-0.001035,1.098065,25.413989,1.687982,0.881108,-0.886902,-0.567222
Tm.STL,4.510063,0.004388,-0.003194,-0.003037,-1.750161,1.687982,8.386989,0.032831,1.308719,0.273774
Tm.BLK,1.680997,0.005725,0.003556,-0.000288,2.815002,0.881108,0.032831,6.647865,0.451256,0.036096
Tm.TOV,-4.704925,0.007339,0.005498,-0.009051,3.186443,-0.886902,1.308719,0.451256,14.616469,2.450163
Tm.PF,7.426513,-0.000621,0.002567,0.003774,-0.327592,-0.567222,0.273774,0.036096,2.450163,18.883561


In [5]:
cov_metrix= nba_df_subset.cov


<bound method DataFrame.cov of        Tm.Pts  Tm.FG_Perc  Tm.3P_Perc  Tm.FT_Perc  Tm.TRB  Tm.AST  Tm.STL  \
0         120       0.529       0.583       0.724      35      30      16   
1         100       0.410       0.250       0.912      47      18       5   
2         110       0.449       0.304       0.885      40      21       8   
3          97       0.463       0.333       0.818      46      21       4   
4         113       0.541       0.375       0.789      47      17       2   
5          83       0.398       0.125       0.652      35      10       7   
6         125       0.517       0.421       0.900      44      30       4   
7         114       0.543       0.462       0.833      42      23       7   
8          97       0.452       0.267       0.739      47      19       7   
9         121       0.545       0.350       0.621      48      27      10   
10         99       0.429       0.278       0.800      47      20       7   
11        105       0.427       0.400       0

We can now evaluate the covariance between every pair of variables in the subsetted data set. For example:
- The covariance between Team Points and Team Total Rebounds is 10.05
- The covariance between Team 3 Point Shot Percentage and Team Steal sis -0.003

Note:
1. The matrix is symmetric so you only need to consider either the upper right triangle or the lower left triangle 
2. The diagonal elements represent the covariance of the variable onto itself which is just the variance of that variable (see below)

In [None]:
# variance of each column which is the diagonal elements of the covariance matrix 
nba_df_subset.var(axis = 0)

### Example 2 (Correlation):

In [None]:
nba_df_subset_corr = pd.DataFrame(np.corrcoef(nba_df_subset.T)) # corrcoef function calculates row based correlations by default, but we want column correlations
nba_df_subset_corr.columns = ['Tm.Pts', 'Tm.FG_Perc', 'Tm.3P_Perc', 'Tm.FT_Perc', 
                               'Tm.TRB','Tm.AST', 'Tm.STL', 'Tm.BLK', 'Tm.TOV', 'Tm.PF']
nba_df_subset_corr.index = ['Tm.Pts', 'Tm.FG_Perc', 'Tm.3P_Perc', 'Tm.FT_Perc', 
                               'Tm.TRB','Tm.AST', 'Tm.STL', 'Tm.BLK', 'Tm.TOV', 'Tm.PF']
nba_df_subset_corr

We can now evaluate the correlation between every pair of variables in the subsetted data set. For example:
- The correlation between Team Points and Team Total Rebounds is 0.127
- The correlation between Team 3 Point Shot Percentage and Team Steal is -0.01

Note:
1. The matrix is symmetric so you only need to consider either the upper right triangle or the lower left triangle 
2. The diagonal elements represent the correlation of the variable onto itself which is perfect correlation (1)

Below we are going to analyze graphically the relationship between __Tm.Pts__ and __Tm.FG_Perc__ which are highly correlated, as well as __Tm.3P_Perc__ and __Tm.TOV__ which have a low correlation,

First we will normalize the features (calculate their z-score) by subtracting their mean and dividing by their standard deviation.

In [None]:
Pts_Norm = (nba_df_subset['Tm.Pts']-np.mean(nba_df_subset['Tm.Pts']))/np.std(nba_df_subset['Tm.Pts'])
FG_Perc_Norm = (nba_df_subset['Tm.FG_Perc']-np.mean(nba_df_subset['Tm.FG_Perc']))/np.std(nba_df_subset['Tm.FG_Perc'])

ThreeP_Perc_Norm = (nba_df_subset['Tm.3P_Perc']-np.mean(nba_df_subset['Tm.3P_Perc']))/np.std(nba_df_subset['Tm.3P_Perc'])
TOV_Norm = (nba_df_subset['Tm.TOV']-np.mean(nba_df_subset['Tm.TOV']))/np.std(nba_df_subset['Tm.TOV'])

In [None]:
plt.figure(figsize=[10,10])
plt.subplot(3,1,1)
plt.plot(nba_df_subset['Tm.Pts'])
plt.plot(nba_df_subset['Tm.FG_Perc'])
plt.legend(['Pts','FG_Perc'])
plt.subplot(3,1,2)
plt.plot(Pts_Norm,alpha = 0.5)
plt.plot(FG_Perc_Norm,alpha = 0.5)
plt.legend(['Normalized Pts','Normalized FG_Perc'])
plt.subplot(3,1,3)
plt.scatter(Pts_Norm,FG_Perc_Norm,alpha = 0.1)
plt.xlabel('Normalized Pts')
plt.ylabel('Normalized FG_Perc')

As we can see, there is a relationship between 'Pts' and 'FG_Perc'.

In [None]:
plt.figure(figsize=[10,10])
plt.subplot(3,1,1)
plt.plot(nba_df_subset['Tm.3P_Perc'])
plt.plot(nba_df_subset['Tm.TOV'])
plt.legend(['3P_Perc','TOV'])
plt.subplot(3,1,2)
plt.plot(ThreeP_Perc_Norm,alpha = 0.5)
plt.plot(TOV_Norm,alpha = 0.5)
plt.legend(['Normalized ThreeP_Perc','Normalized TOV'])
plt.subplot(3,1,3)
plt.scatter(ThreeP_Perc_Norm,TOV_Norm,alpha = 0.1)
plt.xlabel('Normalized ThreeP_Perc')
plt.ylabel('Normalized TOV')

In this case, there is a relationship between '3P_Perc' and 'TOV'.

### Problem 1

Select the quantitative variables from the Seattle Home Price data and develop both a Covariance and Correlation Matrix as shown above. 

In [None]:
home_df = pd.read_csv("SeattleHomePrices.csv")
# Write your code here 




### Example 3 (Correlation Heat Map):

In [None]:
sns.heatmap(np.corrcoef(nba_df_subset.T))

### SOLUTIONS

### Problem 1

Select the quantitative variables from the Seattle Home Price data and develop both a Covariance and Correlation Matrix as shown above. 

In [None]:
home_df = pd.read_csv("SeattleHomePrices.csv")

In [None]:
home_df_subset = home_df.loc[:, ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
                               'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH']]

In [None]:
home_df_subset = home_df_subset.dropna()

### Part 1 - Covariance Matrix

In [None]:
home_df_subset_cov = pd.DataFrame(np.cov(home_df_subset.T))
home_df_subset_cov.columns = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
                               'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH']
home_df_subset_cov.index = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
                               'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH']
home_df_subset_cov

### Part 2 - Correlation Matrix

In [None]:
home_df_subset_corr = pd.DataFrame(np.corrcoef(home_df_subset.T))
home_df_subset_corr.columns = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
                               'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH']
home_df_subset_corr.index = ['PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'LOT SIZE',
                               'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH']
home_df_subset_corr