# Statistics Fundamentals 

_May 12, 2020_

Agenda today:
- Measure of central tendency: mean, median, mode
- Measure of dispersion: variance, standard deviation
- Measure of relationship: covariance and correlation

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Part I. Mean, Median, and Mode
What are the definition of the three measurements?

In [2]:
array = [10,11,11,12,11,13,14,16,17,18,19,20,22,24,26,22,24]
# plot it out and examine it 
plt.style.use('fivethirtyeight')
plt.hist(array)
plt.ylabel('count')

<matplotlib.text.Text at 0x118595f98>

What is the above plot called? What kind of values can it be used to represent?

## Part II. Measure of Dispersion
Two measurements of dispersion we will be concerned with is **variance** and **standard deviation**. They are both measurement of variability of dataset. Why might we need a measure of variability in addition to central tendency?

#### Variance calculation:
$$ \large \sigma^2 = \dfrac{1}{n}\displaystyle\sum^n_{i=1}(x_i-\mu)^2 $$

#### Standard deviation calculation:
$$ \large \sigma = \sqrt{\dfrac{1}{n}\displaystyle\sum^n_{i=1}(x_i-\mu)^2} $$

In [3]:
# exercises

# can you write a function that takes in an array, calculate the variance and standard deviation?
def calculate_variance(array):
    '''
    calculate the variance of an array
    '''
    from statistics import mean
    
    n = len(array)
    
    arrmean = mean(array)
    
    variance = (1/n) * sum([(array[i]-arrmean)**2 for i in range(n)])
    
    return variance
    

In [9]:
def calculate_std(array):
    '''
    calculate the standard deviation of an array
    '''
    from statistics import mean
    
    n = len(array)
    
    arrmean = mean(array)
    
    variance = (1/n)*sum([(array[i]-arrmean)**2 for i in range(n)])
    
    return variance**0.5

In [18]:
# checking my results...

my_arr = [10,12,36]

print("JP's variance = " + str(calculate_variance(my_arr)),"\nJP's stdev = " +  str(calculate_std(my_arr)) + "\n") 

print("Numpy's variance = " + str(np.var(my_arr)),"\nNumpy's stdev = " +  str(np.std(my_arr)) ) 
   

JP's variance = 139.55555555555554 
JP's stdev = 11.813363431112899

Numpy's variance = 139.555555556 
Numpy's stdev = 11.8133634311


## Part III. Covariance and Correlation
Covariance and correlation measures the degree of two variables' relationship. 

#### Covariance calculation:
$$Cov_{X,Y} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

#### Correlation calculation:
$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sigma_x  \sigma_y}$$

<img src= 'https://raw.githubusercontent.com/learn-co-curriculum/dsc-correlation-covariance/master/images/correx.svg'>

In [60]:
## exercises: define functions that calculate covariance & correlation

def calculate_covariance(array1, array2):
    '''
    calculate the covariance of two arrays
    '''
    
    from statistics import mean
    
    # Assuming both arrays are of the same length:
    n = len(array1)     
    
    arr1mean = mean(array1)
    arr2mean = mean(array2)
    
    covariance =  * sum( [ ((array1[i]-arr1mean) * (array2[i]-arr2mean)) for i in range(n) ] )
    
    return covariance 
                      

In [73]:
print(calculate_covariance(my_arr1,my_arr2))

np.cov(my_arr1,my_arr2, bias=True)[0][1]

80.66666666666666


80.666666666666671

In [71]:
def calculate_correlation(array1, array2):
    '''
    calculate the correlation of two arrays
    '''
    
    from statistics import mean
    
    # Assuming both arrays are of the same length:
    n = len(array1)     
    
    arr1mean = mean(array1)
    arr2mean = mean(array2)
    
    covariance = (1/n) * sum( [ ((array1[i]-arr1mean)*(array2[i]-arr2mean)) for i in range(n) ] )
    
    sigma_x = calculate_std(array1)
    sigma_y = calculate_std(array2)
    
    return covariance / (sigma_x * sigma_y)
                      

In [77]:
my_arr1 = [10,21,32]
my_arr2 = [40,51,62]


calculate_correlation(my_arr1,my_arr2)

print("JP's covariance = " + str(calculate_covariance(my_arr1,my_arr2)),"\nJP's correlation = " + 
      str(calculate_correlation(my_arr1,my_arr2)) + "\n") 

print("Numpy's covariance = " + str(np.cov(my_arr1,my_arr2,bias=True)[0][1]),"\nNumpy's correlation = " + 
      str(np.corrcoef(my_arr1,my_arr2)[0][1]) ) 
   

JP's covariance = 80.66666666666666 
JP's correlation = 1.0000000000000002

Numpy's covariance = 80.6666666667 
Numpy's correlation = 1.0
