# 1.Correlation

## 1.1 What is correlation?

The statistical relationship between two variables is referred to as their **correlation.**

A correlation could be:
- **Positive**: both variables move in the same direction
- **Negative**: when one variable’s value increases, the other variables’ values decrease.
- **Neutral or zero**: variables are unrelated.

## 1.2 Example 
Correlation between 2 variables.
Here we create data1 with 1000 samples, Gausian distribution, data2 base on data1 and add some noise

In [None]:
# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
import numpy as np
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()

A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

In [None]:
data3 = data1 * (-1) + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data3), std(data3)))
# plot
pyplot.scatter(data1, data3)
pyplot.show()

In [None]:
data4 = 30 * randn(1000) + 100
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data3), std(data4)))
# plot
pyplot.scatter(data1, data4)
pyplot.show()

## 1.3 Covariance

Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples.

This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted).
$$Cov(x,y)=\frac{\sum_{i=1}^N (x_{i}-\overline{x}) * (y_{i} - \overline{y})}{N - 1}$$
or
$$Cov(x,y)=\frac{\sum_{i=1}^N (x_{i}-\overline{x}) * (y_{i} - \overline{y})}{N}$$


In [None]:
# Write the function manually
def MyCov(X,Y):
    n = len(X)
    return (np.sum ((X - np.mean(X)) * (Y - np.mean(Y)) )) /(n-1)

covariance = MyCov(data1, data2)
covariance

In [None]:
#Use np function
cov2 = np.cov(data1, data2)
cov2

In [None]:
covariance2 = MyCov(data1, data3)
covariance2

In [None]:
covariance3 = MyCov(data1, data4)
covariance3

In [None]:
# Write the function manually
def MyCovN(X,Y):
    n = len(X)
    return (np.sum ((X - np.mean(X)) * (Y - np.mean(Y)) )) /(n)

covariance = MyCov(data1, data2)
covariance

## 1.4 Pearson’s Correlation

A problem with covariance as a statistical tool alone is that it is ***challenging to interpret***. This leads us to the Pearson’s correlation coefficient next.

The **Pearson correlation coefficient** (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples.

The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

$$Corr(X,Y)=\frac{Cov(X,Y)}{std(X) \times std(Y)}$$

The result of the calculation, the correlation coefficient can be interpreted to understand the relationship.

The coefficient returns a ***value between -1 and 1*** that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means ***no correlation***. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a ***notable correlation***, and values below those values suggests a ***less notable correlation.***

In [None]:
# Manual create function
def Pearson_Correlation(X,Y):
    return MyCovN(X, Y) / (np.std(X) * np.std(Y))
#     return MyCovN(X, Y) / np.sqrt(np.var(X) * np.var(Y))
per_cor = Pearson_Correlation(data1,data2)
per_cor

In [None]:
# using library
import scipy
corr, _ = scipy.stats.pearsonr(data1, data2)
corr

We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.

In [None]:
# Negative correlation
per_cor2 = Pearson_Correlation(data1,data3)
per_cor2

In [None]:
# Neutral correlation
per_cor3 = Pearson_Correlation(data1,data4)
per_cor3

In [None]:
data5 = data1*2
per_cor4 = Pearson_Correlation(data1,data5)
per_cor4

# 2. Autocorrelation

## 2.1 Introduction

Autocorrelation measures the degree of similarity between a time series and a lagged version of itself over successive time intervals.

It’s also sometimes referred to as “serial correlation” or “lagged correlation” since it measures the relationship between a variable’s current values and its historical values.

When the autocorrelation in a time series is high, it becomes easy to predict future values by simply referring to past values.

In short, ***autocorrelation is correlation between the signal and it laged version***

In [None]:
# Write the function by the above definition
def autocorrelation_naive(X, k):
    if k == 0:
        return 1
    return Pearson_Correlation(X[:-k],X[k:])

In [None]:
# Testing
ls1 = [1,2,3]
ls2 = ls1*5
x = np.array(ls2)
x

In [None]:
# autocorrelation_naive(x,3)
for i in range(len(x)-1):
    print(f'k = {i}, autocorrelation = {autocorrelation_naive(x,i)}')

In [None]:
x

In [None]:
k = 6
print(x[:-k])
print(x[k:])


## 2.2 Autocorrelation formular

Autocorrelation of signal X at lagged k:
$$Autocorrelation(X,k)=\frac{\sum_{i=k+1}^N (x_{i}-\overline{x}) * (x_{i-k}-\overline{x)}}{\sum_{i=1}^N (x_{i}-\overline{x})^2}$$


In [None]:
# Manual function for above formular
def Autocorrelation(X,k):
    if k == 0:
        return 1
    X_norm = X - X.mean()
    Num = np.sum( X_norm[k:] * X_norm[:-k])
    De = np.sum( X_norm * X_norm)
    return Num/De

In [None]:
for i in range(len(x)-1):
    print(f'k = {i}, autocorrelation = {Autocorrelation(x,i)}')

### Using library

In [None]:
import statsmodels.api as sm
#calculate auto`correlations
sm.tsa.acf(x)

## 2.3 Exploration on autocorrelation of echo signal

In [None]:
samples = np.random.randint(-10,10,100)
samples

In [None]:
sample_delay = 7
echo = np.zeros_like(samples)
echo[sample_delay:] = samples[:-sample_delay]
echo

In [None]:
samples_echo = samples + echo
samples_echo

In [None]:
for i in range(len(x)-1):
    print(f'k = {i}, autocorrelation = {Autocorrelation(samples_echo,i)}')

## 2.4 Demo with audio signal

Refer to 09-Demo.ipynb (sound) for more information about audio processing

In [None]:
from scipy.io import wavfile
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio

In [None]:
Audio(filename='09-vocals.wav')

In [None]:
rate, samples = wavfile.read('09-vocals.wav')
rate

In [None]:
samples

In [None]:
plt.plot(samples[:, 0])

In [None]:
time_delay = 0.1 # s
sample_delay = int(time_delay * rate)
sample_delay

In [None]:
echo = np.zeros_like(samples[:, 0])
echo[sample_delay:] = samples[:-sample_delay, 0]

In [None]:
plt.plot(samples[:, 0])
plt.plot(echo)

In [None]:
samples_echo = samples[:, 0] + echo

In [None]:
plt.plot(samples_echo)

In [None]:
acs = []
# Calculate autocorrelation with the first 10000 values
for k in range(10000):
    ac = Autocorrelation(samples_echo,k)
    acs.append(ac)

In [None]:
plt.plot(acs)
plt.plot([sample_delay, sample_delay], [-1, 1], '--', alpha=0.5)
plt.xticks([0, sample_delay])
plt.ylim(-1, 1)
plt.xlabel('k'); plt.ylabel('autocorrelation');

# Reference
[1] https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/