# Correlation and Regression Analysis

## Correlation Analysis
<p>
    Correlation analysis is a statistical technique for determining the strength of a link between two variables. It is used to detect patterns and trends in data and to forecast future occurrences.
</p>

* Consider a problem with different factors to be considered for making optimal conclusions
* Correlation explains how these variables are dependent on each other.
* Correlation quantifies how strong the relationship between two variables is. A higher value of the correlation coefficient implies a stronger association.
* The sign of the correlation coefficient indicates the direction of the relationship between variables. It can be either positive, negative, or zero.

### Pearson correlation coefficient
<p>
    The Pearson correlation coefficient is the most often used metric of correlation. It expresses the linear relationship between two variables in numerical terms. The Pearson correlation coefficient, written as “r,” is as follows:

r=∑(xi−xˉ)(yi−yˉ) / ∑(xi−xˉ)^2∑(yi−yˉ)^2

where,

    r: Correlation coefficient 
    xi​ : i^th value first dataset X
    xˉ : Mean of first dataset X
    yi​ : i^th value second dataset Y
    yˉ​ : Mean of second dataset Y
</p>

### Spearnman correlation coefficient
<p>
    Spearman’s Rank Correlation Coefficient is a method of calculating the correlation coefficient of qualitative variables and was developed in 1904 by Charles Edward Spearman. In other words, the formula determines the correlation coefficient of variables like beauty, ability, honesty, etc., whose quantitative measurement is not possible. Therefore, these attributes are ranked or put in the order of their preference. 

rk =1 – (6∑D^2)/(N^3–N)

In the given formula,

rk = Coefficient of rank correlation

D = Rank differences

N = Number of variables
</p>

In [1]:
# Importing libraries
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
df = sns.load_dataset('iris')

In [3]:
x = df.drop(['species'], axis=1)
x.corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


The above result shows the correlation between each variables.

* Closer to 1 means there is a positive correlation
* Closer to -1 means there is a negative correaltion

These are called directions and strength represents how well the variables are correlated,

* 1 means Perfect correaltion
* 0.9 to 0.6 means strong correlation
* 0.5 to 0.3 means moderate correlation
* 0.3 to 0.1 means weak correlation
* 0 means no correlation

Same is true for negative correaltion as well

Lets try to calculate correlation between sepal_length and petal_length ourselves

In [4]:
sepal_length = df['sepal_length']
petal_length = df['petal_length']

# Using the pearman method
sepal_mean = sepal_length.mean()
petal_mean = petal_length.mean()

formula_top = sum([(ele_1-sepal_mean) * (ele_2-petal_mean) for ele_1, ele_2 in zip(sepal_length, petal_length)])
formula_bottom_1 = sum([(ele-sepal_mean) ** 2 for ele in sepal_length])
formula_bottom_2 = sum([(ele-petal_mean) ** 2 for ele in petal_length])
formula_bottom = (formula_bottom_1 * formula_bottom_2) ** (1/2)

result = formula_top / formula_bottom
result

0.8717537758865832

As we can see from the above result the correlation is same