## Lesson 6

### The correlation of values. Parametric and non-parametric measures of correlation. 

### Correlation analysis

**Correlation**  — is a mathematical indicator by which to judge whether there is a statistical relationship between random variables. If there is such a relationship, changes in the value of one variable affect the other.


**Correlation coefficient** indicates how great the relationship is. It is denoted by $R$ or $r$ and can take values from -1 to 1 inclusive.

When the correlation coefficient is close to 1, there is a **direct** relationship between the quantities: an increase in one quantity is accompanied by an increase in the other, and similarly a decrease in one quantity is accompanied by a decrease in the other.

If the correlation coefficient is close to -1, there is an **inverse** correlation between the quantities: an increase in one quantity is accompanied by a decrease in the other and vice versa.

A correlation coefficient close to 0 indicates that there is **no relationship** between the quantities, i.e. there is a change in the quantities independently of each other, or there is only a non-linear relationship.

**Example 1**

Let's calculate the correlation between height and weight of hockey players from the dataset discussed earlier.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('csv/hockey_players.csv', encoding='cp1251', parse_dates=['birth'])
df = df.drop_duplicates(['name', 'birth'])

To do this, use the **corr** method from the **pandas** library:

In [None]:
corr_matrix = df.loc[:, ['height', 'weight']].corr()
corr_matrix

Unnamed: 0,height,weight
height,1.0,0.693731
weight,0.693731,1.0


We obtained a correlation matrix. The correlation between height and weight is quite high and has a value almost equal to 0.7:

In [None]:
corr_matrix.loc['height', 'weight']

0.69373056796630506

The correlation is positive, so we can conclude that the taller a hockey player is, the greater his weight is usually.

#### Relationship of values

If two variables are correlated, this may indicate that there is a **statistical relationship** between them. And we can say that there is a correlation for the variables in one sample, which does not guarantee that the same relationship will be found in another sample and must be of the same nature. 

Correlation analysis is simple to interpret, so a statistician can draw the false conclusion that there is a causal relationship between these characteristics. Such a conclusion cannot be drawn on the basis of the correlation coefficient - it can only be said that there is a statistical relationship between the characteristics.

For example, if we look at data on fires in a city, we can see that there is a strong correlation between the material loss caused by a fire and the number of firefighters who have been involved in putting it out.

It would be false to conclude that a large number of firefighters present at a fire results in an increase in fire damage. It can lead to the wrong decision - to reduce the fire brigade in order to reduce material losses.

Another example of how correlated values can send statistics on a false trail: in cities with high crime rates, the number of police officers is often high too. There is a positive correlation between the number of police officers and crime. The false conclusion that can be drawn in such a case is to decide that an increase in the number of police officers has caused the increased crime rate, and to reduce the number of law enforcement officers in order to reduce the crime rate.

If the covariance is not zero, the two random variables are dependent.

A high **correlation** of the two quantities may indicate that they have a **common cause** - even though 

there is no direct interaction between the two correlated variables. For example, the onset of winter might be the cause of both 

of colds and higher heating costs. This is exactly the case when the two variables (number of people getting sick 

the number of people getting sick and the heating costs) are correlated although they do not influence one another directly.

They do, however, have a common cause - the winter season.

A lack of correlation between two variables does not mean that there is no relationship between the indicators. 

It is possible that there is a non-linear relationship between the indicators, which the correlation coefficient cannot capture.

**Correlation indicators**

Depending on the nature of the variables, a suitable method of calculating the correlation coefficient can be chosen.

For interval and quantitative traits, Pearson's correlation coefficient ($r$), which refers to parametric measures of correlation, is used. If at least one of the two traits is ordinal or its distribution is not normal, Spearman's rank correlation or Kendall's $\tau$ are used, which are non-parametric correlation measures.

**Covariance**

Covariance, or correlation momentum, is a parametric measure of the joint distribution of two signs. It is equal to the mathematical expectation of the product of the deviations of the random variables:

$$cov_{XY} = M[(X - M(X))(Y - M(Y))] = M(XY) - M(X)M(Y)=\overline{X \cdot Y} - \overline{X} \cdot \overline{Y}$$

where $M$ — expected value, $\overline{X} \: and \:\overline{Y}$ — sample average.

The dimensionality of covariance, which is equal to the product of the dimensionality of the random variables, is their scale, i.e. the value of covariance depends on the units of the independent variables. Therefore, covariance is difficult to apply to correlation analysis.

Knowing the covariance and the standard deviation of each of the two traits, the Pearson correlation coefficient can be calculated:

$$r_{XY} = \frac{cov_{XY}}{\sigma_{X}\sigma_{Y}}$$

**Correlation analysis**

Correlation analysis is a method of statistical data processing that can determine the closeness of the relationship between several indicators. Correlation analysis is closely related to regression analysis. It is commonly referred to as correlation and regression analysis - using it, we can determine the inclusion and exclusion of indicators in a multiple regression equation. And by applying the coefficient of determination, the resulting regression equation can be assessed for consistency with the relationships identified.

**Limitations of correlation analysis**

Consider the following limitations of correlation analysis:

1. In order to apply correlation analysis, a large number of observations are required.


2. The set of factor characteristics and the resulting trait must have a multivariate normal distribution.


3. Although the method is simple and straightforward, it does not accurately establish a causal relationship.

Correlation analysis is used in many fields: economics, astrophysics, psychology, political science and sociology.
    
This method of information processing is popular because it is easy to calculate and interpret, and convenient when processing statistical information.