#### Covariance & Correlation with python

#### Covariance

Covariance measures the degree to which two variables change together. If the covariance is positive, it means that as one variable increases, the other tends to increase as well. If it's negative, it means that as one variable increases, the other tends to decrease.

The formula for covariance between two variables X and Y, given a sample of size n, is:
![image.png](attachment:image.png)

Where:

•	𝑋𝑖Xi and 𝑌𝑖Yi are the individual data points

•	𝑋ˉXˉ and 𝑌ˉYˉ are the means of X and Y, respectively.


In [1]:
# Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
# Seaborn is built on top of matplotlib and closely integrated with pandas data structures.
import seaborn as sns

#### healthexp dataset from https://github.com/mwaskom/seaborn-data/blob/master/healthexp.csv

In [2]:
## loading healthexp dataset from https://github.com/mwaskom/seaborn-data/blob/master/healthexp.csv
df = sns.load_dataset('healthexp')
df.head()

Unnamed: 0,Year,Country,Spending_USD,Life_Expectancy
0,1970,Germany,252.311,70.6
1,1970,France,192.143,72.2
2,1970,Great Britain,123.993,71.9
3,1970,Japan,150.437,72.0
4,1970,USA,326.961,70.9


Note: from the above dataset data mainly concentrating on neumerical fields Spending_USD & Life_Expectancy for calculating covariance and correlation

In [3]:
## numpy 
import numpy as np


In [4]:
# select only the numeric columns for covariance computation. can use the select_dtype
df_number = df.select_dtypes(include=np.number)

In [5]:
# Include specific columns
selected_columns = ['Year', 'Spending_USD', 'Life_Expectancy']
selected_df = df[selected_columns]

In [6]:
## finding the covariance for spefic columns Spending_USD & Life_Expectancy
selected_df.cov()

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,201.098848,25718.83,41.915454
Spending_USD,25718.827373,4817761.0,4166.800912
Life_Expectancy,41.915454,4166.801,10.733902


Note: the more we Spending_USD(4.817761e+06) the more Life_Expectancy(4166.800912) is

#### Correlation

Correlation is a standardized measure of the relationship between two variables. It tells us both the strength and direction of the linear relationship between them. Correlation values range from -1 to 1. A correlation of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
The formula for Pearson correlation coefficient (often denoted by 𝑟r) between two variables X and Y, given a sample of size n, is:

![image.png](attachment:image.png)

Where:

•	𝑋𝑖Xi and 𝑌𝑖Yi are the individual data points

•	𝑋ˉXˉ and 𝑌ˉYˉ are the means of X and Y, respectively.

##### Spearman Correlation:
Spearman correlation is a non-parametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. It's often used when the variables may not have a linear relationship or when the data are ordinal.

The Spearman correlation coefficient (
𝜌
ρ) is computed similarly to Pearson correlation but using the ranks of the observations rather than the actual observations themselves.

These formulas provide the mathematical foundation for understanding the relationships between variables in a dataset.

In [7]:
## finding Spearman Rank Correlation
selected_df.corr(method='spearman')

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.931598,0.896117
Spending_USD,0.931598,1.0,0.747407
Life_Expectancy,0.896117,0.747407,1.0


Note: the Spearman Rank Correlation varies on rank. Spending_USD(0.93(more +ve)) the more Life_Expectancy(0.7).

##### Pearson Correlation Coefficient (r):
The Pearson correlation coefficient measures the linear relationship between two continuous variables. It ranges from -1 to 1, where:

•  1 indicates a perfect positive linear relationship,

•  -1 indicates a perfect negative linear relationship, and

•  0 indicates no linear relationship.

It's calculated as the covariance of the two variables divided by the product of their standard deviations.

Here's the formula for the Pearson correlation coefficient between two variables X and Y, given a sample of size n:

![image.png](attachment:image.png)

Where:

•	𝑋𝑖Xi and 𝑌𝑖Yi are the individual data points

•	𝑋ˉXˉ and 𝑌ˉYˉ are the means of X and Y, respectively.

In [8]:
## finding pearson Correlation
selected_df.corr(method='pearson')

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.826273,0.902175
Spending_USD,0.826273,1.0,0.57943
Life_Expectancy,0.902175,0.57943,1.0


Note: the pearson Correlation varies between -1 to +1. Spending_USD(1(more +ve)) the more Life_Expectancy(0.5).
    the year is incresing 1.0 and life expectancy is also increasing 0.9

#### other example
##### flights dataset from https://github.com/mwaskom/seaborn-data/blob/master/flights.csv

In [9]:
flight_df = sns.load_dataset('flights')
flight_df.head()


Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [10]:
# Include specific columns (numeric columns for compute covariance and correlation)
selected_columns = ['year', 'passengers']
selected_flight_df = flight_df[selected_columns] # using specific columns

In [11]:
## finding covariance
selected_flight_df.cov()

Unnamed: 0,year,passengers
year,12.0,383.087413
passengers,383.087413,14391.917201


Note: this represents the covariance between the 'year' and 'passengers' columns. A positive value (12.000000 in this case) indicates a positive covariance, which suggests that as the year increases, the number of passengers also tends to increase.

In [12]:
## finding Spearman Rank Correlation
selected_flight_df.corr(method='spearman')

Unnamed: 0,year,passengers
year,1.0,0.950549
passengers,0.950549,1.0
