Pandas Correlation

Correlation is a statistical concept that quantifies the degree to which two variables are related to each other.

Correlation can be calculated in Pandas using the corr() function.

Let's look at an example.



In [1]:
import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate correlation matrix
print(df.corr())

                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000


Positive and Negative Correlation
Positive correlation refers to a relationship between two variables where they both tend to change in the same direction. When one variable increases, the other variable also tends to increase, and when one variable decreases, the other variable also tends to decrease.

Negative correlation, on the other hand, refers to a relationship between two variables where they tend to change in opposite directions. When one variable increases, the other variable tends to decrease, and vice versa.

Instead of finding the whole correlation matrix, we can specify the columns to calculate correlation between them.

In [2]:
import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate correlation coefficient
correlation = df['Temperature'].corr(df["Ice_Cream_Sales"])

print(correlation)

0.9234007664064656


Correlation Methods in Pandas
We can calculate correlation using three different methods in Pandas:

Pearson Method (Default): evaluates the linear relationship between two continuous variables
Kendall Method: measures the ordinal association between two measured quantities
Spearman Method: evaluates the monotonic relationship between two continuous or ordinal variables
By default, corr() computes the Pearson correlation coefficient, which measures the linear relationship between two variables.



In [3]:

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate different correlation coefficients
pearson = df['Temperature'].corr(df["Ice_Cream_Sales"])
kendall = df['Temperature'].corr(df["Ice_Cream_Sales"], method='kendall')
spearman = df['Temperature'].corr(df["Ice_Cream_Sales"], method='spearman')

# display different correlation coefficient
print(f"Pearson's Coefficient: {pearson}")
print(f"Kendall's Coefficient: {kendall}")
print(f"Spearman's Coefficient: {spearman}")

Pearson's Coefficient: 0.9234007664064656
Kendall's Coefficient: 0.7999999999999999
Spearman's Coefficient: 0.8999999999999998


Perfect, Good & Bad Correlation
We can interpret the correlation values as:

Perfect Correlation

A perfect positive correlation implies that for every increase in one variable, there is a proportionate increase in the other variable, indicated by a coefficient of +1.

A perfect negative correlation, represented by -1, signifies that an increase in one variable leads to a proportionate decrease in the other.

Good Correlation

A good correlation can range from 0.5 to 0.9 (positive or negative) and generally indicates a strong relationship between the variables, but it doesn't mean the relationship is perfect.

Bad Correlation

A bad correlation is typically close to zero, indicating that there is no relationship or any form of dependence between the two variables.