# Correlation


Let's revisit our dataset of economic indicators.

We will focus on correlation, and determining which of these indicators may be positively or negatively correlated with eachother. This will allow us to answer questions like, "is gold a good hedge against inflation?".

In [None]:
from pandas import read_csv

df = read_csv("https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-indicators.csv")
df.head()

Unnamed: 0,timestamp,cpi,fed,spy,gld
0,2024-05-01,314.069,5.33,525.6718,215.3
1,2024-04-01,313.548,5.33,500.3636,211.87
2,2024-03-01,312.332,5.33,521.3857,205.72
3,2024-02-01,310.326,5.33,504.8645,189.31
4,2024-01-01,308.417,5.33,479.824,188.45


In [None]:
print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())

234
2004-12-01 ... 2024-05-01


The primary reason why we fetched all these different datasets and merged them together, is so we can explore the correlation between them.

**Correlation** is a measure of how two datasets are related to eachother.


https://www.investopedia.com/terms/c/correlation.asp

<img src="https://www.investopedia.com/thmb/Xz1Mnf7Ji54AAfAT1fsiwcZvmxM=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/correlation_defintion_-9d2d662781724d61af6d6322a2a294b5.jpg" height=250>


> Investment managers, traders, and analysts find it very important to calculate correlation because the risk reduction benefits of diversification rely on this statistic.

Let's take a quick detour to make a scaled version of this data, to make it easier to plot all these different series on a graph, so we can perhaps start to get a sense of how their movements might correlate (in an unofficial way).

In [None]:
scaled_df = df.copy()
scaled_df.index = df["timestamp"] # save the ts for charting, knowing we will remove it
scaled_df.drop(columns=["timestamp"], inplace=True) # remove the ts column, in preparation to operate on all numeric columns
scaled_df = scaled_df / scaled_df.max() # dividing all numeric col values by their column's max. there are many alternative methods for scaling the data
scaled_df.head()

import plotly.express as px
px.line(scaled_df, y=["cpi", "fed", "spy", "gld",
                      #"btc"
                      ],
        title="Scaled data over time")

Looks like the [...] has been moving [upward/downward] at a time when [...] has been moving [upward/downward]. We might start to suspect they are correlated in a [pos/neg] way.

> NOTE: correlation does not imply causation!

Let's now perform tests for correlation in more official / formal ways.



## Correlation Considerations

Certain methods for calculating correlation may depend on the normality of our data's distribution, or the sample size, so we should keep these in mind as we determine if we are able to calculate correlation, and which method to use.



https://www.investopedia.com/terms/n/nonparametric-method.asp


> The nonparametric method refers to a type of statistic that does not make any assumptions about the characteristics of the sample (its parameters) or whether the observed data is quantitative or qualitative.
>
> Nonparametric statistics can include certain descriptive statistics, statistical models, inference, and statistical tests. The model structure of nonparametric methods is not specified a priori but is instead determined from data.
>
> Common nonparametric tests include Chi-Square, Wilcoxon rank-sum test, Kruskal-Wallis test, and Spearman's rank-order correlation.
>
> In contrast, well-known statistical methods such as ANOVA, Pearson's correlation, t-test, and others do make assumptions about the data being analyzed. One of the most common parametric assumptions is that population data have a "normal distribution."


## Correlation with `scipy`



We can always calculate correlation between two lists of numbers, using the `pearsonr` and `spearmanr` functions from the `scipy` package.

One difference between these two correlation methods is that Spearman is more robust to (i.e. less affected by) outliers. Also being nonparametric, the Spearman method does not assume our data is normally distributed.


https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

> Pearson correlation coefficient and p-value for testing non-correlation.
>
> The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
>
> This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

> Calculate a Spearman correlation coefficient with associated p-value.
>
> The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
>
> The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. Although calculation of the p-value does not make strong assumptions about the distributions underlying the samples, it is only accurate for very large samples (>500 observations). For smaller sample sizes, consider a permutation test instead (see docs for examples).

In [None]:
from scipy.stats import pearsonr

x = df["fed"]
y = df["spy"]

result = pearsonr(x, y)
print(result)

PearsonRResult(statistic=0.17282057382978896, pvalue=0.008062179433931187)


In [None]:
from scipy.stats import spearmanr

x = df["fed"]
y = df["spy"]

result = spearmanr(x, y)
print(result)

SignificanceResult(statistic=0.005936198901328186, pvalue=0.9280322090398303)


## Correlation Matrix with `pandas`

OK sure we can calculate correlation between two sets of data. But what if we wanted to calculate correlation between many different data sets? We could perhaps set up a loop, but there is an easier way.

If we have a pandas dataframe, we can use it's `corr()` method to produce a "correlation matrix", which shows us the "pairwise correlation of columns", in other words, the correlation of each column with respect to each other column.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html



In [None]:
#df.corr(method="pearson") # method is pearson by default
df.corr(method="pearson", numeric_only=True) # numeric_only to suppress warning

Unnamed: 0,cpi,fed,spy,gld
cpi,1.0,0.078102,0.949065,0.823717
fed,0.078102,1.0,0.172821,-0.263213
spy,0.949065,0.172821,1.0,0.71916
gld,0.823717,-0.263213,0.71916,1.0


In [None]:
#df.corr(method="spearman")
df.corr(method="spearman", numeric_only=True) # numeric_only to suppress warning

Unnamed: 0,cpi,fed,spy,gld
cpi,1.0,-0.102732,0.953588,0.790661
fed,-0.102732,1.0,0.005936,-0.308626
spy,0.953588,0.005936,1.0,0.714306
gld,0.790661,-0.308626,0.714306,1.0


We may begin to notice the diagonal of 1s values. This is because each dataset is perfectly positively correlated with itself.

We may also start to notice the symmetry of values mirrored across the diagonal. In other words, the value in column 1, row 4 is the same as the value in column 4, row 1.

## Plotting Correlation Matrix

It may not be easy to quickly interpret the rest of the values in the correlation matrix, but if we plot it with colors as a "heat map" then we will be able to use color to more easily interpret the data and tell a story.

### Correlation Heatmap with `plotly`

https://plotly.com/python-api-reference/generated/plotly.express.imshow.html

In [None]:
# https://plotly.com/python/heatmaps/
# https://plotly.com/python-api-reference/generated/plotly.express.imshow.html
import plotly.express as px

cor_mat = df.corr(method="spearman", numeric_only=True) # using numeric_only to suppress warning

title= "Spearman Correlation between Economic Indicators"
fig = px.imshow(cor_mat,
                height=600, # title=title,
                text_auto= ".2f", # round to two decimal places
                color_continuous_scale="Blues",
                color_continuous_midpoint=0, # set color midpoint at zero because correlation coeficient ranges from -1 to 1 (see correlation notes)
                labels={"x": "Indicator", "y": "Indicator"}
)
fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'}) # https://stackoverflow.com/questions/64571789/center-plotly-title-by-default
fig.show()

What stories can we tell with the correlation heatmap? Which indicators are most positively correlated? Which are most negatively correlated?

Is gold a hedge against inflation, or is there another indicator which may be a better hedge?
