# PDA data science - Correlation
<div class="alert alert-block alert-info"> 
    Notebook 4: by michael.ferrie@edinburghcollege.ac.uk <br> Edinburgh College, March 2022
</div>

## Key concepts

Correlation means association - more precisely it is a measure of the extent to which two variables are related. There are three possible results of a correlational study: a positive correlation, a negative correlation, and no correlation.

"Correlation is not causation" just because two variables are related, it does not necessarily mean that one causes the other.

Dr Saul McLeod [explains](https://www.simplypsychology.org/correlation.html) in simple terms.

A <b>positive correlation</b> is a relationship between two variables in which both variables move in the same direction. Therefore, when one variable increases as the other variable increases, or one variable decreases while the other decreases. An example of positive correlation would be height and weight. Taller people tend to be heavier.

A <b>negative correlation</b> is a relationship between two variables in which an increase in one variable is associated with a decrease in the other. An example of negative correlation would be height above sea level and temperature. As you climb the mountain (increase in height) it gets colder (decrease in temperature).

A <b>zero correlation</b> exists when there is no relationship between two variables. For example there is no relationship between the amount of tea drunk and level of intelligence.

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.
    
One way to quantify the relationship between two variables is to use the Pearson correlation coefficient, which is a measure of the linear association between two variables. It always takes on a value between -1 and 1 where:

* -1 indicates a perfectly negative linear correlation between two variables
* 0 indicates no linear correlation between two variables
* 1 indicates a perfectly positive linear correlation between two variables

In order to calculate this in python we can use the `numpy` library, first step is to install it:

`pip install numpy`

By convention numpy is imported as `np`

In [None]:
# import numpy and alias to np
import numpy as np

# create two very small example lists
my_list1 = [1, 2, 13, 41]
my_list2 = [4, 10, 12, 2]

# run correlation function
np.corrcoef(my_list1, my_list2)

# print correlation
np.corrcoef(my_list1, my_list2)[0,1]

### Understanding the output

The `corrcoef` function returns a negative correlation for the two lists, this means that these have a negative correlation.

A second useful library for statistics in `scipy`

`pip install scipy`

We can use scipy to return the pearsons r for the two datasets:

In [None]:
# imporrt library
from scipy.stats import pearsonr

# run function
pearsonr(my_list1, my_list2)

Not that this returns two values, the second is the probability or p-value. To strip the response to only the correlation just use the index.

In [None]:
# print only pearsons r
pearsonr(my_list1, my_list2)[0]

### Visualise the data

It would be good to get a picture of the shape of the data, here is how we can achieve that.

In [None]:
# import pyplot
import matplotlib.pyplot as plt

# plot my_list1
plt.plot(my_list1)

In [None]:
# plot my_list2
plt.plot(my_list2)

In [None]:
# plot both values, we can pass multiple lines to the plot function
plt.plot(my_list1)
plt.plot(my_list2)
plt.show()

In [None]:
# as a scatter plot, pass each one in reverse order
plt.scatter(my_list1, my_list2)

### Using more realistic data

As we can see there isn't much relationship between the variables, let's look at some more realistic data.

We will look at the relationship between the height and weight of two sets of people.

In [None]:
# set up two lists, using cms for height and kgs for weight
height = [183, 162, 190, 178, 140]
weight = [95, 82, 91, 83, 64]

Our hypothesis should be that the taller a person is the higher their weigh is, or at least we would hope so, first calculate the correlation.

In [None]:
# import scipy
import scipy.stats

# print only pearsons r
pearsonr(height, weight)[0]

Hypothesis is correct, we can see that there is a very strongly positive correlation between height and weight. Rearrange the data to make the correlation negative, change the height to be different values but leave the weight.

In [None]:
# set up two lists, using cms for height and kgs for weight
height = [183, 162, 190, 178, 140]
weight = [500, 1, 5, 8, -33]

# import scipy
import scipy.stats

# print only pearsons r
pearsonr(height, weight)[0]

Notice there is still some correlation because of the relative similarity of the height variable, what happens if we change them both?

In [None]:
# set up two lists, using cms for height and kgs for weight
height = [-55, 3000, 451, 987, -22]
weight = [500, 1, 5, 8, -33]

# import scipy
import scipy.stats

# print only pearsons r
pearsonr(height, weight)[0]

This shows a negative correlation between the lists.

### Using pandas

We can work with pandas dataframes to get a value for two dataframes, consider the following, first convert our two test lists to dataframes.

In [None]:
# import pandas and alias as pd
import pandas as pd

# convert lists to dataframes
df_height = pd.DataFrame(height)
df_weight = pd.DataFrame(weight)

# print out to visualise
print(df_height)
print(df_weight)

# print type (always check the type you are working with)
print(type(df_height))
print(type(df_weight))

With pandas we can describe the data to get statistics and also print the correlation.

In [None]:
# generate summary statistics on the data
df_height.describe()

It is actually easier if we put our two lists into the same dataframe as separate columns, then we can perform operations on them.

In [None]:
# define the dataframe and put the first vaue in
df = pd.DataFrame(height)

# add in the next list
df['weight'] = weight

# label the columns
df.columns = ['height in cm', 'weight in kg']
print(df)

In [None]:
# Generate summary statistics and correlation, plot
print(df.describe())
print(df.corr())

# Plot values
df.plot()

## Introducing `mplfinance`

Mplfinance is a fantastic library for visualising data install with pip

`pip install mplfinance`

Get some data from `yfinance` and then we can examine the output

In [None]:
# import libraries
import mplfinance as mpf
import yfinance as yf

# make a dataframe of the bitcoin value for Feb 2022
df = yf.download(['BTC-USD'],
                 start='2022-02-01', end='2022-02-28', interval='1d')

# plot with mpl
mpf.plot(df)

The [documentation on github](https://github.com/matplotlib/mplfinance) shows different examples.

In [None]:
# candle plot
mpf.plot(df, type='candle')

In [None]:
# line plot
mpf.plot(df, type='line')

In [None]:
# open low high close chart with moving average
mpf.plot(df, type='ohlc', mav=4)

# Questions
Add your code to the code cells to answer the questions.

USD = United States Dollar
GBP = Great Britain Pound Sterling
EUR = Euro

#### Use the `yfinance` library to collect data on the value of GBP in USD for the whole year of 2021, set the interval to 1 day, then plot this on an ohlc chart with `mplfinance`, add a moving average line and set the value of this to 4?

In [None]:
# your answer in this cell


#### With `pandas` create two dataframes, one with the value of bitcoin in USD, set the interval to 1d and gather all data for the month of January 2022. Create a second dataframe with data for the same time period for ethereum?

In [None]:
# your answer in this cell


#### Use pandas to describe both dataframes?

In [None]:
# your answer in this cell


#### Create a new dataframe using the close value of bitcoin in USD and the close value of ethereum in USD using the two dataframes from the previous question, call this dataframe `bit_eth`, name the columns `bit_close_jan22` and `eth_close_jan22`.

In [None]:
# your answer in this cell


#### Show the correlation between the columns of data, is there a positive or a negative correlation between these values?

In [None]:
# your answer in this cell


#### Create two new dataframes, one for the value of GBP in USD and one for EUR in USD, gather all of the data for 2021, set the interval to one day?

In [None]:
# your answer in this cell


#### Create an OHLC plot for each dataframe?

In [None]:
# your answer in this cell


#### What was the correlation between the value of GBP and EUR, for the data we have from 2021?

In [None]:
# your answer in this cell
