## Correlation

Correlation is the measure of a positive, negative, or neutral (random) relationship between two variables. For example, there is often a positive correlation between height and weight; that is, as you grow in height, you tend to weigh more.

### Instructor Demo: Correlation

This program reads in the CSV datasets of ice cream sales and drowning incidents, combines the two datasets into a single DataFrame, creates a line and scatterplot, and calculates the correlation between the two variables.

In [None]:
# Import libraries and dependencies
import pandas as pd

import seaborn as sns
%matplotlib inline

### Read CSV in as DataFrame

In [None]:
# Read the ice cream sales data, set the `Month` as the index

ice_cream = pd.read_csv("ice_cream.csv", index_col="Month")
ice_cream.head()

In [None]:
# Read the drowning incident data, set the `Month` as the index

drowning = pd.read_csv('drowning.csv', index_col="Month")
drowning.head()

### Combine the DataFrames

In [None]:
# Use the `concat` function to combine the two DataFrames by matching indexes (or in this case `Month`)
combined_df = pd.concat([ice_cream, drowning], axis="columns", join="inner")
combined_df

### Plot Trends

In [None]:
# Plot the data trends of the two variables over time
combined_df.plot()



When comparing the line trend of ice cream sales to drowning incidents, it is difficult to detect a relationship between the two. Therefore, use a scatterplot and set the x and y axes to the corresponding DataFrame columns. With a scatterplot, the relationship becomes more apparent.

### Plot Relationships

In [None]:
# Plot the relationship between the two variables
combined_df.plot(kind='scatter', x='Ice Cream Sales', y='Drowning Incidents')

### Calculate Correlation

Use the corr function to calculate and output a matrix of correlation values for each column-to-column pair of a DataFrame. Correlation values range from -1 to +1.

-1 indicates a negative relationship: variables move inversely to one another.

0 indicates a neutral relationship: variables have no relationship and move randomly.

+1 indicates a positive relationship: variables move in tandem with one another.



In [None]:
# Calculate the correlation between each column
correlation = combined_df.corr()
correlation

### Plot Correlations

The heatmap function from the seaborn library color codes the different variations in a correlation table. This is particularly useful when there are many variables in a correlation table.



In [None]:
# Use the `heatmap` function from the Seaborn library to visualize correlations
sns.heatmap(correlation, vmin=-1, vmax=1)

Remember that correlation does not imply causation!

Although Ice Cream Sales has a positive correlation of 0.819404 with Drowning Incidents, this does not mean that buying more ice cream causes people to drown; it simply means that there is a positive relationship between the numbers.

Chances are there is another factor at play that results in this positive correlation. One possible factor is that as temperature increases (during the summer months), people tend to both eat more ice cream and go swimming.

Multiple regression analysis is a method we'll learn in a later unit that can measure multiple relationships at the same time (e.g., the effect of both weather and income on ice cream sales). This does not solve our problem of confusing correlation with causation, but it will help us better tease out economic relationships from multiple influences.

How do these concepts apply to stock investments?

Investigating the correlations of returns among stocks in a portfolio can help analysts properly diversify their portfolios and mitigate risk and volatility.

Non-correlated stocks in a portfolio tend to cancel out large swings in volatility; one stock may increase in price while another may decrease in price rather than all stocks increasing or all stocks decreasing.