# Field Relationships and Correlations


Lesson Goals

In this lesson we will learn to find relationships between columns in our Pandas DataFrame.
Introduction

In the pre-work module we learned about correlation, particularly linear correlation. Pandas allows us to compute the correlation between two columns as well as a correlation matrix. We can also explore other types of relationships between columns in addition to linear correlation.
Correlation Between Two Columns

In order to compute the linear correlation between two columns, we use the corr function. We apply it to one column and pass the other column as an argument to the function.

Recall our vehicles dataset. We would like to compute the correlation between city MPG and highway MPG.

In [1]:
import numpy as np
import pandas as pd

vehicles = pd.read_csv('data/vehicles.csv')
vehicles['City MPG'].corr(vehicles['Highway MPG'])

0.9238555885288405

Recall that a correlation closer to 1 means a strong positive linear relationship between the two variables. Here the correlation is over 92%. This indicates a strong linear relationship between city and highway MPG.


# Correlation Matrix

Instead of computing the correlation for each pairs of columns, we can compute all correlations at once. Note that in our case not all correlations will be meaningful. For example, the year column might contain a numeric value but it is ordinal. Therefore, correlating year with values like MPG or fuel cost per year has no real meaning.

In [2]:
vehicles.corr()

Unnamed: 0,Year,Engine Displacement,Cylinders,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
Year,1.0,0.037876,0.082469,-0.221084,0.161818,0.267259,0.204751,-0.2223,-0.091913
Engine Displacement,0.037876,1.0,0.901858,0.789752,-0.740317,-0.715039,-0.746782,0.80352,0.769678
Cylinders,0.082469,0.901858,1.0,0.739517,-0.703866,-0.650287,-0.698648,0.752393,0.778153
Fuel Barrels/Year,-0.221084,0.789752,0.739517,1.0,-0.877752,-0.909664,-0.909743,0.986189,0.916208
City MPG,0.161818,-0.740317,-0.703866,-0.877752,1.0,0.923856,0.985457,-0.894139,-0.858645
Highway MPG,0.267259,-0.715039,-0.650287,-0.909664,0.923856,1.0,0.969392,-0.926405,-0.851404
Combined MPG,0.204751,-0.746782,-0.698648,-0.909743,0.985457,0.969392,1.0,-0.926229,-0.875185
CO2 Emission Grams/Mile,-0.2223,0.80352,0.752393,0.986189,-0.894139,-0.926405,-0.926229,1.0,0.930865
Fuel Cost/Year,-0.091913,0.769678,0.778153,0.916208,-0.858645,-0.851404,-0.875185,0.930865,1.0


The diagonal of a correlation matrix is always 1 since each variable has a perfect correlation with itself. We can see some interesting relationships. For example, engine displacement has over 90% correlation with number of cylinders. Fuel cost per year has a strong negative correlation with combined MPG. All groups of MPG have a strong correlation to each other.



# Other Types of Correlation

The pandas .corr function allows for three types of correlation.

The first one we explored is called Pearson correlation. This type measures linear correlation. However, we can examine the two other types of correlation as well.



Spearman Correlation

Spearman correlation is a non-parametric measure of correlation. Spearman correlation measures relationships between variables that are not necessarily linear. We are only looking at whether the data is moving in the same or the opposite direction.

We can measure the Spearman correlation by ranking the data in each column from largest to smallest and then compute the Pearson correlation of those ranks. The ranks of the data are linear. Therefore, with non-linear data it makes sense to perform this transformation.

Here is an example of Spearman correlation in our vehicles dataset:

In [3]:
vehicles.corr(method='spearman')

Unnamed: 0,Year,Engine Displacement,Cylinders,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
Year,1.0,0.05137,0.068727,-0.214857,0.157137,0.266934,0.20508,-0.215108,-0.091437
Engine Displacement,0.05137,1.0,0.927979,0.827152,-0.848167,-0.75408,-0.824065,0.831333,0.794755
Cylinders,0.068727,0.927979,1.0,0.784595,-0.818672,-0.698356,-0.783362,0.788777,0.790481
Fuel Barrels/Year,-0.214857,0.827152,0.784595,1.0,-0.974144,-0.963335,-0.990364,0.995539,0.919069
City MPG,0.157137,-0.848167,-0.818672,-0.974144,1.0,0.93012,0.985062,-0.979787,-0.928713
Highway MPG,0.266934,-0.75408,-0.698356,-0.963335,0.93012,1.0,0.970769,-0.968693,-0.876067
Combined MPG,0.20508,-0.824065,-0.783362,-0.990364,0.985062,0.970769,1.0,-0.995258,-0.926078
CO2 Emission Grams/Mile,-0.215108,0.831333,0.788777,0.995539,-0.979787,-0.968693,-0.995258,1.0,0.922723
Fuel Cost/Year,-0.091437,0.794755,0.790481,0.919069,-0.928713,-0.876067,-0.926078,0.922723,1.0


Kendall's Tau

Kendall's Tau is also a non-parametric measure of correlation. It is less commonly used than the Spearman correlation coefficient.

Here is an example of the Kendall's Tau coefficient between city and highway MPG:

In [4]:
vehicles['City MPG'].corr(vehicles['Highway MPG'], method="kendall")

0.8171408108342495