# Variable Correlation
We can calculate if variables are highly or lowly correlated with each other.
### Calculate Correlation Coefficients
In order to determine how correlated variables are to each other, we can calculate their correlation coefficients.

#### Correlation Coefficients Range From -1 to 1:  
1. If two variables have a negative coeff, then they're negatively correlated.
    - If two variables have a coeff of -1, the they're perfectly negative multi-collinear.  
2. If two variables have a positive coeff, then they're positively correlated.
    - If two variables have a coeff of 1, then they're perfectly positive multi-collinear.  
3. If two variables have a coeff of 0, then they have no linear correlation.

We can create a correlation matrix to help us observe the correlation between variables.

### Problem With Perfect Multi-Collinarity
Perfect multicollinearity skews with the coefficient estimates for a machine learning model. This is beause there becomes an infinite number of equally "good" solutions, so there's no way to tell which solution is the actual best for the model.
- For instance, in the OLS Regression models, the least-squares error is minimized equally well, thus there's an inifinite number of equally "good" solutions for the algorithm
- The goal is to get a a unique, best, solution; not an infinite number of solutions!

This leads to an unstable model with high variance of predicted values. If we attempt to predict a value, that value would be produce a large change in the prediction due to such high variance.
- It becomes difficult to determine a "precise" prediction since there are so many variations

In [6]:
# import libraries
import pandas as pd

In [7]:
# read the customers csv as a data set
customers_df = pd.read_csv("datasets/customers.csv")

# fill NaN with the mean of each column
customers_df = customers_df.fillna(customers_df.mean())

customers_df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes


In [8]:
# return a correlation matrix
customers_df.corr()

Unnamed: 0,Age,Salary
Age,1.0,0.912577
Salary,0.912577,1.0


# Correlation Matrix Conclusion
### Disclaimer About Categorical Variables
Because the "Country" and "Purchased" columns were categorical variables, Pandas excluded it from the correlation matrix.
- We could create dummy variables for those columns, but that'll overcomplicate the explanation

### Explanation of Correlation Matrix
The "Age" and "Salary" columns have a correlation coefficient of 0.912577.
- Therefore, we could assume they have a high positive correlation

Should we remove the Age or Salary column to prevent redundancy?

I recommend No, because the only severe case to prevent high correlation is when perfect multicollinearity exists: when the correlation coefficient equals -1 or 1.