# Variable Correlation
We can calculate if variables are highly or lowly correlated with each other.

If an independent variable is highly correlated to another independent variable, then we can use that independent variable to "predict" the other independent variable. Therefore, this causes redundancy in the machine learning model.

This is a problem because independent variables are supposed to be independent from each other, so high correlation would actually skew that independence when fitting to a machine learning model.
- This implies lower correlated independent variables are much better for the model

Typically, high independent variable correlation can cause overfitting onto a model.
- Fortunately, SKLearn models automatically remove highly correlated variables

### Calculate Correlation Coefficients
In order to determine how correlated variables are to each other, we can calculate their correlation coefficients.

#### Correlation Coefficients Range From -1 to 1:  
1. If two variables have a negative coeff, then they're negatively correlated.
    - If two variables have a coeff of -1, the they're perfectly negative multi-collinear.  
2. If two variables have a positive coeff, then they're positively correlated.
    - If two variables have a coeff of 1, then they're perfectly positive multi-collinear.  
3. If two variables have a coeff of 0, then they have no linear correlation.

We can create a correlation matrix to help us observe the correlation between variables.

In [6]:
# import libraries
import pandas as pd

In [5]:
# read the customers csv as a data set
customers_df = pd.read_csv("datasets/customers.csv")

customers_df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [11]:
# return a correlation matrix
customers_df.corr()

Unnamed: 0,Age,Salary
Age,1.0,0.982495
Salary,0.982495,1.0


# Correlation Matrix Conclusion
### Disclaimer About Categorical Variables
Because the "Country" and "Purchased" columns were categorical variables, Pandas excluded it from the correlation matrix.
- We could create dummy variables for those columns, but that'll overcomplicate the explanation

### Explanation of Correlation Matrix
The "Age" and "Salary" columns have a correlation coefficient of 0.982495.
- Therefore, we could assume they have a high positive correlation

Should we remove the Age or Salary column to prevent redundancy?

I recommend No, because the only severe case to prevent high correlation is when perfect multicollinearity exists: when the correlation coefficient equals -1 or 1.