### Example Multiple Linear Regression 4.17

The following **Python**-output displays the correlation matrix for the **Credit** data set. 

In [1]:
import pandas as pd

# Load data
df = pd.read_csv('./data/Credit.csv')

# Drop all quantitative columns
df = df.drop(['Unnamed: 0','Gender','Student','Married','Ethnicity'], 
             axis=1)

# Find the correlation Matrix using DataFrame.corr()
print(round(df.corr(), 4))

           Income   Limit  Rating   Cards     Age  Education  Balance
Income     1.0000  0.7921  0.7914 -0.0183  0.1753    -0.0277   0.4637
Limit      0.7921  1.0000  0.9969  0.0102  0.1009    -0.0235   0.8617
Rating     0.7914  0.9969  1.0000  0.0532  0.1032    -0.0301   0.8636
Cards     -0.0183  0.0102  0.0532  1.0000  0.0429    -0.0511   0.0865
Age        0.1753  0.1009  0.1032  0.0429  1.0000     0.0036   0.0018
Education -0.0277 -0.0235 -0.0301 -0.0511  0.0036     1.0000  -0.0081
Balance    0.4637  0.8617  0.8636  0.0865  0.0018    -0.0081   1.0000


From the **Python**-output we read off that the  correlation coefficient between **limit** and **age** is $ 0.101 $ which corresponds to a rather weak correlation. On the other hand, we find for the correlation between **limit** and **rating** a value of $ 0.997 $ which is very large. 

### Example Multiple Linear Regression 4.18

In the **Credit** data, a regression of **balance** on **age**, **rating**, and **limit** indicates that the predictors have VIF values of 1.01, 160.67, and 160.59. As we suspected, there is considerable collinearity in the data.


In [2]:
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
    
# Define the linear model:
x = pd.DataFrame({
    'Age' : df['Age'], 
    'Rating' : df['Rating'],
    'Limit' : df['Limit']})
y = df['Balance']

# VIF Analysis
x_c = sm.add_constant(x)
VIF  = []
for i in range(1,4):
    VIF.append(variance_inflation_factor(x_c.to_numpy(), i))
    
print(list(x.columns), '\n', np.round(VIF, 3))

# R Squared for 'complete' system
# Fit models
x_sm = sm.add_constant(x)
model = sm.OLS(y, x_sm).fit()

print('\nRsquared for the complete model is given by:\n', 
      np.round(model.rsquared, 4))

['Age', 'Rating', 'Limit'] 
 [  1.011 160.668 160.593]

Rsquared for the complete model is given by:
 0.7536


### Example Multiple Linear Regression 4.19

When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic variables from the regression. This can usually be done without much compromise to the regression fit, since the presence of collinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables. 

For instance, if we regress **balance** onto **age** and **limit**, without the **rating** predictor, then the resulting VIF values are close to the minimum possible value of 1, and the $ R^2 $ drops from 0.754 to 0.75. 

In [3]:
# Define the linear model:
x = x.drop('Rating', axis=1, errors='ignore')

# Fit models
x_sm = sm.add_constant(x)
model = sm.OLS(y, x_sm).fit()

# Print result
print('\n Rsquared without \'Rating\' is given by:\n', 
      np.round(model.rsquared, 4))


 Rsquared without 'Rating' is given by:
 0.7498


So dropping **rating** from the set of predictors has effectively solved the collinearity problem without compromosing  the fit. 
