#### detect multi-collinearity

Here's how eigenvalues near zero indicate multicollinearity:

- `Small Eigenvalues`: Eigenvalues near zero indicate that the corresponding eigenvectors (principal components) explain very little variance in the data.

- `Multicollinearity`: In the context of multicollinearity, a small eigenvalue suggests that there is little unique information in the data that is not already captured by other predictors. This means that the corresponding predictor variable (associated with the eigenvector) is highly correlated with other predictors in the dataset.

- `Ill-Conditioned Matrix`: In numerical terms, a covariance matrix with small eigenvalues is considered ill-conditioned, indicating that the matrix is close to being singular or non-invertible. This implies that the matrix is nearly linearly dependent, which is a characteristic of multicollinearity.

- `Identifying Multicollinearity`: By examining the magnitudes of the eigenvalues, you can identify multicollinearity. If there are one or more eigenvalues that are significantly smaller than the others (i.e., close to zero), it suggests that the corresponding predictors are highly collinear.

#### Practical Implications:

`Model Instability`: Multicollinearity can lead to instability in regression coefficient estimates, making them sensitive to small changes in the data.

`Inflated Standard Errors`: Multicollinearity inflates the standard errors of regression coefficients, reducing the precision of parameter estimates.

`Difficulty in Interpretation`: Multicollinearity makes it difficult to interpret the individual effects of predictors on the response variable because the effects of collinear predictors are confounded.

`Dimensionality Reduction`: In some cases, multicollinearity can be addressed by removing one or more highly correlated predictors or by performing dimensionality reduction techniques like principal component analysis (PCA).

In [34]:
from sklearn.datasets import load_breast_cancer

import pandas as pd
import numpy as np
np.set_printoptions(precision=3, suppress=True)

import matplotlib.pyplot as plt

In [35]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data

In [36]:
# Compute the correlation matrix
correlation_matrix = np.corrcoef(X, rowvar=False)

In [37]:
# Compute the eigenvalues of the correlation matrix
eigenvalues, _ = np.linalg.eig(correlation_matrix)

In [38]:
# Sort the eigenvalues in descending order
eigenvalues_sorted = np.sort(eigenvalues)[::-1]

In [39]:
# Print the eigenvalues
print("Eigenvalues of the correlation matrix:")
print(eigenvalues_sorted)

Eigenvalues of the correlation matrix:
[13.282  5.691  2.818  1.981  1.649  1.207  0.675  0.477  0.417  0.351
  0.294  0.261  0.241  0.157  0.094  0.08   0.059  0.053  0.049  0.031
  0.03   0.027  0.024  0.018  0.015  0.008  0.007  0.002  0.001  0.   ]


In [40]:
# Check if any eigenvalue is close to zero (indicating multicollinearity)
tolerance = 1e-10
multicollinear_indices = np.where(eigenvalues_sorted < tolerance)[0]

if len(multicollinear_indices) > 0:
    print("\nMulticollinearity detected in the dataset.")
    print("Indices of eigenvalues close to zero:", multicollinear_indices)
else:
    print("\nNo multicollinearity detected in the dataset.")


No multicollinearity detected in the dataset.


#### using dummy dataset

In [41]:
from numpy.linalg import inv

import scipy 
import scipy.linalg as la

import seaborn as sns

# import the ML algorithm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

from sklearn.model_selection import train_test_split

from sklearn import metrics

In [42]:
# generate some random data
np.random.seed(seed=100)

X = np.random.randint(1,   50,  size=(10, 4))
y = np.random.uniform(100, 200, size=10)

In [43]:
np.corrcoef(X, rowvar=False)

array([[ 1.   , -0.237, -0.069, -0.16 ],
       [-0.237,  1.   , -0.044,  0.267],
       [-0.069, -0.044,  1.   , -0.322],
       [-0.16 ,  0.267, -0.322,  1.   ]])

In [44]:
# check the multi collinearity
corr = np.corrcoef(X, rowvar=False)

eigvals, eigvecs = la.eig(corr)

eigvals = eigvals.real

print(eigvals)
print(eigvecs)

[1.52  1.148 0.741 0.591]
[[ 0.39  -0.602  0.661  0.218]
 [-0.538  0.336  0.718 -0.286]
 [ 0.39   0.685  0.203  0.581]
 [-0.638 -0.234 -0.077  0.73 ]]


there seems to __no multi-collinearity__, as there no eigen value = 0

#### compute the beta coefficients

In [45]:
# calculate the coefficients
part1       = inv(np.dot(X.T, X))
part2       = np.dot(X.T, y)

beta_coeffs = np.dot(part1, part2)
beta_coeffs

array([1.977, 1.848, 1.018, 1.145])

In [46]:
# instantiate
linreg = LinearRegression(fit_intercept=False)

# fit the model to the training data (learn the coefficients)
linreg.fit(X, y)

# print the coefficients
print(linreg.intercept_)
print(linreg.coef_)

0.0
[1.977 1.848 1.018 1.145]


In [47]:
df = pd.DataFrame(X, columns=['c0', 'c1', 'c2', 'c3'])

# white noise
noise = np.random.randn(10)

df['c4'] = 2 * df['c1'] + .5 * noise  + 3

X_new = df.values

In [48]:
# check the multi collinearity
corr = np.corrcoef(X_new, rowvar=False)

eigvals, eigvecs = la.eig(corr)

eigvals = eigvals.real

print(eigvals)
print(eigvecs)

[2.244 1.231 0.923 0.602 0.   ]
[[-0.274  0.192 -0.885 -0.323 -0.004]
 [ 0.635 -0.168 -0.251  0.058 -0.709]
 [-0.113 -0.772  0.093 -0.619 -0.002]
 [ 0.33   0.556  0.278 -0.711  0.007]
 [ 0.632 -0.176 -0.26   0.062  0.706]]


In [49]:
# [1.52  1.148 0.741 0.591]
# [[ 0.39  -0.602  0.661  0.218]
#  [-0.538  0.336  0.718 -0.286]
#  [ 0.39   0.685  0.203  0.581]
#  [-0.638 -0.234 -0.077  0.73 ]]

In [50]:
ev0 = eigvecs[:,0].reshape(5,1)
ev1 = eigvecs[:,1].reshape(5,1)
ev2 = eigvecs[:,2].reshape(5,1)
ev3 = eigvecs[:,3].reshape(5,1)
ev4 = eigvecs[:,4].reshape(5,1)

print(ev4)

[[-0.004]
 [-0.709]
 [-0.002]
 [ 0.007]
 [ 0.706]]


#### Observation
1. the eigen value e4 = 0
2. look for the eigen vector of e4 and look for non-zero eigen co-ordinate
    - in this case c1 = [-0.709] and c4 =[ 0.706] are collinear
    
#### compute the beta coefficients

In [51]:
X_new = df.values

# calculate the coefficients
part1       = inv(np.dot(X_new.T, X_new))
part2       = np.dot(X_new.T, y)

beta_coeffs = np.dot(part1, part2)
beta_coeffs

array([ -0.149, -92.862,  -0.219,   0.605,  46.355])

In [52]:
# array([1.977, 1.848, 1.018, 1.145])

In [53]:
# instantiate
linreg = LinearRegression(fit_intercept=False)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_new, y)

# print the coefficients
print(linreg.intercept_)
print(linreg.coef_)

0.0
[ -0.149 -92.862  -0.219   0.605  46.355]


With NO multi-collinearity the coeff were:-

[1.977 1.848 1.018 1.145]

#### observation 
1. the coefficients have gone up 
2. Even the directions have changed in some cases

#### Identifying Multicollinearity

**Small Eigenvalues:**

If an eigenvalue is near zero, the corresponding eigenvector points in a direction where there is almost no variability in the data. This suggests that the data points are nearly collinear in that direction.

**Eigenvectors and Predictors:**

The eigenvector associated with a near-zero eigenvalue indicates a linear combination of the original predictor variables that has very little variance.
In other words, the predictors involved in this linear combination are highly collinear.

**Non-Zero Values in Eigenvectors**

- Identifying Collinear Variables:

    - The non-zero elements of the eigenvector indicate which predictor variables contribute to the linear combination with low variance.
    - Large (in magnitude) non-zero values in an eigenvector show which variables are most involved in the multicollinear relationship.


#### we will add more multi collinearity

In [54]:
df = pd.DataFrame(X, columns=['c0', 'c1', 'c2', 'c3'])

# white noise
noise = np.random.randn(10)

df['c4'] = 2 * df['c0'] + df['c3'] + .5 * noise 

X_new = df.values

In [55]:
# check the multi collinearity
corr = np.corrcoef(X_new, rowvar=False)

eigvals, eigvecs = la.eig(corr)

eigvals = eigvals.real

print(eigvals)
print(eigvecs)

[1.998 0.    1.512 0.859 0.631]
[[ 0.646 -0.656 -0.26  -0.208 -0.201]
 [-0.175 -0.     0.5   -0.684 -0.502]
 [-0.236  0.002 -0.466 -0.669  0.529]
 [ 0.144 -0.321  0.678 -0.014  0.645]
 [ 0.689  0.683  0.07  -0.204  0.108]]


In [56]:
# calculate the coefficients
part1       = inv(np.dot(X_new.T, X_new))
part2       = np.dot(X_new.T, y)

beta_coeffs = np.dot(part1, part2)
beta_coeffs

array([-2.357,  1.845,  1.027, -1.008,  2.159])

array([  0.451, -71.213,   0.352,   0.141,  35.854])

In [57]:
# instantiate
linreg = LinearRegression(fit_intercept=False)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_new, y)

# print the coefficients
print(linreg.intercept_)
print(linreg.coef_)

0.0
[-2.357  1.845  1.027 -1.008  2.159]


[  0.451 -71.213   0.352   0.141  35.854]

#### we will add PERFECT (worst for the model) multi collinearity
- computing beta coefficient will be almost impossible

In [30]:
df = pd.DataFrame(X, columns=['c0', 'c1', 'c2', 'c3'])

# white noise
noise = np.random.randn(10)

df['c4'] = df['c0'] * 2 

X_new = df.values

In [31]:
# check the multi collinearity
corr = np.corrcoef(X_new, rowvar=False)

eigvals, eigvecs = la.eig(corr)

eigvals = eigvals.real

print(eigvals)
print(eigvecs)

[2.177 0.    1.364 0.861 0.598]
[[-0.647  0.707  0.194 -0.202 -0.05 ]
 [ 0.317 -0.     0.289 -0.826  0.366]
 [-0.004 -0.    -0.666 -0.486 -0.566]
 [ 0.249  0.     0.63  -0.009 -0.736]
 [-0.647 -0.707  0.194 -0.202 -0.05 ]]


In [32]:
X_new

array([[ 9, 25,  4, 40, 18],
       [24, 16, 49, 11, 48],
       [31, 35,  3, 35, 62],
       [15, 35, 49, 25, 30],
       [16, 37, 44, 17, 32],
       [10, 30, 23,  3, 20],
       [28, 45,  5, 32, 56],
       [ 2, 14, 20, 37,  4],
       [ 5, 28,  4,  8, 10],
       [48,  2, 15,  8, 96]])

In [33]:
np.dot(X_new.T, X_new)

array([[ 5316,  4635,  3894,  3780, 10632],
       [ 4635,  8609,  5669,  6193,  9270],
       [ 3894,  5669,  7958,  3898,  7788],
       [ 3780,  6193,  3898,  6390,  7560],
       [10632,  9270,  7788,  7560, 21264]])

In [34]:
np.dot(X_new.T, y)

array([27361.457, 37925.345, 30730.86 , 30194.901, 54722.914])

In [35]:
inv(np.dot(X_new.T, X_new))

LinAlgError: Singular matrix

In [25]:
# calculate the coefficients
part1       = inv(np.dot(X_new.T, X_new))
part2       = np.dot(X_new.T, y)

beta_coeffs = np.dot(part1, part2)
beta_coeffs

LinAlgError: Singular matrix