### Assumptions in Linear Regression

1. **Normally Distributed Residuals**
   - Residuals should be normally distributed. This can be checked using a histogram of residuals.

2. **Little to No Multicollinearity**
   - Multiple regression assumes that the independent variables are not highly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) values. One way to deal with multicollinearity is by subtracting the mean.

3. **Homoscedasticity**
   - This assumption states that the variance of error terms is similar across the values of the independent variables. A plot of standardized residuals versus predicted values can show whether points are equally distributed across all values of the independent variables.

### Dummy Variable Trap

This occurs when there is redundant information due to OneHotEncoder. For example, if there are two cities, New York and California, then a single column `City_New_York` with values 0 or 1 is enough to preserve the information. If you create two columns, `City_New_York` and `City_California`, then both will portray the same information, just with opposite values. This introduces multicollinearity. When there are many unrelated features, the model can learn a lot from those. However, when there are fewer features, the model will be unstable and undergo significant changes with little change in input value.

The dummy variable trap can be avoided by dropping one feature from every subset of dummy variables.





In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Import functions and classes from scikit-learn
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score  # Corrected 'form' to 'from'

# Create a synthetic regression dataset with 100 samples, 20 features, and some noise
x, y = make_regression(n_samples=100, n_features=20, noise=0.95)

# Convert the dataset into a pandas DataFrame for better visualization
df = pd.DataFrame(x)

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.660655,0.466621,-0.272171,-1.040826,-0.245894,-0.419403,-0.277848,-0.573909,-0.828155,2.389983,0.711084,0.231527,-0.722313,0.547985,0.629112,0.689861,-0.833304,-0.299322,0.956913,0.609194
1,0.514763,0.196311,-0.305518,-0.665858,-0.433939,0.306565,-1.259641,-0.488029,-1.29672,0.031979,-0.4588,-2.751292,-0.0968,0.561309,-1.230791,-0.771245,1.002987,-0.695356,-0.18975,-2.059868
2,0.835498,1.267293,0.164167,-1.023043,1.500174,-1.156688,-0.995897,-1.053274,-1.477756,1.073379,1.489118,0.712619,-1.697714,1.424876,0.90695,-0.070008,1.947493,0.110727,0.591193,-0.000447
3,-0.735586,1.316215,-0.897825,0.653786,-1.454557,0.534217,0.079728,-0.03234,-0.216081,0.143756,-1.5123,0.148923,-0.0996,0.653759,0.382213,-0.089786,-0.02842,-0.022584,-0.184187,-1.534721
4,0.387919,-1.017217,0.069884,2.409913,0.461058,-0.745499,0.415146,0.291695,-0.16421,0.771544,-1.297284,0.994511,0.841454,0.186958,0.129123,1.617611,1.307831,-0.845798,-0.892965,-0.657802


### Result on dataset with no multicollinearity

In [4]:
# Perform cross-validation on the Linear Regression model
# Cross-validation will fit the classifier N number of times (in this case, 10) and display the accuracies
cv = cross_val_score(LinearRegression(), x, y, cv=10)

# Print the mean accuracy of the cross-validation
print("Mean: {}".format(cv.mean()))

# Print all the accuracy values from the cross-validation
print("Values: {}".format(cv))


Mean: 0.9999169841617684
Values: [0.99993713 0.99991164 0.99980716 0.99997706 0.99986302 0.9999655
 0.9998904  0.99994775 0.99993166 0.99993853]


In [6]:
# Create the dataset once again with high multicollinearity
# Introduce high multicollinearity by setting a low rank to the input matrix
x, y = make_regression(n_samples=100, n_features=20, noise=0.95, effective_rank=1)

# Convert the dataset into a pandas DataFrame for better visualization
df = pd.DataFrame(x)

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.013061,-0.061369,-0.000156,-0.010368,-0.006214,0.020108,-0.005791,0.052993,-0.046331,0.024798,0.020855,0.015057,0.051126,-0.056392,-0.013801,0.01851,-0.079251,0.018598,-0.041208,0.026439
1,0.034189,-0.013488,-0.022446,-0.008836,-0.011681,-0.005614,-0.021592,-0.002434,-0.05276,0.038439,0.036112,0.023598,0.002218,-0.081476,-0.011928,0.010654,-0.032505,0.026807,-0.06549,0.041673
2,0.078963,-0.010856,0.013611,0.034094,-0.036543,0.04512,-0.021598,0.04001,0.002154,0.073436,0.026049,-0.031315,0.017198,-0.020508,-0.008962,0.010008,0.018851,-0.041065,-0.025262,0.047412
3,-0.060033,0.051791,-0.025619,0.002775,-0.013755,-0.001497,-0.008737,-0.037213,0.038245,-0.021438,-0.012701,-0.053746,-0.00277,0.065488,0.02018,-0.000172,0.04938,-0.02291,0.028298,-0.042891
4,-0.003949,-0.064451,0.037618,0.021158,0.028755,0.066083,-0.012995,0.033762,-0.046133,0.010539,-0.007244,0.031732,0.005075,-0.071118,0.032666,-0.00235,-0.006802,0.031028,-0.090243,0.065865


### Results on dataset with high multicollinearity

In [None]:
# Cross validation will fit the classifier N number of times and display the accuracies 
cv = cross_val_score (LinearRegression(), x, y, cv = 10)
print("Mean: {}".format(c))