# Linear Regression Regularization

## I. Linear Regression Basics

Linear regression is a tool often underrated tool in the machine learning toolkit. I cannot give linear models the treatment they deserve in this article, but I will give a brief overview. Linear models assume a model of the form:

$y_i = \beta_0 + \beta_1 x_{i1} + + \beta_2x _{i2} + ... + \beta_k x_{ik} + \epsilon_i$

where $y_i$ is the label of the *i*-th observation,  $\hat{\beta}_j$ is the coefficient for the *j*-th variable, $x_{ij}$ is the value for the *j*-th variable for the *i*-th observation, and $\epsilon_i$ is the error for the *i*-th observation.

Some assumptions on $\epsilon_i$ make linear regression particularly useful.

The loss function most often used in linear regression is the sum of squared residuals (SSR). Minimizing this value is equivalent to minimizing the mean squared error (MSE). There are alternatives such as [Least Absolute Deviation](https://en.wikipedia.org/wiki/Least_absolute_deviations) and [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss), which are less sensitive to outliers. Their details are not discussed here, but you should review them so you have them in your toolkit.

The loss function in matrix form is:

# $\text{min } (y - X\hat{\beta})^T(y - X\hat{\beta})$

where $y$ is the column vector of true values, $X$ is the design matrix, and $\hat{\beta}$ is the column vector of coefficient estimates in the linear model. A design matrix has rows representing observations and columns representing variables, typically the first column is all 1s for the intercept term.

Using summation notation, the loss function is:

# $\text{min } \Sigma_{i=1}^n (y_i - \Sigma^k_{j=1} (\hat{\beta}_j x_{ij}))^2$

where $y_i$ is the label for the *i*-th observation, $\hat{\beta}_j$ is the estimated coefficient for the *j*-th variable, and $x_{ij}$ is the value for the  *j*-th variable for the *i*-th observation.



In [1]:
import numpy as np
from LinearModel import OLS

size = 2000
cols = 3
eta = 1

def make_datasets(size, cols, num_zero=0, beta_scale=1., intercept=1., eta=1., seed=1):
    np.random.seed(seed)
    
    X = np.random.normal(loc=0, scale=1, size=(size,cols))
    beta = np.random.normal(loc=0, scale=beta_scale, size=cols)
    
    if num_zero != 0.:
        beta_indices = np.arange(0, beta.shape[0], 1)
        zero_indices = np.random.choice(beta_indices, num_zero, replace=False)
        beta[zero_indices] = 0
    
    y = intercept + np.matmul(X, beta) + np.random.normal(loc=0, 
                                                          scale=eta, 
                                                          size=size) 
    X_train = X[:round(size/2),:]
    y_train = y[:round(size/2)]  
    X_test = X[round(size/2):,:]
    y_test = y[round(size/2):] 
    
    return X_train, y_train, X_test, y_test


X_train, y_train, X_test, y_test = make_datasets(size, cols, num_zero=1, beta_scale=1., eta=1)
 


# Create an array of 1s equal in length to the observations in X.
intercept_column = np.repeat(1, repeats=X_train.shape[0])
# Insert it at the 0-th column index.
X_copy = np.insert(X_train, 0, intercept_column, axis=1)

In [2]:
my_ols = OLS()
my_ols.fit(X_train, y_train)  

print(f"Model df: {my_ols.df_model}")
print(f"Residual df: {my_ols.df_residuals}")
print(f"R-squared: {round(my_ols.R_sq, 4)}")
print(f"Adj. R-squared: {round(my_ols.adj_R_sq, 4)}")
print(f"F-stat: {round(my_ols.F_stat, 4)}")
print(f"F-prob: {round(my_ols.F_prob, 4)}")
print(f"Est. Coef.: {np.round(my_ols.beta_hat, 4)}")
print(f"Est. Coef. Std. Error: {np.round(my_ols.beta_hat_se, 4)}")
print(f"t-stats: {np.round(my_ols.beta_hat_t_stats, 4)}")
print(f"P(|t-stat| > 0): {np.round(my_ols.beta_hat_prob, 4)}")

Model df: 3
Residual df: 996
R-squared: 0.9002
Adj. R-squared: 0.8999
F-stat: 2993.9391
F-prob: 0.0
Est. Coef.: [ 0.9893 -2.8303  0.7886 -0.035 ]
Est. Coef. Std. Error: [0.0311 0.0315 0.0304 0.0319]
t-stats: [ 31.8331 -89.7663  25.9239  -1.0985]
P(|t-stat| > 0): [0.     0.     0.     0.2722]
