# Linear Regression

Linear regression is a foundational technique in statistics and machine learning that models the relationship between a dependent variable and one or more independent variables. It aims to find the best-fitting linear equation that predicts the dependent variable based on the independent variables. In this chapter, we will delve into the concepts of linear regression, its assumptions, mathematical representation, and different solution methods.

## Assumptions of Linear Regression

Before applying linear regression, it's important to consider its underlying assumptions:

1. **Linearity**: The relationship between independent variables and the dependent variable is assumed to be linear.

2. **Independence**: The residuals (differences between actual and predicted values) are independent of each other.

3. **Homoscedasticity**: The variance of the residuals is constant across all levels of the independent variables.

4. **Normality**: The residuals follow a normal distribution.

5. **No Multicollinearity**: The independent variables are not highly correlated with each other.

## Mathematical Formulation

Let's define the following terms for the mathematical formulation of linear regression:

- **X**: The matrix of independent variables (features), with an added column of ones for the intercept term.
- **y**: The vector of target values.
- **θ**: The vector of coefficients (weights) that we want to optimize.
- **n**: The number of data points.
- **m**: The number of features.

The extended linear regression equation can be written as:

$$
y = Xθ + ε
$$

where ε represents the error term.

## Direct Solution: Normal Equation

One way to find the optimal coefficients θ directly is by using the normal equation. The normal equation provides an analytical solution to the least squares problem, minimizing the sum of squared errors.

The normal equation is:

$$
θ = (X^TX)^{-1}X^Ty
$$

where:
- $\mathbf{X^T}$: The transpose of the feature matrix X.
- $\mathbf{y}$: The vector of target values.

The normal equation directly calculates the optimal values of θ that minimize the loss function.


## Loss Function

The goal of linear regression is to find the coefficients θ that minimize the difference between the predicted values (Xθ) and the actual target values (y). The most common loss function used is the Mean Squared Error (MSE):

$$
J(θ) = \frac{1}{2n} \sum_{i=1}^{n} (y_i -  \sum_{j=1}^{k}X_{i,j}θ_j)^2
$$

where n is the number of data points, and X_i represents the i-th row of the feature matrix X.

## Gradient Descent for Linear Regression

Gradient Descent is an iterative optimization algorithm used to minimize the loss function. The gradient of the loss function with respect to the coefficients θ is computed, and θ is updated in the opposite direction of the gradient. The update rule is:

$$
θ := θ - α \nabla J(θ)
$$

where α is the learning rate.

The gradient of the loss function can be computed as:

$$
\nabla J(θ) = \frac{1}{n} X^T(Xθ - y)
$$

## Conclusion

Linear regression serves as a powerful tool for modeling relationships between variables. By considering its assumptions and understanding various solution methods, such as gradient descent and the normal equation, we can apply linear regression effectively to various prediction tasks. Its simplicity, interpretability, and versatility make linear regression a cornerstone of both statistical analysis and machine learning.


In [14]:
import numpy as np

x = np.array([2, 4, 7, 3, 5, 6, 1, 9])
ind = np.argpartition(x, 5)
print(ind)
print(x[ind])

x_s = sorted([(val, ind) for ind, val in enumerate(x)], key= lambda x: x[0])
print(x_s)
print([x[1] for x in x_s[:3]])

[6 0 3 4 1 5 2 7]
[1 2 3 5 4 6 7 9]
[(1, 6), (2, 0), (3, 3), (4, 1), (5, 4), (6, 5), (7, 2), (9, 7)]
[6, 0, 3]


1. LR Assumptions.
2. LR Definition.
3. LR Direct Solution.
4. LR Gradient Methods.
5. LR Multicoliniarity.


In [24]:
import numpy as np
from sklearn.datasets import make_regression

C = 1

def loss_fn(X_intercept, y, parameters):
    I = np.eye(parameters.shape[0])
    I[0,0] = 0
    n = y.shape[0]
    return (1/(2*n))*(np.sum((X_intercept @ parameters - y)**2 + np.sum(C * np.concatenate(([0], parameters[1:]))**2)))

def grad_fn(X_intercept, y, parameters):
    I = np.eye(parameters.shape[0])
    I[0,0] = 0
    n = y.shape[0]
    return (1/(2*n))*(X_intercept.T @ (X_intercept @ parameters - y) + C * np.concatenate(([0], parameters[1:])))

def closed_form(X_intercept, y, parameters):
    I = np.eye(parameters.shape[0])
    I[0,0] = 0
    return np.linalg.inv(X_intercept.T @ X_intercept + C*I) @ X_intercept.T @ y

X, y = make_regression(n_samples=10, n_features=2, noise=1, random_state=42)


X_intercept = np.c_[np.ones((X.shape[0], 1)), X]
theta = np.ones(X_intercept.shape[1])
print(f"X shape: {X.shape} y shape: {y.shape} X_intercept shape: {X_intercept.shape} theta shape: {theta.shape}")

print(f" loss fn: {loss_fn(X_intercept, y, theta)}")
print(f" grad fn: {grad_fn(X_intercept, y, theta)}")
solution = closed_form(X_intercept, y, theta)
print(f"Closed form: {solution} y_pred: {X_intercept @ solution} y: {y}" )
print(X_intercept.shape)

X shape: (10, 2) y shape: (10,) X_intercept shape: (10, 3) theta shape: (3,)
 loss fn: 4958.580652311469
 grad fn: [ 10.84934975 -33.04821563 -44.95133608]
Closed form: [-1.8251543  43.12541053 71.25877893] y_pred: [  16.59073627  -23.11103096  134.63592763  120.96550839    9.74330219
  -54.9975661   -28.60742837 -141.62310256 -116.2808667  -127.72844599] y: [  20.03819923  -22.11994877  147.69902512  132.3912091    11.86498924
  -57.32737267  -29.60728634 -150.45066604 -123.10948375 -139.7916313 ]
(10, 3)


In [8]:
np.concatenate(([0], theta[1:]))

array([0., 1., 1.])

Linear Regression links:
https://aunnnn.github.io/ml-tutorial/html/blog_content/linear_regression/linear_regression_regularized.html#which-one-to-use-l1-or-l2


## Assumptions of Linear Regression

Before applying linear regression, it's important to consider its underlying assumptions:

1. **Linearity**: The relationship between independent variables and the dependent variable is assumed to be linear.  

2. **Independence**: The residuals (differences between actual and predicted values) are independent of each other.

3. **Homoscedasticity**: The variance of the residuals is constant across all levels of the independent variables.

4. **Normality**: The residuals follow a normal distribution.

5. **No Multicollinearity**: The independent variables are not highly correlated with each other.

### Multicollinearity:
https://www.stat.cmu.edu/~larry/=stat401/lecture-17.pdf
https://gregorygundersen.com/blog/2021/07/12/multicollinearity/
https://stats.stackexchange.com/questions/479886/unexpected-relative-value-of-eigenvalues-of-a-top-a-and-a-top-a-1-in?noredirect=1&lq=1
https://math.stackexchange.com/questions/1181271/if-ata-is-invertible-then-a-has-linearly-independent-column-vectors
https://stats.stackexchange.com/questions/221902/what-is-an-example-of-perfect-multicollinearity
https://mathformachines.com/posts/least-squares-with-the-mp-inverse/

Multicollinearity in the context of linear regression refers to a situation where two or more predictor variables (features) in your dataset are highly correlated. This high correlation can lead to instability and issues when computing the direct solution of linear regression, $y = X\theta$, where y is the target variable, $X$ is the design matrix of predictor variables, and $\theta$ is the vector of coefficients that we want to estimate.

When multicollinearity is present, it means that some predictor variables can be approximately expressed as a linear combination of other predictor variables. This can cause problems when trying to compute the pseudo-inverse of the matrix $X$, which is needed to calculate the least squares estimates of the coefficients $\theta$.

**Computation of pseudo-inverse of $X$**:
- Direct computation:  

    $X^+ = (X^TX)^{-1}X^T$

- via SVD:
$$ 
\begin{gather*}
X = U\Sigma V^T \\
X^+ = V \Sigma^{-1} U^T
\end{gather*}
$$
Where $\Sigma  = diag(\sigma_1(X), \sigma_2(X), \cdots, \sigma_n(X)) = diag(\sqrt{\lambda_1(X^TX)}, \sqrt{\lambda_2(X^TX)}, \cdots, \sqrt{\lambda_n(X^TX)})$

Since $rank(X^TX) = rank(X)$, if $rank(X_{m \times n}) < \min(m, n)$ than $X^TX$ is singular.  It means that some of its eigenvalues are zero or very close to zero.

When performing matrix operations involving a singular or ill-conditioned matrix, numerical instability can occur. Small numerical errors can lead to large errors in the computed solution.


#### Condition number

Ill-conditioning occurs when the condition number of the matrix $X^T * X$ is very large. The condition number is a measure of how sensitive the solution is to changes in the input data. A high condition number means that small changes in the input can lead to large changes in the output, making the solution unstable.

Lets numerically evaluate the stability of the solution of $y = X\theta$:

We define:
- $\delta \boldsymbol{\theta}$: as perturbation of $\boldsymbol{\theta}$
- $\delta \boldsymbol{y}$: as perturbation of $y$
- $||\boldsymbol{X}||$: as second matrix norm, $||\boldsymbol{X}||_2 = \max \boldsymbol{\sigma}(\boldsymbol{X})$
- $\boldsymbol{\sigma}(\boldsymbol{X})$: singular values of matrix $\boldsymbol{X}$

$$ \boldsymbol{X}(\boldsymbol{\theta} + \delta \boldsymbol{\theta}) = \boldsymbol{y} + \delta \boldsymbol{y} $$

Since $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\theta}$:

$$
\begin{gather*}
\boldsymbol{X}\boldsymbol{\theta} + \boldsymbol{X}\delta \boldsymbol{\theta} = \boldsymbol{y} + \delta \boldsymbol{y} \\
\boldsymbol{y} + \boldsymbol{X}\delta \boldsymbol{\theta} = \boldsymbol{y} + \delta \boldsymbol{y} \\
\boldsymbol{X}\delta \boldsymbol{\theta} = \delta \boldsymbol{y} \\
\delta \boldsymbol{\theta} = \boldsymbol{X}^{-1} \delta \boldsymbol{y} \\
||\delta \boldsymbol{\theta}|| = ||\boldsymbol{X}^{-1} \delta \boldsymbol{y}|| \\
||\delta \boldsymbol{\theta}|| \leq ||\boldsymbol{X}^{-1}||\cdot||\delta \boldsymbol{y}|| \\
\end{gather*}
$$

From $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\theta}$ we get:

$$||\boldsymbol{y}|| \leq ||\boldsymbol{X}|| \cdot ||\boldsymbol{\theta}|| $$

Combining both inequalities:

$$ \dfrac{||\delta \boldsymbol{\theta}||}{||\boldsymbol{\theta}||} \leq ||\boldsymbol{X}|| \cdot ||\boldsymbol{X}^{-1}||  \dfrac{||\delta \boldsymbol{y}||}{||\boldsymbol{y}||}$$


### Singular or Near-Singular Matrix:

### Explanation:

When we perform linear regression using the least squares method, we aim to find the coefficients that minimize the residual sum of squares. This involves solving the equation $y = X\theta$, where X is the design matrix of predictor variables, w is the vector of coefficients, and y is the target variable. One key step in this process is calculating the matrix product X^T * X, which is used to estimate the coefficients w.

When multicollinearity is present, it means that some predictor variables are highly correlated, and this can lead to issues with the matrix X^T * X.

### Formula:

The matrix X^T * X is used to compute the coefficients w as follows:

w = (X^T * X)^(-1) * X^T * y

### Example:

Let's consider an example with two highly correlated predictor variables: height in feet and height in inches. These two variables are essentially representing the same information, and therefore they are linearly dependent, resulting in multicollinearity.

Suppose our design matrix X is:

X = [[5, 60],
     [6, 72],
     [5.5, 66]]

Now, let's compute X^T * X:

X^T * X = [[5^2 + 6^2 + 5.5^2, 5*60 + 6*72 + 5.5*66],
           [5*60 + 6*72 + 5.5*66, 60^2 + 72^2 + 66^2]]

As we can see, the first term in the diagonal, which represents the sum of squares of the first predictor (height in feet), is very similar to the second term, which represents the sum of squares of the second predictor (height in inches). This similarity indicates that the matrix is nearly singular.

## Ill-Conditioning:

### Explanation:

Ill-conditioning occurs when the condition number of the matrix X^T * X is very large. The condition number is a measure of how sensitive the solution is to changes in the input data. A high condition number means that small changes in the input can lead to large changes in the output, making the solution unstable.

### Formula:

The condition number (κ) of a matrix A is defined as the ratio of the largest to the smallest singular value of A:

κ(A) = σ_max(A) / σ_min(A)

where σ_max(A) is the largest singular value and σ_min(A) is the smallest singular value of matrix A.

### Example:

Continuing from the previous example, let's compute the condition number of the matrix X^T * X:

Assuming the computed X^T * X is:

X^T * X = [[100, 600],
           [600, 15000]]

The singular values of this matrix are approximately 15018.8 and 81.2. Therefore, the condition number is:

κ(X^T * X) = 15018.8 / 81.2 ≈ 185.03

This high condition number indicates that the solution (coefficients w) will be highly sensitive to small changes in the input data.


### Variance Inflation Factor
Mathematically, multicollinearity can be detected by calculating the variance inflation factor (VIF) for each independent variable. The VIF measures the inflation in the variance of the estimated regression coefficient due to multicollinearity. The formula for VIF is:

$$ VIF(X_i) = \frac{1}{1 - R_{i}^{2}} $$

where $( R_{i}^{2} )$ is the coefficient of determination $( R^{2} )$ obtained by regressing the $ (i) $ th independent variable on all other independent variables. A VIF value greater than 1 indicates the presence of multicollinearity, and a higher VIF suggests a higher degree of multicollinearity.

In [2]:
import numpy as np

v = np.array([2, 5, 7, 8])
X = np.stack([v*coef for coef in [1, 1.25, 1.5, 1.75]], axis=1)
print(f"X shape: {X.shape}")
print(X)
eigenvalues, eigenvectors = np.linalg.eig(X)
u, s, vh = np.linalg.svd(X)
print(f"egenvals: {eigenvalues}")
print(f"singular vals: {s}")
print(f"cond number: {max(s)/min(s)}")

X = np.stack([v*coef for coef in [1, 1.25, 1.5, 1.75]], axis=1)
X_p = 0.01*np.eye(4)
X = X + X_p
print(X)
eigenvalues, eigenvectors = np.linalg.eig(X)
u, s, vh = np.linalg.svd(X)
print(f"egenvals: {eigenvalues}")
print(f"singular vals: {s}")
print(f"cond number: {max(s)/min(s)}")

X = np.stack([v*coef for coef in [1, 1.25, 1.5, 1.75]], axis=1)
X_p = 0.1*np.eye(4)
X = X + X_p
print(X)
eigenvalues, eigenvectors = np.linalg.eig(X)
u, s, vh = np.linalg.svd(X)
print(f"egenvals: {eigenvalues}")
print(f"singular vals: {s}")
print(f"cond number: {max(s)/min(s)}")

X shape: (4, 4)
[[ 2.    2.5   3.    3.5 ]
 [ 5.    6.25  7.5   8.75]
 [ 7.    8.75 10.5  12.25]
 [ 8.   10.   12.   14.  ]]
egenvals: [0.00000000e+00+0.0000000e+00j 3.27500000e+01+0.0000000e+00j
 5.97687542e-16+6.7382728e-16j 5.97687542e-16-6.7382728e-16j]
singular vals: [3.34402452e+01 2.32995685e-15 7.74888563e-16 9.41612800e-32]
cond number: 3.5513796361461816e+32
[[ 2.01  2.5   3.    3.5 ]
 [ 5.    6.26  7.5   8.75]
 [ 7.    8.75 10.51 12.25]
 [ 8.   10.   12.   14.01]]
egenvals: [1.000e-02 3.276e+01 1.000e-02 1.000e-02]
singular vals: [3.34500389e+01 1.00000000e-02 1.00000000e-02 9.79371058e-03]
cond number: 3415.461245676248
[[ 2.1   2.5   3.    3.5 ]
 [ 5.    6.35  7.5   8.75]
 [ 7.    8.75 10.6  12.25]
 [ 8.   10.   12.   14.1 ]]
egenvals: [ 0.1  32.85  0.1   0.1 ]
singular vals: [33.53819325  0.1         0.1         0.09794803]
cond number: 342.40803841192354


a = [1, 2, 3]

In [9]:
a = [1, 1, 1]
b = [2, 2, 2]

print(sum(map(lambda x: x[0]>x[1], zip(a,b))))
print(f"{3/2 :.6f}")

0
1.500000


In [35]:
import pandas as pd
d = {"1st":[203, 122], "2nd":[118, 167], "3rd":[178, 528], "Crew":[212, 673]}
df = pd.DataFrame(data=d, index=["survived", "not survived"])
df.head()

Unnamed: 0,1st,2nd,3rd,Crew
survived,203,118,178,212
not survived,122,167,528,673


In [46]:
total_psg = df.sum().sum()
fst_class = df["1st"].sum()
survived = df.loc["survived", :].sum()
not_survived = df.loc["not survived", :].sum()

print(f"total: {total_psg} 1st: {fst_class} survived: {survived} not survived: {not_survived}")
print("P first class:", round(fst_class/total_psg, 2))
print("P survived:", round(survived/total_psg, 2))
print("P 1st class survived:", round(df.loc["survived", "1st"]/fst_class, 2))
print(round((3/5 +1/6)/3,2), round(23/90, 2))
print(round((3/5)*(1/3)/(23/90),2), round(18/23, 2))
print(round((1/3)/(67/90),2), round(30/67, 2))

total: 2201 1st: 325 survived: 711 not survived: 1490
P first class: 0.15
P survived: 0.32
P 1st class survived: 0.62
0.26 0.26
0.78 0.78
0.45 0.45


In [36]:
import numpy as np

def gini_inpurity(y):
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    probabilities = np.bincount(y)/y.shape[0]
    print(f"probabilities: {probabilities}")
    print(f"probabilities squared: {probabilities**2}")
    print(f"gini inpurity: {1 - np.sum(probabilities**2)}")
    return 1 - np.sum(probabilities**2)

def entropy_inpurity(y):
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    probabilities = np.bincount(y)/y.shape[0]
    print(f"probabilities: {probabilities}")
    print(f"log probabilities: { np.log2(probabilities, where=(probabilities > 0))}")
    print(f"p*log_2 p: { probabilities*np.log2(probabilities, where=(probabilities > 0))}")
    print(f"entropy inpurity: {np.sum(-probabilities * np.log2(probabilities, where=(probabilities > 0)))}")
    return np.sum(-probabilities * np.log2(probabilities, where=(probabilities > 0)))

y = [0, 0, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)
print()
y = [0, 0, 1, 1, 1, 1, 1, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)

print()
y = [1, 1, 1, 1, 1, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)

probabilities: [0.5 0.5]
probabilities squared: [0.25 0.25]
gini inpurity: 0.5
probabilities: [0.5 0.5]
log probabilities: [-1. -1.]
p*log_2 p: [-0.5 -0.5]
entropy inpurity: 1.0

probabilities: [0.22222222 0.77777778]
probabilities squared: [0.04938272 0.60493827]
gini inpurity: 0.345679012345679
probabilities: [0.22222222 0.77777778]
log probabilities: [-2.169925   -0.36257008]
p*log_2 p: [-0.48220556 -0.28199895]
entropy inpurity: 0.7642045065086203

probabilities: [0. 1.]
probabilities squared: [0. 1.]
gini inpurity: 0.0
probabilities: [0. 1.]
log probabilities: [0. 0.]
p*log_2 p: [0. 0.]
entropy inpurity: 0.0


0.0

In [39]:
probabilities = np.array([0, 1])
print(np.unique(probabilities))
entropy = np.sum(-probabilities * np.log2(probabilities, where=(probabilities > 0)))
print(entropy)

[0 1]
0.0


Imagine an experiment with $k$ possible output categories. Category $j$ has a probability of occurrence $p(j|t)$ (where $j=1,\ldots,k$).

Reproduce the experiment two times and make these observations:

1. The probability of obtaining two identical outputs of category $j$ is $p^2(j|t)$.
2. The probability of obtaining two identical outputs, independently of their category, is: $\sum_{j=1}^{k} p^2(j|t)$.
3. The probability of obtaining two different outputs is thus: $1 - \sum_{j=1}^{k} p^2(j|t)$.

That's it: the Gini impurity is simply the probability of obtaining two different outputs, which is an "impurity measure".

https://stats.stackexchange.com/questions/308885/a-simple-clear-explanation-of-the-gini-impurity

In [34]:
def divisibleSumPairs(n, k, ar):
    reminder = dict()
    for i, number in enumerate(ar):
        reminder.setdefault(number%k, set()).update([i])
        print(f"i: {i} number: {number} reminder: {reminder}")
    pairs_count = 0
    for i in range(n-1):
        rem_i = ar[i]%k
        pairs = reminder.get((k-rem_i)%k)
        #print(f"i: {i} pairs: {pairs}")
        if pairs:
            subset = pairs-set(list(range(i+1)))
            print(", ".join([f"({i},{p})" for p in subset]))
            pairs_count += len(subset)
    return pairs_count

n = 6
k = 3
ar = [1, 3, 2, 6, 1, 2]
divisibleSumPairs(n, k, ar)

i: 0 number: 1 reminder: {1: {0}}
i: 1 number: 3 reminder: {1: {0}, 0: {1}}
i: 2 number: 2 reminder: {1: {0}, 0: {1}, 2: {2}}
i: 3 number: 6 reminder: {1: {0}, 0: {1, 3}, 2: {2}}
i: 4 number: 1 reminder: {1: {0, 4}, 0: {1, 3}, 2: {2}}
i: 5 number: 2 reminder: {1: {0, 4}, 0: {1, 3}, 2: {2, 5}}
(0,2), (0,5)
(1,3)
(2,4)

(4,5)


5

In [40]:
y = np.array([0,0,1,2,2,2,3])
print(np.bincount(y))
print(np.argmax(np.bincount(y)))

[2 1 3 1]
2


In [14]:
def introTutorial(V, arr):
    # Write your code here
    n = len(arr)
    if n == 0:
        return 0
    left = 0
    right = n - 1
    iter = 0
    while (iter< 100) and (left != right):
        iter += 1
        ind = left + (right-left)//2
        if arr[ind] == V:
            return ind
        elif arr[ind] > V:
            right = ind-1
        else:
            left = ind+1
    return left

s = [1, 5, 51, 53, 2, 123, 223, 345, 5432, 111]
for n in range(15):
    for j in range(n):
        s = list(range(n))
        v = s[j]
        res = introTutorial(v, s)
        assert j == res, f"fail. j:{j} ans: {res}"

out of while
inside while
out of while
out of while
inside while
out of while
out of while
inside while
inside while
out of while
inside while
out of while
inside while
inside while
out of while
inside while
out of while
inside while
out of while
inside while
out of while
out of while
inside while
out of while
inside while
out of while
inside while
out of while
out of while
inside while
out of while
inside while
out of while
inside while
inside while
out of while
out of while
inside while
inside while
out of while
inside while
out of while
inside while
inside while
out of while
out of while
inside while
inside while
out of while
inside while
inside while
out of while
inside while
inside while
out of while
inside while
out of while
inside while
inside while
out of while
inside while
inside while
out of while
inside while
inside while
out of while
inside while
out of while
inside while
inside while
out of while
inside while
inside while
out of while
inside while
out of while
inside while

Logistic Regression links:  

 - https://ai.plainenglish.io/l1-lasso-and-l2-ridge-regularizations-in-logistic-regression-53ab6c952f15  
 - https://ml-explained.com/blog/logistic-regression-explained