### The math
$\hat{y} = Xb$ where $\hat{y}$ are predictions, $X$ are features and $b$ are trained model parameters

For MSE we have closed form solution($y$ as labels):
$
b=(X^TX)^{-1}X^T y
$

In [40]:
import numpy as np
from numpy.typing import NDArray

n = 10 # 10 data
k = 5 # num of feautres

features = np.random.randn(n, k)
labels = np.random.randn(n, 1)

X = features
y_hat = labels

In [None]:
def MSE(y: NDArray, y_pred: NDArray) -> float:
    return sum(((y-y_pred) ** 2)/len(y))

**Linear Regression with sklearn**

In [104]:
from sklearn.linear_model import LinearRegression

def linear_reg_sklearn(X, y):
    reg = LinearRegression(fit_intercept=False).fit(X, y)
    print(f"Mse: {MSE(y_hat, reg.predict(X))}")

linear_reg_sklearn(X, y_hat)


Mse: [0.24471667]


**Linear regression with closed form**

In [50]:
def linear_reg_closed_form(X, y):
    b_closed_form = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_hat)
    # print(b_closed_form)
    print(f"MSE: {MSE(y_hat, X.dot(b_closed_form))}")

linear_reg_closed_form(X, y_hat)

MSE: [0.74354049]


**Linear regression through gradient descent**

The math:
$
L = \frac{1}{n}\sum_{i=1}^n(y_i-\sum_{j=1}^kx_{ij}b_j)^2
\Rightarrow \frac{\partial L}{\partial b_m} = -\frac{2}{n}\sum_{i=1}^n x_{im}(y_i-\sum_{j=1}^kx_{ij}b_j)
\Rightarrow \frac{\partial L}{\partial b} = -\frac{2}{n} X^T(y - Xb)
$

In [53]:
from typing import List


max_epochs = 100
learning_rate = 0.1

def linear_reg_gradient_descent(X, y_hat):
    b_gd = np.random.randn(k, 1)
    loss = []
    for i in range(1, max_epochs+1):
        pred = X.dot(b_gd)
        loss.append(MSE(y_hat, pred))
        gradient = -2/n * X.T.dot(y_hat - pred)
        b_gd -= learning_rate*gradient

        if i%10 == 0:
            print(f'At epoch {i}, loss {loss[-1]}')

    # print(b_gd)
    print(f"MSE: {MSE(y_hat, X.dot(b_gd))}")

linear_reg_gradient_descent(X, y_hat)

At epoch 10, loss [1.23534309]
At epoch 20, loss [0.9463261]
At epoch 30, loss [0.83242008]
At epoch 40, loss [0.7826889]
At epoch 50, loss [0.76080672]
At epoch 60, loss [0.75115875]
At epoch 70, loss [0.74690226]
At epoch 80, loss [0.74502402]
At epoch 90, loss [0.74419518]
At epoch 100, loss [0.74382941]
MSE: [0.74380671]


**Now let's check how logistic regression works with classification problem**
Firstly check normal linear regression and GD based linear regression
$X : n*k, y:n*1$

In [112]:
labels_binary = np.random.randint(low=0, high=2, size=(n,1))

print("--- linear regression by closed form ---")
linear_reg_closed_form(X, labels_binary)
print("--- linear regression by gradient descent ---")
linear_reg_gradient_descent(X, labels_binary)
print("--- linear regression by sklearn ---")
linear_reg_sklearn(X, labels_binary)


--- linear regression by closed form ---
MSE: [0.24471667]
--- linear regression by gradient descent ---
At epoch 10, loss [0.46517566]
At epoch 20, loss [0.28263667]
At epoch 30, loss [0.24675687]
At epoch 40, loss [0.23608397]
At epoch 50, loss [0.23207116]
At epoch 60, loss [0.23039512]
At epoch 70, loss [0.22966838]
At epoch 80, loss [0.22934943]
At epoch 90, loss [0.22920891]
At epoch 100, loss [0.22914693]
MSE: [0.22914309]
--- linear regression by sklearn ---
Mse: [0.69316996]


In [114]:
from sklearn.linear_model import LogisticRegression

def logistic_reg_sklearn(X, y):
    reg = LogisticRegression(fit_intercept=False).fit(X, y.reshape(-1))
    print(f"Mse: {MSE(y.reshape(-1), reg.predict(X))}")

logistic_reg_sklearn(X, labels_binary)

Mse: 0.1


**Logistic regression with gradient boost**
The math: $\hat{y} = \sigma(Xb)$, let $y^p=Xb$. Then we have:

$L = \frac{1}{n}\sum_{i=1}^n - [y_i \log(\hat{y}_i) + (1 - y_i) \log(1-\hat{y}_i)]
\Rightarrow \frac{\partial L}{\partial \hat{y}_i} = \frac{1}{n}(\frac{1-y_i}{1-\hat{y_i}}-\frac{y_i}{\hat{y}_i})
$

$\frac{\partial \hat{y}_i}{\partial y^p_i} = \sigma(y^p_i)(1-\sigma(y^p_i))$. For $i\neq j$, $\frac{\partial \hat{y}_i}{\partial y^p_j} = 0$

$\frac{\partial y^p_i}{\partial b_j} = x_{ij}$

Therefore:

$\frac{\partial L}{\partial y^p_i} = \frac{\partial L}{\partial \hat{y}_i}  \frac{\partial \hat{y}_i}{\partial y^p_i}$

$\frac{\partial L}{\partial b_j} = \sum_{i=1}^n \frac{\partial L}{\partial y^p_i} x_{ij} \Rightarrow \frac{dL}{db} = X^T \frac{dL}{dy^p}$

In [120]:
def logistic_reg_gradient_descent(X, y):
    b_gd = np.random.randn(k, 1)
    loss = []
    for i in range(1, max_epochs+1):
        yp = X.dot(b_gd)
        y_hat =  1 / (1 + np.exp(-yp))
        CE_array = -y*np.log(y_hat)-(1-y)*np.log(1-y_hat)
        ce_loss = np.mean(CE_array)
        loss.append(MSE(y_hat, y))
        
        dL_dYhat = ((1-y)/(1-y_hat) - y/y_hat) / n 
        dYhat_dYp = y_hat * (1-y_hat)
        dL_dYp = dL_dYhat * dYhat_dYp

        gradient = X.T.dot(dL_dYp)

        b_gd -= learning_rate*gradient

        if i%10 == 0:
            print(f'At epoch {i}, CE loss: {ce_loss}, MSE: {loss[-1]}')


logistic_reg_gradient_descent(X, labels_binary)

At epoch 10, CE loss: 0.7188416231639809, MSE: [0.22453573]
At epoch 20, CE loss: 0.5973793703921088, MSE: [0.19748365]
At epoch 30, CE loss: 0.5075192331701573, MSE: [0.17169645]
At epoch 40, CE loss: 0.44453767404852096, MSE: [0.14797108]
At epoch 50, CE loss: 0.4017175836897085, MSE: [0.12904094]
At epoch 60, CE loss: 0.37200949666475475, MSE: [0.11529178]
At epoch 70, CE loss: 0.35023992498163714, MSE: [0.10540733]
At epoch 80, CE loss: 0.33331105251084914, MSE: [0.09804089]
At epoch 90, CE loss: 0.319496133739193, MSE: [0.09229634]
At epoch 100, CE loss: 0.3078253812628044, MSE: [0.08764142]
