# Logistic Regression

Logistic Regression is a powerful classification algorithm that is widely used for binary classification tasks. It models the probability that a given input belongs to a certain class. In logistic regression, the output is transformed using the sigmoid function to produce values between 0 and 1, which can be interpreted as class probabilities.

## Mathematical Formulation

Given a dataset with $N$ samples and $M$ features, the goal of logistic regression is to find the optimal set of weights $\theta$ that best separates the two classes. The mathematical formulation of logistic regression involves the following components:

- Inputs: $X \in \mathbb{R}^{M \times N+1}$ (matrix where each row represents a sample and each column corresponds to a feature plus a column of ones for bias term)
- Weights: $\theta \in \mathbb{R}^{N+1}$ (a vector of coefficients plus bias term)
- Logits: $t = X\theta$ ($t \in \mathbb{R}^{N+1}$)

The assumption of logistic regression is that logit $\log \dfrac{p(x)}{1-p(x)}$ is linear function of $x$:

$$\log \dfrac{p(x)}{1 - p(x)} = \theta^T x$$

Solving for $p$, we get:

$$
\begin{gather*}
\dfrac{p(x)}{1 - p(x)} = e^{\theta^T x}\\
p(x) = (1 - p(x))e^{\theta^T x}\\
p(x) = e^{\theta^T x} - p(x)e^{\theta^T x}\\
p(x)(1 + e^{\theta^T x}) = e^{\theta^T x}\\
p(x) = \dfrac{e^{\theta^T x}}{(1 + e^{\theta^T x})}\\
p(x) = \dfrac{1}{(1 + e^{-\theta^T x})}\\
\end{gather*}
$$

From the equation above we can see that, $P(Y=1|X=x) = \sigma(t)$ , where $\sigma$ is the sigmoid function.

The sigmoid function $\sigma(t)$ is defined as:
$$ \sigma(t) = \frac{1}{1 + e^{-t}} $$

The likelihood of the observed data under the logistic regression model is given by:
$$ L(\theta) = \prod_{i=1}^{N} P(Y=y_i|X=x_i) = \prod_{i=1}^{N} \left(\sigma(\theta^T x_i)\right)^{y_i} \left(1 - \sigma(\theta^T x_i)\right)^{1 - y_i} $$

The goal is to maximize the likelihood, which is equivalent to minimizing the negative log-likelihood (log loss):
$$ J(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log\left(\sigma(\theta^T x_i)\right) + (1 - y_i) \log\left(1 - \sigma(\theta^T x_i)\right) \right] $$

## Optimization

Optimizing the logistic regression model involves finding the values of $W$ and $b$ that minimize the log loss $J(W, b)$. This is typically done using optimization algorithms like gradient descent or its variants. The gradients of the log loss with respect to $W$ and $b$ can be computed using the chain rule and are used to update the model parameters iteratively.

## Conclusion

Logistic regression is a fundamental classification algorithm with a well-defined mathematical formulation. It models class probabilities using the sigmoid function and optimizes its parameters to minimize the log loss. This approach allows logistic regression to make accurate predictions for binary classification tasks.

Logistic Regression is a widely used classification algorithm in machine learning. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an input belonging to a particular class. The logistic function, also known as the sigmoid function, is used to model this probability.

## Loss Function and Derivative

In logistic regression, the sigmoid function is applied to the linear combination of input features, resulting in the predicted probability. The loss function used in logistic regression is the Cross-Entropy Loss, also known as Log Loss or Binary Cross-Entropy. The loss measures the difference between the predicted probabilities and the actual labels.

Given:
- Inputs: $X$ (matrix of features)
- Weights: $W$
- Predicted probabilities: $s = \sigma(X\theta)$, where $\sigma$ is the sigmoid function
- Actual labels: $Y$ (binary, 0 or 1)

The Cross-Entropy Loss ($J$) is defined as the negative log-likelihood of the data given the model's predictions:
$$ J(\theta) = \sum_{n=1}^{N} \left( y_n \log \left(s_n\right) + \left(1 - y_n\right) \log \left(1 - s_n\right) \right) $$

Where:  

$t(\theta) = \theta^T x$ 
 
$s(t) = \sigma (t)$  
  
$\dfrac{d s(t)}{dt} = \sigma(t)(1-\sigma(t)) = s(t)(1-s(t))$

$\dfrac{d t}{d \theta} = x$

Now defferentiate:
$$
\begin{align*}
\dfrac{\partial J(\theta_i)}{\partial \theta} &=  -\sum_{n=1}^{N} \left(y_n  \cdot \dfrac{\partial  \log (s_n)}{\partial s_n} \cdot \dfrac{\partial s_n(t_n)}{\partial t_n}  \cdot \dfrac{\partial t_n(\theta)}{\partial \theta_i} + (1-y_n)  \cdot \dfrac{\partial  \log (1-s_n)}{\partial s_n}  \cdot \dfrac{\partial s_n(t_n)}{\partial t_n}  \cdot \dfrac{\partial t_n(\theta)}{\partial \theta_i}\right)\\
&= -\sum_{n=1}^{N} \left( \dfrac{y_n}{s_n} \cdot s_n \cdot (1-s_n) \cdot x_n + \dfrac{1-y_n}{1-s_n} \cdot (-1) \cdot s_n \cdot (1-s_n) \cdot x_n \right) \\
&= -\sum_{n=1}^{N} \left( y_n\cdot (1-s_n) \cdot x_n + (y_n-1)\cdot s_n  \cdot x_n \right) \\
&= -\sum_{n=1}^{N} \left(y_n \cdot x_n - y_n \cdot s_n \cdot x_n + y_n \cdot s_n \cdot x_n - s_n \cdot x_n \right) \\
&= -\sum_{n=1}^{N} \left((y_n - s_n) \cdot x_n \right) \\
\end{align*}
$$

Removing summation gives us the derivative in vector form:

$$ \frac{\partial J(\theta)}{\partial \theta} = X^T(S - Y) $$

- $N$ is the number of data points
- $Y_i$ is the true label (0 or 1) for the $i$th data point
- $S_i$ is the predicted probability of the positive class for the $i$th data point



In [5]:
import numpy as np 
from mllib.logistic_regression import LogisticRegression as LogisticRegression_mllib
from sklearn.linear_model import LogisticRegression as LogisticRegression_sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr_mllib = LogisticRegression_mllib().fit(X_train, y_train)
y_pred = lr_mllib.predict(X_test)
mllib_lr_acc =  accuracy_score(y_test, y_pred)

lr_sklearn = LogisticRegression_sklearn().fit(X_train, y_train)
y_pred = lr_sklearn.predict(X_test)
sklearn_lr_acc = accuracy_score(y_test, y_pred)
print(f"Mllib LogisticRegression Accuracy: {mllib_lr_acc :5.3f} \nSklearn LogiscticRegression Accuracy: {sklearn_lr_acc :5.3f}")

Mllib LogisticRegression Accuracy: 0.860 
Sklearn LogiscticRegression Accuracy: 0.847


Logistic Regression links:  

 - https://ai.plainenglish.io/l1-lasso-and-l2-ridge-regularizations-in-logistic-regression-53ab6c952f15  
 - https://ml-explained.com/blog/logistic-regression-explained