# Logistic Regression

## Definition

Logistic regression is a statistical model and a type of regression analysis used for **binary classification tasks**, where the goal is to predict a binary outcome (usually denoted as 0 or 1, Yes or No, True or False, etc.) based on one or more independent variables (features or predictors). 

$$
P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p)}}
$$

In this formula:

- $(P(Y = 1|X))$represents the probability of the binary outcome $Y$ being equal to 1 given the values of the independent variables $X$.
- $\beta$ is the intercept or bias term.
- $\beta_1$, $\beta_2$, $\ldots$, $\beta_p$ are the coefficients associated with the independent variables $X_1$, $X_2$, $\ldots$, $X_p$.

This formula calculates the probability that the binary outcome is 1 based on a linear combination of the independent variables, transformed using the **sigmoid (logistic) function** to ensure that the predicted probability is between 0 and 1. The values of the coefficients $\beta$ are estimated from the training data using maximum likelihood estimation.

The sigmoid activation function is the following:
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

In this formula:

- $\sigma(x)$ represents the sigmoid function.
- $x$ is the input to the sigmoid function.
- $e$ is the base of the natural logarithm (approximately equal to 2.71828).

## Assumptions

1. **Linearity of the Log-Odds:**
   - Logistic regression assumes that the log-odds of the binary outcome is a linear combination of the predictor variables. The relationship between independent variables and log-odds should be linear.

2. **Independence of Observations:**
   - Observations should be independent of each other. This means that the values of the binary outcome for one observation should not be influenced by the values of the binary outcome for other observations. 

3. **No or Little Multicollinearity:**
   - Multicollinearity, where predictor variables are highly correlated, should be avoided. High multicollinearity can lead to unstable coefficient estimates.

4. **Large Sample Size:**
   - A sufficiently large sample size is important for reliable parameter estimation and hypothesis testing. A guideline is to have at least 10-20 observations per predictor variable.

5. **Binary or Ordinal Dependent Variable:**
   - Logistic regression is designed for binary or ordinal (ordered categorical) dependent variables. If your dependent variable is continuous or nominal, consider other modeling techniques.

7. **No Perfect Separation:**
   - Perfect separation, where one or more independent variables perfectly predict the outcome, should be avoided. In such cases, the maximum likelihood estimates may not converge.

## Algorithm

Deriving the optimal solution for logistic regression involves finding the parameter values that best fit the data. This is typically done using a method called Maximum Likelihood Estimation (MLE). Let's go through the derivation step by step in LaTeX.

Logistic Regression Model
$$
   P(y=1|x) = \frac{1}{1 + e^{-\theta^T x}}
$$

Likelihood Function. The likelihood function measures how likely the observed data is, given the parameters of the model. For logistic regression, the likelihood $L$is the product of probabilities assigned to all individual observations:
$$
   \ell(\theta) = \prod_{i=1}^{n} P(y_i|x_i)^{y_i} (1 - P(y_i|x_i))^{(1 - y_i)}
$$

Log-Likelihood. It's easier to maximize the log of the likelihood function, as it turns the product into a sum, simplifying the computation. The log-likelihood $\ell(\theta)$ is given by:
$$ln(\ell(\theta)) = \sum_{i=1}^{n} \ln(P(y_i|x_i)^{y_i}(1 - P(y_i|x_i)^{(1 - y_i)})$$
$$ln(\ell(\theta)) = \sum_{i=1}^{n} y_i \ln(P(y_i|x_i)) + (1 - y_i) \ln(1 - P(y_i|x_i))$$
$$ln(\ell(\theta)) = \sum_{i=1}^{n} y_i \ln(\frac{1}{1 + e^{-\theta x_i}}) + (1 - y_i) \ln(1 - \frac{1}{1 + e^{-\theta x_i}})$$
$$ln(\ell(\theta)) = \sum_{i=1}^{n} y_i[\ln(\frac{1}{1+e^{-\theta x_i}}) - \ln(\frac{e^{-\theta x_i}}{1+e^{-\theta x_i}})] + \ln(\frac{e^{-\theta x_i}}{1+e^{-\theta x_i}})$$
$$ln(\ell(\theta)) = \sum_{i=1}^{n} y_i\ln(e^{\theta x_i}) + \ln(\frac{1}{1+e^{\theta x_i}})$$
$$ln(\ell(\theta)) = \sum_{i=1}^{n} y_i\theta x_i - \ln(1+e^{\theta x_i})$$

Get gradient of log-likehood
$$\nabla_\theta \ell(\theta) = \sum_{i=1}^{n} y_i x_i - \frac{1}{1+e^{\theta x_i}}*e^{\theta x_i}*x_i$$
$$\nabla_\theta \ell(\theta) = \sum_{i=1}^{n} (y_i - P(y_i|x_i))x_i$$

Gradient Ascent. To find the parameters $\theta$ that maximize the log-likelihood, we use gradient ascent (the counterpart of gradient descent used in minimization problems).
$$
   \theta := \theta + \alpha \nabla_\theta \ell(\theta)
$$

Convergence Criteria. This iterative process continues until convergence, which can be determined by a small change in the value of the log-likelihood between iterations, or until a maximum number of iterations is reached.

## Code

### Numpy Version

In [1]:
import warnings
warnings.filterwarnings('ignore')
from datasets import load_dataset
import numpy as np
import pandas as pd
import sklearn
from logistic_regression_numpy import *

In [2]:
dataset = load_dataset("imodels/diabetes-readmission", split='train')
df = dataset.to_pandas()
df.head(5)

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,change,diabetesMed,...,glyburide-metformin:Up,A1Cresult:>7,A1Cresult:>8,A1Cresult:None,A1Cresult:Norm,max_glu_serum:>200,max_glu_serum:>300,max_glu_serum:None,max_glu_serum:Norm,readmitted
0,2.0,38.0,3.0,27.0,0.0,1.0,2.0,7.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
1,4.0,48.0,0.0,11.0,0.0,0.0,0.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
2,2.0,28.0,0.0,15.0,0.0,3.0,4.0,9.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
3,4.0,44.0,0.0,10.0,0.0,0.0,0.0,7.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
4,3.0,54.0,0.0,8.0,0.0,0.0,0.0,8.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0


In [3]:
X = np.array(df.iloc[:,:-1])
y = np.array(df.iloc[:,-1])
print(X.shape)
print(y.shape)

(81410, 150)
(81410,)


In [None]:
theta = model_fit(X,y,0.001,100000)

0 0.538570200221103
10000 0.6196904557179708
20000 0.6223559759243337
30000 0.6238914138312247
40000 0.6251566146665029
50000 0.6258322073455349
60000 0.6266429185603735
70000 0.6273185112394055


In [None]:
intercept = np.ones((X.shape[0], 1))
X_numpy = np.concatenate((intercept, X), axis=1)
predict(theta, X_numpy[0])

In [None]:
predict(theta, X_numpy[0:5])

In [None]:
predict_prob(theta, X_numpy[0])

### Sklearn Version

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Create a Logistic Regression model
model = LogisticRegression()

# Fit the model on the training data
model.fit(X, y)

In [None]:
model.predict(X[0].reshape(1,-1))

In [None]:
model.predict(X[0:2])

In [None]:
y_pred = model.predict(X)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y, y_pred)
print(f'Accuracy: {accuracy}')

# Generate a confusion matrix
conf_matrix = confusion_matrix(y, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

# Generate a classification report
class_report = classification_report(y, y_pred)
print('Classification Report:')
print(class_report)