# Logistic Regression

Logistic Regression is a classification algorithm which is used when the dependent variable(target) is categorical. The data is linearly separable and output is binary or dichotomous. eg. Binary classification.


A linearly separable dataset refers to a graph where a straight line separates the two data classes. 

### Assumption:

1. The first assumption of logistic regression is that response variables can only take on two possible outcomes – pass/fail, male/female, and malignant/benign.


2. Multicollinearity : This assumption implies that the predictor variables (or the independent variables) should be independent of each other. Multicollinearity relates to two or more highly correlated independent variables. Such variables do not provide unique information in the regression model and lead to wrongful interpretation.


3. Large Sample size: Logistic regression analysis yields reliable, robust, and valid results when a larger sample size of the dataset is considered.

    **variance inflation factor (VIF)**, which determines the correlation strength between the independent variables in a regression model.


4: No outliers: Another critical assumption of logistic regression is the requirement of no extreme outliers in the dataset.

   This assumption can be verified by calculating **Cook’s distance (Di)** for each observation to identify influential data points that may negatively affect the regression model. In situations when outliers exist, one can implement the following solutions: 

        1. Eliminate or remove the outliers

        2. Consider a value of mean or median instead of outliers, or

        3. Keep the outliers in the model but maintain a record of them while reporting the regression results
        
5. Independent observation: The dataset observations should be independent of each other. The assumption can be verified by plotting residuals against time, which signifies the order of observations. The plot helps in determining the presence or absence of a random pattern. If a random pattern is present or detected, this assumption may be considered violated.


6. Linear relationship of independent variables to log odds:
Log odds refer to the ways of expressing probabilities. Log odds are different from probabilities. Odds refer to the ratio of success to failure, while probability refers to the ratio of success to everything that can occur.


### How does logistic regression work

1. Used to predict the Probabilities for classification problems.


2. It predicts the probability of occurrence of an event by fitting data to a logit function/sigmoid function. Hence, it is also known as logit regression. Therefore the output value lies between 0 and 1


3. Similar to linear regression, we have weights and biases here, too. We first multiply the input with those weights and add it with the bias. The end result of this would go into the sigmoid function to give us a probability between 0 and 1.


4. In logistic regression weighted sum of input (Linear regression output) is passed through the sigmoid activation function and the curve which is obtained is called the sigmoid curve (s-shaped curve).

Linear regression output :
$$ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n $$

Sigmoid activation function:
$$ \sigma(z) = \frac{1}{1 + e^{-z}}$$


$$ h_\theta(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}$$


The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. 

$$
0<= h_\theta(x)<=1
$$



$$\text{logit}(h_\theta(x) ) = \log\left(\frac{h_\theta(x) }{1-h_\theta(x) }\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$$


5. If the output given by a sigmoid function is more than 0.5, the output is classified as 1 & if is less than 0.5, the output is classified as 0. We can set the threshold.






![image.png](attachment:image.png)

Img source: https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148

### Cost Function


Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. 

<img src="https://miro.medium.com/v2/resize:fit:980/format:webp/1*gAsyT-YdsQZUMF81NTZQdQ.png"  width=" 200"/> 

Actual Target = 1
Penalize the model if the predicted value is away from 1 and therefore the loss value decreases as we move closer to 1(our desired prediction). Loss is 0 when prediction is 1

<img src="https://miro.medium.com/v2/resize:fit:1022/format:webp/1*2QLAi8r4BWFZ4AC6aQLzbA.png"  width=" 200"/> 

Actual Target = 0
Penalize the model if the predicted value is away from 0 and therefore the loss value increases as we move away from (our desired prediction). Loss is 0 when prediction is 0


Linear regression cost function in this case will result in a non convex curve wil multiple minimas. Therefore the cost function for logistic regression, also known as the binary cross-entropy loss or logistic loss:

$$
J(\beta) = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]
$$

For each observation $i$:

1. Let $y_i$ be the actual binary label (0 or 1).

2. Let $\hat{y}_i$ be the predicted probability that the observation belongs to class 1

Compute the log loss for each observation:

1. If $y_i = 1$, the contribution to the loss is $-\log(\hat{y}_i)$. This encourages the predicted probability for class 1 ($\hat{y}_i$) to be close to 1.

2. If $y_i = 0$, the contribution to the loss is $-\log(1 - \hat{y}_i)$. This encourages the predicted probability for class 0 ($(1 - \hat{y}_i)$) to be close to 1.
    
Compute the overall cost function:
    - Take the average (or sum) of the log losses across all observations.
    
    
For multiclass classification, the cross entropy is essentially the negative log-likelihood of the true class labels given the predicted probabilities. The loss looks like:

$$
L(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log(\hat{y}_i)
$$

where,

$y_i$ is actual probability of class i

$\hat{y}_1$ is predicted probability of class i

### Types of Logistic regression;

1. Binary logistic regression: Some examples of the output of this regression type may be, success/failure, 0/1, or true/false.
    
eg. Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.

2. Multinomial logistic regression: A categorical dependent variable has two or more discrete outcomes in a multinomial regression type. This implies that this regression type has more than two possible outcomes.

eg. Estimating the type of food consumed by pets, the outcome may be wet food, dry food, or junk food.

3. Ordinal logicstic regression: tate (i.e., ordinal). The dependent variable (y) specifies an order with two or more categories or levels.

eg. Scores on a math test: Outcomes = Poor/Average/Good

### Difference between Linear and Logistic Regression

1. Logistic regression can be used both in classification and regression problems but it is widely used as a classification algorithm. 

2. In Linear regression, the output should be continuous like price & age, whereas in Logistic regression the output must be categorical like either Yes / No or 0/1.


3. Linear regression uses mean squared error as its cost function. If this is used for logistic regression, then it will be a non-convex function of parameters (theta). Gradient descent will converge into global minimum only if the function is convex.


4. Logistic Regression and not Logistic Classification? Essentially the Logistic Regression model outputs probabilities (or log odds ratios in the logit form) that have a linear relationship with the predictor variables. When you attach a threshold to these probability values, it classifies the outcomes as 1 or 0 (Binomial Logistic Regression). Hence even if Logistic Regression is a classification algorithm, it has the word regression in it.


5. Univariate Logistic Regression means the output variable is predicted using only one predictor variable, while Multivariate Logistic Regression means output variable is predicted using multiple predictor variables.


6. LR is same as MLE for some probabilistic model

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png"  width=" 400"/> 





### Additional terminologies:

#### Variance inflation factor (VIF)

A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression analysis. Multicollinearity exists when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect the regression results.



####  Cooks Distance

Cook’s Distance is a summary of how much a regression model changes when the ith observation is removed. The larger the value for Cook’s distance, the more influential a given observation.

Finding outliers

The formula for Cook’s distance is:

$$
D_i = \left( \frac{r_i^2}{p \times \text{MSE}} \right) \times \left( \frac{h_{ii}}{(1 - h_{ii})^2} \right)
$$ 

where 

    ri is the ith residual

    p is the number of coefficients in the regression model

    MSE is the mean squared error

    hii is the ith leverage value

Rule of thumb is to investigate any point that is more than 3x the mean of all the distances. Run cooks distance on model to remove values > 3x mean and observe the adjusted R square. If it increases, the model fit is better.

Cons: To compute Cook’s distance of each row, it requires the model to be retrained. So, computationally expensive to apply this method to other algorithms besides linear regression.


Note: Just because an observation is influential doesn’t necessarily mean that it should be deleted from the dataset.



#### References:

1. https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
2. https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression/
3. https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
4. https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning
5. https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
6. https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a
7. https://www.statology.org/cooks-distance-python/