# Logistic Regression
Logistic regression is a statistical method used for binary classification, where the outcome or dependent variable is categorical (usually 0 or 1, yes or no, true or false). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability that an observation belongs to a certain class. It uses the logistic (sigmoid) function to model the relationship between the independent variables and the probability of the dependent variable being in a particular class.

# why linear regression cannot be used in classification?
1) Outlier: linear regression is sensitive to outliers, and in a classification setting, extreme values in the feature space could skew the results, leading to incorrect predictions.
2) Squash Line: Logistic regression, with its sigmoid function, ensures outputs are constrained between 0 and 1, making it more appropriate for classification tasks.

# Loss Function

Logistic regression solves classification problems by using the sigmoid function, which maps any real-valued input into a probability between 0 and 1. The sigmoid function ensures that the output can be interpreted as a probability, making it ideal for binary classification. The model then assigns class labels based on a threshold, typically 0.5.

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:
- \( z \) is the input to the function (linear combination of weights and features).


# Log-Loss Function

However, the sigmoid function creates a non-convex optimization landscape when directly used for convergence, which makes gradient descent prone to getting stuck in local minima. This is why logistic regression does not optimize the raw sigmoid function itself but instead uses a log-loss (or cross-entropy) cost function, which is convex and ensures global convergence using gradient-based optimization methods.

$$
\text{Log-Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]
$$

Where:
- $y^{(i)}$ is the true label (0 or 1).
- $h_\theta(x^{(i)})$ is the predicted probability from the sigmoid function.
- $m$ is the number of training examples.

# Perfomance Matrix

## Confusion Matrix
A confusion matrix is a performance measurement tool used for classification models. It provides a summary of prediction results by comparing the actual labels to the predicted labels. The matrix has four components:

- True Positives (TP): Correct predictions where the model predicted the positive class correctly.
- True Negatives (TN): Correct predictions where the model predicted the negative class correctly.
- False Positives (FP): Incorrect predictions where the model predicted the positive class but it was actually negative (also called Type I error).
- False Negatives (FN): Incorrect predictions where the model predicted the negative class but it was actually positive (Type II error).

The confusion matrix helps evaluate classification accuracy, precision, recall, and other metrics.
The formula of accuracy is (TP+TN)/(TP+TN+FP+FN)

## Precision
Precision measures the accuracy of positive predictions. It is the ratio of true positives (TP) to the total number of positive predictions (TP + FP). Precision answers the question: "Of all the instances predicted as positive, how many were correct?"
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

##  Recall
Recall (or Sensitivity) measures how well the model identifies actual positives. It is the ratio of true positives (TP) to the total actual positives (TP + FN). Recall answers the question: "Of all the actual positive instances, how many were correctly predicted?"
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

## F-Beta Score
F-beta score is a weighted harmonic mean of precision and recall, where the weight is determined by the parameter β. A higher β values gives more importance to recall, while lower β gives more weight to precision. The most common case is the F1 score (β = 1), which balances both precision and recall equally.
$$
F_{\beta} = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}
$$

### F-1 Score
The F1 score is the harmonic mean of precision and recall, with equal importance given to both (β = 1). It is widely used when precision and recall are equally important and you want to balance both.
$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$


### F-0.5 Score
This score gives more weight to precision over recall. A low β value like 0.5 favors models that make fewer false positive errors, meaning precision is prioritized. It is useful when false positives are more costly than false negatives.
$$
F_{0.5} = (1 + 0.5^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(0.5^2 \cdot \text{Precision}) + \text{Recall}}
$$


### F-2 Score
This score gives more weight to recall over precision (β = 2). It is useful when false negatives are more costly than false positives, meaning that identifying as many positives as possible is more important than avoiding false alarms.
$$
F_2 = (1 + 2^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(2^2 \cdot \text{Precision}) + \text{Recall}}
$$

# Logistic Regression one versus Rest

**One-vs-Rest (OvR)** is a strategy used in **logistic regression** to extend the binary classification model to handle **multiclass classification** problems. In this approach, a separate binary classifier is trained for each class. Each classifier treats one class as the positive class (label 1) and all other classes as the negative class (label 0).

For a dataset with \(n\) classes, OvR creates \(n\) logistic regression models. During prediction, each model outputs a probability, and the class with the highest probability is chosen as the final prediction. 

This method is simple and widely used because it allows logistic regression to work with multiclass data while maintaining the advantages of the binary classifier.