# Classification


## *Logistic Regression*


- Logistic Regression predicts the probability of an instance belonging
to a specific class

- Binary Outcome: The dependent variable is binary, indicating the presence or absence of a characteristic or outcome.
- Logit Function: The logistic regression model uses the logit function to model the probability of the binary outcome.
- Odds and Probability: The model predicts the log-odds of the outcome, which can be converted to a probability.


![](https://miro.medium.com/v2/resize:fit:1400/1*dm6ZaX5fuSmuVvM4Ds-vcg.jpeg)

![](https://www.saedsayad.com/images/LogReg_1.png)


- Sigmoid Function 
![](https://cdn.botpenguin.com/assets/website/Sigmoid_Function_90ec70976d.png)

- Log Odds
 Odds ratio is obtained by the probability of an event occurring divided by the probability that it will not occur. and taking the log of Odds ratio will give the log of Odds


 -When p = sigmoid( f(p) ), then f(p) is the inverse of the sigmoid, which ends up
being ‘the logit’ function:

- Logit function: the natural logarithm of the odds ratio -> logit(p) = log(p/1-p)
- Odds ratio: Ratio of the probability of success to the probability of failure p/1-p
-


Logistic Regression Optimization:

- Cross-Entropy (CE) Loss 
    Cross-entropy loss, also known as log-loss, is widely used in logistic regression for binary classification problems. It measures the performance of a classification model whose output is a probability value between 0 and 1. The cross-entropy loss increases as the predicted probability diverges from the actual label.

- Maximum Likelihood Estimation (MLE) 
    Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model. In the context of logistic regression, MLE is used to find the values of the coefficients that maximize the likelihood of the observed data under the model.

Solvers
- newton-cg (Newton-Conjugate Gradient)
- lbfgs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
- liblinear (Library for Large Linear Classification)
- sag (Stochastic Average Gradient)
- saga (Stochastic Average Gradient with L1/L2 Regularization)

newton-cg: Second-order derivative optimization method that converges faster
than first-order methods. Suitable for small to medium-sized datasets

lbfgs: Quasi-Newton method that approximates second-order

information using a limited amount of memory. Suitable for small to
medium-sized datasets

liblinear: First-order derivative method that uses a coordinate descent
algorithm. Works well for high-dimensional datasets but does not
support L1/L2 regularization

sag: A first-order derivative method that uses a stochastic gradient
descent algorithm with averaging. Suitable for large-scale datasets

saga: A variant of 'sag' that supports L1/L2 regularization. Suitable for

large-scale datasets and sparse features



Choosing the right solver depends on the size of your dataset, the type of regularization you need, and the computational resources available. Here's a guideline:

- Small datasets? → newton-cg or lbfgs for faster convergence

- High-dimensional datasets? → liblinear may be a good choice

- Large-scale datasets? → sag or saga, especially if you have sparse features

- Need L1/L2 regularization? → saga for the best performance



## Classification Performance

- True positive: Model correctly predicts positive for a sample
                (model: positive, actual: positive)

- False positive: Model falsely predicts positive for a sample
                 (model: positive, actual: negative)

- True negative: Model correctly predicts negative for a sample
            (model: negative, actual: negative)

- False negative: Model falsely predicts negative for a sample        
(              model: negative, actual: positive)

false negative -> very costly (somebody dies)
false positive -> has relatively negligible cost (somebody gets shocked but further tests will confirm no disease)


## Confusion matrix

![](https://miro.medium.com/v2/resize:fit:969/1*d0UCCIF10Soi7VQGxdVrWQ.jpeg)


- Precision, on the other hand, asks: Of the relevant cases identified, how many errors were made? (i.e., how precise are my predictions). This is the positive predictive value of the classifier.
A high precision indicates that the model is effective at making correct
positive predictions and minimizing false positives

- Accuracy: The overall correctness of the model.
- Recall (Sensitivity or True Positive Rate): The proportion of actual positives that were identified correctly.
     How accurately can the algorithm identify all relevant cases? This is the True Positive Rate (TPR) of the classifier, and identifies the hit rate.
    Out of all the actual positive cases, how many did the model call out?
    Measures the ability of a model to capture all the relevant instances in the    
    dataset

## Precison Vs Recall
- Cancer Screening -- (Higher Recall) Higher recall to ensure that potential cancer cases are not missed
- Search Engine Results (Higher Precision) Users expect to receive accurate and relevant results. In this scenario,
higher precision is more important
- Manufacturing Quality Control (Higher Precision)Higher precision is preferred, as it results in fewer defective products being produced and shipped to customers.


- F1 Score
The F1 Score (or just F score) is the harmonic mean between Precision & Recall:
    - Ranges from 0 to 1
    - Balanced way of combining precision and recall into single metric


## ROC

Definition:
ROC AUC represents the area under the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds.
Interpretation:
AUC ranges from 0 to 1
AUC of 0.5 indicates random guessing (no discrimination)
AUC closer to 1 indicates better model performance
AUC of 1 represents perfect classification

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQRhI6JCzC4ZPY1ls847WCcUe5tm0Cf5UYigw&s)

![](https://miro.medium.com/v2/resize:fit:494/1*EPmzi0GCgdLstsJb6Q8e-w.png)
