# Logistic Regression

**Logistic Regression** is a supervised learning algorithm used for classification tasks. Despite the term "regression" in its name, logistic regression is widely used for binary classification problems rather than regression. It estimates the probability that a given input belongs to a certain category.

## Analyze Salaries On Orange
![](./logistic_regression_orange.png) 


## Key Concepts

### 1. **Binary Classification**
Logistic regression is primarily used for binary classification problems where the target variable has two possible outcomes, such as:
- Spam vs. Not Spam
- Purchased vs. Not Purchased
- Disease vs. No Disease

### 2. **Sigmoid Function**
At the core of logistic regression is the **sigmoid function**, which maps any real-valued number to a value between 0 and 1. The function can be expressed as:

$$ \text{sigmoid}(z) = \frac{1}{1 + e^{-z}} $$


Where $ z $ is the linear combination of the input features. The sigmoid function converts the output of a linear regression model into a probability score between 0 and 1.

### 3. **Decision Boundary**
The decision boundary is the threshold used to classify the output. Typically, the threshold is set at 0.5. If the sigmoid function's output is greater than 0.5, the data point is classified as class 1 (positive class), otherwise, it's classified as class 0 (negative class).

$$
\hat{y} =
\begin{cases} 
1, & \text{if } P(y=1|X) \geq 0.5 \\
0, & \text{if } P(y=1|X) < 0.5 
\end{cases}
$$

### 4. **Cost Function**
Logistic regression uses a cost function based on maximum likelihood estimation, also known as the **log loss** or **binary cross-entropy loss**:

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]
$$

Where:
- $y^{(i)}$ is the actual label (0 or 1),
- $h_\theta(x^{(i)})$ is the predicted probability,
- $m$ is the number of training examples.

### 5. **Gradient Descent**
To minimize the cost function, logistic regression uses **gradient descent** to update the model's parameters (weights and bias). It calculates the gradient of the cost function and iteratively adjusts the parameters to find the minimum cost.

### 6. **Multiclass Classification**
Although logistic regression is commonly used for binary classification, it can be extended to multiclass classification using techniques like **One-vs-Rest (OvR)** or **Softmax Regression**.

## Regularization in Logistic Regression

### Why Regularization?
In logistic regression, **overfitting** can occur when the model becomes too complex and captures noise in the training data rather than the underlying pattern. To prevent this, we apply regularization techniques like **Lasso (L1 regularization)** and **Ridge (L2 regularization)**, which add a penalty term to the cost function.

### 1. **Lasso Regression (L1 Regularization)**

- **Lasso** stands for **Least Absolute Shrinkage and Selection Operator**. In Lasso, the regularization term is the sum of the absolute values of the coefficients.
  
  The cost function for Lasso Logistic Regression is:

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \lambda \sum_{j=1}^{p} |\theta_j|
$$


  Where $ \lambda $ is the regularization parameter controlling the strength of regularization. Lasso tends to **shrink some coefficients to exactly zero**, which makes it useful for **feature selection**.


#### Analysis Using Lasso

![](./lasso_score.png) 

![](./lasso_confusion_matrix.png) 

### 2. **Ridge Regression (L2 Regularization)**

- **Ridge** regression adds a penalty term based on the sum of the squared coefficients. This is known as **L2 regularization**. The cost function is:

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \lambda \sum_{j=1}^{p} \theta_j^2
$$

  Unlike Lasso, Ridge regression **shrinks the coefficients** but does not force them to be zero. This means it retains all features but reduces the impact of less important ones.

#### Analysis Using Ridge
![](./ridge_score.png) 

![](./ridge_confusion_matrix.png) 

## Advantages of Regularization
- Helps prevent overfitting by penalizing large coefficients.
- Lasso can perform feature selection by shrinking irrelevant features to zero.
- Ridge can handle multicollinearity, where input features are highly correlated.

## Applications
- Logistic regression with Lasso or Ridge regularization is used in:
  - Medical diagnosis (e.g., predicting disease presence)
  - Spam detection
  - Credit scoring and fraud detection
  - Predicting customer churn

## Conclusion
Logistic regression is a fundamental algorithm for classification problems, and with the addition of regularization techniques like Lasso and Ridge, it becomes even more powerful in handling overfitting and selecting important features. These methods ensure that the model generalizes well to unseen data, making them valuable tools in data science and machine learning.
From the analysis of data scientist salaries, the Lasso and Ridge has similar performance. Lasso is slightly better than Ridge.
