<a href="https://colab.research.google.com/github/jman4162/machine-learning-review/blob/main/Review_Python_Tutorial_on_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review: Python Tutorial on Logistic Regression

Name: John Hodge

Date: 05/09/24

# Introduction
Logistic regression is a statistical model used for binary classification tasks, which predicts the probability that an observation falls into one of two categories. This makes it particularly useful in fields like medicine, finance, and social sciences for outcomes like disease/no disease, default/no default, etc. Unlike linear regression, logistic regression outputs probabilities by applying a logistic function to a linear combination of features.

## Theory Behind Logistic Regression

### The Logistic Function

The logistic function, also known as the sigmoid function, is defined as:

$ \sigma(z) = \frac{1}{1 + e^{-z}} $

Here, \( $z$ \) is the linear combination of the input features ($x$), represented as:

$ z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n $

Where \( $\beta_0, \beta_1, ..., \beta_n$ \) are the coefficients of the model.

### Probabilities and Odds Ratios

The logistic function outputs values between 0 and 1, which are interpreted as probabilities. The odds ratio for a variable can be given by the exponential of its coefficient:

$ \text{Odds Ratio} = e^{\beta_i} $

This indicates how a unit increase in the variable \( $x_i$ \) affects the odds of the outcome occurring.

### Cost Function: Log Loss

To train a logistic regression model, we minimize a cost function called "log loss", which is defined for a single observation as:

$ \text{Log Loss} = -y \log(p) - (1-y) \log(1-p) $

Where \( $y$ \) is the actual label (0 or 1) and \( $p$ \) is the predicted probability of the outcome being 1. See the Appendix for a detailed explanation and derivation of the log-loss function.



## Implementing Logistic Regression in Python

We'll use the popular `scikit-learn` library to implement logistic regression. Let's start by setting up our environment and loading some data to work with.

### Setup and Data Loading

In [3]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import make_classification

In [4]:
# Generate synthetic data set
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

### Splitting the Data

Before training, we split our dataset into training and testing sets:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Model Training

Now, let's train our logistic regression model:

In [6]:
# Create a logistic regression model
model = LogisticRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

### Model Evaluation

After training the model, we can evaluate its performance on the test set:

In [7]:
# Predict on the test data
y_pred = model.predict(X_test)

# Print out the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[104  11]
 [ 26 109]]
              precision    recall  f1-score   support

           0       0.80      0.90      0.85       115
           1       0.91      0.81      0.85       135

    accuracy                           0.85       250
   macro avg       0.85      0.86      0.85       250
weighted avg       0.86      0.85      0.85       250



### Interpreting Coefficients

We can also look at the coefficients of the model to interpret the impact of each feature:

In [8]:
# Coefficients and intercept
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

Intercept: [0.12283339]
Coefficients: [[ 0.10919003 -0.49578821  0.16607838  0.14131473  0.05205305  1.59487115
  -0.0959503   0.0280646  -0.02121372  0.01935852  0.1831346   0.43485908
   0.03552626  0.16072815 -0.49521638  0.11335956  0.06258129 -0.13471173
  -0.75742756  0.1332694 ]]


The results from the logistic regression model consist of an intercept and an array of coefficients, which are crucial for understanding how the model makes predictions. Here's a breakdown of what these values represent and how they impact the prediction:

### Intercept: \([0.12283339]\)
- **Intercept (or Bias)**: The intercept in logistic regression represents the log-odds of the outcome when all the predictor variables \( $X_i$ \) are set to zero. In more practical terms, it's a baseline prediction when no other information (features) is provided. In your model, the intercept is approximately 0.123, suggesting that when \( $X = 0$ \), the log-odds of the positive class is slightly positive.

### Coefficients:
- **Coefficients**: These values represent the change in the log-odds of the dependent variable \( $Y$ \) (i.e., the probability of the positive outcome) for a one-unit change in the corresponding independent variable \( $X_i$ \), while all other variables remain constant. These coefficients are crucial for understanding the influence of each feature on the outcome. Here’s a brief look at the role of each coefficient:

    - \( $\beta_1 = 0.109$ \): A positive coefficient indicates that as this feature increases by one unit, the log-odds of the positive class increases by approximately 0.109, making the event more likely.
    - \( $\beta_2 = -0.496$ \): A negative coefficient means that increasing this feature by one unit decreases the log-odds of the positive class by about 0.496, making the event less likely.
    - \( $\beta_6 = 1.595$ \): This feature has a significant positive impact on the probability of the outcome being positive. Its high coefficient suggests it's a strong predictor in favor of the positive class.
    - \( $\beta_{19} = -0.757$ \): This feature strongly decreases the probability of a positive outcome, as indicated by its substantial negative coefficient.
  
### Understanding Log-Odds:
- **Log-Odds**: The logistic regression model calculates the probability of the dependent variable being 1 (positive class) as the sigmoid function of the linear combination of features. Mathematically, it's expressed as:

  $ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}} $
  
  Here, \( $\beta_0, \beta_1, ..., \beta_n$ \) are your intercept and coefficients, respectively, and \( $x_1, x_2, ..., x_n$ \) are your feature values.

### Implications:
- Each coefficient tells us how much weight or importance the model gives to each feature when predicting the outcome.
- Positive coefficients increase the probability of the positive class as the value of the corresponding feature increases, while negative coefficients decrease it.

These coefficients and the intercept form the backbone of the logistic regression model, allowing it to estimate probabilities based on a combination of these inputs. Understanding these can help interpret the model in the context of the specific application, whether it’s in predicting medical outcomes, customer behavior, or any other binary classification task.

## Conclusion

This tutorial introduced the basics of logistic regression, from its theoretical foundations to its implementation in Python using `scikit-learn`. This model is powerful for binary classification problems, providing probabilities and classifications based on the input features.

By analyzing the coefficients and the model's performance metrics, you can gain insights into the relationships between features and the predicted outcome, allowing for better-informed decision-making in practical applications.

# Appendix

## Deriving the Log-Loss Function

The log-loss function, also known as binary cross-entropy loss, is commonly used in binary classification tasks to measure the performance of a classification model whose output is a probability value between 0 and 1. The log-loss provides a measure of accuracy where the prediction input is a probability value between 0 (representing class 0) and 1 (representing class 1).

### Theoretical Foundation

The log-loss function is derived from the concept of likelihood in statistical modeling, particularly from maximizing the likelihood of the observed data under the model.

#### 1. Model Prediction:
Consider a binary classification problem where each instance \( $x_i$ \) is predicted to belong to one of two classes (0 or 1) with probability \( $p_i$ \), which is the model's estimated probability that $y_i = 1$. The probability that $y_i = 0$ is \( $1 - p_i$ \).

#### 2. Likelihood of an Observation:
For a single observation \( $x_i, y_i$ \), the likelihood of observing \( $y_i$ \), given \( $p_i$ \), is:

$ P(y_i | p_i) = p_i^{y_i} (1 - p_i)^{(1 - y_i)} $

This equation states that:
- If \( $y_i = 1$ \), the likelihood is \( $p_i$ \).
- If \( $y_i = 0$ \), the likelihood is \( $1 - p_i$ \).

#### 3. Log-Likelihood:
To simplify calculations and manage numerical underflow, we use the logarithm of the likelihood. For a single observation, the log-likelihood is:

$ \log P(y_i | p_i) = y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $

#### 4. Negative Log-Likelihood for All Observations:
When considering the entire dataset of \( $n$ \) observations, the log-likelihood of the dataset is the sum of the log-likelihoods of all individual observations. The negative log-likelihood (which we aim to minimize) is:

$ -\sum_{i=1}^n \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] $

### Log-Loss Function:

For machine learning purposes, particularly in training classification models like logistic regression, we define the log-loss function, which is the average of the negative log-likelihood across all observations. This is also called the cross-entropy loss in the context of binary classifications.

$ \text{Log-Loss} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] $

### Explanation of Log-Loss:

- **Interpretation**: Log-loss penalizes incorrect classifications, but its penalty is not uniform; it depends on how confident the prediction was and whether it was correct or incorrect.
- **Confident and Wrong**: A prediction that an event occurs with high probability, say \( $p_i \approx 1$ \), when it actually does not occur ($y_i = 0$) results in a high loss because $\log(1 - p_i)$ becomes a large negative number (log of a number close to zero is negative infinity).
- **Confident and Right**: Conversely, if the prediction is highly confident and correct, the penalty is small (log of a number close to 1 is 0).
- **Less Confident**: Predictions that are less confident (e.g., \( $p_i$ \) close to 0.5) have smaller penalties regardless of whether they are correct or not.

### Conclusion:

Thus, the log-loss function not only emphasizes accuracy but also the confidence of the predictions. It's an essential tool for developing robust binary classification models as it ensures that the models are not only accurate but also confident in their predictions, preventing overfitting to some extent. It encourages the model to predict probabilities as close as possible to 0 or 1, but only when it's almost certain, thereby making it integral for training stable and reliable models.