# Tutorial: Logistic Regression for Binary Classification and Probability Estimation

Logistic Regression is a widely used statistical technique for binary classification and probability estimation tasks. It's especially useful when you need to predict whether an instance belongs to one of two classes or when you want to estimate the probability of an instance belonging to a particular class. In this tutorial, we'll cover the following topics:

1. **Introduction to Logistic Regression**
2. **Logistic Regression Equation**
3. **Sigmoid Function**
4. **Training Logistic Regression**
5. **Making Predictions**
6. **Evaluating the Model**
7. **Probability Estimation**
8. **Code Example in Python**

Let's get started!

## 1. Introduction to Logistic Regression

Logistic Regression is a type of regression analysis used for predicting the probability of a binary outcome. It's named "logistic" because it's based on the logistic function (or sigmoid function), which maps any real-valued number to a value between 0 and 1. Logistic Regression is widely used in various fields, including medicine, finance, and machine learning.

## 2. Logistic Regression Equation

The logistic regression equation is:

$$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} $$

Where:
- $ P(Y=1|X) $ is the probability that the dependent variable $ Y $ equals 1 given the independent variable $ X $.
- $ X $ is the input features or predictors.
- $ \beta_0 $ is the intercept.
- $ \beta_1 $ is the coefficient associated with the independent variable $ X $.
- $ e $ is the base of the natural logarithm (approximately 2.71828).

## 3. Sigmoid Function

The sigmoid function $ \sigma(z) $, also known as the logistic function, is the core of logistic regression. It maps any real-valued number $ z $ to a value between 0 and 1. The formula for the sigmoid function is:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

The sigmoid function has an S-shaped curve, which makes it suitable for modeling probabilities in binary classification problems.

## 4. Training Logistic Regression

The goal of training logistic regression is to find the best values for the coefficients $ \beta_0 $ and $ \beta_1 $ that minimize the error in predicting the target variable. This is typically done using optimization techniques like gradient descent.

## 5. Making Predictions

Once the model is trained, you can make predictions on new data points. You plug the values of the independent variables into the logistic regression equation, and the output will be a probability score between 0 and 1. By setting a threshold (e.g., 0.5), you can classify instances into one of the two classes.

## 6. Evaluating the Model

Let's dive into more detail about the metrics used to evaluate the performance of a logistic regression model in binary classification.

**1. Accuracy:** Accuracy is a common metric used to evaluate classification models, including logistic regression. It measures the percentage of correctly predicted instances out of all instances. The formula for accuracy is:

$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $$

While accuracy is a useful metric, it may not provide a complete picture of a model's performance, especially when dealing with imbalanced datasets.

**2. Precision:** Precision is a metric that focuses on the accuracy of positive predictions. It answers the question: "Of all the instances predicted as positive, how many were actually positive?" Precision is important when false positives are costly. The formula for precision is:

$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}} $$

**3. Recall (Sensitivity or True Positive Rate):** Recall measures the ability of the model to correctly identify all relevant instances. It answers the question: "Of all the actual positive instances, how many did the model correctly predict as positive?" Recall is important when false negatives are costly. The formula for recall is:

$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} $$

**4. F1-Score:** The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall and is especially useful when the dataset is imbalanced. The formula for the F1-score is:

$$ \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

The F1-score is a useful metric when you want to find a balance between precision and recall.

**5. ROC Curve (Receiver Operating Characteristic Curve):** The ROC curve is a graphical representation of a model's performance across different thresholds for classifying instances. It plots the true positive rate (recall) against the false positive rate at various threshold values. A model with a higher area under the ROC curve (AUC-ROC) is generally considered better at distinguishing between classes. An AUC-ROC of 0.5 indicates a model that performs no better than random guessing, while an AUC-ROC of 1.0 represents a perfect classifier.

These metrics provide a comprehensive view of how well a logistic regression model is performing in binary classification tasks:

- **Accuracy** provides an overall view of correctness but may not be suitable for imbalanced datasets.
- **Precision** focuses on minimizing false positives.
- **Recall** focuses on minimizing false negatives.
- **F1-Score** balances precision and recall, making it useful for imbalanced datasets.
- **ROC Curve and AUC-ROC** help assess a model's ability to distinguish between classes and set the appropriate classification threshold.

In practice, the choice of evaluation metrics depends on the specific problem and the relative costs of false positives and false negatives.

## 7. Probability Estimation

One of the advantages of logistic regression is its ability to provide probability estimates. Instead of just classifying instances into one of two classes, you can obtain the probability that an instance belongs to a particular class. This is useful in applications like risk assessment or fraud detection.

## 8. Code Example in Python

Let's illustrate logistic regression with a simple Python example using the scikit-learn library:


In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, auc, classification_report, confusion_matrix

# Load the Iris dataset for binary classification
iris = load_iris()
X = iris.data
y = iris.target 

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [6]:

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)


Accuracy: 1.00
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



In [4]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Calculate precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

# Calculate recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")

# Generate classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)



Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        32
           1       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



This code demonstrates how to train a logistic regression model for binary classification using the Iris dataset. It loads the dataset, splits it into training and testing sets, trains the model, makes predictions, and evaluates its performance.

That concludes our tutorial on Logistic Regression for binary classification and probability estimation. Logistic Regression is a fundamental tool in machine learning, and understanding it is crucial for a wide range of data science applications.

In [10]:
confusion = confusion_matrix(y_test, y_pred) 
print("Confusion Matrix")
print(confusion)

Confusion Matrix
[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]


In [23]:
for i in range(len(model.predict_proba(X_test))):
    print(max(model.predict_proba(X_test)[i])*100, y_pred[i])

81.22912737035179 1
94.2033531701884 0
99.79057878816691 2
77.32337848068198 1
75.12981894330335 1
95.00924472170655 0
90.41396410357152 1
85.23276524503447 2
78.06092724358322 1
94.08071690241997 1
77.42039003546799 2
96.27361260062257 0
96.92055048526598 0
95.60592918838925 0
97.65446304440218 0
66.33333264206968 1
97.32611376430009 2
94.51508303347799 1
82.00886912089929 1
95.8520559720802 2
95.87439028548367 0
61.18890507434447 2
95.63031849015965 0
94.78641977790096 2
97.96278144998497 2
89.4350821883366 2
93.0132417569576 2
98.05000053096622 2
96.09968177119924 0
94.9518212823561 0
99.32041667476321 0
98.07112588080186 0
89.00237476264081 1
96.46836509859983 0
98.26091560020343 0
71.42442864030222 2
81.21564984613232 1
96.41675194303856 0
97.81195954928307 0
98.30825516568645 0
80.55332306698226 2
73.69852814763581 1
75.1701803957671 1
98.17703822243685 0
96.66547652579736 0


In [1]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")


Mean Squared Error: 2900.19


In [2]:
from sklearn.metrics import r2_score

# Calculate the R-squared score
r_squared = r2_score(y_test, y_pred)
print(f"R-squared Score: {r_squared:.2f}")


R-squared Score: 0.45
