### üî∞ Notation Used in This Notebook

In Linear Models, the hypothesis (prediction function) is traditionally written as:

$$
y(x) = mx + c
$$

However, in Machine Learning ‚Äî especially for **Logistic Regression and Linear Regression** ‚Äî we use **vectorized parameter notation**:

$$
h_\theta(x) = \theta_0 + \theta_1 x_1
$$

where  
- $ \theta_0 $ ‚Üí intercept / bias term  
- $ \theta_1 $ ‚Üí weight associated with feature $ x_1 $

This notation makes the model easier to **extend to multiple features (dimensions)** and **express in matrix form** later.


---



# üîπ Logistic Regression ‚Äî Binary Classification

Logistic Regression is a supervised machine learning algorithm used when the **target variable has only two possible classes** (0 or 1).  
It is one of the most popular and foundational algorithms for **classification problems**.

Examples of binary classification:
- üé¨ Spam (1) vs Not Spam (0)
- üè• Disease (1) vs No Disease (0)
- üí≥ Fraud (1) vs Legit (0)
- üîê Login Success (1) vs Failure (0)

---

## üß© Why Logistic Regression?
Linear Regression is not suitable for classification because:
- It predicts continuous values (beyond 0 and 1)
- It does not output probabilities
- It is **very sensitive to outliers**
- Most importantly, **the decision boundary becomes unreliable for binary outcomes**

So instead of predicting any numeric value, logistic regression predicts the **probability that a sample belongs to class 1**.

---

## üéö The Need for an Activation Function
If we apply linear regression directly:
$$
z = \theta^T x
$$

or 

$$
h\theta(x) = \theta_0 + \theta_1 x_1
$$
the output can be any value from **‚àí‚àû to +‚àû**, but for classification we need a probability between **0 and 1**.

To convert linear output into a probability, we apply the **Sigmoid Activation Function**.

---

## üìå Sigmoid Function
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

- $ z  = \theta_0 + \theta_1 x_1 $
- Compresses \( z \) into a range of **0 to 1**
- Helps interpret the output as a **probability**
- Creates a **smooth S-shaped curve**

---

## üí∞ Cost Function

$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (h\theta(x)^i - y^i)^2
$$

We know that $ h\theta(x)^i = \frac{1}{1 + e^{-z}} \quad \bigg|_{z=\theta_0 + \theta_1 x}$

Hence $ (h\theta(x)^i i y^i)^2 $ can be denoted as follows

$$ 
\text{Cost}(h\theta(x)^i, y^i)
$$

Therefore,



To convert the loss into a **convex function** (so that Gradient Descent can reach a **global minima**), we use the **Log Loss (Binary Cross-Entropy Loss)**.

$$
\text{Cost}(h_\theta(x)^i, y^i)
= -\, y \, \log(h_\theta(x)) \;-\; (1 - y) \, \log(1 - h_\theta(x))
$$

$$
\text{Cost}(h_\theta(x)^i, y^i) =
\begin{cases}
-\log(h_\theta(x)) & \text{if } y = 1 \\
-\log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases} \quad \quad \Bigg |_{\text{we basically call this as Log Loss}}
$$

Therefore, the cost function can be denoted as 
$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (y_i - \log(h\theta(x)^i)) - (1 - y^i) \log(1- h\theta(x)^i)

$$

---

## üß† Logistic Regression Hypothesis
$$
h_\theta(x) = \sigma(\theta^T x)
$$

or

$$
h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1)
$$

---

## ‚ö†Ô∏è Why the Loss Function Cannot Be Mean Squared Error
If we use **MSE (Mean Squared Error)** for logistic regression:
- The optimization surface becomes **non-convex**
- Gradient Descent may get stuck in **local minima**
- The model becomes **unstable and inaccurate**

This happens because sigmoid is a **non-linear transformation**, and MSE does not handle that well.

---

## ‚úîÔ∏è Solution ‚Äî Log Loss
To ensure a **convex curve** (smooth bowl-shape) that Gradient Descent can minimize efficiently, logistic regression uses **Log Loss (Binary Cross-Entropy Loss)**.

$$
\text{Cost}(h_\theta(x)^i, y^i)
= -\, y \, \log(h_\theta(x)) \;-\; (1 - y) \, \log(1 - h_\theta(x))
$$

$$
\text{Cost}(h_\theta(x)^i, y^i) =
\begin{cases}
-\log(h_\theta(x)) & \text{if } y = 1 \\
-\log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases} \quad \quad \Bigg |_{\text{we basically call this as Log Loss}}
$$

Therefore, the cost function can be denoted as 
$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (y_i - \log(h\theta(x)^i)) - (1 - y^i) \log(1- h\theta(x)^i)

$$

---

## üéØ Summary
| Concept | Meaning |
|--------|---------|
| Goal | Classify input into 0 or 1 |
| Output | Probability of being class 1 |
| Activation | Sigmoid to squash values into 0‚Äì1 |
| Best Loss Function | Binary Cross-Entropy (Convex) |
| Optimization | Gradient Descent |

---

üöÄ Final takeaway:  
Even though the name says **"Regression"**, Logistic Regression is actually used for **classification**, and it works by predicting **probabilities**, not continuous values.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

In [2]:
# make_classification is used to generate a dummy dataset with input features and class labels
# We can control number of samples, features, informative features, redundant features, and number of classes
from sklearn.datasets import make_classification

In [None]:
# Generate synthetic data for classification:
# n_samples = total rows (1000)
# n_features = total input features per row (10)
# n_classes = output classes (binary: 0 or 1)
# random_state = for reproducibility of results
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

In [5]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets:
# X  = input features
# y  = target labels
# test_size = 0.30 ‚Üí 30% of the data will be used for testing and 70% for training
# random_state = 42 ‚Üí ensures the split is always the same (reproducible results)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)


In [8]:
## Model training

from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
# LogisticRegression() creates a classifier that predicts the probability of a sample
# belonging to class 1 using the sigmoid function
logistic_model = LogisticRegression()

# Train (fit) the model on the training data
# X_train = input features for training
# y_train = actual labels for those samples
# During training, the model learns the optimal weights (Œ∏) that minimize the log-loss cost function
logistic_model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [17]:
# Make predictions on the test dataset using the trained model
# X_test = input features that the model has never seen before
# y_pred = predicted class labels (0 or 1) for each sample in X_test
y_pred = logistic_model.predict(X_test)

# Print the prediction outputs in a clean and readable format
print("-" * 75)
print("Prediction:")
print("-" * 75)
print(y_pred)    # Displays the array of predicted labels
print("-" * 75)


---------------------------------------------------------------------------
Prediction:
---------------------------------------------------------------------------
[0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 1
 1 1 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0
 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1
 1 1 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0
 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 1
 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 1 0
 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1
 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1
 1 1 1 0]
---------------------------------------------------------------------------


In [20]:
# Get the predicted probabilities for each class
# logistic_model.predict_proba(X_test) returns a 2-column array:
#   - Column 0 ‚Üí probability that the sample belongs to class 0
#   - Column 1 ‚Üí probability that the sample belongs to class 1
# These probabilities come from the sigmoid function (for binary logistic regression)
# and are useful when we need confidence scores instead of just class predictions.
logistic_model.predict_proba(X_test)


array([[7.74477909e-01, 2.25522091e-01],
       [3.36684957e-02, 9.66331504e-01],
       [6.70682154e-01, 3.29317846e-01],
       [7.98668032e-02, 9.20133197e-01],
       [9.76616650e-01, 2.33833501e-02],
       [4.13572804e-02, 9.58642720e-01],
       [9.79028767e-01, 2.09712329e-02],
       [9.59367261e-01, 4.06327393e-02],
       [8.08520049e-01, 1.91479951e-01],
       [6.84954318e-01, 3.15045682e-01],
       [9.13669448e-01, 8.63305524e-02],
       [2.63597018e-01, 7.36402982e-01],
       [5.25844192e-01, 4.74155808e-01],
       [2.11912354e-01, 7.88087646e-01],
       [7.93592056e-01, 2.06407944e-01],
       [9.46621678e-01, 5.33783219e-02],
       [2.62957933e-02, 9.73704207e-01],
       [3.24212617e-01, 6.75787383e-01],
       [3.14803584e-01, 6.85196416e-01],
       [2.04956217e-01, 7.95043783e-01],
       [5.04587065e-01, 4.95412935e-01],
       [9.66703630e-01, 3.32963696e-02],
       [2.00514796e-01, 7.99485204e-01],
       [7.77891908e-01, 2.22108092e-01],
       [8.729863

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Evaluate the model performance on test data

# 1Ô∏è‚É£ Accuracy Score
# Measures the percentage of correctly predicted samples:
# (TP + TN) / (TP + TN + FP + FN)
score = accuracy_score(y_test, y_pred)

# 2Ô∏è‚É£ Confusion Matrix
# Summarizes the prediction results in a 2√ó2 matrix:
# [[TP, FN],
#  [FP, TN]]
confusion_matrix = confusion_matrix(y_test, y_pred)

# 3Ô∏è‚É£ Classification Report
# Provides precision, recall, F1-score, and support for each class
classification_report = classification_report(y_test, y_pred)

# Print evaluation results in a readable format
print(f"Accuracy score: {score}")
print(f"Confusion matrix: \n{confusion_matrix}")
print(f"Classification Report: \n{classification_report}")


Accuracy score: 0.8466666666666667
Confusion matrix: 
[[118  17]
 [ 29 136]]
Classification Report: 
              precision    recall  f1-score   support

           0       0.80      0.87      0.84       135
           1       0.89      0.82      0.86       165

    accuracy                           0.85       300
   macro avg       0.85      0.85      0.85       300
weighted avg       0.85      0.85      0.85       300




---
## üòµ‚Äçüí´ Confusion Matrix

The confusion matrix is basically derived from the dataset. A confusion matrix is a table used to evaluate the performance of a classification model (mainly supervised learning).

$$
\begin{array}{c|c}
\text{} & \text{Actual} \\
\hline
\text{Predicted} &
\begin{array}{|c|c|}
\hline
TP & FN \\
\hline
FP & TN \\
\hline
\end{array}
\end{array}
$$


---

## üîë Meaning of each term

| Term                    | Meaning                                                                        |
| ----------------------- | ------------------------------------------------------------------------------ |
| **TP ‚Äì True Positive**  | Model predicted **Positive** and it **was actually Positive**                  |
| **TN ‚Äì True Negative**  | Model predicted **Negative** and it **was actually Negative**                  |
| **FP ‚Äì False Positive** | Model predicted **Positive**, but it was **actually Negative** (Type-I error)  |
| **FN ‚Äì False Negative** | Model predicted **Negative**, but it was **actually Positive** (Type-II error) |

---


---
## üéØ Accuracy Score 

Accuracy score can be represented as 

$$
Accuracy = \frac{TP + TN}{TP + FP + FN + TN}
$$

Accuracy score cannot be used on a imbalanced dataset, therefore it cannot be used directly for model accuracy. Inorder to prevent this we can use Precision and Recall.

---

## ‚úÖ Precison 

Precision can be represented as

$$
Precision = \frac{TP}{TP + FP} \quad \bigg | \text{ Out of all the acutal values, how many are correctly predicted}
$$

---

## üîÅ Recall

Recall can be represented as 

$$
Recall = \frac{TP}{TP + FN} \quad \bigg| \text{ Out of all the predicted values, how many are correctly predicted}
$$

---

## F - $\beta$ score

F-$\beta$ score can be represented as 
$$
F-\beta Score = (1 + \beta^2) \frac{Precision x Recall}{Precision + Recall}
$$

Usecase,

- If $ FP $ and $ FN $ are both important, then $\beta = 1$

Therefore,
$$
    F-1 Score = (1 + 1^2) \frac{Precision x Recall}{Precision + Recall}
$$

This is reffered as Harmonic Mean

- If $ FP $ is more important than $ FN $, then $\beta = 0.5$

Therefore,

$$
F-0.5 Score = (1 + 0.5^2) \frac{Precision x Recall}{Precision + Recall}
$$

- If $ FN $ is more important than $ FP $, then $\beta = 2$

Therefore,

$$
F-2 Score = (1 + 2^2) \frac{Precision x Recall}{Precision + Recall}
$$