# Logistic Regression: Classification Made Simple

Welcome to the fifth notebook in our **Machine Learning Basics for Beginners** series! After learning about linear regression for predicting continuous values, let's explore **Logistic Regression**, a supervised learning algorithm used for **classification**—predicting categories or labels instead of numbers.

**What You'll Learn in This Notebook:**
- What logistic regression is and when to use it.
- How logistic regression works in simple terms.
- A hands-on example of classifying emails as spam or not spam using logistic regression.
- An interactive exercise to adjust data and see how predictions change.
- Visualizations to understand the concept of a decision boundary.

Let's get started!

## 1. What is Logistic Regression?

**Logistic Regression** is a supervised learning algorithm used to predict a categorical outcome—typically whether something belongs to one of two categories (binary classification). Despite its name, it’s not about predicting numbers like linear regression; it’s about predicting probabilities that lead to a category decision.

- **Goal**: Predict the probability of an outcome (e.g., the chance an email is spam) and classify it into a category (e.g., spam or not spam) based on a threshold.
- **When to Use It**: Use logistic regression when you want to classify data into categories, especially for binary (two-class) problems. It works well when the relationship between features and the outcome can be separated by a line or curve.
- **Examples**:
  - Classifying emails as spam or not spam.
  - Predicting if a patient has a disease (yes/no) based on medical data.
  - Determining if a customer will buy a product (yes/no) based on their profile.

**Analogy**: Imagine you're trying to decide if a fruit is an apple or an orange based on its color and size. Logistic regression helps draw a boundary line between "apple" and "orange" characteristics, so you can classify new fruits based on where they fall relative to that line.

## 2. How Does Logistic Regression Work?

Logistic regression might sound complex, but let’s break it down into simple steps:

1. **Start with a Linear Model**: Like linear regression, logistic regression starts by combining input features into a linear equation (something like `score = m1*x1 + m2*x2 + b`), where `x1`, `x2` are features (e.g., email length, number of suspicious words), and `m1`, `m2`, `b` are learned weights.
2. **Convert to Probability**: Instead of predicting a number, logistic regression transforms this score into a probability (a value between 0 and 1) using a special curve called the **sigmoid function**. The sigmoid function looks like an S-shaped curve and squashes any score into a probability:
   - If the score is very high, the probability is close to 1 (e.g., very likely spam).
   - If the score is very low, the probability is close to 0 (e.g., very likely not spam).
3. **Set a Threshold**: By default, if the probability is above 0.5, we classify it as one category (e.g., spam); if below 0.5, the other category (e.g., not spam). This threshold can be adjusted.
4. **Learning**: The algorithm adjusts the weights (`m1`, `m2`, `b`) to minimize errors, making the predicted probabilities as close as possible to the actual labels in the training data.

**Analogy**: Think of logistic regression as a judge in a contest. It looks at all the evidence (features), calculates a "score" for guilt or innocence, converts that score into a confidence level (probability), and makes a final verdict (classification) based on whether the confidence is above a certain level.

## 3. Example: Classifying Emails as Spam or Not Spam

Let’s see logistic regression in action with a small dataset of emails. We’ll use two features to predict if an email is spam:
- Number of suspicious words (like "lottery" or "win").
- Email length (number of characters).

**Dataset**:
- Suspicious Words: 0, 2, 1, 5, 3
- Email Length: 100, 300, 150, 400, 250
- Is Spam (Label): No (0), Yes (1), No (0), Yes (1), Yes (1)

We’ll use Python’s `scikit-learn` library to create a logistic regression model, train it on this data, and predict whether a new email is spam. Focus on the steps and output, not the code details.

**Instructions**: Run the code below to see how logistic regression classifies emails and visualizes the decision boundary.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Our small dataset
X = np.array([[0, 100], [2, 300], [1, 150], [5, 400], [3, 250]])  # Features: suspicious words, email length
y = np.array([0, 1, 0, 1, 1])  # Label: 0 = Not Spam, 1 = Spam

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Predict for a new email with 4 suspicious words and 350 characters
new_email = np.array([[4, 350]])
prediction = model.predict(new_email)[0]
probability = model.predict_proba(new_email)[0][1]  # Probability of being spam (class 1)
print(f"New Email (4 suspicious words, 350 chars): Predicted as {'Spam' if prediction == 1 else 'Not Spam'}")
print(f"Probability of being Spam: {probability:.2%}")

# Visualize the data and decision boundary
# Create a mesh grid for plotting decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 50, X[:, 1].max() + 50
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 10))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and points
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='blue', label='Not Spam (0)', alpha=0.8)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='red', label='Spam (1)', alpha=0.8)
plt.scatter(new_email[0][0], new_email[0][1], color='green', marker='x', s=200, label='New Email')
plt.xlabel('Suspicious Words')
plt.ylabel('Email Length (chars)')
plt.title('Logistic Regression: Spam vs. Not Spam')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are emails labeled as Not Spam.")
print("- Red dots are emails labeled as Spam.")
print("- The colored background shows the decision boundary: blue area predicts Not Spam, red area predicts Spam.")
print("- The green 'X' is the new email's position and predicted class based on the boundary.")

## 4. Interactive Exercise: Adjust Data and Classify

Now it’s your turn to experiment with logistic regression! In this exercise, you can add a new email to the dataset by specifying its features and label, then see how the decision boundary changes and make a prediction for another email.

**Instructions**:
- Run the code below.
- Enter the number of suspicious words, email length, and whether it’s spam (1 for yes, 0 for no) when prompted to add to the dataset.
- Enter features for a new email you want to predict.
- Observe how the boundary and prediction update with the new data.

In [None]:
# Interactive exercise for logistic regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

print("Welcome to the 'Adjust Data and Classify' Exercise!")
print("You’ll add a new email to the dataset and see how the logistic regression boundary changes.")

# Original dataset
X = np.array([[0, 100], [2, 300], [1, 150], [5, 400], [3, 250]])
y = np.array([0, 1, 0, 1, 1])

# Ask user to add a new data point
try:
    new_words = float(input("Enter number of suspicious words for new email (e.g., 3): "))
    new_length = float(input("Enter email length for new email (chars, e.g., 200): "))
    new_label = int(input("Is this email spam? (1 for Yes, 0 for No): "))
    if new_label not in [0, 1]:
        raise ValueError("Label must be 0 or 1.")
    X = np.vstack([X, [new_words, new_length]])
    y = np.append(y, new_label)
    print(f"Added email: {new_words} suspicious words, {new_length} chars, {'Spam' if new_label == 1 else 'Not Spam'}.")
except ValueError as e:
    print(f"Invalid input: {e}. Using original data without changes.")

# Train the model with updated data
model = LogisticRegression()
model.fit(X, y)

# Ask user for a new email to predict
try:
    predict_words = float(input("Enter suspicious words for an email to predict (e.g., 4): "))
    predict_length = float(input("Enter email length to predict (chars, e.g., 350): "))
    new_email = np.array([[predict_words, predict_length]])
    prediction = model.predict(new_email)[0]
    probability = model.predict_proba(new_email)[0][1]
    print(f"Predicted class for email ({predict_words} words, {predict_length} chars): {'Spam' if prediction == 1 else 'Not Spam'}")
    print(f"Probability of being Spam: {probability:.2%}")
except ValueError:
    new_email = np.array([[4, 350]])
    prediction = model.predict(new_email)[0]
    probability = model.predict_proba(new_email)[0][1]
    print(f"Invalid input. Defaulting to 4 words, 350 chars. Predicted: {'Spam' if prediction == 1 else 'Not Spam'}, Probability of Spam: {probability:.2%}")

# Visualize the updated data and decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 50, X[:, 1].max() + 50
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 10))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X[:-1][y[:-1] == 0][:, 0], X[:-1][y[:-1] == 0][:, 1], color='blue', label='Original Not Spam (0)', alpha=0.8)
plt.scatter(X[:-1][y[:-1] == 1][:, 0], X[:-1][y[:-1] == 1][:, 1], color='red', label='Original Spam (1)', alpha=0.8)
plt.scatter(X[-1, 0], X[-1, 1], color='orange', label=f"Your Added Data ({'Spam' if y[-1] == 1 else 'Not Spam'})", alpha=0.8)
plt.scatter(new_email[0][0], new_email[0][1], color='green', marker='x', s=200, label='Prediction')
plt.xlabel('Suspicious Words')
plt.ylabel('Email Length (chars)')
plt.title('Logistic Regression: Updated Spam vs. Not Spam')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are original emails labeled as Not Spam.")
print("- Red dots are original emails labeled as Spam.")
print("- Orange dot is the email data you added.")
print("- The colored background shows the updated decision boundary.")
print("- The green 'X' is the predicted class for the email you chose to classify.")

## 5. Key Considerations for Logistic Regression

Logistic regression is a powerful and interpretable algorithm for classification, but it has limitations to keep in mind:

- **Assumes Linear Separability**: It works best when the two classes can be separated by a straight line (or plane for multiple features). If the relationship is very complex or non-linear, other algorithms might perform better.
- **Binary Classification by Default**: While it’s primarily for two-class problems, it can be extended to multi-class problems with techniques like "one-vs-rest," though this adds complexity.
- **Sensitive to Imbalanced Data**: If one class (e.g., spam) is much rarer than the other, logistic regression might bias toward the majority class unless adjusted.

**Analogy**: Logistic regression is like drawing a straight line to separate two groups of people in a crowd based on height and weight. If the groups are mixed in a way that a straight line can’t separate them (like a circular pattern), you’d need a different method to draw the boundary.

Despite these limitations, logistic regression is widely used due to its simplicity, speed, and ability to provide probabilities, making it a great starting point for classification tasks.

## 6. Key Takeaways

- **Logistic Regression** is a supervised learning algorithm for classification, predicting categories (often binary, like yes/no) by estimating probabilities.
- It works by combining features into a score, transforming it into a probability using the sigmoid function, and classifying based on a threshold (usually 0.5).
- Use it for tasks like spam detection, disease prediction, or customer behavior classification when classes can be separated linearly.
- Be aware of limitations: it assumes linear separability, works best for binary classification, and can struggle with imbalanced data.

You’ve now learned a key classification algorithm! Logistic regression builds on the ideas of linear regression but adapts them for categorical predictions, expanding your machine learning toolkit.

**What's Next?**
Move on to **Notebook 6: Decision Trees** to learn about another powerful algorithm for both classification and regression tasks, which works by making decisions in a tree-like structure. See you there!