# Linear Regression: Predicting Continuous Values

Welcome to the fourth notebook in our **Machine Learning Basics for Beginners** series! Now that you've learned the foundations of machine learning and its different types, it's time to dive into specific algorithms. We’re starting with **Linear Regression**, one of the simplest and most widely used algorithms in supervised learning for predicting continuous values.

**What You'll Learn in This Notebook:**
- What linear regression is and when to use it.
- How linear regression works in simple terms.
- A hands-on example of predicting house prices using linear regression.
- An interactive exercise to adjust data and see how predictions change.
- Visualizations to understand the concept of a "best-fit line."

Let's get started!

## 1. What is Linear Regression?

**Linear Regression** is a supervised learning algorithm used to predict a continuous value (like a number) based on one or more input features. It assumes there is a straight-line (linear) relationship between the input features and the output value.

- **Goal**: Find the "best-fit line" that predicts the output as closely as possible to the actual values in the data.
- **When to Use It**: Use linear regression when you want to predict something numerical, like house prices, temperatures, or sales figures, and you think the relationship between your inputs and output is roughly linear (can be represented by a straight line or plane).
- **Examples**:
  - Predicting house prices based on size (square footage).
  - Estimating a car's fuel efficiency based on its weight.
  - Forecasting sales based on advertising budget.

**Analogy**: Imagine you're trying to draw a straight line through a scatter of points on a graph so that the line gets as close as possible to most points. That line helps you guess where new points might fall. Linear regression does exactly that!

## 2. How Does Linear Regression Work?

Linear regression works by finding a mathematical equation for a straight line that best matches the data. Let’s break it down step by step:

1. **The Line Equation**: The simplest form of linear regression (with one feature) uses the equation of a straight line:
   - `y = mx + b`
     - `y` is the predicted value (e.g., house price).
     - `x` is the input feature (e.g., house size).
     - `m` is the slope of the line (how much `y` changes for a change in `x`).
     - `b` is the y-intercept (the value of `y` when `x` is 0).
2. **Finding the Best Line**: The algorithm tries different values of `m` and `b` to draw a line that minimizes the "error"—the difference between the predicted values (on the line) and the actual values in the data. It does this by calculating a "cost" (how bad the predictions are) and adjusting the line to lower the cost.
3. **Prediction**: Once the best line is found, you can plug in a new `x` value (e.g., a new house size) to predict `y` (the price).

If you have multiple features (like size, bedrooms, and location), linear regression extends to a plane or hyperplane, but the idea is the same: find the best fit.

**Analogy**: Think of linear regression as trying on different pairs of glasses until you find the one that lets you see a blurry image most clearly. Each pair of glasses is like a different line, and the clearest view is the "best-fit line."

## 3. Example: Predicting House Prices

Let’s see linear regression in action with a small dataset of house sizes (in square feet) and their prices (in thousands of dollars). Our goal is to predict the price of a new house based on its size.

**Dataset**:
- House Size (sq ft): 500, 1000, 1500, 2000, 2500
- Price (thousands $): 50, 100, 140, 190, 240

We’ll use Python’s `scikit-learn` library to create a linear regression model, train it on this data, and make a prediction. Don’t worry about the code details—just focus on the steps and output for now.

**Instructions**: Run the code below to see how linear regression predicts house prices and visualizes the best-fit line.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Our small dataset
house_sizes = np.array([500, 1000, 1500, 2000, 2500]).reshape(-1, 1)  # Feature: size in sq ft
house_prices = np.array([50, 100, 140, 190, 240])  # Label: price in thousands $

# Create and train the linear regression model
model = LinearRegression()
model.fit(house_sizes, house_prices)

# Get the slope (m) and intercept (b) of the best-fit line
slope = model.coef_[0]
intercept = model.intercept_
print(f"Best-fit line equation: Price = {slope:.2f} * Size + {intercept:.2f}")
print(f"This means for every 1 sq ft increase, price increases by about ${slope:.2f} thousand.")

# Predict the price for a new house of 1800 sq ft
new_house_size = np.array([[1800]])
predicted_price = model.predict(new_house_size)[0]
print(f"Predicted price for an 1800 sq ft house: ${predicted_price:.2f} thousand")

# Visualize the data and the best-fit line
plt.scatter(house_sizes, house_prices, color='blue', label='Actual Data')
plt.plot(house_sizes, model.predict(house_sizes), color='red', label='Best-Fit Line')
plt.scatter(new_house_size, predicted_price, color='green', marker='x', s=200, label='Prediction (1800 sq ft)')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price (thousands $)')
plt.title('Linear Regression: Predicting House Prices')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are the actual house data.")
print("- The red line is the best-fit line found by linear regression.")
print("- The green 'X' is the predicted price for a new 1800 sq ft house.")

## 4. Interactive Exercise: Adjust Data and Predict

Now it’s your turn to play with linear regression! In this exercise, you can add a new house to the dataset by specifying its size and price, then see how the best-fit line changes and make a prediction for another house size.

**Instructions**:
- Run the code below.
- Enter a house size (in sq ft) and price (in thousands $) when prompted to add to the dataset.
- Enter a size for a house you want to predict the price for.
- Observe how the line and prediction update with the new data.

In [None]:
# Interactive exercise for linear regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

print("Welcome to the 'Adjust Data and Predict' Exercise!")
print("You’ll add a new house to the dataset and see how the linear regression line changes.")

# Original dataset
house_sizes = np.array([500, 1000, 1500, 2000, 2500]).reshape(-1, 1)
house_prices = np.array([50, 100, 140, 190, 240])

# Ask user to add a new data point
try:
    new_size = float(input("Enter a new house size (sq ft, e.g., 3000): "))
    new_price = float(input("Enter the price for that house (thousands $, e.g., 300): "))
    house_sizes = np.append(house_sizes, new_size).reshape(-1, 1)
    house_prices = np.append(house_prices, new_price)
    print(f"Added house: {new_size} sq ft for ${new_price} thousand.")
except ValueError:
    print("Invalid input. Using original data without changes.")

# Train the model with updated data
model = LinearRegression()
model.fit(house_sizes, house_prices)

# Get the updated slope and intercept
slope = model.coef_[0]
intercept = model.intercept_
print(f"Updated best-fit line equation: Price = {slope:.2f} * Size + {intercept:.2f}")

# Ask user for a size to predict
try:
    predict_size = float(input("Enter a house size to predict its price (sq ft, e.g., 1800): "))
    predicted_price = model.predict(np.array([[predict_size]]))[0]
    print(f"Predicted price for a {predict_size} sq ft house: ${predicted_price:.2f} thousand")
except ValueError:
    predict_size = 1800
    predicted_price = model.predict(np.array([[predict_size]]))[0]
    print(f"Invalid input. Defaulting to 1800 sq ft. Predicted price: ${predicted_price:.2f} thousand")

# Visualize the updated data and line
plt.scatter(house_sizes[:-1], house_prices[:-1], color='blue', label='Original Data')
plt.scatter(house_sizes[-1], house_prices[-1], color='orange', label='Your Added Data')
plt.plot(house_sizes, model.predict(house_sizes), color='red', label='Best-Fit Line')
plt.scatter(predict_size, predicted_price, color='green', marker='x', s=200, label=f'Prediction ({predict_size} sq ft)')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price (thousands $)')
plt.title('Linear Regression: Updated with Your Data')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are the original house data.")
print("- Orange dot is the house data you added.")
print("- The red line is the updated best-fit line.")
print("- The green 'X' is the predicted price for the house size you chose.")

## 5. Key Considerations for Linear Regression

While linear regression is simple and powerful, it’s not perfect for every situation. Here are some things to keep in mind:

- **Assumes a Linear Relationship**: Linear regression works best when the relationship between features and the label is roughly a straight line. If the relationship is curved or complex (like a parabola), it might not predict well.
- **Sensitive to Outliers**: If your data has extreme values (outliers), they can pull the line in the wrong direction, leading to bad predictions.
- **Limited to Continuous Outputs**: Linear regression predicts numbers, not categories. For categories (like spam vs. not spam), you’d use a different algorithm.

**Analogy**: Linear regression is like using a ruler to draw a straight line through points. If the points naturally form a curve, a ruler won’t help much—you’d need a different tool!

Despite these limitations, linear regression is a great starting point for understanding machine learning and is often used as a baseline to compare more complex models.

## 6. Key Takeaways

- **Linear Regression** is a supervised learning algorithm for predicting continuous values (numbers) by finding a best-fit line.
- It works by minimizing the error between predicted values and actual data points, using an equation like `y = mx + b` for one feature.
- Use it for tasks like predicting prices, temperatures, or sales when the relationship between inputs and outputs is roughly linear.
- Be aware of limitations: it assumes linearity, can be affected by outliers, and isn’t suited for categorical predictions.

You’ve just learned your first machine learning algorithm! Linear regression is a building block for many other techniques, and understanding it sets you up for success with more complex models.

**What's Next?**
Move on to **Notebook 5: Logistic Regression** to learn about a related algorithm used for classification tasks (predicting categories instead of numbers). See you there!