# Linear Regression Tutorial

Welcome to this interactive tutorial on **Linear Regression**! 

In this notebook, we will cover:
1.  **Theory**: Understanding the math behind Linear Regression.
2.  **Implementation**: Building a model using `scikit-learn`.
3.  **Visualization**: Visualizing the data and the regression line.
4.  **Exercises**: Hands-on practice to reinforce your learning.

## 1. Theory

### What is Linear Regression?
Linear Regression is a supervised learning algorithm used to predict a continuous target variable ($y$) based on one or more input features ($X$). It assumes a linear relationship between the input and the output.

### Simple Linear Regression
For a single feature, the relationship is defined as:
$$
y = \beta_0 + \beta_1 x + \epsilon
$$
Where:
- $y$ is the dependent variable (target).
- $x$ is the independent variable (feature).
- $\beta_0$ is the y-intercept.
- $\beta_1$ is the slope (coefficient).
- $\epsilon$ is the error term (noise).

### Cost Function
To find the best-fitting line, we minimize the error between the predicted values ($\hat{y}$) and the actual values ($y$). The most common cost function is the **Mean Squared Error (MSE)**:
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

### Optimization
Algorithms like **Gradient Descent** or the **Normal Equation** (OLS) are used to find the values of $\beta_0$ and $\beta_1$ that minimize the MSE.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Set seed for reproducibility
np.random.seed(42)

## 2. Implementation with scikit-learn

Let's start by generating some synthetic data to work with.

In [None]:
# Generate synthetic data
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Visualize the data
plt.scatter(X, y, alpha=0.6)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Synthetic Data")
plt.show()

### Splitting the Data
It's crucial to split our data into a **training set** and a **testing set**. We train the model on the training set and evaluate its performance on the unseen testing set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

### Training the Model
Now we will instantiate the `LinearRegression` model from scikit-learn and fit it to our training data.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Intercept (beta_0): {model.intercept_[0]:.2f}")
print(f"Coefficient (beta_1): {model.coef_[0][0]:.2f}")

### Evaluation
Let's evaluate our model using the testing set. We'll look at the **Mean Squared Error (MSE)** and the **R-squared ($R^2$)** score.

In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")

### Visualization of the Regression Line
Let's plot the regression line over our test data to see how well it fits.

In [None]:
plt.scatter(X_test, y_test, color='black', label='Actual Data')
plt.plot(X_test, y_pred, color='blue', linewidth=3, label='Regression Line')
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression Fit")
plt.legend()
plt.show()

## 3. Exercises

### Exercise 1: Different Noise Levels
Create a new synthetic dataset with a higher noise level (increase the multiplier for `np.random.randn`). 
Train a Linear Regression model on this noisy data and compare the $R^2$ score with the previous model. 
Does the model perform better or worse? Why?

In [None]:
# Your code here



### Exercise 2: Multiple Linear Regression
Generate a dataset with **2 features** ($X_1, X_2$) instead of 1. 
Hint: Use `X = 2 * np.random.rand(100, 2)` and update the equation for `y` to include both features (e.g., `y = 4 + 3 * X[:, 0] + 5 * X[:, 1] + noise`).

Train a model and print the coefficients.

In [None]:
# Your code here



## 4. Solutions

<details>
<summary>Click to see Solution for Exercise 1</summary>

```python
# High noise data
X_noisy = 2 * np.random.rand(100, 1)
y_noisy = 4 + 3 * X_noisy + 5 * np.random.randn(100, 1) # Increased noise multiplier to 5

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_noisy, y_noisy, test_size=0.2, random_state=42)

model_noisy = LinearRegression()
model_noisy.fit(X_train_n, y_train_n)

print(f"R^2 Score (High Noise): {model_noisy.score(X_test_n, y_test_n):.2f}")
# The R^2 score should be lower because the relationship is less clear due to noise.
```
</details>

<details>
<summary>Click to see Solution for Exercise 2</summary>

```python
# Multiple features
X_multi = 2 * np.random.rand(100, 2)
y_multi = 4 + 3 * X_multi[:, 0] + 5 * X_multi[:, 1] + np.random.randn(100)

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)

print(f"Coefficients: {model_multi.coef_}")
print(f"Intercept: {model_multi.intercept_:.2f}")
```
</details>