# Week 6 – Regression
**IIT Delhi AI/ML Course – Applied & Mathematical View**

This notebook covers Regression concepts step-by-step, combining IIT lecture notes with Scikit-learn practical examples.

## 1. Introduction
Regression is a supervised learning method for predicting a continuous numeric value from one or more features.
**Examples:**
- Predicting house prices from size, location, rooms
- Predicting sales revenue from advertising spend
- Predicting fare amount from passenger details in Titanic dataset

**General equation:**
$y = \beta_0 + \beta_1 x + \epsilon$

## 2. Simple Linear Regression
**Model equation:**
$y = \beta_0 + \beta_1 x + \epsilon$

**Estimating coefficients (Least Squares):**
$\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$
$\beta_0 = \bar{y} - \beta_1 \bar{x}$
The goal is to minimize SSE (Sum of Squared Errors):
$SSE = \sum (y_i - \hat{y}_i)^2$

In [None]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load Titanic dataset
titanic = sns.load_dataset('titanic').dropna(subset=['age', 'fare'])

X = titanic[['age']]  # Feature
y = titanic['fare']   # Target

model = LinearRegression()
model.fit(X, y)

print("Intercept:", model.intercept_)
print("Slope:", model.coef_)

# Predictions
titanic['pred_fare'] = model.predict(X)

# Plot regression line
plt.figure(figsize=(6,4))
sns.scatterplot(x='age', y='fare', data=titanic, alpha=0.5)
sns.lineplot(x='age', y='pred_fare', data=titanic, color='red')
plt.title("Linear Regression: Age vs Fare")
plt.show()


## 3. Multiple Linear Regression
When there are multiple independent variables:
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$

In [None]:

titanic['sex'] = titanic['sex'].map({'male':0, 'female':1})
X = titanic[['age', 'pclass', 'sex']]
y = titanic['fare']

model = LinearRegression()
model.fit(X, y)

print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)


## 4. Assumptions of Linear Regression
1. **Linearity** – Relationship between X and Y is linear
2. **Independence** – Observations are independent
3. **Homoscedasticity** – Constant variance of errors
4. **Normality of errors** – Residuals follow normal distribution
5. **No multicollinearity** – For multiple regression

## 5. Evaluation Metrics
- **R² Score:** Proportion of variance explained
- **MSE:** Mean Squared Error
- **RMSE:** Root Mean Squared Error
- **MAE:** Mean Absolute Error

In [None]:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

y_pred = model.predict(X)
print("R²:", r2_score(y, y_pred))
print("MSE:", mean_squared_error(y, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y, y_pred)))
print("MAE:", mean_absolute_error(y, y_pred))


## 6. Practical Tips
- Visualize residuals to check assumptions
- Scale features for gradient descent models
- Use regularization (Ridge/Lasso) if overfitting occurs

## 7. Summary
- Regression predicts continuous values
- Simple regression: one predictor
- Multiple regression: several predictors
- Evaluate with R², MSE, RMSE, MAE
- Check assumptions before trusting results