# Linear Regression

## 🔷 What is Linear Regression?

Linear Regression is one of the simplest and most widely used **supervised learning algorithms** for predicting a continuous numerical value. It seeks to model the relationship between independent variables (features) and a dependent variable (target) by fitting a linear equation to the observed data.

- A linear regression model assumes that the target variable is a **linear combination** of input features.
- It aims to minimize the error (difference) between predicted values and the actual values using techniques such as **Ordinary Least Squares (OLS)**.

---

## 🔑 Mathematical Representation

The equation for Linear Regression is:
\[
y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + \dots + \beta_n \cdot X_n
\]
Where:
- \( y \): Predicted value (dependent variable)
- \( X_1, X_2, ..., X_n \): Features (independent variables)
- \( \beta_0 \): Intercept (value of \( y \) when all features equal 0)
- \( \beta_1, \beta_2, ..., \beta_n \): Coefficients that represent the weight/contribution of each feature to the prediction

For simplicity, in its **simplest form (univariate linear regression)**:
\[
y = \beta_0 + \beta_1 \cdot X
\]

---

## 🚀 How Linear Regression Works

### 1. **Training - Fitting a Line**
Linear Regression finds the best-fit line (or hyperplane for multivariate regression) by minimizing the **loss function**:
- The most common loss function is the **Mean Squared Error (MSE)**:
\[
MSE = \frac{1}{m} \sum_{i=1}^{m} (y_{\text{true}}^{(i)} - y_{\text{predicted}}^{(i)})^2
\]
Where:
  - \( m \): Number of data points
  - \( y_{\text{true}}^{(i)} \): True value for the \(i\)-th sample
  - \( y_{\text{predicted}}^{(i)} \): Predicted value for the \(i\)-th sample

The algorithm calculates coefficients (\( \beta_0 \), \( \beta_1 \), ...) such that this error is minimized.

### 2. **Making Predictions**
Once trained, the model uses the learned coefficients to predict new data points: \( y = \beta_0 + \beta_1 \cdot X \).

---

## 🧠 Key Types of Linear Regression

1. **Simple Linear Regression**:
   - Involves **one independent variable** and one dependent variable.
   - Example: Predicting house price based on square footage.

   Equation: \( y = \beta_0 + \beta_1 \cdot X \)

2. **Multiple Linear Regression** (Multivariate Regression):
   - Involves **multiple independent variables** to predict a single dependent variable.
   - Example: Predicting house price based on square footage, number of bedrooms, and neighborhood.

   Equation: \( y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + .... + \beta_n \cdot X_n \)

---

## 🔍 Strengths of Linear Regression

- **Simplicity**:
  - Very easy to understand and interpret.
  - Minimal computational overhead compared to more complex models.

- **Interpretability**:
  - Coefficients indicate the importance of each feature and its effect on the target.

- **Fast Training**:
  - Linear regression training is computationally efficient, even for large datasets.

---

## ⚠️ Limitations of Linear Regression

1. **Linearity Assumption**:
   - Assumes that the relationship between features and target is linear, which may not hold for complex real-world problems.

2. **Sensitivity to Outliers**:
   - Outliers can disproportionately affect the coefficients and reduce accuracy.

3. **Feature Selection**:
   - Linear regression requires careful selection of features. Including irrelevant/noisy features can lead to poor performance.

4. **Multicollinearity**:
   - If features are highly correlated, it can weaken the interpretability of coefficients.

5. **Underfitting**:
   - Linear regression may fail to capture non-linear relationships in the data, leading to underfitting.

---

## 🔧 Applications of Linear Regression

- Predicting continuous outcomes:
  - **Housing prices** based on features like size, location, etc.
  - **Stock prices** based on market indicators.
  - **Sales prediction** based on historical data.
  - **Medical applications** such as predicting blood pressure or cholesterol levels.
  
---

## 🧠 Extensions of Linear Regression

1. **Regularized Linear Regression**:
   - Addresses issues like overfitting and multicollinearity.
   - Examples:
     - **Lasso (L1 Regularization)**: Adds a penalty term to shrink coefficients to reduce complexity.
     - **Ridge (L2 Regularization)**: Penalizes large coefficients more to improve model generalization.
     - **ElasticNet**: Combines L1 and L2 regularization.

2. **Polynomial Regression**:
   - Transforms features to capture non-linear relationships.
   - Example: \( y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot X^2 \).

3. **Robust Regression**:
   - Reduces the effect of outliers on predictions.

---

## 🛠 Example Code in Python

### Using Scikit-Learn:
```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Sample dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 3.5, 4.0, 6.0, 7.5])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Making Predictions
y_pred = lr_model.predict(X_test)

# Evaluating Model
mse = mean_squared_error(y_test, y_pred)

print("Predicted values:", y_pred)
print("Mean Squared Error:", mse)

# Coefficients
print("Intercept:", lr_model.intercept_)
print("Slope (Coefficient):", lr_model.coef_)

# Code

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


### Lets predict the *Fare*

In [2]:
# load dataset
df = pd.read_csv("../titanic.csv")

# keep useful columns
df = df[['Fare', 'Pclass', 'Age', 'SibSp', 'Parch', 'Sex', 'Embarked']]

# fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# encode categorical variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# separate target & features
X = df.drop("Fare", axis=1)
y = df["Fare"]


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [4]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [5]:
y_pred = lr.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2%}")


Mean Squared Error: 928.61
R² Score: 39.99%


In [6]:
coefficients = pd.Series(lr.coef_, index=X.columns)
print(coefficients.sort_values(ascending=False))


Parch          8.592276
SibSp          6.824174
Age           -1.039057
Sex_male      -1.713282
Embarked_Q    -3.871888
Embarked_S    -9.291567
Pclass       -27.960685
dtype: float64


In [7]:
import pickle

with open("fare_model.pkl", "wb") as f:
    pickle.dump((scaler, lr), f)
