#### **regression**
a statistical and machine learning technique used to model the relationship between **independent variables** (inputs/features) and a **dependent variable (output/target)**

unlike classification, where outputs fall into discrete categories, regression outputs **continuous numeric values**.

goal of regression is to estimate the mathematical relationship so that future, unseen values can be predicted reliably

applications may include 
- Predicting house or car prices

- Forecasting stock trends

- Predicting electricity usage

- Estimating sales or demand

#### **linear regression**
assumes relationship between input variables and output is **linear** - can be represented by a straight line (in 1D) or a hyperplane (in multi-dimensional space)

model tries to draw a line that best fits the data by minimising the prediction error.

achieved through the Least Squares Method, which finds coefficients that minimize the sum of squared differences between the predicted and actual values

#### **simple linear regression**
involves:
- **one independent variable (X)**
- **one dependent variable (Y)**

$$Y=Œ≤_0‚Äã+Œ≤_1‚ÄãX$$

- **Œ≤‚ÇÄ (Intercept)**: The predicted value of Y when X = 0

- **Œ≤‚ÇÅ (Slope)**: How much Y changes for a unit increase in X

- **X**: Independent variable

- **Y**: Dependent variable

#### **residuals** - error calc in regression
a residual is the diff btw the predicted and the actual value
$$Œµ_i‚Äã=y_‚Äã(pred)_i‚Äã‚àíy_i‚Äã$$

y_(pred)_i = predicted value

y_i = actual value

residuals measure how far the model is from the real data.

##### **Random Error**

ideally, residuals should:

- Look random

- Have no patterns

- Have constant variance

_critical for valid regression._

##### **measuring testing accuracies**
testing accuracies in regression is evaluated using metrics such as:
- **R¬≤ (Coefficient of Determination)**: How much variance the model explains
- **MAE (Mean Absolute Error)**: Average absolute difference
- **MSE (Mean Squared Error)**: Average squared difference
- **RMSE**: Root of MSE (same units as target)

high R¬≤ and low error values indicate a good regression model

#### **multiple linear regression**
an extension of simple LR where we have **multiple independent variables**
$$Y=Œ≤_0‚Äã+Œ≤_1‚ÄãX_1‚Äã+Œ≤_2‚ÄãX_2‚Äã+‚ãØ+Œ≤_p‚ÄãX_p‚Äã+Œµ$$

- There are p features (X‚ÇÅ, X‚ÇÇ,‚Ä¶, X‚Çö)
- The model computes one Œ≤ (coefficient) for each feature
- Œ≤ values show the impact of each variable on the prediction

like:

Predicting house prices based on:
- Area
- Bedrooms
- Age
- Location

each feature gets its own coefficient.

#####  **considerations in multiple regression**
1. **overfitting**

    adding too many features may allow the model to memorize the training data

    symptoms:
    - very high training accuracy
    - very low testing accuracy

    solution:
    - remove irrelevant features
    - use regularization **(_ridge_, _lasso_)**
    - use cross-validation

2. **multi-collinearity**
    
    occurs when two or more independent variables strongly correlate with eachother

    problems caused:
    - coefficients become unstable
    - interpretation becomes difficult
    - predictions become unreliable

    detection:
    - correlation matrix
    - VIF (variance inflation factor)

3. **feature selection**
    
    choosing right set of features improves:
    - accuracy
    - interpretability
    - speed

    methods:
    - filter methods (correlation)
    - wrapper methods (RFE)
    - embedded methods (Lasso)

#### **linear regression coefficients**
The line equation is:

$$ùë¶=ùëöùë•+ùëè$$

Where:

-m = slope (coefficient)
-b = intercept

**interpretation:**

- Coefficient (m): Indicates how much the target changes with a unit change in the predictor

- Intercept (b): Target value when inputs are zero

in multiple LR, each feature has its own coefficient that shows its contribution.

#### **regression plot (regplot)**
visually represents
- data pts (scatter plot)
- regression line (best fit line)

help judge:
- linearity
- presence of outliers
- strength of relationship

tighter cluster around the line -> strong relationship

wide spread -> weak relationship

#### **linear regression assumptions**
##### **1. linearity**
relationship between the independent and dependent variable must be linear

if the pattern is curved/non-linear -> LR may be inappropriate

##### **2. independence**
observations must be independent from eachother

violation example:
- Time-series data where value at t depends on t-1

##### **3. homoscendasticity (constant variance)**
residuals must have constant variance across all predicted values

if residual spread increases or decreases -> heteroscedasticity

##### **4. normality of residuals**
residual (errors) should follow a normal distribution

check via:
- histogram
- Q-Q plot

used for conducting statistical tests and confidence intervals

##### **5. no perfect multicollinearity**
independent variables should not be perfectly correlated (like X2 = 2 * X1)

perfect multicollinearity prevents calculation of independent effects

##### **6. no autocorrelation**
residuals should not be correlated with eachother

violation is common in time-series data

##### **7. additivity**
effects of independent variables add up linearly

e.g., Effect of horsepower on price does not depend on mileage.

#### **polynomial linear regression**
extends linear regression to capture non-linear patterns

model becomes:
$$y=b_0‚Äã+b_1‚Äãx+b_2‚Äãx^2+‚ãØ+b_n‚Äãx^n+Œµ$$

- Degree n controls curve complexity

- Higher degree = more flexibility

- Too high degree = overfitting

**advantages:**
- captures non-linear patterns
- more accurate for curved data

**disadvantages:**
- prone to overfitting
- harder to interpret
- sensitive to noise

_Still, polynomial regression is considered ‚Äúlinear‚Äù because coefficients appear linearly in the equation_