# Regression

## What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between a dependent variable (Y) and a single independent variable (X). The relationship is represented using a straight line with an equation `Y = mX + c`, where `m` is the slope and `c` is the intercept. It is useful when we want to understand how a change in one variable affects another.



## What are the key assumptions of Simple Linear Regression?

The key assumptions of Simple Linear Regression include:
- **Linearity**: The relationship between X and Y is linear.
- **Independence**: The residuals (errors) are independent.
- **Homoscedasticity**: The residuals have constant variance.
- **Normality**: The residuals are normally distributed.



## What does the coefficient `m` represent in the equation `Y = mX + c`?

The coefficient `m` represents the slope of the line. It indicates the change in the dependent variable `Y` for every one unit increase in the independent variable `X`.



## What does the intercept `c` represent in the equation `Y = mX + c`?

The intercept `c` is the point where the regression line crosses the Y-axis. It represents the value of `Y` when `X` is zero.



## How do we calculate the slope `m` in Simple Linear Regression?

The slope `m` is calculated using the formula:

`m = sum((Xi - X_mean) * (Yi - Y_mean)) / sum((Xi - X_mean)^2)`

This formula finds the line that minimizes the squared differences between the predicted and actual values.



## What is the purpose of the least squares method in Simple Linear Regression?

The least squares method is used to find the best-fitting line by minimizing the sum of the squared differences between the observed values and the predicted values. It ensures that the total error is as small as possible.



## How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

R² measures how well the independent variable explains the variation in the dependent variable. A value close to 1 means the model explains most of the variation, while a value near 0 means it explains very little.



## What is Multiple Linear Regression?

Multiple Linear Regression is an extension of Simple Linear Regression where two or more independent variables are used to predict the value of a dependent variable. It helps in modeling more complex relationships where the outcome depends on multiple factors.



## What is the main difference between Simple and Multiple Linear Regression?

The main difference is in the number of independent variables. Simple Linear Regression uses one independent variable, whereas Multiple Linear Regression uses two or more. This allows Multiple Linear Regression to model more complex real-world scenarios.



## What are the key assumptions of Multiple Linear Regression?

The assumptions are similar to those in Simple Linear Regression but also include:
- **Linearity**: Relationship between independent and dependent variables is linear.
- **Independence**: Observations are independent of each other.
- **Homoscedasticity**: Residuals have constant variance.
- **Normality**: Residuals are normally distributed.
- **No Multicollinearity**: Independent variables should not be highly correlated with each other.



## What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

Heteroscedasticity refers to the condition where the variance of residuals is not constant across all levels of the independent variables. It can lead to inefficient estimates and unreliable hypothesis tests, which may affect the accuracy and trustworthiness of the model.



## How can you improve a Multiple Linear Regression model with high multicollinearity?

To improve a model with multicollinearity, you can:
- Remove or combine highly correlated independent variables.
- Use dimensionality reduction techniques like PCA.
- Apply regularization techniques such as Ridge or Lasso regression.



## What are some common techniques for transforming categorical variables for use in regression models?

Categorical variables can be transformed using:
- **One-hot encoding**: Converts categories into binary columns (0 or 1).
- **Label encoding**: Assigns numeric values to categories (useful for ordinal data).
- **Dummy variables**: Similar to one-hot but excludes one column to avoid multicollinearity.



## What is the role of interaction terms in Multiple Linear Regression?

Interaction terms are used to model the combined effect of two or more variables on the dependent variable. They help in capturing relationships where the effect of one variable depends on the value of another.



## How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

In Simple Linear Regression, the intercept represents the expected value of the dependent variable when the independent variable is zero. In Multiple Linear Regression, it represents the expected value of the dependent variable when **all** independent variables are zero, which might not always be meaningful in real-world context.



## What is the significance of the slope in regression analysis, and how does it affect predictions?

The slope shows the rate of change in the dependent variable for a unit change in the independent variable. It directly affects predictions because it determines the direction and strength of the relationship.



## How does the intercept in a regression model provide context for the relationship between variables?

The intercept provides a baseline value of the dependent variable when all independent variables are zero. It gives context to the regression line and helps anchor predictions, although in some cases it may not be interpretable.



## What are the limitations of using R² as a sole measure of model performance?

R² shows how well the model explains the variance in the data, but it doesn't indicate if the model is appropriate or if variables are significant. A high R² can result from overfitting, especially in models with many predictors. Adjusted R² and other metrics like RMSE should also be considered.



## How would you interpret a large standard error for a regression coefficient?

A large standard error suggests that the estimated coefficient is not precise. It implies that the actual value of the coefficient could vary significantly, reducing the confidence in its effect on the dependent variable.



## How can heteroscedasticity be identified in residual plots, and why is it important to address it?

Heteroscedasticity can be identified by plotting residuals vs. predicted values. If the spread of residuals increases or decreases instead of remaining constant, it indicates heteroscedasticity. Addressing it ensures more reliable coefficient estimates and valid statistical inferences.



## What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?

It means that some of the predictors may not be contributing meaningfully to the model. Adjusted R² penalizes the addition of irrelevant variables, so a large gap between R² and adjusted R² can indicate overfitting.



## Why is it important to scale variables in Multiple Linear Regression?

Scaling ensures that all variables contribute equally to the model, especially when regularization is applied (like Ridge or Lasso). It also helps in interpreting coefficients more consistently when units of measurement vary widely.



## What is polynomial regression?

Polynomial regression is a form of regression analysis where the relationship between the independent variable and the dependent variable is modeled as an nth-degree polynomial. It is used when the data shows a non-linear trend.



## How does polynomial regression differ from linear regression?

Linear regression models a straight-line relationship, while polynomial regression allows for curves by introducing powers of the independent variable (e.g., x², x³). This provides flexibility to fit more complex patterns in the data.



## When is polynomial regression used?

Polynomial regression is used when the data shows a curved or non-linear relationship, and a straight line doesn’t fit well. For example, modeling population growth, price-demand curves, or trajectories.



## What is the general equation for polynomial regression?

The general equation is:

`Y = b0 + b1*X + b2*X² + b3*X³ + ... + bn*Xⁿ`

Where `n` is the degree of the polynomial, and `b0`, `b1`, ..., `bn` are the coefficients.



## Can polynomial regression be applied to multiple variables?

Yes, polynomial regression can be extended to multiple variables by including interaction and power terms for each variable. However, it increases the complexity and can lead to overfitting if not handled carefully.



## What are the limitations of polynomial regression?

- Risk of overfitting, especially with high-degree polynomials
- Difficult to interpret coefficients
- Sensitive to outliers
- May not generalize well to unseen data



## What methods can be used to evaluate model fit when selecting the degree of a polynomial?

We can use:
- **Cross-validation**
- **Adjusted R²**
- **AIC/BIC**
- **Residual plots**
- **Validation error on a test set**

These help ensure the model fits well without overfitting.



## Why is visualization important in polynomial regression?

Visualization helps in understanding the shape of the fitted curve and identifying whether the model is underfitting or overfitting. It provides intuitive insights into how well the model captures the pattern in the data.



## How is polynomial regression implemented in Python?

Polynomial regression can be implemented using `scikit-learn`:

```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)
```
This allows transforming the features and fitting the polynomial model in a single pipeline.