1. What is Simple Linear Regression?

**Simple Linear Regression** is a statistical and machine learning method used to model the relationship between **one independent variable (X)** and **one dependent variable (Y)** by fitting a straight line to the data.

It assumes that the relationship between X and Y is **linear**.

### Mathematical Equation

[
Y = mX + c
]
Where:

* **Y** = dependent variable (output)
* **X** = independent variable (input)
* **m** = slope (change in Y for a one-unit change in X)
* **c** = intercept (value of Y when X = 0)

### Purpose

* To **predict** the value of Y based on X
* To **understand** how changes in X affect Y

### Example

Predicting **house price (Y)** based on **house size (X)**.

### Key Points (Short)

* Uses **one input variable**
* Fits a **straight line**
* Based on minimizing **error (least squares method)**
* Widely used for **prediction and trend analysis**


2. What are the key assumptions of Simple Linear Regression?

The **key assumptions of Simple Linear Regression** are:

1. **Linearity**
   The relationship between the independent variable (X) and dependent variable (Y) is linear.

2. **Independence**
   Observations are independent of each other (no correlation between errors).

3. **Homoscedasticity**
   The variance of errors is constant across all values of X.

4. **Normality of Errors**
   The residuals (errors) are normally distributed.

5. **No or Minimal Outliers**
   There are no extreme outliers that strongly influence the regression line.



3. What is heteroscedasticity, and why is it important to address in regression models?

**Heteroscedasticity** occurs in a regression model when the **variance of the error terms (residuals) is not constant** across all levels of the independent variable(s).

### In Simple Terms

As X changes, the spread of errors increases or decreases instead of remaining uniform.

### Why It Is Important to Address

* **Biased standard errors** → incorrect confidence intervals
* **Unreliable hypothesis tests** → t-tests and p-values become misleading
* **Inefficient estimates** → coefficients are still unbiased but not optimal
* **Poor model interpretation** → predictions at some X values become less reliable

### Common Causes

* Presence of outliers
* Skewed data
* Incorrect model specification
* Large differences in scale of variables

### How to Detect

* Residual vs fitted value plots
* Breusch–Pagan test
* White test

### How to Fix

* Apply transformations (log, square root)
* Use **Weighted Least Squares (WLS)**
* Use **robust standard errors**
* Add missing variables


4. What is Multiple Linear Regression?

**Multiple Linear Regression (MLR)** is a statistical technique used to model the relationship between **one dependent variable (Y)** and **two or more independent variables (X₁, X₂, …, Xₙ)**.

It extends Simple Linear Regression by using multiple predictors to explain or predict the outcome.

### Mathematical Equation

[
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n + \varepsilon
]

Where:

* **Y** = dependent variable
* **X₁, X₂, …, Xₙ** = independent variables
* **β₀** = intercept
* **β₁, β₂, …, βₙ** = coefficients (effect of each X on Y, holding others constant)
* **ε** = error term

### Purpose

* To **predict** Y using multiple factors
* To **understand the individual impact** of each independent variable

### Example

Predicting **house price** based on **size, location, number of rooms, and age of the house**.

### Key Points (Short)

* Uses **multiple input variables**
* Assumes a **linear relationship**
* Helps control for **confounding variables**
* Widely used in **data analysis and machine learning**


5. What is polynomial regression, and how does it differ from linear
regression?

**Polynomial Regression** is a type of regression that models the relationship between the independent variable(s) and the dependent variable as an **nth-degree polynomial**. It is used when the relationship between variables is **non-linear**.

### Polynomial Regression Equation

[
Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \cdots + \beta_nX^n + \varepsilon
]

---

### How It Differs from Linear Regression

| Aspect       | Linear Regression        | Polynomial Regression                         |
| ------------ | ------------------------ | --------------------------------------------- |
| Relationship | Straight-line (linear)   | Curved (non-linear)                           |
| Model form   | (Y = mX + c)             | (Y = \beta_0 + \beta_1X + \beta_2X^2 + \dots) |
| Complexity   | Simple                   | More flexible                                 |
| Fit          | May underfit curved data | Fits non-linear patterns better               |
| Risk         | Low overfitting          | Higher overfitting with high degree           |

---

### Key Points

* Polynomial regression is **linear in parameters**, but **non-linear in variables**
* Degree of polynomial controls model flexibility
* Choosing a very high degree can lead to **overfitting**

### Example

Predicting **sales growth** over time where growth accelerates or decelerates.


6. Implement a Python program to fit a Simple Linear Regression model to
the following sample data:

Here’s a complete **Python implementation of Simple Linear Regression** with a plot of the regression line over the data points.

### Explanation (Short)

* We use **NumPy** for data handling
* **scikit-learn’s LinearRegression** to fit the model
* **Matplotlib** to visualize data points and the regression line

### Model Results

* **Slope (m)** ≈ **1.98**
* **Intercept (c)** ≈ **0.18**

So, the fitted equation is approximately:
[
Y = 1.98X + 0.18
]

### What the Plot Shows

* Scatter points → actual data
* Straight line → predicted values from the regression model



7. Fit a Multiple Linear Regression model on this sample data:

Here is the **Multiple Linear Regression model** fitted to the given data.

### Model Used

[
\text{Price} = \beta_0 + \beta_1(\text{Area}) + \beta_2(\text{Rooms})
]

### Fitted Model Parameters

* **Intercept (β₀)** ≈ **103,157.89**
* **Coefficient for Area (β₁)** ≈ **63.16**
* **Coefficient for Rooms (β₂)** ≈ **34,736.84**

### Final Regression Equation

[
\text{Price} = 103157.89 + 63.16 \times \text{Area} + 34736.84 \times \text{Rooms}
]

### Interpretation

* For every **1 sq ft increase in area**, price increases by approximately **₹63.16**, keeping rooms constant.
* Adding **one extra room** increases the price by approximately **₹34,736.84**, keeping area constant.


8. Implement polynomial regression on the following data:

Here is the **implementation of Polynomial Regression** (degree = 2) for the given data.

### Model Used

[
Y = \beta_0 + \beta_1X + \beta_2X^2
]

### Fitted Model Parameters

* **Intercept (β₀)** ≈ **0.06**
* **β₁ (X)** ≈ **1.94**
* **β₂ (X²)** ≈ **0.20**

### Final Polynomial Equation

[
Y \approx 0.06 + 1.94X + 0.20X^2
]

### Explanation

* Polynomial regression captures the **curved (non-linear) trend** in the data.
* The scatter points show the actual data.
* The smooth curve represents the polynomial regression fit.


9. Create a residuals plot for a regression model trained on this data:

Step 1: Calculate the Residuals
The first step is to fit a linear regression model to the data to find the predicted values ($\hat{Y}$) and then calculate the residuals ($e$) using the formula: $e = Y - \hat{Y}$. The linear model has the equation $y = 1.15x + 6.5$ (calculated by the Python code). The residuals are calculated as:

• For $X=10$: $e = 15 - (1.15 \times 10 + 6.5) = 15 - 18 = -3$
• For $X=20$: $e = 35 - (1.15 \times 20 + 6.5) = 35 - 29.5 = 5.5$
• For $X=30$: $e = 40 - (1.15 \times 30 + 6.5) = 40 - 41 = -1$
• For $X=40$: $e = 50 - (1.15 \times 40 + 6.5) = 50 - 52.5 = -2.5$
• For $X=50$: $e = 65 - (1.15 \times 50 + 6.5) = 65 - 64 = 1$

Step 2: Plot Residuals Against the Independent Variable [5]  
A residuals plot is a scatter plot with the independent variable ($X$) on the horizontal axis and the corresponding residuals ($e$) on the vertical axis. The points should be plotted as $(X, e)$. [6]  
Answer:
The residual plot is a scatter plot of the following points:

• (10, -3)
• (20, 5.5)
• (30, -1)
• (40, -2.5)
• (50, 1)

The plot would show these points randomly dispersed around the horizontal axis (the $e=0$ line), with no discernible pattern, indicating that a linear model is appropriate for this data. [7, 8]  



10. Imagine you are a data scientist working for a real estate company. You
need to predict house prices using features like area, number of rooms, and location.
However, you detect heteroscedasticity and multicollinearity in your regression model. Explain the steps you would take to address these issues and ensure a robust model.

As a data scientist, I would handle **heteroscedasticity** and **multicollinearity** systematically to ensure the regression model is **reliable, interpretable, and accurate**.

---

## 1. Addressing Heteroscedasticity

(Heteroscedasticity = non-constant variance of errors)

### Step 1: Detect the problem

* Plot **residuals vs predicted values**
* Use statistical tests:

  * Breusch–Pagan test
  * White test

### Step 2: Apply solutions

* **Transform the target variable**

  * Log(price), square root, or Box-Cox transformation
* **Use robust standard errors**

  * Heteroscedasticity-consistent (HC) standard errors
* **Weighted Least Squares (WLS)**

  * Give less weight to observations with high variance
* **Improve model specification**

  * Add missing variables (e.g., neighborhood quality, amenities)

### Outcome

✔ Correct confidence intervals
✔ Reliable hypothesis tests
✔ Better prediction stability

---

## 2. Addressing Multicollinearity

(Multicollinearity = high correlation among predictors)

### Step 1: Detect the problem

* Correlation matrix / heatmap
* Variance Inflation Factor (VIF)

  * VIF > 5 or 10 indicates a problem

### Step 2: Apply solutions

* **Remove or combine correlated variables**

  * Example: area and carpet area → keep one
* **Feature engineering**

  * Create ratios or aggregated features
* **Dimensionality reduction**

  * Principal Component Analysis (PCA)
* **Regularization techniques**

  * Ridge Regression (L2)
  * Lasso Regression (L1)

### Outcome

✔ Stable coefficient estimates
✔ Improved interpretability
✔ Reduced variance in predictions

---

## 3. Model Validation & Final Checks

* Use **train–test split or cross-validation**
* Compare models using **RMSE, MAE, R²**
* Re-check residual plots after fixes
* Ensure assumptions are reasonably satisfied

