#Supervised Learning: Regression Models and Performance Metrics Assignment.

Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

->Answer:

Simple Linear Regression (SLR) is a statistical method used to study the relationship between two variables — one independent variable (X) and one dependent variable (Y).

It tries to find a linear equation that best predicts the value of the dependent variable based on the independent variable. The equation of SLR is:

𝑌
=
𝛽
0
+
𝛽
1
𝑋
+
𝜀
Y=β
0
	​

+β
1
	​

X+ε

Where:

𝑌
1 Y = Dependent variable (what we want to predict)

𝑋
2 X = Independent variable (predictor)

𝛽
0
β
0
	​

 = Intercept (value of Y when X = 0)

𝛽
1
β
1
	​

 = Slope (change in Y for one unit change in X)

𝜀
ε = Error term (difference between predicted and actual values)

Purpose of Simple Linear Regression:

1. Prediction: To predict the value of the dependent variable based on the independent variable.

2. Relationship Analysis: To understand how changes in one variable affect another.

3. Trend Identification: To find trends or patterns between two continuous variables.

Example:
If we want to predict a student’s exam score (Y) based on their study hours (X), SLR can help find the linear relationship — e.g.,

Score
=
40
+
5
×
(
Study Hours
)
Score=40+5×(Study Hours)


Question 2: What are the key assumptions of Simple Linear Regression?

->**Answer:**

Simple Linear Regression (SLR) is based on several key **assumptions** that ensure the model gives valid and reliable results. These assumptions are:

---

### **1. Linearity**

* The relationship between the independent variable (**X**) and the dependent variable (**Y**) is **linear**.
* This means changes in X lead to proportional changes in Y.
*Example:* As study hours increase, marks increase in a roughly straight-line pattern.

---

### **2. Independence of Errors**

* The residuals (errors) should be **independent** of each other.
* This means one observation’s error should not influence another’s.
 *Example:* One student’s exam score error should not depend on another student’s score error.

---

### **3. Homoscedasticity**

* The variance of the errors (residuals) should be **constant** across all values of X.
* If the spread of errors increases or decreases with X, it violates this assumption.
 *Example:* The error spread should look uniform across the regression line.

---

### **4. Normality of Errors**

* The residuals should be **normally distributed** (especially important for hypothesis testing and confidence intervals).
 *Example:* When you plot residuals, they should form a bell-shaped curve.

---

### **5. No Multicollinearity** *(only relevant when more than one predictor is used, i.e., in Multiple Linear Regression)*

* For SLR (only one X), this assumption is automatically satisfied.

---

**In summary:**

| No. | Assumption       | Meaning                                |
| --- | ---------------- | -------------------------------------- |
| 1   | Linearity        | Relationship between X and Y is linear |
| 2   | Independence     | Errors are independent                 |
| 3   | Homoscedasticity | Constant variance of errors            |
| 4   | Normality        | Errors are normally distributed        |

---

**If these assumptions are met,** the regression model’s estimates are **accurate, unbiased, and reliable** for prediction and interpretation.

Question 3: Write the mathematical equation for a simple linear regression model and
explain each term.


->The **mathematical equation** for a Simple Linear Regression (SLR) model is:

[
Y = \beta_0 + \beta_1 X + \epsilon
]

---

### **Explanation of each term:**

* **(Y)** → Dependent variable (or **response/output variable**)

  * The variable we are trying to predict or estimate.

* **(X)** → Independent variable (or **predictor/input variable**)

  * The variable used to make predictions about (Y).

* **(\beta_0)** → Intercept (constant term)

  * The value of (Y) when (X = 0).
  * It represents the point where the regression line crosses the Y-axis.

* **(\beta_1)** → Slope (coefficient of (X))

  * It shows how much (Y) changes when (X) increases by 1 unit.
  * Mathematically, it is the **rate of change** of (Y) with respect to (X).

* **(\epsilon)** → Error term (residual)

  * Represents the difference between the actual value and the predicted value of (Y).
  * Captures all other factors that affect (Y) but are not included in the model.

---

**In simple words:**
Simple Linear Regression finds the best-fitting straight line that predicts the relationship between one independent variable (X).

Question 4: Provide a real-world example where simple linear regression can be
applied.

->**Answer:**
A real-world example of **Simple Linear Regression** is predicting a student’s **exam score based on the number of hours studied**.

For example:

* **Independent variable (X):** Hours studied
* **Dependent variable (Y):** Exam score

If we collect data from several students about how many hours they studied and their corresponding scores, we can use simple linear regression to find a line of best fit.
This line can then be used to **predict the exam score** of a new student based on how many hours they study.


Question 5: What is the method of least squares in linear regression?

->The **method of least squares** is a mathematical technique used in **linear regression** to find the **best-fitting line** through a set of data points.

###  Definition:

It minimizes the **sum of the squared differences (errors)** between the observed values (actual data points) and the values predicted by the linear model.

---

### ⚙️ Explanation:

In **simple linear regression**, the model is:
[
Y = a + bX
]
where:

* ( Y ) = dependent variable
* ( X ) = independent variable
* ( a ) = intercept
* ( b ) = slope

The **method of least squares** finds values of ( a ) and ( b ) such that the sum of squared errors (SSE) is **minimum**:
[
SSE = \sum (Y_i - \hat{Y_i})^2 = \sum (Y_i - (a + bX_i))^2
]
where:

* ( Y_i ) = actual value
* ( \hat{Y_i} ) = predicted value

---

### Formulae for coefficients:

[
b = \frac{n(\sum XY) - (\sum X)(\sum Y)}{n(\sum X^2) - (\sum X)^2}
]
[
a = \bar{Y} - b\bar{X}
]

---

### Purpose:

To find the line of best fit that:

* Minimizes prediction errors.
* Represents the overall trend of the data as accurately as possible.

---

###  Example:

If you plot the relationship between **hours studied (X)** and **exam score (Y)**, the least squares method helps you find the straight line that best predicts scores based on study hours.

---

Would you like me to include a **numerical example** to show how ( a ) and ( b ) are calculated using this method.


Question 6: What is Logistic Regression? How does it differ from Linear Regression?

->**Question 6: What is Logistic Regression? How does it differ from Linear Regression?**

---

### **Logistic Regression:**

Logistic Regression is a **supervised machine learning algorithm** used for **classification problems**, especially **binary classification** (where the output can take only two possible values, such as 0 or 1, Yes or No, True or False).

It predicts the **probability** that a given input belongs to a particular class using the **logistic (sigmoid) function**.

The logistic (sigmoid) function is:
[
P(Y=1|X) = \frac{1}{1 + e^{-(b_0 + b_1X)}}
]

This function maps any real-valued number into a range between **0 and 1**, making it ideal for probability estimation.

---

### **Difference Between Logistic and Linear Regression**

| **Aspect**            | **Linear Regression**                                            | **Logistic Regression**                                              |
| --------------------- | ---------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Purpose**           | Used for **predicting continuous values** (e.g., salary, price). | Used for **predicting categorical values** (e.g., spam or not spam). |
| **Output**            | Produces a **real number** (can be negative or positive).        | Produces a **probability** between **0 and 1**.                      |
| **Function Used**     | Uses a **straight-line equation**: ( Y = b_0 + b_1X )            | Uses a **sigmoid function**: ( P = \frac{1}{1 + e^{-(b_0 + b_1X)}} ) |
| **Decision Boundary** | No fixed boundary; predicts continuous outcomes.                 | Classifies data based on a **threshold** (commonly 0.5).             |
| **Type of Problem**   | **Regression** problem.                                          | **Classification** problem.                                          |

---

**Example:**

* **Linear Regression:** Predicting house prices based on area.
* **Logistic Regression:** Predicting whether a house will be sold (Yes=1 / No=0) based on price and area.

---

Would you like me to add a small diagram showing the **difference between linear and logistic regression curves** (straight line vs sigmoid curve).
