# Simple Linear Regression Assignmnet


**Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.**

Simple Linear Regression attempts to determine the strength and characteristics of relationship between one independent variable(x-axis) and the relationship it has with another dependent variable (y-axis). The simplicity of this relationship is drawn on an axis and is represented by a straight line

*Types of Relationship*

1. Direct- this refers to a scenario where an increase in the x-axis(independent variable) causes a increase in the y-axis(dependent variable). This causes the graph to be positively inverted.

2. Indirect- this refers to a scenario where an increase in the independent variable causes a decrease in the dependent variable.


*The Purpose of SLR*

1. model the linear relationship between one independent variable(predictor) and one dependent variable(outcome)

2. predict values of dependent variable based on known values of the independent variable.

3. quantify the strength,direction, and significance of the linear association between variables.

4. estimate the slope and intercept of the best-fit line.

5. minimize the sum of squared differences between observed and predicted values for accurate forecasting.




**Question 2: What are the key assumptions of Simple Linear Regression?**

1. linearity: the relationship between the independent variable(x) and the dependent variable (Y) must be linear, meaning the data points should approximately follow a straight line when plotted on a scatterplot.

2. Independence of errors (residuals): Observations must be independent of each other; residuals from one observation should not influence or correlate with residuals from another, avoiding issues like autocorrelation in time series data.

3. Homoscedasticity: The variance of the residuals must be constant across all levels of the independent variable; residuals should show equal spread around the regression line, not fanning out or narrowing.

4. Normality of residuals: The residuals (errors) should be approximately normally distributed, especially for inference like confidence intervals and hypothesis tests; this ensures reliable p-values and predictions.

5. in simple linear regression (one predictor), but the independent variable should vary and not be constant.

6. Mean of residuals is zero: The residuals should have an average value of zero, which is automatically satisfied if the regression line passes through the mean of the data.

7. Fixed independent variable: The predictor X is treated as fixed or non-random, not subject to the same error process as Y.

**Question 3: Write the mathematical equation for a simple linear regression model and
explain each term**

Simple Linear Regression Model
1. Basic Linear Equation

The simplest form of a straight-line equation is:

y=mx+c

This equation represents a linear relationship between two variables.

### 2. Explanation of Each Term in (y = mx + c)

#### **(y) — Dependent Variable**

* (y) is the **dependent (response) variable**.
* It represents the variable whose value depends on (x).
* It is the outcome we are trying to **predict or explain**.

---

#### **(x) — Independent Variable**

* (x) is the **independent (explanatory) variable**.
* It is the variable used to explain changes in (y).
* It is assumed to be measured without error.

---

#### **(m) — Slope (Gradient)**

* (m) is the **slope** or **gradient** of the line.
* It measures the **rate of change of (y) with respect to (x)**.
* It shows how much (y) changes for a **one-unit increase in (x)**.
* If (m > 0), the relationship is positive.
* If (m < 0), the relationship is negative.

---

#### **(c) — Intercept**

* (c) is the **y-intercept**.
* It represents the value of (y) when (x = 0).
* It is the point where the line crosses the **y-axis**.

---

## 3. Converting to a Regression Model

In statistics, the linear equation is adapted to model real-world data by introducing randomness. The regression model is written as:

[
Y = mX + c + \varepsilon
]

Where:

* (Y) is the **observed value** of the dependent variable
* (X) is the independent variable
* (m) and (c) represent the true (but unknown) slope and intercept
* (\varepsilon) is the **error term**

---

## 4. Error Term ((\varepsilon))

* The error term accounts for **unexplained variation** in (Y).
* It represents the influence of:

  * Omitted variables
  * Measurement errors
  * Random noise
* It explains why observed data points do not lie exactly on a straight line.

---

## 5. Statistical Notation of the Model

In formal statistical terms, the simple linear regression model is written as:

[
Y = \beta_0 + \beta_1 X + \varepsilon
]

Where:

* (\beta_0) corresponds to **(c)** (intercept)
* (\beta_1) corresponds to **(m)** (slope)

---

## 6. Estimated Regression Equation

Since the true parameters are unknown, they are estimated from sample data:

[
\hat{Y} = b_0 + b_1 X
]

Where:

* (\hat{Y}) is the **predicted value** of (Y)
* (b_0) is the estimated intercept
* (b_1) is the estimated slope






**Question 4: Provide a real-world example where simple linear regression can be
applied.**

### Real-World Example of Simple Linear Regression

A common real-world example of simple linear regression is **predicting a student’s exam score based on the number of hours studied**.

* **Independent variable (x):** Number of hours studied
* **Dependent variable (y):** Exam score

Using simple linear regression, we can model the relationship as:

y = mx + c

This model helps estimate how much a student’s exam score is expected to increase for each additional hour of study. The slope (m) shows the average increase in marks per extra hour studied, while the intercept (c) represents the expected score when no hours are studied.

This type of model is widely used in **education, performance analysis, and forecasting**, where one variable is used to predict another in a simple and interpretable way.


**Question 5: What is the method of least squares in linear regression?**

### Method of Least Squares in Linear Regression

The **method of least squares** is a technique used to estimate the parameters of a linear regression model so that the fitted line best represents the observed data.

In simple linear regression, the model is:

y = mx + c



### Idea Behind Least Squares

* For each data point, there is a **residual**, which is the difference between the actual value and the predicted value:

  residual = actual y − predicted y

* The method of least squares chooses the values of **m (slope)** and **c (intercept)** that **minimize the sum of the squared residuals**.



### Objective Function

The quantity minimized is:

Sum of squared errors = Σ (yᵢ − (mxᵢ + c))²

Where:

* yᵢ are the observed values
* mxᵢ + c are the predicted values

Squaring ensures:

* All errors are positive
* Larger errors are penalized more heavily


### Purpose

* To find the **best-fitting straight line** through the data
* To make predictions that are **as close as possible** to the observed values
* To provide **unique and optimal estimates** for the slope and intercept



**Question 6: What is Logistic Regression? How does it differ from Linear Regression?**



## What is Logistic Regression?

**Logistic regression** is a statistical and machine learning method used for **classification problems**, especially when the dependent variable is **binary** (e.g., yes/no, 0/1, pass/fail).

Instead of predicting a continuous value, logistic regression predicts the **probability** that an observation belongs to a particular class.

The model is:

p = 1 / (1 + e^-(mx + c))

Where:

* p is the probability of the positive class
* m and c are model coefficients
* e is the base of the natural logarithm



## What is Linear Regression?

**Linear regression** is used to predict a **continuous numerical value**.

The model is:

y = mx + c

It assumes a **linear relationship** between the independent and dependent variables.



## Differences Between Logistic Regression and Linear Regression

| Feature            | Linear Regression         | Logistic Regression                |
| ------------------ | ------------------------- | ---------------------------------- |
| Purpose            | Predict continuous values | Classify outcomes                  |
| Output             | Any real number           | Probability between 0 and 1        |
| Dependent variable | Continuous                | Binary (0 or 1)                    |
| Model equation     | y = mx + c                | p = 1 / (1 + e^-(mx + c))          |
| Error method       | Least squares             | Maximum likelihood                 |
| Linearity          | Linear in output          | Linear in log-odds                 |
| Typical use        | Predicting prices, scores | Spam detection, disease prediction |



## Key Concept: Sigmoid Function

* Logistic regression uses the **sigmoid (logistic) function**.
* This ensures predicted probabilities lie between **0 and 1**.
* A threshold (commonly 0.5) is used to assign class labels.




**Question 7: Name and briefly describe three common evaluation metrics for regression
models.**


## Three Common Evaluation Metrics for Regression Models

### 1. Mean Absolute Error (MAE)

* Measures the **average absolute difference** between actual and predicted values.
* It shows how far predictions are from true values **on average**.
* Easy to interpret because it is in the **same units as the dependent variable**.
* Less sensitive to large errors than MSE.



### 2. Mean Squared Error (MSE)

* Measures the **average of the squared differences** between actual and predicted values.
* Penalizes **large errors more heavily** due to squaring.
* Commonly used for **model optimization**.



### 3. R-squared (R²)

* Measures the **proportion of variance** in the dependent variable explained by the model.
* Indicates the **goodness of fit** of the regression model.
* Values range from **0 to 1**, with higher values indicating better fit.






**Question 8: What is the purpose of the R-squared metric in regression analysis?**



## Purpose of the R-squared Metric in Regression Analysis

R-squared (R²), also known as the **coefficient of determination**, measures how well a regression model explains the variability in the dependent variable.

### Key Purposes of R-squared

* It shows the **proportion of variation in the dependent variable** that is explained by the independent variable(s).
* It indicates the **goodness of fit** of the regression model.
* R² values range from **0 to 1**.

  * R² = 0 means the model explains none of the variation.
  * R² = 1 means the model explains all the variation.
* A higher R² value suggests the model fits the data **better**.

### Interpretation

* R² = 0.75 means **75% of the variation** in the dependent variable is explained by the model.
* The remaining variation is due to **unexplained factors or random error**.

# Important Notes

* R-squared does **not** indicate causation.
* A high R² does not guarantee the model is correct or appropriate.
* R² should be interpreted along with other metrics and diagnostic checks.


**Question 9: Write Python code to fit a simple linear regression model using scikit-learn
and print the slope and intercept.
(Include your Python code and output in the code box below.)**

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # independent variable
y = np.array([2, 4, 5, 4, 5])                # dependent variable

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Print slope and intercept
print("Slope (m):", model.coef_[0])
print("Intercept (c):", model.intercept_)


Slope (m): 0.6
Intercept (c): 2.2


**Question 10: How do you interpret the coefficients in a simple linear regression model?**





## Interpretation of Coefficients in a Simple Linear Regression Model

### Model

y = mx + c



### Intercept (c)

* c is the **expected value of y when x = 0**
* It represents the **baseline level** of the dependent variable
* It is the point where the regression line crosses the **y-axis**
* If x = 0 is outside the data range, c may have **limited practical meaning**
* Still important for **constructing the regression line**



### Slope (m)

* m represents the **average change in y for a one-unit increase in x**
* It measures the **rate of change** between x and y
* m > 0: y increases as x increases (positive relationship)
* m < 0: y decreases as x increases (negative relationship)
* m = 0: no linear relationship between x and y



### Average Interpretation

* The coefficients describe the **average effect**, not exact values
* Individual observations differ due to **random error**
* The relationship is assumed to be **linear**



### Example

y = 10 + 2x

* c = 10 means y = 10 when x = 0
* m = 2 means y increases by 2 units for each 1-unit increase in x

