# Assignment Supervised Learning: Regression Models and Performance Metrics

## Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.
Answer:

Simple Linear Regression (SLR) is a type of regression analysis used to establish a relationship between one independent variable (X) and one dependent variable (Y). It assumes that the relationship is linear, meaning the change in Y is proportional to the change in X. The model can be represented graphically as a straight line through data points, called the regression line.

###Purpose:

- Prediction: SLR allows us to predict future values of Y for given values of X.

- Relationship analysis: It quantifies the strength and direction of the relationship. A positive slope indicates that Y increases as X increases; a negative slope indicates the opposite.

- Decision-making: Organizations use it for forecasting and planning, e.g., predicting sales based on advertising expenditure.

###Example:
 A company wants to predict revenue (Y) based on advertising budget (X). By fitting an SLR model, the company can estimate how revenue changes for each additional dollar spent on advertising

## Question 2: What are the key assumptions of Simple Linear Regression?
Answer:

###SLR relies on several key assumptions to ensure accurate and reliable results:

1. Linearity: There should be a linear relationship between X and Y. Non-linear relationships violate the model’s validity.

2. Independence of errors: The residuals (differences between observed and predicted values) should be independent of each other.

3. Homoscedasticity: Residuals should have constant variance across all levels of X. Unequal variance (heteroscedasticity) can lead to misleading predictions.

4. Normality of residuals: The residuals should follow a normal distribution. This is important for confidence intervals and hypothesis testing.

5. No significant outliers: Extreme values can distort the slope and intercept of the regression line, leading to inaccurate predictions.

###Example:

If you are predicting exam scores based on hours studied, the residuals (errors) should not increase or decrease systematically as hours increase.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.
Answer:

###The mathematical equation of a simple linear regression model is:

  # Y = β0 ​+ β1​X + ϵ

Where:

- Y: Dependent (target) variable we want to predict.

- X: Independent (predictor) variable.

- β₀ (Intercept): Value of Y when X = 0; it is the point where the regression line crosses the Y-axis.

- β₁ (Slope): Change in Y for every one-unit increase in X. It shows the strength and direction of the relationship.

- ε (Error term): The difference between the observed value and the predicted value, accounting for randomness or unobserved factors.

###Interpretation Example:

If β₀ = 2 and β₁ = 3, the equation is Y = 2 + 3X.

For every 1-unit increase in X, Y increases by 3 units, and when X = 0, Y = 2

## Question 4: Provide a real-world example where simple linear regression can be applied.
Answer:
###Example 1: House Price Prediction

- Independent variable (X): Size of the house in square feet.

- Dependent variable (Y): Selling price of the house.

If the slope β₁ = 5000, it indicates that for every additional square foot, the house price increases by ₹5000. SLR helps real estate analysts estimate pricing trends.

###Example 2: Salary Prediction

- X: Years of experience

- Y: Annual salary

SLR can predict expected salary increases based on experience. These examples show that SLR is widely applicable in business, finance, and social sciences.


###Other Real-life Examples:

  - Predicting crop yield based on rainfall

  - Predicting sales based on advertising budget


## Question 5: What is the method of least squares in linear regression?
Answer:

The Method of Least Squares is a mathematical approach used to find the best-fitting regression line. It minimizes the sum of the squared differences between actual values (Y) and predicted values ($\hat{Y}$). This sum of squared differences is often referred to as the Residual Sum of Squares (RSS).

- The goal is to find the values of the intercept ($\beta_0$) and the slope ($\beta_1$) that minimize this RSS. The formulas for these values can be derived using calculus.

Steps:

1. **Calculate the predicted value** ($\hat{Y}$) = $\beta_0$ + $\beta_1 X $
2. **Compute residuals:** $e_i = Y_i - \hat{Y}_i = Y_i - (\beta_0 + \beta_1 X_i)$.
3. **Square the residuals and sum them:**  $\sum e_i^2$
4. Adjust β₀ and β₁ to minimize this sum.

#Formula:

  $Minimize S(β0​,β1​) = ∑_{i=1}^n (Yi​−(β0​+β1​Xi​))^2$


- This ensures the regression line is as close as possible to all data points. Least squares is the foundation of most regression algorithms.

- The resulting values for $\beta_0$ and $\beta_1$ give the equation of the best-fitting line that minimizes the errors between the predicted and actual values.

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?
Answer:

**Logistic Regression** is a statistical model used for binary classification problems. Unlike linear regression, which predicts a continuous outcome, logistic regression predicts the probability of a binary outcome (e.g., yes/no, 0/1, true/false). It uses a logistic function (also known as the sigmoid function) to map any real-valued number to a value between 0 and 1, which can then be interpreted as a probability.

### $P(Y=1∣X) = \frac{1}{1+e−(β0​+β1​X)​ }$

**Differences from Linear Regression:**

  |**Feature**      |**Linear Regression**    | **Logistic Regression**      |
  |-----------------|-------------------------|------------------------------|
  | Output          | Continuous values       | Probability (0 to 1)         |
  | Purpose         | Regression              | Classification               |
  | Equation        | Straight line           | Sigmoid curve                |
  | Loss function   | MSE                     | Log Loss / Cross-Entropy     |
  | Example         | Predicting house price  | Predicting if a student passes fails |


**Example:**

Predicting whether a patient has diabetes based on age and BMI. Linear regression is unsuitable because probabilities must be between 0 and 1


**In summary**, while both are regression techniques, Linear Regression is for predicting continuous values, and Logistic Regression is for predicting probabilities for binary outcomes.


## Question 7: Name and briefly describe three common evaluation metrics for regression models.
Answer:

1. Mean Absolute Error (MAE):
 - The average of the absolute differences between the actual values ($Y_i$) and the predicted values ($\hat{Y}i$). It measures the average magnitude of errors without considering their direction.
 - Interpretation: Less sensitive to outliers compared to MSE or RMSE. A lower MAE indicates a better model fit.


 Formula: $MAE = \frac{1}{n} \sum{i=1}^n |Y_i - \hat{Y}_i|$

2. Mean Squared Error (MSE):
- The average of the squared differences between the actual values ($Y_i$) and the predicted values ($\hat{Y}_i$). Squaring the errors gives more weight to larger errors.
-  Interpretation: Penalizes larger errors more than smaller ones. A lower MSE indicates a better model fit.

Formula: $MSE = \frac{1}{n} \sum{i=1}^n (Y_i - \hat{Y}_i)^2$



3. Root Mean Squared Error (RMSE):
Square root of MSE. Measures error in the same units as Y.

 - The square root of the Mean Squared Error. It measures the error in the same units as the dependent variable, making it more interpretable than MSE.
 - Interpretation: Provides a measure of the typical error size in the original units of the dependent variable. A lower RMSE indicates a better model fit.

Formula: $RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2}$

In general, for all three metrics, lower values indicate a better-fitting model. The choice of which metric to use can depend on the specific problem and the importance of penalizing larger errors.

## Question 8: What is the purpose of the R-squared metric in regression analysis?
Answer:

R-squared (R²), also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model.

In simpler terms, R-squared tells you how well the regression model fits the observed data. It indicates how much of the variation in the dependent variable can be explained by the independent variable(s).

The formula for R-squared is:

$R^2 = 1 - \frac{SSE}{SST}$

Where:

*   **SSE (Sum of Squared Errors or Residual Sum of Squares):** The sum of the squared differences between the actual values and the predicted values from the regression model. It represents the unexplained variance.
*   **SST (Total Sum of Squares):** The sum of the squared differences between the actual values and the mean of the dependent variable. It represents the total variance in the dependent variable.

Interpretation:

*   An R² value of 1 means that the model perfectly explains all the variability in the dependent variable.
*   An R² value of 0 means that the model explains none of the variability in the dependent variable.
*   Higher R² values generally indicate a better-fitting model, as they suggest that a larger proportion of the variance in the dependent variable is explained by the independent variable(s).

However, it's important to note that a high R² does not necessarily mean the model is good or that the independent variable(s) are the cause of the changes in the dependent variable. It's just a measure of how well the model fits the data. It's also important to consider other evaluation metrics and the context of the problem.

## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

Answer:

In [1]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])      # Independent variable
y = np.array([2, 4, 5, 4, 5])                # Dependent variable

# Create model
model = LinearRegression()

# Train model
model.fit(X, y)

# Print slope and intercept
print("Slope (β1):", model.coef_[0])
print("Intercept (β0):", model.intercept_)

Slope (β1): 0.6
Intercept (β0): 2.2


## Question 10: How do you interpret the coefficients in a simple linear regression model?
Answer:

In a simple linear regression model, the equation is typically represented as $Y = \beta_0 + \beta_1 X + \epsilon$, where:

*   **$\beta_0$ (Intercept):** This is the predicted value of the dependent variable (Y) when the independent variable (X) is equal to 0. It represents the point where the regression line crosses the Y-axis. However, interpreting the intercept only makes sense if X=0 is a meaningful value in the context of your data.

*   **$\beta_1$ (Slope):** This coefficient represents the change in the dependent variable (Y) for every one-unit increase in the independent variable (X).
    *   A **positive slope** ($\beta_1 > 0$) indicates that as X increases, Y also tends to increase.
    *   A **negative slope** ($\beta_1 < 0$) indicates that as X increases, Y tends to decrease.
    *   The magnitude of the slope indicates the strength of the linear relationship between X and Y. A larger absolute value of $\beta_1$ suggests a stronger relationship.

**Example:**

Consider a simple linear regression model predicting house price (Y) based on the size of the house in square feet (X), with the equation: `House Price = β₀ + β₁ * Size`.

*   If $\beta_0 = 50000$ and $\beta_1 = 100$, the interpretation would be:
    *   The intercept ($\beta_0 = 50000$) would suggest that a house with 0 square feet is predicted to cost $50,000. This interpretation is likely not meaningful in this context, as a house cannot have 0 square feet.
    *   The slope ($\beta_1 = 100$) indicates that for every additional square foot of size, the predicted house price increases by $100.

Understanding the interpretation of these coefficients is crucial for understanding the relationship between the variables in your model and for making predictions.