# **Supervised Learning: Regression Models and Performance Metrics (assignment)**

# **Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.**

### Simple Linear Regression (SLR)

Simple Linear Regression (SLR) is a statistical method used to model the relationship between two continuous variables: a dependent variable (also known as the response or outcome variable) and an independent variable (also known as the predictor or explanatory variable). The core idea is to find a linear equation that best describes how the dependent variable changes as the independent variable changes.

Mathematically, the relationship is expressed as:

\[ y = \beta_0 + \beta_1x + \epsilon \]

Where:
- \( y \) is the dependent variable.
- \( x \) is the independent variable.
- \( \beta_0 \) (beta-zero) is the y-intercept, representing the expected value of \( y \) when \( x \) is 0.
- \( \beta_1 \) (beta-one) is the slope of the regression line, indicating the change in \( y \) for a one-unit change in \( x \).
- \( \epsilon \) (epsilon) is the error term, representing the random variability in \( y \) that cannot be explained by \( x \).

### Purpose of SLR

The primary purposes of Simple Linear Regression are:

1.  **Prediction:** To predict the value of the dependent variable based on the value of the independent variable. For example, predicting house prices based on square footage.

2.  **Understanding Relationships:** To understand the strength and direction of the linear relationship between two variables. It helps determine if there's a positive, negative, or no linear association.

3.  **Explanation:** To explain how much of the variation in the dependent variable can be attributed to the independent variable. The \( R^2 \) (R-squared) value is often used for this purpose.

4.  **Inference:** To make inferences about the population parameters (\( \beta_0 \) and \( \beta_1 \)) based on sample data, allowing us to generalize findings beyond the observed data.

# **Question 2: What are the key assumptions of Simple Linear Regression?**

The key assumptions of Simple Linear Regression are often remembered by the acronym **LINE** or **L.I.N.E.R.**:

1.  **Linearity:** The relationship between the independent variable (x) and the dependent variable (y) must be linear. If the relationship is not linear, the model will not accurately represent the data.

2.  **Independence of Errors:** The errors (residuals) should be independent of each other. This means that the error for one observation should not be related to the error for any other observation. This assumption is often violated in time-series data.

3.  **Normality of Errors:** The errors (residuals) should be normally distributed. This assumption is particularly important for hypothesis testing and confidence intervals. If the sample size is large enough, the Central Limit Theorem can help mitigate violations of this assumption.

4.  **Equal Variance of Errors (Homoscedasticity):** The variance of the errors should be constant across all levels of the independent variable. In simpler terms, the spread of the residuals should be roughly the same throughout the range of the predictor variable. If the variance of the errors is not constant, it's called heteroscedasticity.

5.  **No or Little Multicollinearity (for Multiple Linear Regression, but relevant in principle):** While SLR only has one independent variable, in Multiple Linear Regression, this assumption states that independent variables should not be highly correlated with each other. For SLR, this implies that the single independent variable should not be perfectly correlated with other factors not included in the model that might influence the dependent variable.

Violations of these assumptions can lead to unreliable or misleading results from the regression model.

# **Question 3: Write the mathematical equation for a simple linear regression model and explain each term.**

The mathematical equation for a simple linear regression model is typically represented as:

\[ y = \beta_0 + \beta_1x + \epsilon \]

Let's break down each term:

*   **\( y \) (Dependent Variable / Response Variable):** This is the variable we are trying to predict or explain. Its value depends on the independent variable.

*   **\( x \) (Independent Variable / Predictor Variable):** This is the variable used to predict \( y \). It is assumed to be an independent factor influencing \( y \).

*   **\( \beta_0 \) (Beta-zero / Y-intercept):** This is the value of \( y \) when \( x \) is 0. It represents the point where the regression line crosses the y-axis.

*   **\( \beta_1 \) (Beta-one / Slope):** This coefficient represents the change in \( y \) for every one-unit change in \( x \). It indicates the steepness and direction (positive or negative) of the linear relationship between \( x \) and \( y \).

*   **\( \epsilon \) (Epsilon / Error Term / Residual):** This term represents the random error in the model. It accounts for all the variability in \( y \) that cannot be explained by \( x \). It is assumed to have a mean of zero and constant variance.

# **Question 4: Provide a real-world example where simple linear regression can be applied.**

A common real-world example where Simple Linear Regression can be applied is in **predicting house prices based on square footage**.

**Example Scenario:**
Imagine a real estate agent who wants to estimate the selling price of a house based on its size.

*   **Dependent Variable (y):** House Price (e.g., in dollars).
*   **Independent Variable (x):** Square Footage of the house.

**How SLR is Applied:**

1.  **Data Collection:** The agent collects data on recently sold houses, including their square footage and their final selling prices.

2.  **Model Formulation:** A simple linear regression model is built using this data. The model would look like:
    \[ \text{House Price} = \beta_0 + \beta_1 \times \text{Square Footage} + \epsilon \]

3.  **Interpretation:**
    *   \( \beta_0 \) would represent the base price of a house (the price when square footage is 0, though this often isn't directly interpretable as houses can't have 0 square footage; it's more of a mathematical intercept).
    *   \( \beta_1 \) would represent the average increase in house price for every one-unit (e.g., one square foot) increase in the house's size.

4.  **Prediction:** Once the model is trained, the agent can input the square footage of a new house that is on the market and the model will provide a predicted selling price. This helps in pricing new properties competitively or for buyers to estimate what they might pay.

5.  **Understanding Relationship:** SLR also helps confirm the relationship between size and price, showing if larger houses tend to sell for more, and how strong that relationship is.

# **Question 5: What is the method of least squares in linear regression?**

The **Method of Least Squares** is a standard approach in linear regression to estimate the unknown parameters (the coefficients \( \beta_0 \) and \( \beta_1 \)) in a linear regression model. Its primary goal is to find the line that best fits the observed data by minimizing the sum of the squares of the differences between the observed and predicted values.

Here's a breakdown:

1.  **Observed vs. Predicted Values:**
    *   For each data point \( (x_i, y_i) \), \( y_i \) is the actual observed value of the dependent variable.
    *   The regression line, given by \( \hat{y}_i = \beta_0 + \beta_1x_i \), provides a *predicted* value (denoted \( \hat{y}_i \)) for each \( x_i \).

2.  **Residuals (Errors):**
    *   The difference between the observed value and the predicted value for each data point is called the residual or error term, denoted \( e_i = y_i - \hat{y}_i \).
    *   A smaller residual indicates a better fit of the model for that particular data point.

3.  **Minimization Objective:**
    *   The goal of the least squares method is to find the values of \( \beta_0 \) and \( \beta_1 \) that minimize the **sum of the squared residuals** (SSR).
    *   Mathematically, this objective function is:
        \[ \text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2 \]

4.  **Why Square the Residuals?**
    *   Squaring the residuals serves two main purposes:
        *   It ensures that positive and negative errors do not cancel each other out, preventing a misleadingly small sum of errors.
        *   It penalizes larger errors more heavily, pushing the regression line closer to the majority of the data points and reducing the impact of outliers to some extent.

5.  **Finding the Coefficients:**
    *   To find the values of \( \beta_0 \) and \( \beta_1 \) that minimize the SSR, calculus is typically used. This involves taking partial derivatives of the SSR with respect to \( \beta_0 \) and \( \beta_1 \), setting them to zero, and solving the resulting system of equations.
    *   The solutions for \( \beta_0 \) and \( \beta_1 \) are known as the **least squares estimates** or **ordinary least squares (OLS) estimates**.

In essence, the method of least squares provides a systematic way to draw the "best-fit line" through a set of data points, where "best" is defined by minimizing the sum of the squared vertical distances from each data point to the line.

# **Question 6: What is Logistic Regression? How does it differ from Linear Regression?**


## **Question 6: What is Logistic Regression? How does it differ from Linear Regression?**

### **Logistic Regression**

Logistic Regression is a statistical model primarily used for **binary classification tasks**. Unlike Linear Regression, which predicts a continuous outcome, Logistic Regression predicts the probability that an observation belongs to a particular category or class. It's called "regression" because it estimates the probability, which is a continuous value between 0 and 1, but this probability is then mapped to a discrete class label (e.g., 0 or 1, Yes or No, True or False).

The core of Logistic Regression is the **logistic function** (also known as the sigmoid function), which transforms the linear combination of independent variables into a probability.

Mathematically, the probability of the dependent variable \( y \) being 1 (the positive class) given \( x \) is:

\[ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}} \]

Where:
- \( P(y=1|x) \) is the probability that the dependent variable \( y \) is 1.
- \( e \) is the base of the natural logarithm.
- \( \beta_0 \) is the intercept.
- \( \beta_1 \) is the coefficient for the independent variable \( x \).

If the probability \( P(y=1|x) \) is greater than a certain threshold (commonly 0.5), the outcome is classified as 1; otherwise, it's classified as 0.

### **How it Differs from Linear Regression**

The fundamental differences between Logistic Regression and Linear Regression lie in their **purpose**, **output**, **underlying function**, and **assumptions**:

1.  **Type of Problem (Purpose):**
    *   **Linear Regression:** Used for **regression problems** where the goal is to predict a **continuous numerical output** (e.g., house prices, temperature, sales figures).
    *   **Logistic Regression:** Used for **classification problems** where the goal is to predict a **categorical outcome** (most commonly binary classification, but can be extended to multiclass).

2.  **Type of Output:**
    *   **Linear Regression:** Outputs a **continuous value** that can range from negative infinity to positive infinity.
    *   **Logistic Regression:** Outputs a **probability** score that is constrained between 0 and 1. This probability is then converted into a discrete class label.

3.  **Underlying Function/Equation:**
    *   **Linear Regression:** Uses a **linear function** (straight line) to model the relationship between independent and dependent variables: \( y = \beta_0 + \beta_1x + \epsilon \).
    *   **Logistic Regression:** Uses the **sigmoid (logistic) function** to transform the linear combination of inputs into a probability. It models the *log-odds* (logit) of the outcome as a linear combination of predictors.
        \[ \ln\left(\frac{P(y=1|x)}{1 - P(y=1|x)}\right) = \beta_0 + \beta_1x \]

4.  **Assumptions:**
    *   **Linear Regression:** Assumes a linear relationship between independent and dependent variables, normally distributed errors, homoscedasticity, and independence of errors.
    *   **Logistic Regression:** Does *not* assume a linear relationship between the independent and dependent variables themselves, nor does it assume normally distributed errors or homoscedasticity. Instead, it assumes:
        *   The dependent variable is binary.
        *   Observations are independent.
        *   There is a linear relationship between the independent variables and the log-odds of the dependent variable.
        *   Little or no multicollinearity among independent variables.

In summary, while both are "regression" techniques, they serve very different purposes in predictive modeling, with Linear Regression tackling continuous predictions and Logistic Regression handling categorical (especially binary) classifications by estimating probabilities.

# **Question 7: Name and briefly describe three common evaluation metrics for regression models.**


Three common evaluation metrics for regression models are:

1.  **Mean Absolute Error (MAE):**
    *   **Description:** MAE is the average of the absolute differences between the predicted and actual values. It measures the average magnitude of the errors in a set of predictions, without considering their direction.
    *   **Formula:** \[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]
    *   **Interpretation:** A lower MAE indicates a better fit. It is robust to outliers compared to MSE because it does not square the errors.

2.  **Mean Squared Error (MSE):**
    *   **Description:** MSE is the average of the squared differences between the predicted and actual values. It penalizes larger errors more heavily than MAE due to the squaring of the errors.
    *   **Formula:** \[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
    *   **Interpretation:** A lower MSE indicates a better fit. Its units are the square of the dependent variable's units, which can make it less intuitive to interpret than MAE. The square root of MSE is RMSE, which is in the same units as the dependent variable.

3.  **R-squared (Coefficient of Determination):**
    *   **Description:** R-squared represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. It indicates how well the independent variables explain the variability of the dependent variable.
    *   **Formula:** \[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{\text{SSR}}{\text{SST}} \]
        Where SSR is the Sum of Squared Residuals (explained by the model) and SST is the Total Sum of Squares (total variance in y).
    *   **Interpretation:** R-squared ranges from 0 to 1. A value of 1 indicates that the model explains all the variability in the dependent variable, while a value of 0 indicates that the model explains no variability. Higher R-squared values generally indicate a better fit, but it can be misleading as adding more independent variables (even irrelevant ones) can increase R-squared.

# **Question 8: What is the purpose of the R-squared metric in regression analysis?**

The **R-squared (Coefficient of Determination)** metric in regression analysis serves several key purposes:

1.  **Explaining Variance:** Its primary purpose is to quantify the proportion of the variance in the dependent variable (the outcome you are trying to predict) that is predictable from the independent variable(s) (the predictors) in a linear regression model. In simpler terms, it tells you how much of the variability in the 'Y' values can be explained by the 'X' values.

2.  **Goodness of Fit:** R-squared is often interpreted as a measure of the "goodness of fit" of the model. A higher R-squared value indicates that the model fits the data better, as it explains a larger proportion of the dependent variable's variance.

3.  **Model Performance Assessment:** It provides a standardized value (ranging from 0 to 1, or 0% to 100%) that makes it relatively easy to understand how well the independent variables account for the changes in the dependent variable. For example, an R-squared of 0.75 means that 75% of the variation in the dependent variable can be explained by the independent variables in the model.

4.  **Comparison (with caution):** While it can be used to compare different regression models on the same dataset (higher R-squared generally implies a better fit), it's important to use it with caution. R-squared tends to increase as more independent variables are added to the model, even if those variables are not truly significant. For this reason, **Adjusted R-squared** is often preferred for multiple linear regression, as it accounts for the number of predictors in the model.

In essence, R-squared helps researchers and analysts understand the predictive power of their regression model and how much of the observed variation in the outcome can be attributed to the factors included in the model.

# **Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.**


In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Create some sample data
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Independent variable
y = np.array([2, 4, 5, 4, 5, 7, 8, 9, 10, 12]) # Dependent variable

# 2. Instantiate the Linear Regression model
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, y)

# 4. Print the slope (coefficient) and intercept
print(f"Slope (Coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

# Optional: Make a prediction
# new_x = np.array([[11]])
# predicted_y = model.predict(new_x)
# print(f"\nPrediction for x=11: {predicted_y[0]:.2f}")

Slope (Coefficient): 1.01
Intercept: 1.07


# **Question 10: How do you interpret the coefficients in a simple linear regression model?**

In a simple linear regression model, expressed as \[ y = \beta_0 + \beta_1x + \epsilon \], there are two primary coefficients to interpret:

1.  **\( \beta_0 \) (The Intercept):**
    *   **Interpretation:** The intercept represents the predicted value of the dependent variable (\( y \)) when the independent variable (\( x \)) is zero. It's the point where the regression line crosses the y-axis.
    *   **Contextual Relevance:** The practical interpretation of the intercept depends heavily on whether an \( x \) value of zero is meaningful in the context of your data. For example:
        *   If \( x \) is 'years of education' and \( y \) is 'income', then \( \beta_0 \) would be the predicted income for someone with zero years of education. This might be a meaningful baseline.
        *   If \( x \) is 'square footage of a house' and \( y \) is 'house price', a square footage of zero is not physically possible. In such cases, the intercept serves more as a mathematical anchor for the line and might not have a direct, interpretable real-world meaning.

2.  **\( \beta_1 \) (The Slope):**
    *   **Interpretation:** The slope represents the average change in the dependent variable (\( y \)) for every one-unit increase in the independent variable (\( x \)), assuming all other factors remain constant (though in simple linear regression, there's only one independent variable).
    *   **Direction and Magnitude:**
        *   If \( \beta_1 \) is positive, it indicates a positive relationship: as \( x \) increases, \( y \) tends to increase.
        *   If \( \beta_1 \) is negative, it indicates a negative relationship: as \( x \) increases, \( y \) tends to decrease.
        *   The magnitude of \( \beta_1 \) tells you how much \( y \) is expected to change for that one-unit change in \( x \).
    *   **Example:** If \( \beta_1 \) for 'square footage' (\( x \)) and 'house price' (\( y \)) is 100, it means that, on average, for every additional square foot, the house price is predicted to increase by $100.