# **QUESTIONS**

#1. What is Simple Linear Regression (SLR)? Explain its purpose.

- **Simple Linear Regression (SLR)** is a statistical method used to model and analyze the relationship between two continuous variables. It involves one independent variable (predictor, denoted as X) and one dependent variable (outcome, denoted as Y).

- Its **purpose** is to find the best-fitting straight line (the "regression line") that describes how the dependent variable Y changes as the independent variable X changes. This line can then be used to:

  -  Understand the strength and direction (positive or negative) of the relationship.
  -  Make predictions about the value of Y for a given value of X.

#2. What are the key assumptions of Simple Linear Regression?

For a simple linear regression model to be valid and reliable, it relies on four key assumptions:

1) **Linearity:** The relationship between the independent variable (X) and the dependent variable (Y) is linear. This means the data points should roughly follow a straight line.

2) **Independence:** The observations (or more technically, the errors or "residuals") are independent of each other. One data point's value does not influence another's.

3) **Homoscedasticity:** The variance of the errors is constant for all values of X. In simpler terms, the "scatter" of the data points around the regression line should be roughly the same along the entire line.

4) **Normality of Errors:** The errors (the differences between the actual values and the predicted values) are normally distributed.

#3. Write the mathematical equation for a simple linear regression model and explain each term.

The mathematical equation for a simple linear regression model is:

**Y = β₀ + β₁X + ε**

- **Y:** This is the dependent variable (the outcome or the value you are trying to predict).

- **X:** This is the independent variable (the predictor or the value you are using to make the prediction).

- **β₀ (Beta-nought):** This is the intercept of the line. It represents the predicted value of Y when X is equal to 0.

- **β₁ (Beta-one):** This is the slope or coefficient of the line. It represents the change in Y for every one-unit increase in X.

- **ε (Epsilon):** This is the error term. It represents the random variation and the part of Y that cannot be explained by the linear relationship with X.

#4. Provide a real-world example where simple linear regression can be applied.

A classic real-world example is predicting a house's price based on its size.

- **Independent Variable (X):** Size of the house (e.g., in square feet).

- **Dependent Variable (Y):** Selling price of the house.

By collecting data on numerous house sales, you can use simple linear regression to find a line that models this relationship. This model could then predict that for every additional square foot (a one-unit increase in X), the price of the house increases by $150 (the slope, β₁).

#5. What is the method of least squares in linear regression?

- The **method of least squares** is the mathematical procedure used to find the "best-fitting" line for the data. This line is defined by the intercept (β₀) and slope (β₁) that **minimize the sum of the squared differences** between the actual observed values (Y) and the values predicted by the model (Ŷ, or "Y-hat").

- These differences (Y - Ŷ) are called **residuals**. The method squares them to prevent positive and negative errors from canceling each other out and to heavily penalize larger errors. The resulting line is the one that has the smallest possible total squared error.

#6. What is Logistic Regression? How does it differ from Linear Regression?

**Logistic Regression** is a statistical algorithm used for classification tasks, where the goal is to predict a discrete, categorical outcome (e.g., Yes/No, True/False, Spam/Not Spam). It works by modeling the probability that a given input belongs to a specific category.

It differs from Linear Regression in two main ways:

1) **Output Type:**

- **Linear Regression** predicts a continuous numerical value (e.g., $150,000, 25.5°C).

- **Logistic Regression** predicts a probability (a value between 0 and 1), which is then used to classify the output into a discrete category.

2) **Governing Equation:**

- **Linear Regression** uses a straight-line equation (Y = β₀ + β₁X).

- **Logistic Regression** uses the logistic (or sigmoid) function to "squash" the output of a linear equation, forcing the result to be between 0 and 1.

#7. Name and briefly describe three common evaluation metrics for regression models.

1) **Mean Absolute Error (MAE):** This is the average of the absolute differences between the actual values and the predicted values. It's easy to interpret because it is in the same units as the dependent variable. It tells you, on average, how "far off" your predictions are.

2) **Mean Squared Error (MSE):** This is the average of the squared differences between the actual and predicted values. By squaring the errors, it penalizes larger errors much more heavily than smaller ones.

3) **Root Mean Squared Error (RMSE):** This is the square root of the MSE. It is very popular because, like MAE, it is in the original units of the dependent variable (making it interpretable), but it still retains the property of penalizing large errors (due to the squaring).

#8. What is the purpose of the R-squared metric in regression analysis?

The purpose of **R-squared (R²)**, also called the **coefficient of determination**, is to measure the **proportion of the variance** in the dependent variable (Y) that can be explained by the independent variable (X).

It is a value between 0 and 1 (or 0% to 100%).

- An R-squared of 0 means the model explains none of the variability in Y.

- An R-squared of 1 means the model explains all of the variability in Y.

For example, an R-squared of 0.65 means that 65% of the variation in the dependent variable (e.g., house prices) can be explained by the linear relationship with the independent variable (e.g., square footage). It tells you how well your model "fits" the data.

#9. Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept. (Include your Python code and output in the code box below.)



In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Create sample data
# We use .reshape(-1, 1) because scikit-learn expects X to be a 2D array
X = np.array([1, 2, 3, 4, 5, 6]).reshape(-1, 1)
y = np.array([10, 12, 14, 16, 18, 20])

# 2. Create a Linear Regression model instance
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, y)

# 4. Get the slope (coefficient) and intercept
slope = model.coef_[0]
intercept = model.intercept_

# 5. Print the results
print(f"Sample X data:\n{X.flatten()}")
print(f"Sample y data:\n{y}")
print("-" * 30)
print(f"Slope (β₁): {slope}")
print(f"Intercept (β₀): {intercept}")
print(f"\nThe model equation is: y = {intercept:.2f} + {slope:.2f}*X")

Sample X data:
[1 2 3 4 5 6]
Sample y data:
[10 12 14 16 18 20]
------------------------------
Slope (β₁): 1.9999999999999996
Intercept (β₀): 8.000000000000002

The model equation is: y = 8.00 + 2.00*X


#10. How do you interpret the coefficients in a simple linear regression model?

In a simple linear regression model (Y = β₀ + β₁X), the two coefficients are interpreted as follows:

- **Intercept (β₀):** This is the predicted value of the dependent variable (Y) when the independent variable (X) is equal to 0. The practical interpretation of this value depends on the context. For example, in a model predicting weight from height, the intercept would be the predicted weight at 0 height, which is nonsensical. In other cases, like predicting sales (Y) based on ad spend (X), the intercept would be the "baseline" sales you would expect with zero ad spend.

- **Slope (β₁):** This is the primary coefficient of interest. It represents the **average change in the dependent variable (Y) for every one-unit increase in the independent variable (X).**

  -  If β₁ is **positive** (e.g., 2.5), it means Y increases by 2.5 units for every 1-unit increase in X.

  -  If β₁ is **negative** (e.g., -1.2), it means Y decreases by 1.2 units for every 1-unit increase in X.