Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

Answer : Simple Linear Regression (SLR) is a statistical method used to study the relationship between one independent variable (X) and one dependent variable or target variable (y). It fits a straight line through the data in such a way that the difference between the actual and predicted values is minimized. The mathematical form of SLR is

y = β_0 + β_1 X

where, β_0 is the intercept and β_1 is the slope, and together they describe the best-fit line.

The purpose of SLR is to understand how a change in the independent variable affects the dependent variable and to use this relationship for prediction. It helps identify trends, measure the strength and direction of the relationship, and estimate future values of Y based on given values of X.

Question 2: What are the key assumptions of Simple Linear Regression?

Answer: Simple Linear Regression (SLR) is based on several important assumptions that must be satisfied for the model results to be valid and reliable. The key assumptions are:

1) Linearity: There is a linear relationship between the independent variable (X) and the target variable (y). This means the change in y is proportional to the change in X.

2) Independence of errors: The residuals (errors) are independent of each other. In other words, one observation’s error does not influence another’s.

3) Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable. This means the spread of errors should be the same throughout the regression line.

4) Normality of errors: The residuals should be approximately normally distributed. This is important for valid hypothesis testing and confidence intervals.

5) No significant outliers: Outliers should not strongly influence the regression line, as they can distort the relationship.

6) X values are measured without error: The independent variable (X) is assumed to have no measurement error.

Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

Answer: The mathematical equation for a Simple Linear Regression (SLR) model is:

y = β_0 + β_1 X + ϵ

Explanation of each term:

y (Dependent variable or target variable): The variable we want to predict or explain.

X (Independent variable): The predictor variable that influences Y.

β_0 (Intercept): The value of y when X = 0. It represents where the regression line crosses the Y-axis.

β_1 (Slope): The amount by which y changes for a one-unit increase in X. It shows the strength and direction of the relationship.

ϵ (Error Term): Represents the difference between the actual value and the predicted value. It captures the variation in y that cannot be explained by X.

Question 4: Provide a real-world example where simple linear regression can be applied.

Answer: A real-world example of applying Simple Linear Regression is predicting house prices based on the size of the house (in square feet). In this case, the size of the house (X) is the independent variable, and the house price (y) is the target variable. As the size of the house increases, the price generally increases. By fitting a straight line between these two variables, we can estimate the expected price of a house for any given size.

Question 5: What is the method of least squares in linear regression?

Answer: The method of least squares is a technique used in linear regression to find the best-fitting line through a set of data points. It works by minimizing the sum of the squared differences between the actual values (observed data) and the predicted values given by the regression line. These differences are called residuals, and the method of least squares chooses the values of the intercept (β_0) and slope (β_1) such that Sum of Squared Errors (SSE) is as small as possible.

In simple terms, the method ensures that the line drawn through the data points is the one with the least total error, making it the most accurate representation of the relationship between X and y.

Question 6: What is Logistic Regression? How does it differ from Linear Regression?

Answer: Logistic Regression is a statistical and machine learning method used for classification problems, where the target variable is categorical, most commonly binary (0 or 1) and also multiclass. Instead of predicting a continuous value, logistic regression predicts the probability that an observation belongs to a particular class. It uses the sigmoid (logistic) function to convert the output into a probability between 0 and 1.

The logistic regression model is:

P(y = 1 | X) = 1 / (1 + e^(-(β0 + β1 * X)))

Logistic Regression differs from Linear Regression mainly in purpose and output. While Linear Regression is used to predict continuous numerical values, Logistic Regression is used for classification problems where the target variable is categorical, usually 0 or 1. Linear Regression produces a straight-line equation and predicts actual numerical values, whereas Logistic Regression uses the sigmoid function to generate probabilities between 0 and 1, which are then converted into class labels. Linear Regression is trained using the method of least squares, while Logistic Regression uses maximum likelihood estimation. Overall, Linear Regression is suitable for continuous outcomes, whereas Logistic Regression is designed for categorical outcomes.

Question 7: Name and briefly describe three common evaluation metrics for regression models.

Answer: Three common evaluation metrics for regression models are:

1) Mean Absolute Error (MAE): MAE measures the average absolute difference between the actual values and the predicted values. It indicates how much the model’s predictions deviate from the true values on average.

2) Mean Squared Error (MSE): MSE calculates the average of the squared differences between actual and predicted values. Because the errors are squared, larger mistakes are penalized more strongly.

3) R-squared (R^2): R-squared represents the proportion of the variance in the dependent variable that is explained by the model. A higher R² value indicates a better fit and stronger explanatory power.

Question 8: What is the purpose of the R-squared metric in regression analysis?

Answer: The purpose of the R-squared metric in regression analysis is to measure how well the regression model explains the variability of the dependent variable. It represents the proportion of the total variation in the target variable that is accounted for by the model. R-squared ranges from 0 to 1, where a higher value indicates that the model provides a better fit to the data. In simple terms, R-squared shows how effectively the independent variable(s) help in predicting the dependent variable.

R-squared = 1 - (SS_res / SS_tot)

where,

SS_res = Sum of Squared Residuals

SS_tot = Total Sum of Squares

Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

In [2]:
from sklearn.linear_model import LinearRegression
import pandas as pd

X = pd.DataFrame({'feature': [1, 2, 3, 4, 5]})
y = [2, 4, 5, 4, 5]

model = LinearRegression()
model.fit(X, y)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 0.6
Intercept: 2.2


Question 10: How do you interpret the coefficients in a simple linear regression model?

Answer: In a simple linear regression model, the coefficients describe the relationship between the independent variable (X) and the dependent variable (y).

1) Slope (β_1): The slope shows how much the dependent variable (y) is expected to change when the independent variable (X) increases by one unit. If β₁ is positive → y increases as X increases. If β₁ is negative → y decreases as X increases. The slope represents the strength and direction of the relationship.

2) Intercept (β_0): The intercept is the predicted value of y when X = 0. It represents where the regression line crosses the y-axis. It helps in defining the baseline value of the model.