### **Question 1  :** What is Simple Linear Regression (SLR)? Explain its purpose.

=>

Simple Linear Regression (SLR) is a statistical method used to model the linear relationship between two continuous variables: a dependent variable (the variable you want to predict) and an independent variable (the variable used to make the prediction).

Its purpose is to:

1.  **Understand the Relationship**: Determine the strength and direction of the linear relationship between the two variables. For example, how much does the dependent variable change for a one-unit increase in the independent variable?
2.  **Prediction**: Predict the value of the dependent variable for a given value of the independent variable.
3.  **Inference**: Make inferences about the population based on the sample data, such as determining if the relationship is statistically significant.

### **Question 2:** What are the key assumptions of Simple Linear Regression?

=>

Simple Linear Regression relies on several key assumptions for the results to be valid and reliable. These assumptions are about the relationship between the independent and dependent variables, and the properties of the errors (the differences between the observed and predicted values). The key assumptions are:

1.  **Linearity**: The relationship between the independent variable (x) and the dependent variable (y) is linear. This means that the mean of the dependent variable is a straight-line function of the independent variable. You can check this assumption by creating a scatter plot of the data.
2.  **Independence**: The observations are independent of each other. This means that the value of one observation does not influence the value of another observation. This is particularly important in time series data where consecutive observations might be correlated.
3.  **Homoscedasticity (Constant Variance)**: The variance of the errors is constant across all levels of the independent variable. In other words, the spread of the residuals (the difference between the observed and predicted values) should be roughly the same for all values of the independent variable. You can check this assumption by plotting the residuals against the predicted values. A fan shape or cone shape in the plot indicates heteroscedasticity (non-constant variance).
4.  **Normality**: The errors (residuals) are normally distributed. This assumption is more important for smaller sample sizes. You can check this assumption by looking at a histogram or a Q-Q plot of the residuals. Statistical tests like the Shapiro-Wilk test can also be used.
5.  **No Multicollinearity**: While more relevant in multiple linear regression, it's worth noting here that the independent variable should not be perfectly correlated with another variable if you were to extend to multiple regression. In simple linear regression, this is not an issue as there is only one independent variable.

### **Question 3:** Write the mathematical equation for a simple linear regression model and
explain each term.

=>

The mathematical equation for a simple linear regression model is:

$y = \beta_0 + \beta_1x + \epsilon$

Where:

*   **$y$**: The dependent variable (the variable you are trying to predict).
*   **$x$**: The independent variable (the variable used to make the prediction).
*   **$\beta_0$ (Beta naught)**: The y-intercept. This is the predicted value of $y$ when $x$ is 0.
*   **$\beta_1$ (Beta one)**: The slope of the regression line. This represents the change in $y$ for a one-unit increase in $x$.
*   **$\epsilon$ (Epsilon)**: The error term. This represents the random variability in $y$ that cannot be explained by the linear relationship with $x$. It is the difference between the observed value of $y$ and the predicted value of $y$ based on the regression line.

### **Question 4:** Provide a real-world example where simple linear regression can be
applied.

=>

A common real-world example of simple linear regression is the relationship between **years of experience and salary**.

*   **Dependent Variable (y)**: Salary
*   **Independent Variable (x)**: Years of Experience

We could collect data from a group of individuals, recording their years of experience and their current salary. Simple linear regression could then be used to:

1.  **Model the relationship**: Determine if there is a linear relationship between years of experience and salary, and how strong that relationship is.
2.  **Predict salary**: Predict the expected salary for someone with a certain number of years of experience.
3.  **Understand the impact of experience**: Estimate how much, on average, an additional year of experience contributes to an increase in salary (this would be the slope, $\beta_1$).

Other examples include:

*   The relationship between the number of hours studied and exam score.
*   The relationship between advertising expenditure and sales revenue.
*   The relationship between temperature and ice cream sales.

### **Question 5**: What is the method of least squares in linear regression?

=>

The method of least squares is a standard approach in simple linear regression to find the "best" fitting line through a set of data points. The "best" line is defined as the one that minimizes the sum of the squared vertical distances (residuals) between the actual data points and the line.

Here's a breakdown:

1.  **Residuals**: For each data point, a residual is the difference between the observed value of the dependent variable ($y$) and the value predicted by the regression line ($\hat{y}$). That is, $e_i = y_i - \hat{y}_i$.
2.  **Squaring the Residuals**: The residuals are squared to eliminate negative values and to give more weight to larger errors.
3.  **Sum of Squared Residuals (SSR)**: The squared residuals are summed up for all data points. The goal of the least squares method is to find the regression line (i.e., the values of $\beta_0$ and $\beta_1$) that minimizes this sum.

Mathematically, the objective is to minimize:

$SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Where:

*   $y_i$ is the observed value of the dependent variable for the $i$-th data point.
*   $\hat{y}_i$ is the predicted value of the dependent variable for the $i$-th data point.
*   $n$ is the number of data points.

By minimizing the sum of squared residuals, the method of least squares finds the line that is closest to all the data points in a vertical sense. This line represents the best linear approximation of the relationship between the independent and dependent variables based on the given data.

### **Question 6**: What is Logistic Regression? How does it differ from Linear Regression?

=>

**Logistic Regression** is a statistical model used for **binary classification**. This means it is used to predict the probability that an observation belongs to one of two categories (e.g., yes/no, pass/fail, spam/not spam). It does this by modeling the probability of the dependent variable being a certain class.

**How it differs from Linear Regression:**

1.  **Type of Dependent Variable**:
    *   **Linear Regression**: The dependent variable is **continuous** (e.g., salary, temperature, sales revenue).
    *   **Logistic Regression**: The dependent variable is **categorical**, specifically **binary** (e.g., 0 or 1, True or False).

2.  **Output**:
    *   **Linear Regression**: The output is a **continuous value** that represents the predicted value of the dependent variable.
    *   **Logistic Regression**: The output is a **probability** (a value between 0 and 1) that the observation belongs to a specific class. This probability is then typically converted into a class prediction (e.g., if the probability is > 0.5, predict class 1).

3.  **Underlying Function**:
    *   **Linear Regression**: Uses a **linear function** to model the relationship between the independent and dependent variables ($y = \beta_0 + \beta_1x$).
    *   **Logistic Regression**: Uses the **logistic function (or sigmoid function)** to map the output of a linear combination of the independent variables to a probability between 0 and 1. The logistic function is S-shaped and squashes any real-valued input into a value between 0 and 1.

4.  **Assumptions**:
    *   While some assumptions are shared (like independence), Logistic Regression has different assumptions, such as the assumption that the independent variables are linearly related to the log-odds of the dependent variable. It does not assume normality of errors or homoscedasticity as Linear Regression does.

In essence, Linear Regression is for predicting a continuous outcome, while Logistic Regression is for predicting the probability of a binary outcome.

### **Question 7:** Name and briefly describe three common evaluation metrics for regression
models.

=>

Here are three common evaluation metrics for regression models:

1.  **Mean Absolute Error (MAE)**: MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between the actual values and the predicted values. MAE is less sensitive to outliers than MSE.

    $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

2.  **Mean Squared Error (MSE)**: MSE measures the average of the squares of the errors. It is the average of the squared differences between the actual values and the predicted values. MSE gives more weight to larger errors due to the squaring.

    $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

3.  **Root Mean Squared Error (RMSE)**: RMSE is the square root of the MSE. It is a widely used metric because it is in the same units as the dependent variable, making it easier to interpret. Like MSE, it is sensitive to outliers.

    $RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

Another important metric, though not strictly an error metric, is **R-squared ($R^2$)**.

*   **R-squared ($R^2$)**: R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It provides a measure of how well the regression model fits the data. An R-squared of 1 indicates that the model perfectly fits the data, while an R-squared of 0 indicates that the model does not explain any of the variability in the dependent variable.

    $R^2 = 1 - \frac{SSR}{SST}$

    Where:
    *   $SSR$ is the Sum of Squared Residuals (explained in Question 5).
    *   $SST$ is the Total Sum of Squares, which is the sum of the squared differences between the actual values and the mean of the dependent variable.

### **Question 8:** What is the purpose of the R-squared metric in regression analysis?

=>

The purpose of the R-squared ($R^2$) metric in regression analysis is to measure the **proportion of the variance in the dependent variable that is predictable from the independent variable(s)**. In simpler terms, it indicates how well the independent variable(s) explain the variability in the dependent variable.

Here's what R-squared tells us:

*   An $R^2$ value ranges from 0 to 1 (or 0% to 100%).
*   An $R^2$ of 0 means that the model does not explain any of the variability in the dependent variable. The independent variable(s) do not help in predicting the dependent variable.
*   An $R^2$ of 1 means that the model perfectly explains all of the variability in the dependent variable. The independent variable(s) perfectly predict the dependent variable.
*   An $R^2$ value between 0 and 1 indicates the percentage of the dependent variable's variance that is explained by the model. For example, an $R^2$ of 0.60 means that 60% of the variation in the dependent variable can be explained by the independent variable(s) in the model.

$R^2$ is calculated as:

$R^2 = 1 - \frac{SSR}{SST}$

Where:

*   $SSR$ is the Sum of Squared Residuals (the unexplained variance).
*   $SST$ is the Total Sum of Squares (the total variance in the dependent variable).

Essentially, R-squared compares the performance of your regression model to a simple model that just predicts the mean of the dependent variable. A higher R-squared generally indicates a better fit, but it's important to consider it in conjunction with other evaluation metrics and domain knowledge.

### **Question 9:** Write Python code to fit a simple linear regression model using scikit-learn
and print the slope and intercept.
(Include your Python code and output in the code box below.)

=>


In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data: years of experience and salary
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Independent variable (Years of Experience)
y = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]) # Dependent variable (Salary)

# Create a Linear Regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

# Print the slope and intercept
print("Slope (coefficient):", model.coef_[0])
print("Intercept:", model.intercept_)

Slope (coefficient): 5000.000000000002
Intercept: 24999.99999999999


### **Question 10:** How do you interpret the coefficients in a simple linear regression model?

=>

In a simple linear regression model with the equation $y = \beta_0 + \beta_1x + \epsilon$, the coefficients $\beta_0$ and $\beta_1$ have specific interpretations:

*   **$\beta_0$ (Intercept)**: The intercept represents the predicted value of the dependent variable ($y$) when the independent variable ($x$) is equal to zero. In the context of the salary example we used earlier, the intercept would be the predicted salary for someone with zero years of experience. However, it's important to note that interpreting the intercept only makes sense if a value of zero for the independent variable is meaningful in the context of your data.

*   **$\beta_1$ (Slope)**: The slope represents the change in the predicted value of the dependent variable ($y$) for a one-unit increase in the independent variable ($x$). In the salary example, the slope would represent the average increase in salary for each additional year of experience. A positive slope indicates a positive linear relationship (as $x$ increases, $y$ increases), while a negative slope indicates a negative linear relationship (as $x$ increases, $y$ decreases).

Using the output from the previous code cell:

*   **Slope (coefficient): 5000.000000000002**
*   **Intercept: 24999.99999999999**

This means that for every additional year of experience, the predicted salary increases by approximately $5000. The intercept of approximately $25000 suggests that a person with zero years of experience is predicted to have a salary of $25000.