In [1]:
# Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

"""
Simple Linear Regression (SLR) is a statistical method used to model the relationship between two continuous variables: a dependent variable (also known as the response or outcome variable) and an independent variable (also known as the predictor or explanatory variable).

Predict: Estimate the value of the dependent variable based on the value of the independent variable.
Explain: Understand the strength and direction of the relationship between the two variables. For example, if the independent variable increases, does the dependent variable tend to increase, decrease, or stay the same, and by how much?
Quantify: Provide a mathematical equation (a straight line) that describes this relationship, typically in the form Y = β₀ + β₁X + ε, where:
Y is the dependent variable.
X is the independent variable.
β₀ is the y-intercept (the value of Y when X is 0).
β₁ is the slope of the line (the change in Y for a one-unit change in X).
ε is the error term, representing the difference between the observed and predicted values.
"""

In [None]:
# Question 2: What are the key assumptions of Simple Linear Regression?


"""
The key assumptions of Simple Linear Regression are:

Linearity: There is a linear relationship between the independent variable (X) and the dependent variable (Y). This means that the relationship can be best described by a straight line.
Independence of Errors: The residuals (errors) are independent of each other. This means that the error of one observation does not influence the error of another.
Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable. In simpler terms, the spread of the residuals should be roughly the same along the regression line.
Normality of Errors: The residuals are normally distributed. While less critical for larger sample sizes due to the Central Limit Theorem, it's an important assumption, especially for making inferences.
No Multicollinearity (for Multiple Linear Regression): Although not strictly an assumption for Simple Linear Regression (as there's only one independent variable), it's worth noting for future reference when dealing with multiple predictors: independent variables should not be highly correlated with each other.

"""

In [None]:
# Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

"""
The mathematical equation for a simple linear regression model is typically represented as a straight line:

Y = β₀ + β₁X + ε

Here's an explanation of each term:

Y: This is the dependent variable (also known as the response or outcome variable). It's the variable we are trying to predict or explain.
X: This is the independent variable (also known as the predictor or explanatory variable). It's the variable used to predict Y.
β₀ (Beta-naught): This is the y-intercept. It represents the expected value of Y when the independent variable X

"""

In [None]:
 ## Question 4: Provide a real-world example where simple linear regression can be applied.


 """
A common real-world example where Simple Linear Regression can be applied is in predicting house prices based on their size.

Dependent Variable (Y): House Price (e.g., in dollars)
Independent Variable (X): House Size (e.g., in square feet)
In this scenario, a real estate analyst could collect data on various houses, noting both their size and their sale price. By applying Simple Linear Regression, they could:

Understand the relationship: Determine if larger houses generally sell for higher prices.
Quantify the relationship: Find an equation that describes how much the price typically increases for every additional square foot.
Predict: Estimate the selling price of a new house based solely on its square footage. For example, if the regression model suggests Price = $50,000 + $150 * Size (sq ft), a 2000 sq ft house would be predicted to sell for $50,000 + $150 * 2000 = $350,000.

 """

In [None]:
## Question 5: What is the method of least squares in linear regression?
"""
The Method of Least Squares is a standard approach in linear regression to find the best-fitting straight line (the regression line) through a set of data points. The 'best-fitting' line is defined as the one that minimizes the sum of the squares of the vertical distances (residuals) from each data point to the line.

Here's a breakdown:

Residuals (Errors): For each data point (Xᵢ, Yᵢ), a residual is the difference between the observed actual value (Yᵢ) and the value predicted by the regression line (Ŷᵢ). That is, eᵢ = Yᵢ - Ŷᵢ.
Squaring the Residuals: We square each residual (eᵢ²) for two main reasons:
To prevent positive and negative errors from canceling each other out, ensuring that larger errors have a greater impact.
To penalize larger errors more heavily, making the model more sensitive to deviations.
Sum of Squared Residuals (SSR): The goal is to minimize the sum of all these squared residuals: SSR = Σ(Yᵢ - Ŷᵢ)².
Finding the Best Line: The method of least squares mathematically derives the values for the intercept (β₀) and the slope (β₁) of the regression line Ŷ = β₀ + β₁X that result in the smallest possible SSR.
In essence, the method of least squares provides a systematic way to draw a line through a scatter plot of data points such that the average squared distance between the points and the line is as small as possible. This line then serves as the best linear predictor for the relationship between the independent and dependent variables.

"""

In [None]:
## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

"""
Logistic Regression is a statistical model used for binary classification problems. It predicts the probability of a categorical dependent variable (an outcome that can only take on one of a limited number of categories, e.g., 'yes' or 'no', 'true' or 'false'). Instead of directly predicting a value, it predicts the probability that an observation belongs to a particular category.

Here's how it differs from Linear Regression:

Dependent Variable Type: Linear Regression is used when the dependent variable is continuous (e.g., house price, temperature). Logistic Regression is used when the dependent variable is categorical, typically binary (e.g., whether a customer will churn or not, whether an email is spam or not).
Output: Linear Regression outputs a continuous numerical value. Logistic Regression outputs a probability value between 0 and 1, which can then be converted into a binary outcome using a threshold.
Underlying Function: Linear Regression uses a linear function to model the relationship between variables (Y = β₀ + β₁X). Logistic Regression uses the sigmoid (or logistic) function to transform the linear combination of predictors into a probability. The sigmoid function maps any real-valued number into a value between 0 and 1.
Error Distribution: Linear Regression assumes that the residuals are normally distributed. Logistic Regression typically assumes a binomial distribution for the errors, as it deals with binary outcomes.
Equation: The basic equation for Logistic Regression often looks like this (after applying the log-odds transformation): log(p / (1-p)) = β₀ + β₁X where p is the probability of the event occurring.
In essence, while both are regression techniques, Linear Regression predicts quantity, and Logistic Regression predicts the probability of belonging to a class.

"""

In [None]:
## Question 7: Name and briefly describe three common evaluation metrics for regression models.

"""
Mean Absolute Error (MAE): To measure the average error in price predictions.
Root Mean Squared Error (RMSE): To emphasize larger errors, which are critical in financial applications.
R-squared (R²): To assess how well the model captures the variability in stock prices.

"""

In [None]:
## Question 8: What is the purpose of the R-squared metric in regression analysis?


"""
The purpose of the R-squared metric in regression analysis is to indicate how well the independent variables explain the variance in the dependent variable.
Also known as the coefficient of determination, R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables
, serving as a measure of the goodness of fit for the regression model.
A higher R-squared value means that more of the variance is explained by the model.
Measure of fit: R-squared quantifies how closely the data points fit the regression line.

Proportion of variance: It represents the percentage of the dependent variable's variation that is accounted for by the independent variable(s) in the model.
For example, an R-squared of \(0.75\) means that \(75\%\) of the variation in the dependent variable can be explained by the model.

Goodness of fit: A value of \(1\) means the model perfectly predicts the dependent variable
, while a value of \(0\) means the model does not explain any of the variability.

Context is key: It's important to note that R-squared should be interpreted alongside other statistical measures and context
, as a high R-squared does not automatically mean the model is causal or the best possible model.

"""

In [2]:
##  Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.
## (Include your Python code and output in the code box below.)

import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Create some sample data
# Let's say we want to predict 'study_hours' (Y) based on 'exam_score' (X)
X = np.array([2, 3, 5, 7, 9, 10, 12, 14, 15, 17, 18, 20]).reshape(-1, 1) # Independent variable (e.g., hours studied)
Y = np.array([55, 60, 65, 70, 75, 80, 85, 88, 90, 92, 95, 98]) # Dependent variable (e.g., exam score)

# 2. Initialize the Linear Regression model
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, Y)

# 4. Print the slope (coefficient) and intercept
print(f"Intercept: {model.intercept_:.2f}")
print(f"Slope (Coefficient): {model.coef_[0]:.2f}")

# Optional: Make a prediction
sample_study_hours = np.array([[13]])
predicted_score = model.predict(sample_study_hours)
print(f"\nPredicted exam score for {sample_study_hours[0][0]} hours studied: {predicted_score[0]:.2f}")



Intercept: 53.37
Slope (Coefficient): 2.37

Predicted exam score for 13 hours studied: 84.15


In [None]:
## Question 10: How do you interpret the coefficients in a simple linear regression model?


"""
In a simple linear regression model, the coefficients represent the relationship between the independent and dependent variables.
The slope coefficient indicates how much the dependent variable changes, on average, for a one-unit increase in the independent variable.
The sign of the slope (\(\beta _{1}\)) shows the direction of the relationship (positive or negative), while the magnitude indicates the strength of that change.
The intercept (\(\beta _{0}\)) is the predicted value of the dependent variable when the independent variable is zero, though it often has no practical meaning.

"""