# **MTD - AEC1_S1 - Linear Regression**
**Degree:** Data Science  
**Course:** Decision-Making Models      
**Student name:** Pablo Bas Genís

---
---

**Submission Instructions**  

Please read and complete all tasks within the same **Jupyter Notebook (.ipynb)**. Ensure that your code is well-documented, and all required outputs (tables, graphs, and explanations) are included within the notebook.  

Once you have finished, upload your completed **.ipynb file** to **Canvas** before **February 11th at 23:59**.  

Late submissions may be subject to penalties as per the course policies.

---
---

## **Multiple-Choice Questions on Linear Regression**  

Each question is worth **1 point**. The correct answer is worth **0.6 points**, and the justification for why the selected answer is correct (and why the others are not) is worth **0.4 points**.  



**1. In simple linear regression, what does the slope ($\beta_1$) of the regression equation represent?**  
**a)** The value of $Y$ when $X = 0$.  
**b)** The average change in $Y$ for a one-unit increase in $X$.  
**c)** The percentage of variation in $Y$ explained by $X$.  
**d)** The error term that accounts for randomness in the model.  



Answer: b. The average change in Y for a one-unit increase in X.
Justification: The slope β1 indicates how Y varies when X increases by one unit.

**2. Which of the following assumptions is NOT required for simple linear regression?**  
**a)** The relationship between $X$ and $Y$ is linear.  
**b)** The residuals are normally distributed.  
**c)** The independent variable must be normally distributed.  
**d)** The variance of residuals is constant (homoscedasticity).  



Answer: c. The independent variable must be normally distributed.
Justification: In linear regression, normality is assumed for the residuals, not for X.

**3. If the coefficient of determination ($R^2$) is close to 1, what does this imply?**  
**a)** The independent variable perfectly predicts the dependent variable.  
**b)** The model explains most of the variability in the dependent variable.  
**c)** The slope of the regression line is equal to 1.  
**d)** The regression model is guaranteed to be useful for prediction.  



Answer: b. The model explains most of the variability in the dependent variable.
Justification: A high R² indicates that the model fits well, but does not guarantee causation.

**4. Which situation indicates that a linear regression model may not be appropriate?**  
**a)** The residuals show a clear pattern when plotted against $X$.  
**b)** The coefficient of determination ($R^2$) is relatively high.  
**c)** The residuals appear randomly scattered around zero.  
**d)** The p-value for the slope coefficient is below 0.05.  

Answer: a. The residuals show a clear pattern when plotted against X.
Justification: If the residuals are not random, there may be specification problems in the model.

---
---

## **Problem-Solving Questions (3 Points Each)**  

**1. Statement for Simple Linear Regression (SLR) in Python with pandas and plotly**  

The objective of this exercise is to analyze the relationship between two variables, **billing (Y) and investment in R&D&I (X)**, using a **simple linear regression model**. The dataset required for this analysis is provided in the PDF file **Regresion_ANOVA_MTD_2021_2022.pdf**, specifically in the section **"SLR Example: Billing and Investment in R&D&I"**. We will use **pandas** for data manipulation and analysis, along with **plotly** for interactive visualization of the results.  

This analysis should include an initial statistical exploration of the data, fitting a linear regression model, generating a scatter plot with the fitted line, and adding a **text box in the visualization** that displays the regression equation and the coefficient of determination ($R^2$).  

**Evaluation Criteria**  

1. **Data Extraction and Exploration**  
   - Extract the dataset from the **"SLR Example: Billing and Investment in R&D&I"** section of the **Regresion_ANOVA_MTD_2021_2022.pdf** file.  
   - Enter the **billing (Y) and investment in R&D&I (X)** data into a **pandas DataFrame**.  
   - Display a statistical summary including **mean, variance, and covariance**.  

2. **Initial Visualization**  
   - Generate an **interactive scatter plot using plotly** to observe the relationship between **investment (X)** and **billing (Y)**.  

3. **Linear Regression Calculation**  
   - Use the **least squares method** to determine the regression coefficients:  
     - **Slope ($\beta_1$)**: Measures how much billing increases for each additional euro invested in R&D&I.  
     - **Intercept ($\beta_0$)**: Represents the expected billing when there is no investment in R&D&I.  
   - Compute the **coefficient of determination ($R^2$)** to evaluate model accuracy.  

4. **Final Graph Generation**  
   - Display the **regression line** on the scatter plot.  
   - Include an **enclosed text box** in the plot showing:  
     - The regression equation in the form $Y = \beta_0 + \beta_1 X$.  
     - The coefficient of determination $R^2$.  

5. **Final Results Table**  
   - Present the original dataset (from the **"SLR Example: Billing and Investment in R&D&I"** section) along with three additional rows containing:  
     - **Mean of $X$ (investment in R&D&I) and $Y$ (billing)**.  
     - **Variance of $X$ and $Y$**.  
     - **Covariance between $X$ and $Y$**.  

In [1]:
import pandas as pd
import plotly.express as px
import statsmodels.api as sm

# Create DataFrame with data extracted from the PDF
data = {
    "Revenue_Y": [8111, 7462, 9030, 13505, 14801, 10664, 6005, 5853, 19720, 11759, 18640, 23388],
    "Investment_X": [373, 4242, 4115, 5860, 5833, 6002, 1837, 1393, 7829, 5278, 3423, 8481]
}

df = pd.DataFrame(data)

# Fit linear regression model
X = sm.add_constant(df["Investment_X"])  # Add constant for intercept term
y = df["Revenue_Y"]
model = sm.OLS(y, X).fit()

# Visualize data in a scatter plot with a trendline
fig = px.scatter(df, x="Investment_X", y="Revenue_Y", trendline="ols", 
                 title="Relationship between R&D Investment and Revenue",
                 labels={"Investment_X": "R&D Investment", "Revenue_Y": "Revenue"})
fig.show()


# Display model summary
print(model.summary())

# Regression interpretation
print("\nModel Interpretation:")
print(f"Regression equation: Y = {model.params[0]:.2f} + {model.params[1]:.2f}X")
print(f"Coefficient of determination (R²): {model.rsquared:.3f}")
print(f"Model p-value: {model.pvalues[1]:.5f}")
print(f"The slope coefficient indicates that for each additional euro invested in R&D, revenue increases by {model.params[1]:.2f} euros.")


                            OLS Regression Results                            
Dep. Variable:              Revenue_Y   R-squared:                       0.587
Model:                            OLS   Adj. R-squared:                  0.546
Method:                 Least Squares   F-statistic:                     14.22
Date:                Tue, 11 Feb 2025   Prob (F-statistic):            0.00365
Time:                        22:41:55   Log-Likelihood:                -115.04
No. Observations:                  12   AIC:                             234.1
Df Residuals:                      10   BIC:                             235.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         4393.4525   2400.684      1.830   


kurtosistest only valid for n>=20 ... continuing anyway, n=12



**2. Statement for Simple Linear Regression (SLR) in Python with pandas and plotly**  

The objective of this exercise is to analyze the relationship between **wheat production (X, in tons)** and the **price of flour (Y, in Euros)** using a **simple linear regression model**. The dataset required for this analysis is provided in the PDF file **Regresion_ANOVA_MTD_2021_2022.pdf**, specifically in the section containing the **table with wheat production and flour price data**. We will use **pandas** for data manipulation and analysis, along with **plotly** for interactive visualization of the results.  

This analysis should include an initial statistical exploration of the data, fitting a linear regression model, generating a scatter plot with the fitted line, and adding a **text box in the visualization** that displays the regression equation and the coefficient of determination ($R^2$). Additionally, we will evaluate the model error and determine the reliability of our predictions.  

**Tasks to Complete**  

1. **Represent both variables in a Scatter Plot**  
   - Extract the dataset from the **table containing wheat production (X) and flour price (Y)** in the **Regresion_ANOVA_MTD_2021_2022.pdf** file.  
   - Enter the **wheat production and flour price** data into a **pandas DataFrame**.  
   - Generate an **interactive scatter plot using plotly** to visualize the relationship between **wheat production (X)** and **flour price (Y)**.  
   - Analyze the scatter plot and describe what kind of relationship you observe between the variables.  

2. **Calculate the Regression Model**  
   - Use the **least squares method** to determine the regression coefficients:  
     - **Slope ($\beta_1$)**: Measures how the price of flour changes based on wheat production.  
     - **Intercept ($\beta_0$)**: Represents the expected price of flour when wheat production is zero.  
   - Compute the **coefficient of determination ($R^2$)** to evaluate the model's accuracy.  
   - Display the **regression equation** on the scatter plot.  

3. **Predict the Price of Flour for 45 Tons of Wheat Production**  
   - Use the regression equation $Y = \beta_0 + \beta_1 X$ to predict the flour price when **wheat production reaches 45 tons**.  
   - Display the predicted value in a **text box within the scatter plot**.  

4. **Calculate the Model Error**  
   - Compute the sum of squared errors (SSE):  
     $$
     \sum (Y - Y')^2
     $$
   - Display the model error in the output.  

5. **Assess the Reliability of the Prediction**  
   - Evaluate the error and $R^2$ value to determine if the model provides reliable predictions.  
   - Conclude whether the prediction for 45 tons of wheat production is **trustworthy** based on the accuracy of the model.  

In [2]:
# Predict revenue for a new R&D investment of 7000 euros
new_investment = 7000
predicted_revenue = model.predict([1, new_investment])[0]

print("\nPrediction:")
print(f"For an R&D investment of {new_investment} euros, the predicted revenue is {predicted_revenue:.2f} euros.")



Prediction:
For an R&D investment of 7000 euros, the predicted revenue is 16714.02 euros.


### **Artificial Intelligence Usage Statement**  

"I solemnly declare that in the completion of this activity:  

X I have not used any artificial intelligence tools at any stage of the process.  

☐ I have used artificial intelligence in the following way (specify): _______________."