<a href="https://colab.research.google.com/github/hannahclark5067/qj-f24/blob/main/notebooks/regression/regression_basics_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Overview**

This Colab Notebook walks you through:

(A) How to Fit a Linear Regression Model in Python using the library [statsmodels](https://www.statsmodels.org/stable/index.html). You will be asked to fill in the code between, like `### Fill in the Code ###`

(B) How to think about what the library [statsmodels](https://www.statsmodels.org/stable/index.html) is doing when it fits a linear model

### **Clone Library**

In [1]:
!git clone https://github.com/pharringtonp19/business-analytics.git

Cloning into 'business-analytics'...
remote: Enumerating objects: 995, done.[K
remote: Counting objects: 100% (612/612), done.[K
remote: Compressing objects: 100% (261/261), done.[K
remote: Total 995 (delta 442), reused 429 (delta 328), pack-reused 383 (from 1)[K
Receiving objects: 100% (995/995), 18.84 MiB | 16.72 MiB/s, done.
Resolving deltas: 100% (570/570), done.


### **Import Packages**

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import plotly.graph_objects as go
import numpy as np
import jax.numpy as jnp
import jax
import seaborn as sb

### **Read In Data Set**

In [3]:
df = pd.read_csv(### FILL IN THE CORRECT FILE PATH HERE###)
df.head()

SyntaxError: incomplete input (<ipython-input-3-3a90111535ef>, line 2)

\#1. Create a Scatterplot of Size and Price




In [None]:
### ADD SCATTERPLOT HERE ###
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.xlabel('Size', size=14)
plt.title('Price', loc='left', size=14)
plt.show()

\#2. Creating a Linear Model



$$\text{Price}_i = \beta_0 + \beta_1 \text{Size}_i + ɛ_i$$

To create a linear model in Python we have to specify two components.

A) What is the relationship of interest? Or more specifically, what is the outcome or dependent variable and what is the independent variable (also known as a control or a feature variable). We can do so in Python by create a string with the outcome to the left of a tilde sign and the independent variable to the right, as in: `Dependent ~ Independent`
B) What is the name of the DataFrame where the variables are stored?

To create a linear regression model, we simply pass both the of the components mentioned above as inputs to `smf.ols` as in ```linear_model = smf.ols('Dependent ~ Independent', data = df)```

\#3. Estimating a Linear Model

Having specified which linear relationship we're interested in estimating, we can estimate the model by calling the `.fit()` method on the linear model and save the output as some variable named `reg`. Note, the variable name is up to you.

\#4. Showing Results

If you fit the model correctly, then you can adjust the following code to show both the fitted parameters and the overall results.

- We're using an f-string within a `print` statement to show the fitted parameters. Pass in `reg.params.values` within the curly braces.

- To show the results in a table-like format (one that you could use to show your work in a final project), pass in `reg` with the `[]` argument of `summary_col`

In [None]:
### PRINT FITTED PARAMETERS
print(f"Fitted Parameters: {### FILL THIS IN###}")

### PRESENT RESULTS
print(summary_col([### FILL THIS IN###],
                  stars=True,
                  float_format='%0.2f'))

The Following Code Visualizes the Objective Function

In [None]:
# Your code for defining the functions and creating data points
def model(params, x):
    return params[0] + x * params[1]

def loss_function(params, xs, ys_true):
    ys_pred = model(params, xs)
    return jnp.mean((ys_true - ys_pred) ** 2)

# Create data points
xs = jnp.array(df['size'].values)
ys_true = jnp.array(df['price'].values)

# Define parameter ranges and create a grid
intercepts = jnp.linspace(0, 20_000, 100)
slopes = jnp.linspace(0, 1000, 100)
intercepts_grid, slopes_grid = jnp.meshgrid(intercepts, slopes)

# Vectorize the loss computation over the grid
params_grid = jnp.stack([intercepts_grid.ravel(), slopes_grid.ravel()], axis=-1)
loss_values = jax.vmap(lambda params: loss_function(params, xs, ys_true))(params_grid)
loss_values = loss_values.reshape(intercepts_grid.shape)

# Get optimal parameters from regression
eqn = smf.ols('price ~ size', data=df)
results = eqn.fit()
optimal_intercept, optimal_slope = results.params['Intercept'], results.params['size']
optimal_loss = float(loss_function([optimal_intercept, optimal_slope], xs, ys_true))

# Create the Plotly figure
fig = go.Figure()

# Add the surface for loss values
fig.add_trace(go.Surface(
    z=loss_values,
    x=intercepts_grid,
    y=slopes_grid,
    colorscale='Viridis',
    opacity=0.8,
    name="Loss Surface"
))

# Add the optimal point
fig.add_trace(go.Scatter3d(
    x=[optimal_intercept],
    y=[optimal_slope],
    z=[optimal_loss],
    mode='markers',
    marker=dict(size=8, color='red'),
    name="Optimal Parameters"
))

# Customize layout
fig.update_layout(
    title="3D Loss Surface with Optimal Parameters",
    scene=dict(
        xaxis_title="Intercept",
        yaxis_title="Slope",
        zaxis_title="Loss (MSE)"
    )
)

# Show the plot
fig.show()

\#5. Understanding what a linear model captures:

Using the methods that we previouslly learned in pandas, `df[var1].cov(df[var2])`, `df[var1].var()`, divide the covariance of Price and Size by the variance of Size. How does this relate to the slope parameters?

\#6. Using the Model to Generate Predictions

Having fit a linear model to the data, we have a pair of estimated values for the parameters. We can now create a function `model(params, x)` which can take in the estimated parameters and some value of size of the house and output our prediction. Implement this function

\#7. Visulaizing Model Fit

Python provides us with an easy way to see our models predictions on the dataset that we used to fit the model. Given that we named the variable corresponding to the output of `linear_model.fit()`, we can access the predicted values via `reg.fittedvalues`. Create a scatter plot of Size and Price, and overlay a plot of Size and the fitted values.

In [None]:
### ADD SCATTER PLOT HERE
### ADD PLOT HERE
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.xlabel('Size', size=14)
plt.title('Price', loc='left', size=14)
plt.show()

\#8. Mean Residual

In this notebook, we're fitting a linear model to our data. As the above figures highlight, a linear model will not perfectly predict Price for a given Size of the house. We refer to this "prediction error" as the residual.

Compute the mean residual.

$$r_i(\beta_0, \beta_1) = y_i - (\beta_0 + \beta_1 x_i)$$

\#9. Visualing the Distribution of the Residual

Using `reg.resid`, create a histogram which captures the distribution of the residuals. Then create a scatter plot of Size and the residuals

Use `.resid` to create a scatter plot of Size and the Residual Values



In [None]:
### ADD HISTOGRAM HERE
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Residuls', size=14)
plt.title('Density', loc='left', size=14)
plt.show()

### ADD SCATTER PLOT HERE
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.xlabel('Size', size=14)
plt.title('Residuals', loc='left', size=14)
plt.show()

\#10. Let's explore the relationship between Size and the Residual numerically.

(A) Create a new column in your DataFrame that is equal to the residuals.

(B) Regress the residual on Size.

Do the results surprise you?


\#11. Visualizing Uncertainty

We can use **new** plotting libraries, like Seaborn, for more advanced plotting techniques such as capturing the uncertainty in our estimate. Fill in the following value for `x` and `y` to show a scatter plot of size and price and a line of best fit with uncertainty estimates.

In [None]:
sb.regplot(x = ###FILL THIS IN###, y = ###FILL THIS IN###, data = df,
line_kws = {'color': 'red'})
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.xlabel('Size', size=14)
plt.title('Price', loc='left', size=14)
plt.ylabel('')
plt.show()

\#12. Challenge

Read in the WNBA dataset as a new dataframe and adjust the seaborn code above to visualize the relationship between Assists and Price. What is the read line (uncertainty) meant to capture?