# Simple Linear Regression Models

In [1]:
import pandas as pd
amazonbooks = pd.read_csv("amazonbooks.csv", encoding="ISO-8859-1")
amazonbooks

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Height,Width,Thick,Weight_oz
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304.0,Adams Media,2010.0,1605506249,7.8,5.5,0.8,11.2
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273.0,Free Press,2008.0,1416564195,8.4,5.5,0.7,7.2
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96.0,Dover Publications,1995.0,486285537,8.3,5.2,0.3,4.0
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672.0,Harper Perennial,2008.0,61564893,8.8,6.0,1.6,28.8
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720.0,Knopf,2011.0,307265722,8.0,5.2,1.4,22.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192.0,HarperCollins,2004.0,60572345,9.3,6.6,1.1,24.0
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160.0,Worth Publishers,2011.0,1429233443,9.1,6.1,0.7,8.0
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224.0,St Martin's Griffin,2005.0,031233446X,8.0,5.4,0.7,6.4
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480.0,W. W. Norton & Company,2010.0,393934942,10.7,8.9,0.9,14.4


In [2]:
amazonbooks['Amazon Price']

0       5.18
1      10.20
2       1.50
3      10.87
4      16.77
       ...  
320    12.24
321    27.55
322     5.18
323    97.50
324     4.95
Name: Amazon Price, Length: 325, dtype: float64

In [9]:
import plotly.express as px
fig = px.scatter(amazonbooks, x="List Price", y="Amazon Price")
fig.update_layout(yaxis_range=[0,1400])
fig.show()

In [10]:
fig = px.scatter(x=amazonbooks["List Price"], y=10*amazonbooks["Amazon Price"])
fig.update_layout(yaxis_range=[0,1400])
fig.show()

In [4]:
amazonbooks[["List Price","Amazon Price"]].dropna()

Unnamed: 0,List Price,Amazon Price
0,12.95,5.18
1,15.00,10.20
2,1.50,1.50
3,15.99,10.87
4,30.50,16.77
...,...,...
320,18.99,12.24
321,27.55,27.55
322,12.95,5.18
323,97.50,97.50


$$\require{enclose} \enclose{horizontalstrike}{\huge \rho} \text{ this would be the population correlation}$$

Instead, we'll do the sample correlation


$$\huge  \sqrt{r^2} = r_{xy} = \frac{\displaystyle \sum_{i=0}^{n-1(=323)} (x_i - \bar x)(y_i - \bar y)}{s_xs_y (n-1[=323])} $$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/2880px-Correlation_examples2.svg.png)


In [13]:
import numpy as np
amazon_prices_nonans = amazonbooks[["List Price","Amazon Price"]].dropna()
np.corrcoef(amazon_prices_nonans["List Price"], amazon_prices_nonans["Amazon Price"])
# correlation .95

array([[1.        , 0.95032858],
       [0.95032858, 1.        ]])

In [14]:
np.corrcoef(amazon_prices_nonans["List Price"], 10*amazon_prices_nonans["Amazon Price"])

array([[1.        , 0.95032858],
       [0.95032858, 1.        ]])

## Correlation measures the strength of a *straight line* relationship between two variables

$$ \huge 
\begin{aligned}
\hat \beta_1  &={} r_{xy} \frac {s_{y}}{s_{x}} \quad \text{ slope coefficent}\\
 &={} {\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}\\\\
\hat \beta_0  &={} \bar{y} -  \bar{x}\beta_1 \quad \text{ intercept}
\end{aligned}
$$

In [17]:
x = amazon_prices_nonans["List Price"]
y = amazon_prices_nonans["Amazon Price"]

beta1_hat = 0.95032858 * (y.std()/x.std())
beta0_hat = y.mean() - beta1_hat*x.mean()
beta1_hat, beta0_hat

(0.8297803288638311, -2.4069593052705063)

# abline [a=$\beta_0$ and b=$\beta_1$]
$$\huge y = \beta_0+\beta_1x$$
and
$$\huge \hat y = \hat \beta_0+\hat \beta_1x$$


In [52]:
x_ = np.linspace(0,140,141)
beta1 = .7
beta0 = 0
y_ = beta0 + beta1*x_
y_hat = beta0_hat + beta1_hat*x_
#fig = px.line(x=x_, y=y_); fig.show()

line_models = pd.DataFrame({'y_ = 0+.65*x': y_, 'y_hat = -2.4+.82*x': y_hat, 'x': x_})
line_models

Unnamed: 0,y_ = 0+.65*x,y_hat = -2.4+.82*x,x
0,0.0,-2.406959,0.0
1,0.7,-1.577179,1.0
2,1.4,-0.747399,2.0
3,2.1,0.082382,3.0
4,2.8,0.912162,4.0
...,...,...,...
136,95.2,110.443165,136.0
137,95.9,111.272946,137.0
138,96.6,112.102726,138.0
139,97.3,112.932506,139.0


In [53]:
line_models_tall = pd.melt(line_models, id_vars=['x'], value_vars=['y_ = 0+.65*x','y_hat = -2.4+.82*x'])
line_models_tall

Unnamed: 0,x,variable,value
0,0.0,y_ = 0+.65*x,0.000000
1,1.0,y_ = 0+.65*x,0.700000
2,2.0,y_ = 0+.65*x,1.400000
3,3.0,y_ = 0+.65*x,2.100000
4,4.0,y_ = 0+.65*x,2.800000
...,...,...,...
277,136.0,y_hat = -2.4+.82*x,110.443165
278,137.0,y_hat = -2.4+.82*x,111.272946
279,138.0,y_hat = -2.4+.82*x,112.102726
280,139.0,y_hat = -2.4+.82*x,112.932506


In [54]:
fig = px.line(line_models_tall, x="x", y="value", color='variable')
fig.show()

In [56]:
import plotly.graph_objects as go
fig = px.line(line_models_tall, x="x", y="value", color='variable')
fig.add_trace(go.Scatter(x=amazonbooks["List Price"], y=amazonbooks["Amazon Price"],
                         mode='markers', name='observed data'))
fig.show()

$$\huge 
\begin{align}
y &={} \beta_0+\beta_1x \quad \text{(model form)}\\
\hat y_i &={} \hat \beta_0+\hat\beta_1x_i \quad \text{(fit model)}\\
\hat \beta_0, \hat \beta_1 &={} \text{argmin}\quad  
\sum_{i=1}^n (y_i-\hat y_i)^2 \\
&={} \text{argmin} \sum_{i=1}^n \epsilon_i^2  \quad \text{(sum of squared }\enclose{horizontalstrike}{errors}\text{residuals)}\\
\hat \epsilon_i &={} y_i - \hat y_i \quad \text{(residual)}\\
\hat y_i &\color{white}{=}{} \text{are the predictions of the fit model}
\end{align}$$

- we square the prediction errros because some are positive and some are negative and so minimizing that is not helpful; instead, minimizing squared distance (which are positive) allows us to put the line through the data cloud such that the (squared) distance from the points to the line is minimize




In [58]:
# ols: ordinary least squares
fig = px.scatter(amazon_prices_nonans, x="List Price", y="Amazon Price", trendline="ols")
fig.show()

In [63]:
import statsmodels.api as sm

Y = y
X = x
X = sm.add_constant(X) # adds the beta0_hat intercept
model = sm.OLS(Y,X) # Y outcome (endogenous); X covariate or feature (exogeneous)
results = model.fit()
results.params # should be called estimated parameters or estimated coefficients

const        -2.406959
List Price    0.829780
dtype: float64

In [64]:
beta0_hat, beta1_hat

(-2.4069593052705063, 0.8297803288638311)

In [66]:
results.summary()

0,1,2,3
Dep. Variable:,Amazon Price,R-squared:,0.903
Model:,OLS,Adj. R-squared:,0.903
Method:,Least Squares,F-statistic:,3002.0
Date:,"Mon, 08 May 2023",Prob (F-statistic):,2.82e-165
Time:,09:17:04,Log-Likelihood:,-897.98
No. Observations:,324,AIC:,1800.0
Df Residuals:,322,BIC:,1808.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4070,0.354,-6.791,0.000,-3.104,-1.710
List Price,0.8298,0.015,54.789,0.000,0.800,0.860

0,1,2,3
Omnibus:,114.617,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1830.463
Skew:,0.992,Prob(JB):,0.0
Kurtosis:,14.474,Cond. No.,38.5


# $\hat \beta_0$ and $\hat \beta_1$ are statistics and therefore have sampling distributions
- This means that p-values (against the null hypothesis assumptions that $\beta_0=0$ and $\beta_1=0$) are available
- And further, that confidence intervals can be made for $\hat \beta_0$ and $\hat \beta_1$

# Linear Regression Models are good because 
1. They provide statistical uncertain analysis (e.g., confidence intervals and hypothesis testing)
2. They are interpretable

$$\huge \hat y = -2.4 + 0.82 x$$

Each 83 cents of each \\$1 on the list price actually makes its way into the final Amazon sale price; so, for a \\$1 increase in the list price, we expect an 83 cent increase (on average) in the Amazon price. 
