# Linear Regression: A Brief Introduction

Suppose you are given the data in the picture below and asked to draw a straight line that you feel best describes the relationship between the square footage of a house (on the x-axis) and its price (on the y-axis). What might that line look like? Seriously, pause for a moment and imagine a line on that graph. What would it look like?

[figure with point cloud in approximate line]


Most of us would probably end up drawing a line that looks something like the line in the figure below. And that, in a basic sense, is precisely what we are trying to do with linear regression: use mathematics to estimate a "line of best fit" that (we hope) gives a pretty good summary of the relationship between different variables.

[figure with red line]

But how exactly does linear regression do this? Odds are, you probably aren't even quite sure what guided you to think about a line very similar to the red one in the figure above — you just followed your intuition. But the basic idea that probably guided how you drew the line in your head is very similar to the principle used by a linear regression: try to draw a line that, on average, is as close as possible to all the data points plotted.

To be more specific, a linear regression estimates the line that minimizes the sum of squared errors between the line of best fit and each data point. Indeed, linear regression is often called "Ordinary Least Squares" or "Least Squares Regression" precisely because it tries to find the line that minimizes (gives rise to the smallest or least) sum of squared errors. There are reasons that linear regression minimizes the sum of *squared* errors (instead of just the sum of errors), but those reasons aren't crucial to getting an intuitive sense of how linear regression works.

## Representing A Regression Line

While this kind of picture is the easiest way to visualize a simple regression, this is not how most regressions are presented for reasons we'll discuss below. Instead, regressions generally take advantage of the fact that a line can be represented with an intercept (where the line crosses the y-axis) and a slope (the amount the line rises when you move one unit along the x-axis). In math notation, this generally gets written something like:

$$\text{price} = \alpha + \beta * \text{square footage} + \epsilon$$

The variable we're trying to explain (here, price) is on the left-hand side of the equation, and we write that the price of a house is equal to a constant term (the intercept, the value of $\alpha$) plus the houses square footage times the slope of the line of best fit ($\beta$). The last term — $\epsilon$ — is the error associated with a given observation (the difference between the value of the line of best fit for a given house and the house's true price). Mathematically, it works out that the sum of all the error terms ($\epsilon$) from a regression will always add up to zero.

So suppose we ran a regression, and the regression model estimated that $\alpha = SOMETHING$ and $\beta = SOMETHING$. From this, we could conclude that the model's estimate is that a 1,500 square foot house would have a price of ..... From this model, we could also infer that if someone owned a 1,500 square foot house and was thinking of building an extension that would add 500 square feet to the house, then the model's best guess would be that the price of the house would increase by $500 * \beta = SOMETHING$.

While the equation above shows us how regressions are often written out in books or papers, that's not quite how regression models are presented in Python. In Python, this regression would look like:


In [None]:
import statsmodels as sm

## Multivariate Regression

Up until now, we've only looked at regressions in the context of two variables: one we're trying to understand (the "dependent variable") and one we think helps to explain variation in the first (our "explanatory variable"). But this doesn't really tell the whole story of linear regression. Indeed, what makes linear regressions powerful is not their ability to model the relationship between two variables, but between a single dependent variable and an arbitrary number of explanatory variables that we think *jointly and simulataneously* explain the variation we observe in the dependent variable. And it is for that reason that linear regression is also often called "multiple regression."



## Reading Regression Output

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

pd.set_option("mode.copy_on_write", True)

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols("Lottery ~ Literacy + np.log(Pop1831)", data=dat).fit()
results.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.333
Method:,Least Squares,F-statistic:,22.2
Date:,"Tue, 04 Jun 2024",Prob (F-statistic):,1.9e-08
Time:,12:10:29,Log-Likelihood:,-379.82
No. Observations:,86,AIC:,765.6
Df Residuals:,83,BIC:,773.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,246.4341,35.233,6.995,0.000,176.358,316.510
Literacy,-0.4889,0.128,-3.832,0.000,-0.743,-0.235
np.log(Pop1831),-31.3114,5.977,-5.239,0.000,-43.199,-19.424

0,1,2,3
Omnibus:,3.713,Durbin-Watson:,2.019
Prob(Omnibus):,0.156,Jarque-Bera (JB):,3.394
Skew:,-0.487,Prob(JB):,0.183
Kurtosis:,3.003,Cond. No.,702.0


## Want to Learn More?

Great! Our colleague from the statistics department — Mine Çetinkaya-Rundel — has developed an entire course on linear regression and modeling that we think is terrific (and judging by the ratings the course has received, past students do too!). You can check it out here: [Linear Regression and Modeling](https://www.coursera.org/learn/linear-regression-model). The course uses R when they do actual coding, but the focus of the class is on how linear regression works and how results can be interpreted, which is the same whether you're using R, Python, or doing the matrix algebra on a napkin.