## Linear Regression

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jpn--/python-for-transportation-modeling/blob/master/course-content/tabular-analysis/linear-regression.ipynb)

In [1]:
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm

A popular package for developing linear regression models in Python
is `statsmodels`.  This packages includes an extensive set of tools
for statisitical modeling, but in this tutorial we will focus on 
linear regression models.

Generally, linear regression models will be developed using a 
`pandas.DataFrame` containing both independent (explanatory) and 
dependent (target) variables.  We'll work with data in the 
households and trips table from the Jupiter study area.

In [2]:
def data_url(filename):
    url = "https://github.com/jpn--/python-for-transportation-modeling/raw/master/example-package/transportation_tutorials/data/"
    return url + filename
    
def get_data(filename):
    if not os.path.isfile(filename):
        import urllib.request
        urllib.request.urlretrieve(data_url(filename), filename)
    return filename

hh = pd.read_csv(get_data('SERPM8-BASE2015-HOUSEHOLDS.csv.gz'), index_col=0)
hh.set_index('hh_id', inplace=True)

In [3]:
trips = pd.read_csv(get_data('SERPM8-BASE2015-TRIPS.csv.gz'))

If we want to develop a linear regression model to predict trip generation
by households, we'll need to merge these two data tables, tabulating the number
of trips taken by each household. (See the tutorial on 
[grouping](./basic-analysis-with-pandas.html#Grouping) for more details on how
to do this).

In [4]:
hh = hh.merge(
    trips.groupby(['hh_id']).size().rename('n_trips'), 
    left_on=['hh_id'], 
    right_index=True,
)

We can review what variables we now have in the `hh` DataFrame:

In [5]:
hh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17260 entries, 1690841 to 1726370
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   home_mgra     17260 non-null  int64 
 1   income        17260 non-null  int64 
 2   autos         17260 non-null  int64 
 3   transponder   17260 non-null  int64 
 4   cdap_pattern  17260 non-null  object
 5   jtf_choice    17260 non-null  int64 
 6   autotech      17260 non-null  int64 
 7   tncmemb       17260 non-null  int64 
 8   n_trips       17260 non-null  int64 
dtypes: int64(8), object(1)
memory usage: 1.3+ MB


If we suppose that the number of trips made by a household is 
a function of income and the number of automobiles owned, we can
create an ordinary least squares regression model, and find the 
best fitting parameters like this:

In [6]:
mod = sm.OLS(
    hh.n_trips, 
    sm.add_constant(hh[['autos','income']])
)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                n_trips   R-squared:                       0.229
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     2563.
Date:                Tue, 31 May 2022   Prob (F-statistic):               0.00
Time:                        13:22:34   Log-Likelihood:                -48167.
No. Observations:               17260   AIC:                         9.634e+04
Df Residuals:                   17257   BIC:                         9.636e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4460      0.073     33.694      0.0

Note that the `hh` dataframes contains a variety of other columns of data, but
since we're not interested in using them for this model, they can be implicitly 
omitted by creating a dataframe view that includes only the variables we do want.

Also, we use `sm.add_constant`, which includes a constant in the regression
function.  By default, the `statsmodels` module does *not* include a constant
in an ordinary least squares (OLS) model, so you must explicitly add one 
to the explanatory variables to include it.

The output of the model `summary()` is relatively extensive and includes a 
large number of statistical measures and tests that may or may not interest
you.  The most important of these measures include the coefficient estimates
shown in the center panel of this report, as well as the R-squared measure at
the upper right.

One other item that may be concerning in this report is the second warning at
the bottom, which reports that there may be some numerical problem with the model.
This problem is actually reflected also in the coefficients themselves, as the 
coefficient for income is many orders of magnitide different from the others.  
This is reasonable and intuititve: the impact of a unit (single dollar) change in annual
household income is insignificant compared to a unit (single car) change in
automobile ownership.  If we review the standard deviations of these explanatory
variables, we can also see they vary greatly.

In [7]:
sm.add_constant(hh[['autos','income']]).std()

const          0.000000
autos          0.801841
income    112974.383573
dtype: float64

A magnitude variance this large is not problematic in raw statistical theory,
but it can introduce numerical stability problems when using computers to 
represent these models.  To solve this issue, we can simply scale one or more 
variables to more consistent variance:

In [8]:
hh['income_100k'] = hh.income / 100_000

In [9]:
mod = sm.OLS(
    hh.n_trips, 
    sm.add_constant(hh[['autos','income_100k']])
)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                n_trips   R-squared:                       0.229
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     2563.
Date:                Tue, 31 May 2022   Prob (F-statistic):               0.00
Time:                        13:22:34   Log-Likelihood:                -48167.
No. Observations:               17260   AIC:                         9.634e+04
Df Residuals:                   17257   BIC:                         9.636e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.4460      0.073     33.694      

The R-squared and t-statistics of this model are the same as before,
as this is in effect the same model as above.  But in this revised model, the
magnitude of the income coefficient is now much closer to that of the other
coefficients, and the "condition number" warning is not present in the summary.

### Piecewise Linear Functions

OLS linear regression models are by design written as linear-in-parameters
models, but that does not mean that the explanitory data cannot be first
transformed, for example by using a piece-wise linear expansion.  We can expand a single
column of data in a pandas.DataFrame into multiple columns based on defined
breakpoints, for example like this:

In [10]:
piecewise_income = pd.DataFrame({ 
    'income_to_25k': hh.income.clip(None, 25_000),
    'income_25k_to_75k': hh.income.clip(25_000, 75_000) - 25_000,
    'income_75k_and_up': hh.income.clip(75_000) - 75_000,
})
piecewise_income.head()

Unnamed: 0_level_0,income_to_25k,income_25k_to_75k,income_75k_and_up
hh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1690841,25000,50000,437000
1690961,25000,2500,0
1690866,25000,50000,75000
1690895,25000,50000,29000
1690933,25000,50000,20000


The result is three columns of data instead of one, with the first giving 
income up to the lower breakpoint, the next giving income between the 
two breakpoints, and the last giving the amount of income above the
top breakpoint.

We can readily concatenate this expanded data with any other explanatory 
variables by using `pandas.concat`.  Note that by default this function
concatenates dataframes vertically (combining columns and stacking rows), 
but in this case we want to concatenate horizontally (combining rows and
stacking columns).  We can achieve this by also passing `axis=1` to the
function in addition to the list of dataframes to concatenate.

In [11]:
hh_edited = pd.concat([
    hh.autos,
    piecewise_income / 100_000,
], axis=1)

hh_edited.head()

Unnamed: 0_level_0,autos,income_to_25k,income_25k_to_75k,income_75k_and_up
hh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1690841,2,0.25,0.5,4.37
1690961,1,0.25,0.025,0.0
1690866,2,0.25,0.5,0.75
1690895,2,0.25,0.5,0.29
1690933,2,0.25,0.5,0.2


Then we can use this modified dataframe to construct a piecewise linear OLS regression model.
Because the original and modified dataframes have the same index (i.e. number and order of rows)
we can mix them in the OLS defintion, using the `n_trips` column from the original as the dependent 
variable and the explanatory data from the modified dataframe.

In [12]:
mod = sm.OLS(
    hh.n_trips, 
    sm.add_constant(hh_edited)
)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                n_trips   R-squared:                       0.231
Model:                            OLS   Adj. R-squared:                  0.231
Method:                 Least Squares   F-statistic:                     1297.
Date:                Tue, 31 May 2022   Prob (F-statistic):               0.00
Time:                        13:22:34   Log-Likelihood:                -48143.
No. Observations:               17260   AIC:                         9.630e+04
Df Residuals:                   17255   BIC:                         9.633e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 1.9177      0.15

### Polynomial Functions

In addition to piecewise linear terms in the regression equation, 
standard OLS allows for any arbitrary non-linear transformation.
Students of statistics will be familiar with fitting a polynomial
function with OLS coefficients, and this can be done using `statsmodels`
for example by explicitly computing the desired polynomial terms
before estimating model parameter.

In [13]:
hh['autos^2'] = hh['autos'] ** 2
hh['income^2'] = hh['income_100k'] ** 2

In [14]:
mod = sm.OLS(
    hh.n_trips, 
    sm.add_constant(hh[['autos','income_100k', 'autos^2', 'income^2']])
)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                n_trips   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     1286.
Date:                Tue, 31 May 2022   Prob (F-statistic):               0.00
Time:                        13:22:34   Log-Likelihood:                -48160.
No. Observations:               17260   AIC:                         9.633e+04
Df Residuals:                   17255   BIC:                         9.637e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.6596      0.127     20.926      

Alternatively, polynomial terms can be created automatically for every 
column in the source data, as well as for interactions, using 
the `PolynomialFeatures` preprocessor from the `sklearn` package.
This tool doesn't automatically maintain the DataFrame formatting
when applied (instead it outputs a simple array of values), 
but it is simple to write a small function that will do so.

In [15]:
def polynomial(x, **kwargs):
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(**kwargs)
    arr = poly.fit_transform(x)
    return pd.DataFrame(arr, columns=poly.get_feature_names_out(x.columns), index=x.index)

Then we can use the function to calculate polynomial terms automatically. 
In this example, by setting the `degree` to 3, we not only get squared and 
cubed versions of the two parameters, but also all the interactions of these
parameters up to degree 3.

In [16]:
hh_poly = polynomial(hh[['autos','income_100k']], degree=3)
hh_poly.head()

Unnamed: 0_level_0,1,autos,income_100k,autos^2,autos income_100k,income_100k^2,autos^3,autos^2 income_100k,autos income_100k^2,income_100k^3
hh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1690841,1.0,2.0,5.12,4.0,10.24,26.2144,8.0,20.48,52.4288,134.217728
1690961,1.0,1.0,0.275,1.0,0.275,0.075625,1.0,0.275,0.075625,0.020797
1690866,1.0,2.0,1.5,4.0,3.0,2.25,8.0,6.0,4.5,3.375
1690895,1.0,2.0,1.04,4.0,2.08,1.0816,8.0,4.16,2.1632,1.124864
1690933,1.0,2.0,0.95,4.0,1.9,0.9025,8.0,3.8,1.805,0.857375


Great care should be used with this automatic polynomial expansion of the data, 
as it is easy to end up with an overfitted model, especially when using a tool
like OLS that does not attempt to self-correct to limit overfitting.

In [17]:
mod = sm.OLS(
    hh.n_trips, 
    hh_poly
)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                n_trips   R-squared:                       0.236
Model:                            OLS   Adj. R-squared:                  0.235
Method:                 Least Squares   F-statistic:                     590.5
Date:                Tue, 31 May 2022   Prob (F-statistic):               0.00
Time:                        13:22:34   Log-Likelihood:                -48093.
No. Observations:               17260   AIC:                         9.621e+04
Df Residuals:                   17250   BIC:                         9.628e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
1                       3.6606    