# Regression



## Fitting linear regression models

The `correg.linregm_fit` function fits a linear (multiple) regression model for a set of independent variables $\mathbf{x}_1,\ldots,\mathbf{x}_m$ and dependent observations $\mathbf{y}=(y_1,\ldots,y_n)$.

For illustrative purposes the exercises use synthetically generated data.  Here the distribution of $\mathbf{y}$ given $\mathbf{x}_1,\ldots,\mathbf{x}_m$ is normal.

Use the code below to generate synthetic observations.  Find the example for `correg.linregm_fit`.  Use it to help you fit a linear regression model to the synthetic observations.  Print out standard errors of the parameter estimates.

How does the standard error of parameter estimates change with the number of observations, $n$?

*Note*: you can locate the source code for all the examples by running,

`python -m naginterfaces.library.examples --locate` 

from the command-line.

In [1]:
# import modules
import numpy as np
from naginterfaces.library import rand, correg
import reg_fun
# make sure modules are reloaded if we are make changes
import importlib
importlib.reload(reg_fun)

# number of indepdent variables
m = 5
# number of observations
n = 2e1
# initialize RNG
# statecomm = rand.init_nonrepeat(genid=3)          # either with a random seed
statecomm = rand.init_repeat(genid=3, seed=[32958]) # or with a fixed seed
    
# The independent variables:
x = reg_fun.gen_multivar_x(m, n, statecomm)
# The observations:
y = reg_fun.gen_obs(x, statecomm)
# The independent variables to include:
isx = np.ones(m,int)

fit_lin = correg.linregm_fit(x, isx, y)

print('Term' + ' '*9 + 'Estimate' + ' '*3 + 'Standard Error')
for i, b_i in enumerate(fit_lin.b):
    print('Variable:' + ' '*2 + '{:d} {:.3e}   {:.3e}'.format(
        i, b_i, fit_lin.se[i]
    ))


Term         Estimate   Standard Error
Variable:  0 7.040e-01   6.836e-01
Variable:  1 4.439e-01   4.639e-01
Variable:  2 1.060e+00   4.336e-01
Variable:  3 1.039e+00   3.173e-01
Variable:  4 9.061e-01   3.425e-01
Variable:  5 7.366e-01   3.040e-01


## Fitting Generalized Linear Models (GLMs)

The `correg.glm_normal` function fits a GLM with normally distributed errors, again for a set of independent variables $\mathbf{x}_1,\ldots,\mathbf{x}_m$ and dependent observations $\mathbf{y}=(y_1,\ldots,y_n)$.

Identify a value for the `link` argument that results in the same linear regression model as the previous exercise.  Use `correg.glm_normal` to fit the same data as in the previous exercise.  Check that you get the same parameter estimates.

In [2]:
from naginterfaces.base import utils

# Other parameters for the glm_normal function:
link = 'I'
iom = utils.FileObjManager()
nout = iom.advunit
tol = 5.e-5
eps = 1.e-6

fit_glm = correg.glm_normal(x, isx, y, nout, link=link, tol=tol, eps=eps, io_manager=iom)

print('Term' + ' '*9 + 'Estimate' + ' '*3 + 'Standard Error')
for i, b_i in enumerate(fit_glm.b):
    print('Variable:' + ' '*2 + '{:d} {:.3e}   {:.3e}'.format(
        i, b_i, fit_glm.se[i]
    ))

Term         Estimate   Standard Error
Variable:  0 7.040e-01   6.836e-01
Variable:  1 4.439e-01   4.639e-01
Variable:  2 1.060e+00   4.336e-01
Variable:  3 1.039e+00   3.173e-01
Variable:  4 9.061e-01   3.425e-01
Variable:  5 7.366e-01   3.040e-01


## GLM prediction intervals

Generate some more synthetic data.  Use the `correg.glm_predict` function to make predictions for these new observations.  Calculate the Root Mean Squared Error (RMSE) of the predictions.  Compare with the standard errors output by `correg.glm_predict`.

In [3]:
# generate test dataset
x = reg_fun.gen_multivar_x(m, n, statecomm)
y = reg_fun.gen_obs(x, statecomm)

# Parameters for the glm_predict function:
errfn = 'N'
bhat = fit_glm.b
cov_bhat = fit_glm.cov
vfobs = True
s = fit_glm.s

(_, _, yhat, se_yhat) = correg.glm_predict(errfn, x, isx, bhat, cov_bhat, vfobs, link=link, s=s)
mse = np.sqrt( np.mean( (y - yhat) ** 2 ) )
print('Root Mean Squared Prediction Error')
print( mse )
print('Min, Mean and Max Standard Error in Predictions')
print( [np.min(se_yhat), np.mean(se_yhat), np.max(se_yhat)] )

vfobs = False
(_, _, yhat, se_yhat) = correg.glm_predict(errfn, x, isx, bhat, cov_bhat, vfobs, link=link, s=s)
mse = np.sqrt( np.mean( (y - yhat) ** 2 ) )
print('Min, Mean and Max Standard Error in Mean Estimates')
print( [np.min(se_yhat), np.mean(se_yhat), np.max(se_yhat)] )

Root Mean Squared Prediction Error
3.2564737298012454
Min, Mean and Max Standard Error in Predictions
[2.650601849291536, 2.968689936128219, 3.3540313768594596]
Min, Mean and Max Standard Error in Mean Estimates
[0.8926884922476069, 1.557256555956545, 2.240698341517964]


## Identifying outliers

Suppose that there is an error in one of the entries in your dataset, e.g., someone added a couple of extra zeros to an observation by mistake when recording the data.

Use the code below to explore what effect this has on fitting.  How do the regression coefficents change?  Extract the residuals and the leverages from the auxiliary information / output of the fitting.  Which of these can be used to identify the outlier / erroneous observation?

*Hint:* to ease printing of multiple output columns copy the output you are interested in to a `pandas` DataFrame.

In [7]:
import pandas as pd

xc = x.copy()  # copy x
xc[2,3] = 100.0 * x[2,3] # make 1 data point an outlier (3rd observation, 4th variable)

fit_glm = correg.glm_normal(xc, isx, y, nout, link=link, tol=tol, eps=eps, io_manager=iom)

print('Term' + ' '*9 + 'Estimate' + ' '*3 + 'Standard Error')
for i, b_i in enumerate(fit_glm.b):
    print('Variable:' + ' '*2 + '{:d} {:.3e}   {:.3e}'.format(
        i, b_i, fit_glm.se[i]
    ))
    
pd.DataFrame(data=fit_glm.v[:,[4,5]], columns=['Residuals', 'Leverages'])

Term         Estimate   Standard Error
Variable:  0 -2.604e-01   1.501e+00
Variable:  1 7.213e-01   8.174e-01
Variable:  2 4.092e-01   5.921e-01
Variable:  3 2.098e+00   6.612e-01
Variable:  4 1.219e-02   1.305e-02
Variable:  5 1.482e+00   4.854e-01


Unnamed: 0,Residuals,Leverages
0,0.796235,0.488932
1,0.885847,0.151379
2,-0.171904,0.999607
3,0.0,0.176187
4,1.798595,0.458517
5,-2.818573,0.147085
6,0.0,0.2393
7,-6.87794,0.103011
8,4.654881,0.324152
9,0.0,0.327565
