![title](../NAG_logo.png)
# Regression

## Fitting linear regression models

The `correg.linregm_fit` function fits a linear (multiple) regression model for a set of independent variables $\mathbf{x}_1,\ldots,\mathbf{x}_m$ and dependent observations $\mathbf{y}=(y_1,\ldots,y_n)$.

For illustrative purposes the exercises use synthetically generated data.  Here the distribution of $\mathbf{y}$ given $\mathbf{x}_1,\ldots,\mathbf{x}_m$ is normal.

Use the code below to generate synthetic observations.  Find the example for `correg.linregm_fit`.  Use it to help you fit a linear regression model to the synthetic observations.  Print out standard errors of the parameter estimates.

How does the standard error of parameter estimates change with the number of observations, $n$?

*Note*: you can locate the source code for all the examples by running,

`python -m naginterfaces.library.examples --locate` 

from the command-line.

In [1]:
# import modules
# The numpy and naginterfaces imports will become used by your solution:
import numpy as np # pylint: disable=unused-import
from naginterfaces.library import correg # pylint: disable=unused-import
from naginterfaces.library import rand
import reg_fun
# make sure modules are reloaded if we are make changes
import importlib
importlib.reload(reg_fun)

# number of independent variables
m = 5
# number of observations
n = 2e1
# initialize RNG
# statecomm = rand.init_nonrepeat(genid=3)          # either with a random seed
statecomm = rand.init_repeat(genid=3, seed=[32958]) # or with a fixed seed
    
# The independent variables:
x = reg_fun.gen_multivar_x(m, n, statecomm)
# The observations:
y = reg_fun.gen_obs(x, statecomm)

## Fitting Generalized Linear Models (GLMs)

The `correg.glm_normal` function fits a GLM with normally distributed errors, again for a set of independent variables $\mathbf{x}_1,\ldots,\mathbf{x}_m$ and dependent observations $\mathbf{y}=(y_1,\ldots,y_n)$.

Identify a value for the `link` argument that results in the same linear regression model as the previous exercise.  Use `correg.glm_normal` to fit the same data as in the previous exercise.  Check that you get the same parameter estimates.

## GLM prediction intervals

Generate some more synthetic data.  Use the `correg.glm_predict` function to make predictions for these new observations.  Calculate the Root Mean Squared Error (RMSE) of the predictions.  Compare with the standard errors output by `correg.glm_predict`.

## Identifying outliers

Suppose that there is an error in one of the entries in your dataset, e.g., someone added a couple of extra zeros to an observation by mistake when recording the data.

Use the code below to explore what effect this has on fitting.  How do the regression coefficents change?  Extract the residuals and the leverages from the auxiliary information / output of the fitting.  Which of these can be used to identify the outlier / erroneous observation?

*Hint:* to ease printing of multiple output columns copy the output you are interested in to a `pandas` DataFrame.

In [2]:
xc = x.copy()  # copy x
xc[2,3] = 100.0 * x[2,3] # make 1 data point an outlier (3rd observation, 4th variable)