# Exercise 4: Introduction to Regression

Aufgaben:

* Programmierung OLS, 
* Programmierung Diagnostics: Bootstrap CI, R²
* Generierung Zufallsdaten nach Muster, Check Funktionalität, Plot daten und Regreesionsgerade
* Datensatz Student Performance (Preprocessing [!], Übergang zu statsmodels
* Exeriments Predictors
* Analyse Prediktoren
* Residual Plots


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

## A Brief Introduction to Statsmodels

Next to sklearn, statsmodels is probably the most popular Python package for regression. While the focus of sklearn (will be covered later in class) rather lies machine learning applications, statsmodels (as the name suggests) has a rather statitics-oriented focus. We will briefly present the basic functionality of its regression functions by revisiting the Iris data set.

In [None]:
# read in data
df = pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"])
df.head()

#### Example 1: Bivariate Prediction
We want to fit a regession model that estimates sepal length from sepal width.

In [None]:
# specify predictors X and target Y
Y = df.sepal_length
X = df.sepal_width
# most importantly: we have to add a constant term to estimate the intercept
X = sm.add_constant(X)
X[:10]

In [None]:
# initialize model: OLS = ordinary least squares
model = sm.OLS(y,X)
# fit model: only now te model, i.e. the parameters are computed
results = model.fit()

# print a summary, i.e. an overview on parameters and diagnostics
results.summary()

In [None]:
# get parameters of model, i.e. beta_0 and beta_1
params = results.params
params

In [None]:
# we can apply parameters to obtain the predictions of Y based on X
np.dot(X,params)

In [None]:
# unsurprisingly, statsmodels also provides a direct prediction function:
results.predict(X)

#### Example 2: Multivariate Regression
Now we want to include all other numerical columns from the data to fit to estimate sepal length.

In [None]:
# statsmodels also provides a formula syntax, which requires an additional import
from statsmodels.formula.api import ols

# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically
model = ols("sepal_length ~ sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

In [None]:
# formula syntax: dependent variable ~ predictor1 + predictor2 +.....
# note that intercept is fit automatically

# add interaction and squared term
model = ols("sepal_length ~ sepal_width:petal_width + np.square(petal_length) + sepal_width + petal_width + petal_length", data=df)

results = model.fit()
results.summary()

### Task 1: Fitting an artificial data set

We want to implement OLS regression and test it on artificial data. Thus, in this task you may not yet use the statsmodels functions (except for checking results).

#### a) Creating artificial data
Create an artificial dataset which consists of:
* a vector $x$ consisting of 100 (float) values between 0 and 1
* a vector $y = 10x +\varepsilon$, in which for each element, the error $\varepsilon_i$ is drawn from the standard normal distribution.
Create a scatterplot of x against y!

#### b) Implementing OLS regression
Write a function that takes as input a numpy vector of target values $y$, and a matrix of predictors $X$, and returns the parameter vector $\beta$ resulting from OLS regression. Apply this function to fit a model on your artificial data, compute the predictions, and add the resulting regression line to the plot from a). Remember to add a constant term!

#### c) Diagnostics 1: The Bootstrap

Based on your regression function, write a function that again takes as input a predictor matrix $X$ and a target column $y$, plus a integer $N$, and bootstraps the data $N$ times to compute the 95% confidence interval for each parameter. Specifically, return one parameter vector for the bottom beta values, and one parameter vector for the top values. 
Apply this function to estimate the confidence intervals on our artificial data.

#### d) Diagnostics 2: The $R^2$ score.

Write a function that takes as input a ground truth vector y, its prediction y_hat, and computes the $R^2$ value of that prediction! Does your model explain most of the variance in the artificial data?

### Task 2: Predicting Student Performance

We revisit the student performance dataset from last week's exercise and aim to estimate the exam performance in math. In this task you may use statsmodels!

#### a) Data Preprocessing

Load the student performance data into a dataframe. Since we want to estimate students performance in math, separate this column from the dataframe. On the remaining columns, transform, i.e. dummy-code all categorical columns as explained in lecture. Further, check for collinearities. If a pair of highly correlated columns (i.e. pearson correlation > 0.9) is given, remove one of these columns from the predictors. Remember to add a constant term afterwards.

#### b) Learning a simple regression model

Apply statsmodels to estimate the exam performance in math from all other columns, without any column interactions. Remember to use a constant term, and properly transform categorical variables. Which significant effects do you observe?

#### c) Adding interactions
Apply statsmodels to fit a regression model that in addition to the previous model further considers an interaction term between the test preparation curse and each of the continuous columns that are left from the preprocessing. Thus, first add these columns to your predictor matrix, and then compute the corresponding model.
Does this interaction yield an improvement or rather cause problems?

#### d) Optimizing adjusted R^2

Implement a forward selection routine that takes as input a matrix $X$ of predictors, the index of the constant column in $X$, and a target vector $y$, and returns a submatrix of predictors that produces the optimal adjusted R^2 value, plus a vector of the corresponding column indices, the corresponding parameter vector beta, and the optimal adjusted R² value.
Apply this function on the predictor matrix from part c). Which predictors are left?

#### e) Checking residuals
Create a residual plot of your final model and give an interpretation of what you observe!