## 1. Linear Regression on Fish Data

The following data-set contains the number of fish groups of camper caught in a state park (taken from https://stats.idre.ucla.edu/r/dae/zip/). Your task here is to predict the number of fish caught by a fishing party from the following information:

* how many people are in the group
* the number children in the group
* the use of live bait
* whether the group came with a camper to the park.

We have a small data set, of 250 groups, which visited a state park and provided. For comparison the data set is already split into a training set and testset.

In [13]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

In [14]:
# The Fish Data Set
# See example 2 from https://stats.idre.ucla.edu/r/dae/zip/
#"nofish","livebait","camper","persons","child","xb","zg","count"
import os
from urllib.request import urlretrieve
if not os.path.isfile('fishing.npz'):
    print("Downloading")
    urlretrieve('http://www-home.htwg-konstanz.de/~oduerr/data/fishing.npz',filename = 'fishing.npz')
d = np.load('fishing.npz')
Xt = d['Xt'] #"livebait","camper","persons","child"
Xte = d['Xte']
yt = d['yt']
yte = d['yte']
pd.DataFrame(Xt[0:2])

Unnamed: 0,0,1,2,3
0,1.0,0.0,4.0,0.0
1,1.0,1.0,2.0,0.0


a) Do a linear regression by creating a design matrix with the intercept term and use the fomulae given in the lecture to determine the coefficients on the training set.

Intercept (Achsenabschnitt): Das ist der erwartete Wert von y, wenn alle Features 0 sind.
→ Beispiel: Wenn keine Person, kein Kind, kein Camper, kein Köder, dann schätzt das Modell trotzdem einen Basiswert.

In [25]:
xTrain = pd.DataFrame(Xt, columns=['livebait', 'camper', 'persons', 'child'])
xTest  = pd.DataFrame(Xte, columns=['livebait', 'camper', 'persons', 'child'])

yTrain = pd.DataFrame(yt, columns=['nofish'])
yTest  = pd.DataFrame(yte, columns=['nofish'])

# Intercept hinzufügen
xTrain["intercept"] = 1
xTrain = xTrain[['intercept', 'livebait', 'camper', 'persons', 'child']]
xTest["intercept"] = 1
xTest = xTest[['intercept', 'livebait', 'camper', 'persons', 'child']]

In [16]:

# Berechnung
coefficients = np.linalg.inv(xTrain.T @ xTrain) @ xTrain.T @ yTrain

print("Koeffizienten (Intercept zuerst):")
print(coefficients)

Koeffizienten (Intercept zuerst):
    nofish
0 -8.49223
1  2.48221
2  2.95431
3  4.64954
4 -5.47160


In [17]:
xTest = np.c_[np.ones(xTest.shape[0]), xTest]
yPred = xTest @ coefficients
yPred

Unnamed: 0,nofish
0,5.45639
1,2.467
2,10.10593
3,3.28906
4,-1.36048
5,-2.18254
6,5.42131
7,2.467
8,1.59383
9,6.24337


b) Repeat a) but this time with `LinearRegression` from `sklearn.linear_model`

In [27]:
from sklearn.linear_model import LinearRegression

# Modell erstellen
model = LinearRegression()

# Trainieren
model.fit(xTrain, yTrain)

# Ergebnisse
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

yPred = model.predict(xTest)
yPred

Intercept: [-8.49222821]
Coefficients: [[ 0.          2.4822138   2.95430727  4.64953914 -5.47160051]]


array([[ 5.45638921],
       [ 2.46700249],
       [10.10592835],
       [ 3.28906387],
       [-1.36047527],
       [-2.18253664],
       [ 5.42130976],
       [ 2.46700249],
       [ 1.593832  ],
       [ 6.24337113],
       [-2.18253664],
       [ 1.64494112],
       [ 6.24337113],
       [ 6.24337113],
       [-3.00459802],
       [-2.18253664],
       [ 3.76115733],
       [ 1.593832  ],
       [-1.36047527],
       [10.89291027],
       [ 0.77177062],
       [ 2.46700249],
       [-3.00459802],
       [ 1.593832  ],
       [ 6.24337113],
       [ 1.593832  ],
       [-0.8883818 ],
       [ 7.11654163],
       [-1.36047527],
       [-2.18253664],
       [ 3.28906387],
       [10.0708489 ],
       [ 0.77177062],
       [ 1.64494112],
       [ 6.24337113],
       [ 6.24337113],
       [ 0.77177062],
       [-1.36047527],
       [10.89291027],
       [15.54244941],
       [ 2.93909596],
       [ 1.64494112],
       [10.0708489 ],
       [10.0708489 ],
       [ 0.77177062],
       [ 7

c) Determine the Root Mean Square Error (RMSE) and the average negative log-likelihood (NLL) on the testset. For NLL we assume that the conditional probability distrubution (CPD) $p(y|x)$ is given by the density of a Gaussian with constant variance $\sigma^2$. The slope and the intercept of the linear model can be estimated as shown in the lecture. To estimate $\sigma^2$ you can use the variance of the residuals. Use the variance estimation with $1/N$.

Result: $ RMSE \approx 8.58812$, $\hat \sigma^2 \approx 73.7559$, $\tt{NLL} \approx 3.569$