In [177]:
# Initialize Otter
import otter

grader = otter.Notebook("10-exercise-pids2024.ipynb")

# Exercise sheet 10

**Hello everyone!**

**Points: 15**

Topics of this exercise sheet are:
* Regression with scikit-learn
* Neural networks
* Overfitting

Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

In [178]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error

# Question 1: Diagnosing diabetes using linear regression (5 Points)
In this initial exercise, you will implement a program that learns to diagnose diabetes based on various measurements. The prognosis here is a numerical value that describes the progression of the disease. Therefore, this is a regression problem.

Let's load the data:

In [179]:
df = pd.read_csv("daten/diabetes.csv")
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135


Each row represents the measurements for one patient. All measurements have been standardized by subtracting the mean value, and the last column target quantifies the progression of diabetes after one year. All measurements are standardized, thus the other columns correspond to the following measurements:

    age: Age in years.
    sex: Biological sex, encoded and standardized as a numerical value.
    bmi: Body Mass Index (BMI).
    bp: Average blood pressure (in mm Hg).
    s1: Total serum cholesterol level (in mg/dL).
    s2: Low-density cholesterol (LDL) (in mg/dL).
    s3: High-density cholesterol (HDL) (in mg/dL).
    s4: Total serum triglyceride level (in mg/dL).
    s5: Liver enzyme Alanine Aminotransferase (ALT).
    s6: Fasting blood sugar level (in mg/dL).

### Queston 1a) (1 Point)

Create two DataFrames X and y. Here, y should contain only the target values and X all the other columns.

In [180]:
class Question1a:
    X = df[df.columns.drop(['target'])]
    y = df['target']

In [181]:
grader.check("Question 1a")

### 1b) Fit the linear regression model (2 Points)

Create a linear regression model using [scikit-learn](https://scikit-learn.org/stable/index.html)  and fit a linear regression to the diabetes dataset.
(You can find explanations and examples [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)). Make predictions of the data in `X` with your model and store them in the variabel `y_pred`.

In [182]:
class Question1b:
    X = Question1a.X
    y = Question1a.y

    reg = linear_model.LinearRegression()
    reg.fit(X, y)

    print(reg)

    y_pred = reg.predict(X)

LinearRegression()


In [183]:
grader.check("Question 1b")

### 1c) Evaluate the linear regression model (2 Points)

Next, you should evaluate your model. Scikit-learn provides various metrics for this purpose. We will use the mean squared error [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) and the [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). While the mean squared error should be familiar, the r2_score is maybe not. The r2_score (also known as the Coefficient of determination or measure of explained variance) is a dimensionless number typically ranging from 0 to 1. A value of 1 means that the data has been perfectly explained. A value of 0 means that the prediction is no better than if one had simply used the average of the target values for the prediction. Negative values indicate that the prediction is even worse than if one had not used the measurements at all and always estimated the mean value. 

Compute the r2 score and the mean squared error using the metrics from scikit learn. How do you interpret the results? 

In [184]:
class Question1c:
    mse = mean_squared_error(Question1b.y_pred, Question1b.y)
    r2 = r2_score(Question1b.y, Question1b.y_pred)
    print(f"mse: {mse}, r2 score: {r2}")

mse: 2881.0519254794312, r2 score: 0.5032777882050409


In [185]:
grader.check("Question 1c")

## Question 2: Diagnosing Diabetes using Neural Networks (5 Points)

In this task, you will again use the diabetes dataset from the previous task, but this time train a neural network to diagnose the disease. For this purpose, we will again use scikit-learn.

Scikit-learn supports only one type of neural network, namely the so-called Multilayer Perceptrons. These are neural networks in which each neuron in a layer is connected to all neurons in the next layer. Accordingly, the definition of the network is simple. You can specify the number of neurons in the hidden layers through a list. Multilayer Perceptrons can be used for both regression and classification ([MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)). Take a look at the documentation for [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html) to get an overview of how to use the model.


### Question 2a) Scaling the data (2 Points)


Since we need to find the parameters through optimization in neural networks, it is often helpful to scale the data so that the observations for each measurement have a mean of 0 and the same standard deviation. You can achieve this using the following code snippet, which utilizes the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn.

In [186]:
class Question2a:
    scaler = StandardScaler()
    scaler.fit(Question1a.X)
    X_scaled = scaler.transform(Question1a.X)

    print(f"mean {X_scaled.mean(axis=0)} stddev {X_scaled.std(axis=0)}")

mean [-8.14375628e-17  9.41621820e-17 -8.14375628e-16  8.24555324e-16
  1.32336040e-16  6.71859893e-16  4.07187814e-17 -1.01796954e-16
  2.21917359e-15  4.27547205e-16] stddev [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [187]:
grader.check("Question 2a")

### Question 2b: Training a neural network (3 Points)

Now define a neural network with 1 hidden layer, containing 10 neurons. Compute predications and calculate the `r2_score` again. Set the number of iterations for optimization (`max_iter`) to 30000 (see [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)).

In [188]:
class Question2b:
    mlp = MLPRegressor(hidden_layer_sizes=(10,), max_iter=30000, random_state=420)
    mlp.fit(Question2a.X_scaled, np.squeeze(Question1a.y))
    y_pred = mlp.predict(Question2a.X_scaled)
    r2 = r2_score(Question1a.y, y_pred)
    print(f"r2_score: {r2}")

r2_score: 0.6215946209980502


In [189]:
grader.check("Question 2b")

# Question 3 Overfitting (5 Points)


When we define machine learning models with many parameters, there is a risk that they will not learn the patterns in the data, but simply memorize the data. As preparation for the next lecture, we want to investigate this phenomenon in this task.

### Question 3a (2 Points)

Copy your solution from above and fit the model again. This time, however, use multiple hidden layers and a larger number of neurons. Can you achieve an `r2_score` of over 0.99?


In [190]:
class Question3a:
    mlp = MLPRegressor(hidden_layer_sizes=(40, 40, 40), max_iter=30000, random_state=420)
    mlp.fit(Question2a.X_scaled, np.squeeze(Question1a.y))
    y_pred = mlp.predict(Question2a.X_scaled)
    r2 = r2_score(Question1a.y, y_pred)

    print(Question1a.y)
    print('---')
    print(y_pred)

    print(f"r2_score: {r2}")

     target
0       151
1        75
2       141
3       206
4       135
..      ...
344     200
345     139
346     139
347      88
348     148

[349 rows x 1 columns]
---
[166.77015827  59.047549   149.84396788 204.99355256 134.1493321
  96.94935819 136.5946819   63.38456955 101.89024896 309.79360454
  99.64228952  69.78294889 169.7830803  185.79289519 120.16887594
 168.1566964  166.09157873 148.76363909  96.756097   170.79211513
  68.13617205  50.89106029  69.88752807 246.15948337 181.61883234
 200.35281061 136.58375585  83.75068804 129.55646479 283.60602302
 127.61241312  57.17888538 342.05832149  86.2851749   64.5619283
 102.63506786 265.88911206 275.16625033 249.60387022  90.92416202
 102.8905221   54.04810809  57.38821939  92.37217958 252.51132305
  53.95396694 184.96787173 136.49706816  75.94997293 136.16997267
 153.5917611  184.13217626  60.05216325 104.27855226 181.487435
 129.28235485  51.0449808   37.59392375 170.67764288 170.32186061
  57.83012468 153.75498411  55.76001676 

In [191]:
grader.check("Question 3a")

### Question 3b: Predicting on the test set (3 Points) 

We have created a second dataset `diabetes_test.csv` for you. This contains data from additional patients. Take the adjusted model and make predictions for these patients. What do you observe? What is the `r2_score`? Tip: don't forget to scale the measurements again.

In [192]:
class Question3b:
    df_test = pd.read_csv("daten/diabetes_test.csv")

    X_test = df_test[df_test.columns.drop(['target'])]
    y_test = df_test[['target']].copy()

    scaler = StandardScaler()
    scaler.fit(X_test)

    mlp = Question3a.mlp

    X_test_scaled = scaler.transform(X_test)
    y_pred_test = mlp.predict(X_test_scaled)
    r2 = r2_score(y_test, y_pred_test)
    print("r2 test: ", r2)

r2 test:  -0.6686289649121682


In [193]:
grader.check("Question 3b")

 What do you observe? Can you explain the behaviour?

Your answer

In [194]:
grader.check_all()

Question 1a results: All test cases passed!

Question 1b results: All test cases passed!

Question 1c results: All test cases passed!

Question 2a results: All test cases passed!

Question 2b results: All test cases passed!

Question 3a results:
    Question 3a - 1 result:
        ❌ Test case failed
        Error at line 14 in test Question 3a:
             assert_true((y - y_pred).abs().mean() < 15)
        ValueError: Unable to coerce to Series, length must be 1: given 349

Question 3b results: All test cases passed!