In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("10-exercise-pids2024.ipynb")

# Exercise sheet 10

**Hello everyone!**

**Points: 15**

Topics of this exercise sheet are:
* Regression with scikit-learn
* Neural networks
* Overfitting

Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error

# Question 1: Diagnosing diabetes using linear regression (5 Points)
In this initial exercise, you will implement a program that learns to diagnose diabetes based on various measurements. The prognosis here is a numerical value that describes the progression of the disease. Therefore, this is a regression problem.

Let's load the data:

In [None]:
df = pd.read_csv("daten/diabetes.csv")
df.head()

Each row represents the measurements for one patient. All measurements have been standardized by subtracting the mean value, and the last column target quantifies the progression of diabetes after one year. All measurements are standardized, thus the other columns correspond to the following measurements:

    age: Age in years.
    sex: Biological sex, encoded and standardized as a numerical value.
    bmi: Body Mass Index (BMI).
    bp: Average blood pressure (in mm Hg).
    s1: Total serum cholesterol level (in mg/dL).
    s2: Low-density cholesterol (LDL) (in mg/dL).
    s3: High-density cholesterol (HDL) (in mg/dL).
    s4: Total serum triglyceride level (in mg/dL).
    s5: Liver enzyme Alanine Aminotransferase (ALT).
    s6: Fasting blood sugar level (in mg/dL).

### Queston 1a) (1 Point)

Create two DataFrames X and y. Here, y should contain only the target values and X all the other columns.

In [None]:
class Question1a:
    X = ...
    y = ...

In [None]:
grader.check("Question 1a")

### 1b) Fit the linear regression model (2 Points)

Create a linear regression model using [scikit-learn](https://scikit-learn.org/stable/index.html)  and fit a linear regression to the diabetes dataset.
(You can find explanations and examples [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)). Make predictions of the data in `X` with your model and store them in the variabel `y_pred`.

In [None]:
class Question1b:
    X = Question1a.X
    y = Question1a.y
    ...
    y_pred = ...

In [None]:
grader.check("Question 1b")

### 1c) Evaluate the linear regression model (2 Points)

Next, you should evaluate your model. Scikit-learn provides various metrics for this purpose. We will use the mean squared error [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) and the [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). While the mean squared error should be familiar, the r2_score is maybe not. The r2_score (also known as the Coefficient of determination or measure of explained variance) is a dimensionless number typically ranging from 0 to 1. A value of 1 means that the data has been perfectly explained. A value of 0 means that the prediction is no better than if one had simply used the average of the target values for the prediction. Negative values indicate that the prediction is even worse than if one had not used the measurements at all and always estimated the mean value. 

Compute the r2 score and the mean squared error using the metrics from scikit learn. How do you interpret the results? 

In [None]:
class Question1c:
    mse = ...
    r2 = ...
    print(f"mse: {mse}, r2 score: {r2}")

In [None]:
grader.check("Question 1c")

## Question 2: Diagnosing Diabetes using Neural Networks (5 Points)

In this task, you will again use the diabetes dataset from the previous task, but this time train a neural network to diagnose the disease. For this purpose, we will again use scikit-learn.

Scikit-learn supports only one type of neural network, namely the so-called Multilayer Perceptrons. These are neural networks in which each neuron in a layer is connected to all neurons in the next layer. Accordingly, the definition of the network is simple. You can specify the number of neurons in the hidden layers through a list. Multilayer Perceptrons can be used for both regression and classification ([MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)). Take a look at the documentation for [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html) to get an overview of how to use the model.


### Question 2a) Scaling the data (2 Points)


Since we need to find the parameters through optimization in neural networks, it is often helpful to scale the data so that the observations for each measurement have a mean of 0 and the same standard deviation. You can achieve this using the following code snippet, which utilizes the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn.

In [None]:
class Question2a:
    X_scaled = ...
    
    print(f"mean {X_scaled.mean(axis=0)} stddev {X_scaled.std(axis=0)}")

In [None]:
grader.check("Question 2a")

### Question 2b: Training a neural network (3 Points)

Now define a neural network with 1 hidden layer, containing 10 neurons. Compute predications and calculate the `r2_score` again. Set the number of iterations for optimization (`max_iter`) to 30000 (see [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)).

In [None]:
class Question2b:
    ...
    y_pred = ...
    r2 = ...
    print(f"r2_score: {r2}")

In [None]:
grader.check("Question 2b")

# Question 3 Overfitting (5 Points)


When we define machine learning models with many parameters, there is a risk that they will not learn the patterns in the data, but simply memorize the data. As preparation for the next lecture, we want to investigate this phenomenon in this task.

### Question 3a (2 Points)

Copy your solution from above and fit the model again. This time, however, use multiple hidden layers and a larger number of neurons. Can you achieve an `r2_score` of over 0.99?


In [None]:
class Question3a:
    ...
    y_pred = ...
    r2 =  ...
    print(f"r2 score {r2}")

In [None]:
grader.check("Question 3a")

### Question 3b: Predicting on the test set (3 Points) 

We have created a second dataset `diabetes_test.csv` for you. This contains data from additional patients. Take the adjusted model and make predictions for these patients. What do you observe? What is the `r2_score`? Tip: don't forget to scale the measurements again.

In [None]:
class Question3b:
    df_test = pd.read_csv("daten/diabetes_test.csv")
    ...
    X_test_scaled = ...
    y_pred_test= ...
    r2 = ...
    print("r2 test: ", r2)

In [None]:
grader.check("Question 3b")

 What do you observe? Can you explain the behaviour?

Your answer

In [None]:
grader.check_all()