# Exercise 4 Forecast diabetes progression

The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:

- https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9

The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.

```python
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
```

1. Using `train_test_split`, split the data set in a train set and test set (20%). Use `random_state=43` for results reproducibility.

2. Fit the Linear Regression on all the variables. Give the coefficients and the intercept of the Linear Regression. What is then the equation ?

3. Predict on the test set. Predicting on the test set is like having new patients for who, as a physician, need to forecast the disease progression in one year given the 10 baseline variables.

4. Compute the MSE on the train set and test set.  Later this week we will learn about the R2 which will help us to evaluate the performance of this fitted Linear Regression. The MSE returns an arbitrary value depending on the range of error.

**WARNING**: This will be explained later this week. But here, we are doing something "dangerous". As you may have read in the data documentation the data is scaled using the whole dataset whereas we should first scale the data on the training set and then use this scaling on the test set. This is a toy example, so let's ignore this detail for now.

https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset


In [128]:
import numpy as np
import sklearn as sn
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

X = pd.DataFrame(data=X, columns=diabetes.feature_names)
y = pd.DataFrame(data=y, columns=['target'])

# 1.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=43)

# testing
print('y_train: ', y_train.values[:10])
print('y_test: ', y_test.values[:10])

# 2.
regr = LinearRegression()
regr.fit(X_train, y_train)

coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(regr.coef_))], axis = 1)

print(coefficients)
# testing
print('intercept: \n', regr.intercept_)

# 3.
predictions_on_test = regr.predict(X_test)

# testing
print(predictions_on_test[:10])

# 4.
def compute_mse(y_true, y_pred):
    # we can use numpy to do the same thing
    # MSE = np.square(np.subtract(y_true, y_pred)).mean()
    return mean_squared_error(y_true, y_pred)

print(compute_mse(y_test, predictions_on_test))

y_train:  [[202.]
 [ 55.]
 [202.]
 [ 42.]
 [214.]
 [173.]
 [118.]
 [ 90.]
 [129.]
 [151.]]
y_test:  [[ 71.]
 [ 72.]
 [235.]
 [277.]
 [109.]
 [ 61.]
 [109.]
 [ 78.]
 [ 66.]
 [192.]]
     0           0
0  age  -60.401630
1  sex -226.087407
2  bmi  529.383623
3   bp  259.963077
4   s1 -859.121932
5   s2  504.709601
6   s3  157.420349
7   s4  226.295336
8   s5  840.793807
9   s6   34.712226
intercept: 
 [152.05314895]
[[111.74351759]
 [ 98.41335251]
 [168.36373195]
 [255.05882934]
 [168.43764643]
 [117.60982186]
 [198.86966323]
 [126.28961941]
 [117.73121787]
 [224.83346984]]
2858.2551533228366
