This activity is adapted from the Scikit-Learn example at https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

# Introduction to regression: Diabetes dataset

First we load our dataset. This is a built-in dataset that is part of the Scikit-Learn library. So, there is no external file to load and parse.

In [None]:
# Code source: Jaques Grobler
# License: BSD 3 clause

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

## What are we working with?
Let's look at the structure of our data. The data structure returned from our load_diabetes() method is a dictionary. Let's see what the keys are.

In [None]:
print(diabetes.keys())

What are the features?

In [None]:
print(diabetes['feature_names'])

What do some of the feature vectors look like?

In [None]:
print(diabetes['data'][:5])

Can also access the *data* and *target* through class fields `data` and `target`

In [None]:
print(diabetes.data[:5])

What is the output for the first few instances?

In [None]:
print(diabetes.target[:5])

## What's with our feature values?
The output looks like what we saw in the original data set, but the attributes do not. 

For instance, for the first patient, their age is 0.038 and their gender is 0.05? 

*What's going on?*

According to the documentation, the data has been scaled so each attribute has zero mean and a variance of 1. 

https://web.stanford.edu/~hastie/Papers/LARS/diabetes.sdata.txt

This is a common scaling technique (*Standard scaling*). 

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

We will talk more about different transformation techniques that we can apply to our data in the future.  Here we have some features with large values such as 80-130 (BP) and other others with small values such as 1-2 (sex). We can get poor performance if our attributes have significantly different ranges.

## Only use one feature
For an example that is easy to visualize, we will just use one of the features (BMI). Ignoring the other features will mostly likely give us worse predictions. But our goal is to understand what is going on at this point.



In [None]:
# Use only one feature (BMI). BMI is column 3 (index=2) of the feature vectors.
diabetes_X = diabetes.data[:, np.newaxis, 2]

In [None]:
print(diabetes_X[:5])

## Split our dataset
We need to split our dataset into *training* and *test* sets. 

We split both the feature vectors (`data`) and the output (`target`).


*  Training set: used for *learning* the model
*   Test set: used for *evaluating* our trained model



In [None]:
# Split the data into training/testing sets
from sklearn.model_selection import train_test_split
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes_X, diabetes.target, train_size=0.8, random_state=0)

print('Training set size: {}'.format(len(diabetes_X_train)))
print('Test set size: {}'.format(len(diabetes_X_test)))

## Create instance of the model object
Before training, we must create an instance of our linear regression class.

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

## Train the model
We train the model on our training set. 

With Scikit-Learn, *training* is done using the `fit(x,y)` function, where `x` is the list of training feature vectors (*input*) and `y` is the corresponding list of target values (*output*). 

All classifier and regressor classes inherit the `fit` function.

In [None]:
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

In [None]:
# The coefficients for our line
print('Coefficients: \n', regr.coef_)

## Evaluate the mode (Is it good?)
How good is our model in practice? 

Training gives us a model that minimizes error on the training set. But, if the training set is not representative of the real world, then the model will do poorly in practice.

In [None]:
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

In [None]:
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

## Measuring error
While plots are nice, there are many possible lines that fit the data. Some better than others. Visually, it can be difficult to tell which is a better fit. 

We would like to quantify the error. For this, we will use the *Root Mean Square Error* (RMSE).

$$ error = \sqrt{\frac{1}{m}\sum\limits_{i=1}^{m}{(predicted_i - correct_i)^2}} $$


Scikit-Learn has a function for calculating *mean squared error* (MSE). To get RMSE, we just compute the square root of the value returned by this function.



In [None]:
# The root mean squared error (RMSE)
print("Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(diabetes_y_test, diabetes_y_pred)))

## Use all features?
With this example, we only used one of the 10 attributes (or features). What if our model used all 10? 

In [None]:
# Split the data into training/testing sets
from sklearn.model_selection import train_test_split
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes.data, diabetes.target, train_size=0.8, random_state=0)

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The root mean squared error (RMSE)
print("Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(diabetes_y_test, diabetes_y_pred)))

# (30 pts) Lab activity 02 - Regression with synthetic data (Due Wed, Sept 9 by 11:59 PM)
At times ML practitioners work with *synthetic* or data generated using a given mathematical model. 

The use of synthetic data allows practioners to control the characteristics of the datasets in order learn the capabilities and limits of given ML algorithms. 

In this activity, you will use a model to generate a dataset. Then you will perform regression on this dataset.

## (1) Generate the data
Scikit-learn has various generators for creating datasets with different properties. 

We will use the `make_regression` function for creating a dataset for testing regression algorithms.

For details on `make_regression` see:

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html

### About the data
Our data will have the following properties.
1. Number of samples (`n_samples`): 1000
2. Number of features (`n_features`): 1 
3. Number of output values (`n_targets`): 1
4. Standard deviation of gaussian noise added to generated points (`noise`): 5.0

We are just generating one feature per sample for the sake of this exercise, so we can easily visualize our data.

In [None]:
from sklearn.datasets import make_regression

# !!! Take note of the following statement. !!!
# You will use it later in the lab for datasets with different noise values.
data, targets = make_regression(n_samples=1000, n_features=1, n_targets=1, noise=5.0)

## (2) What does our data look like?

Display the feature values for the first 5 instances.

Display the output (i.e. targets) for the first 5 instances.

## (3) Split the data
Split the dataset into training (80%) and test (20%) sets. 

Display the sizes of the training and test sets. 

*(you should have 800 training instances and 200 test instances)*

## (4) Train the model
Train a linear regression model on your training set.

In [None]:
# create the model


In [None]:
# train model


## (5) Evaluate the trained model
Output your model's error (RMSE) on the test set.

In [None]:
# Make predictions using the testing set


# The root mean squared error (RMSE)


As we did earlier with the diabetes dataset, plot the samples in the test set along with the predicted values.

In [None]:
# Plot outputs



## (6) Repeat with different amounts of noise

Repeat the steps you just did with data generated with `noise=10.0` and `noise=0.5`.

1. What is the error with data generated with noise = 10.0?


In [None]:
# create the data


# split the data


# create the model


# train model


# Make predictions using the testing set


# The root mean squared error (RMSE)


# Plot outputs


2. What is the error with data generated with noise = 0.5?

In [None]:
# create the data


# split the data


# create the model


# train model


# Make predictions using the testing set


# The root mean squared error (RMSE)


# Plot outputs


3. Do you notice a pattern with the error with these three experiments?



# (10 pts) Participation quiz - Regression

Answer the following questions for participation credit for today's class.

## (1) What is one thing that you learned in class today?

## (2) What is one question that you have related to today's class?

## (3) Given the linear model

$$ w = [ -1, 3, 2]^T $$

Suppose our features are $x_0$ = 1, $x_1$ = age, $x_2$ = BMI, what is the predicted output for the following patient?

- Age = 53
- BMI = 27