# Assignment: Regression on the Diabetes Dataset

Here we will practice doing regression with [scikit-learn's diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).  (Throughout the following, you are welcome to add additional cells if you need them for coding.)

Execute the following cell to get us started.

In [None]:
# Import scikit-learn's example diabetes dataset
import sklearn.datasets 
diabetes = sklearn.datasets.load_diabetes()

# Print a description of the dataset
print(diabetes.DESCR)

# Get the feature and target arrays
x = diabetes.data
y = diabetes.target

## ML steps

* `x` is now a numpy array containing 10 features and 442 records (corresponding to 442 patients).  
* `y` is a numpy array containing the target values
  * as in the description, `y` is a quantitative measure of disease progression one year after baseline.

Print the array dimensions of x and y and confirm that the dimensions match the sizes above.

Make scatter plots to look at the relationship between y and each feature of x

Write a loop that prints the 10 correlation coefficients between y and the 10 features in x. 
* You may find it useful to use either [numpy's corrcoef method](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) or [pandas dataframe corr method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) for calculating the correlation coefficient matrix.

Perform linear regression using the feature in x that has the highest correlation coefficient with y

* State at the beginning what the name of the most correlated feature variable is, and whether this makes sense to you from your common sense knowledge about diabetes
  * The feature variable names are in the documentation above and accessible in the list `diabetes.feature_names`
* Split your data into a training set and a test set
* Train the model
* Print the coefficients of the model
* Plot the linear model as a red line on top of a scatter plot showing your training data as black circles and your test data as blue circles
* Print the mean squared error and R-squared values for your model applied to the test data

Repeat the linear regression steps, only now use all 10 features when doing your model fit.
* Note that when you pass any feature data like `x` into the fit method (like `fit(x,y)`), `x` is a 2D numpy array that has a size of number_of_samples by number_of_features. When training on one feature variable, `x` will have a size of number_of_samples by 1. If you train on n features, then `x` should have a size of number_of_samples by n.
* I don't expect you to plot anything this time, but print the mean squared error and R-squared values for your model applied to the test data.

Compare your model's results against the results obtained when training on just one feature.

Repeat the regression one more time using all feature variables, but now:
* Use the k-nearest neighbors algorithm rather than linear regression
* When doing so, use cross-validation to obtain the optimal number of neighbors before training your final model

Compare the MSE and R$^2$ scores of k-nearest neighbors, linear regression with the most correlated feature, and multilinear regression with all features

***Optional Bonus Part (2 extra points)***

Perform linear regression with all feature variables, but now use LASSO regularization with an alpha that results in zeroing out the coefficients of some features.
* Which features get zeroed out?
* Do you think it makes sense from a medical perspective to zero out those features?
* How does it affect the performance relative to linear regression without regularization?
* Can you find an alpha that gives improved performance?

## Submit

* Save your work (File -> Save Notebook)
* Verify that your notebook runs without error by restarting the kernel (or closing and opening the notebook) and selecting the top menu item for Run -> Run All Cells.  It should run successfully all the way to the bottom.
* Save your notebook again.  Keep all the output visible when saving the final version.
* Submit the file through the Canvas Assignment.