# Part 1: Diabetes

In this part of the assignment, you will build a predictive model for diabetes disease progression in the next year based on current observed features of disease symptoms. 

**Learning objectives.** You will:
1. Train and test a linear model using ordinary least squares regression. 
2. Apply regularization, specifically LASSO, to build a sparse model.

The following code will download and preview three examples of the data. The ten features are as follows (in order):

- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, log of serum triglycerides level
- s6 glu, blood sugar level

The target value is a quantiative measure of disease progression after 1 year, where larger numbers are worse.

The code stores the feature matrix `X` as a two-dimensional NumPy array where each row corresponds to a data point and each column is a feature. The target value is stored as a one-dimensional NumPy array `y` where the index `i` element of `y` correpsonds to the row `i` data point of `X`.

Your overall goal in this part is to build and evaluate a linear model to predict the target variable `y` as a function of the ten features in `X`, and to identify which features are more significant for predicting `y`.

In [5]:
# Run but DO NOT MODIFY this code

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes(scaled = False)
print(diabetes.feature_names)

# Get the feature data and target variable
X = diabetes.data
y = diabetes.target

# Preview the first 3 data points
print(X[:3])
print(y[:3])

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 59.       2.      32.1    101.     157.      93.2     38.       4.
    4.8598  87.    ]
 [ 48.       1.      21.6     87.     183.     103.2     70.       3.
    3.8918  69.    ]
 [ 72.       2.      30.5     93.     156.      93.6     41.       4.
    4.6728  85.    ]]
[151.  75. 141.]


## Task 1

Randomly split the input data into a [train and test partition](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with 30% of the data reserved for testing. Use a random seed of `2024` for reproducibility of the results.

In [9]:
# Write task 1 code here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=.3, random_state=2024)

## Task 2

Build a baseline prediction by computing the [average](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) target value of the training data. Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) between the baseline and the test data.

In [35]:
# Write task 2 code here
import numpy as np
from sklearn.metrics import root_mean_squared_error
data=np.array(y_train)
print(np.mean(data))
baseline_array=[]
for i in range(133):
    baseline_array.append(np.mean(data))
rmse=root_mean_squared_error(baseline_array,y_test)
print(rmse)
#out=np.mean(rmse)
print()

152.03236245954693
78.17581726028506



## Task 3

Build a linear predictive model using [ordinary least squares regression](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) fit on the training data. 

Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) of the model on **both** the training data **and** the test data (that is, the training error and the generalization error). Report both and briefly discuss the results: Do you observe underfitting or overfitting?

Note that the model predictions on the test data may not be perfect, but they should improve meaningfully over the simple baseline from Task 2 or something is wrong.

In [39]:
from sklearn import linear_model
model=LinearRegression()
model.fit(X_train,y_train)
y_hat=model.predict(X_train)
train_rmse=root_mean_squared_error(y_hat,y_train)
y_predict=model.predict(X_test)
test_rmse=root_mean_squared_error(y_predict,y_test)
print(f"Training Error:{train_rmse}")
print(f"Test Error:{test_rmse}")

Training Error:52.852354801187886
Test Error:55.61674711723443


I think we are observing underfitting. The training and test error are both relatively high.

## Task 4

If your goal is to understand which of the input features in `X` are most important for predicting the target `y`, the linear model you built in task 3 may not be very helpful. Build a new linear model using [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso) that achieves comparable generalization error as the task 3 model using ordinary least squares regression (within 10% of the root mean squared error on the test set), but with **0 for at least three of the model coefficients** (that is, the model does not use these features to make predictions). 

You may need to try multiple vaues of the `alpha` *hyperparameter* to find a model that satisfies both the error and *sparsity* constraints (that at least three of the coefficients are 0). Nevertheless, you should only evaluate error on the test dataset **once**. Show your work for how you find a good `alpha` in code and explain your work in English below. Standard approaches would be to split the training data into a train and validation set, or to use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) on the training data.

For your final fit Lasso model with the chosen `alpha`, report the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) on the test data. Also report the model coefficients and use this to explain which features (see their names/interpretations above) seem less important for predicting the target.

In [127]:
model_lasso=linear_model.Lasso(alpha=4.4)
X_train1, X_val, y_train1, y_val=train_test_split(X_train,y_train, test_size=.3, random_state=2024)
model_lasso.fit(X_train1,y_train1)
print(model_lasso.coef_)
y_hat=model_lasso.predict(X_val)
val_rmse=root_mean_squared_error(y_hat,y_val)
# y_predict=model_lasso.predict(X_test)
# test_rmse=root_mean_squared_error(y_predict,y_test)
print(f"Val Error:{val_rmse}")
# print(f"Test Error:{test_rmse}")
model_lasso.fit(X_train,y_train)
y_hat=model_lasso.predict(X_train)
train_rmse=root_mean_squared_error(y_hat,y_train)
y_predict=model_lasso.predict(X_test)
test_rmse=root_mean_squared_error(y_predict,y_test)
print(f"Training Error:{train_rmse}")
print(f"Test Error:{test_rmse}")

[-0.05914569 -0.          6.66461255  1.20451025  1.00923764 -1.18047839
 -1.72911058 -0.          0.          0.07443746]
Val Error:59.23521335484314
Training Error:54.95636472789052
Test Error:57.70418540942942


I split the training data into a training set and a validation set so I could find a good alpha hyperparameter. I increased alpha until 3 of the coefficients were 0, and then decreased alpha to satisfy the error constraints. The coefficients used in the model are as followed:
age age in years:-0.05914569
sex:0
bmi body mass index:6.66461255
bp average blood pressure:1.20451025 
s1 tc, total serum cholesterol: 1.00923764
s2 ldl, low-density lipoproteins: -1.18047839
s3 hdl, high-density lipoproteins: -1.72911058
s4 tch, total cholesterol / HDL: 0
s5 ltg, log of serum triglycerides level: 0
s6 glu, blood sugar level: 0.07443746

It seems that sex, s4, and s5 were the least important for predicting the target value.