# Scikit-learn is one of the most important and indispensable Python frameworks for Data Science and
Machine Learning in Python. It implements a wide range of Machine Learning algorithms covering major
areas of Machine Learning like classification, clustering, regression, and so on. All the mainstream Machine
Learning algorithms like support vector machines, logistic regression, random forests, K-means clustering,
hierarchical clustering, and many many more, are implemented efficiently in this library. Perhaps this
library forms the foundation of applied and practical Machine Learning. Besides this, its easy-to-use API and
code design patterns have been widely adopted across other frameworks too!

# The Dataset
The diabetes dataset is one of the bundled datasets with the scikit-learn library. This small dataset allows
the new users of the library to learn and experiment various Machine Learning concepts, with a well-known
dataset. It contains observations of 10 baseline variables, age, sex, body mass index, average blood pressure.
and six blood serum measurements for 442 diabetes patients. The dataset bundled with the package is
already standardized (scaled), i.e. they have zero mean and unit L2 norm. The response (or target variable)
is a quantitative measure of disease progression one year after baseline. The dataset can be used to answer
two questions:
           # • What is the baseline prediction of disease progression for future patients?
            • Which independent variables (features) are important factors for predicting disease progression?

In [1]:
from sklearn import datasetssets

In [2]:
diabetes = datasets.load_diabetes()
y = diabetes.target
X = diabetes.data

In [5]:
X.shape

(442, 10)

In [6]:
X[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

In [7]:
y[:10]

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310.])

In [8]:
feature_names=['age', 'sex', 'bmi', 'bp','s1', 's2', 's3', 's4', 's5', 's6']

In [9]:
from sklearn import datasets
from sklearn.linear_model import Lasso
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

In [10]:
diabetes = datasets.load_diabetes()
X_train = diabetes.data[:310]
y_train = diabetes.target[:310]
X_test = diabetes.data[310:]
y_test = diabetes.data[310:]

In [11]:
lasso = Lasso(random_state=0)
alphas = np.logspace(-4, -0.5, 30)

In [12]:
estimator = GridSearchCV(lasso, dict(alpha=alphas))
estimator.fit(X_train, y_train)

GridSearchCV(estimator=Lasso(random_state=0),
             param_grid={'alpha': array([1.00000000e-04, 1.32035178e-04, 1.74332882e-04, 2.30180731e-04,
       3.03919538e-04, 4.01280703e-04, 5.29831691e-04, 6.99564216e-04,
       9.23670857e-04, 1.21957046e-03, 1.61026203e-03, 2.12611233e-03,
       2.80721620e-03, 3.70651291e-03, 4.89390092e-03, 6.46167079e-03,
       8.53167852e-03, 1.12648169e-02, 1.48735211e-02, 1.96382800e-02,
       2.59294380e-02, 3.42359796e-02, 4.52035366e-02, 5.96845700e-02,
       7.88046282e-02, 1.04049831e-01, 1.37382380e-01, 1.81393069e-01,
       2.39502662e-01, 3.16227766e-01])})

In [13]:
estimator.best_score_

0.46170948106181964

In [14]:
estimator.best_estimator_

Lasso(alpha=0.07880462815669913, random_state=0)