Hello! I'll continue with the dataset for breast cancer diagnostics.

In [1]:
#importing libraries
from pandas import DataFrame
import numpy as np
import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split

import sklearn.metrics

from sklearn.linear_model import LassoLarsCV

from sklearn.datasets import load_diabetes 
from sklearn.preprocessing import scale

%matplotlib inline

First, I'll get the data and a corresponding response variable (whether a person was diagnosed with cancer).

In [2]:
data = load_diabetes()

df_data = DataFrame(data=data['data'], columns = data['feature_names']).dropna()
df_target = DataFrame(data=data['target'])

The description of the data:

In [3]:
print(data['DESCR'])

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

All predictors are real numbers.

In [4]:
df_data.dtypes

age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object

Some brief statistics for the explanatory variables.

In [5]:
df_data.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-3.634285e-16,1.308343e-16,-8.045349e-16,1.281655e-16,-8.835316000000001e-17,1.327024e-16,-4.574646e-16,3.777301e-16,-3.830854e-16,-3.412882e-16
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123996,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260974,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665645,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324879,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670611,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947634,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564384,0.02835801,0.02984439,0.0293115,0.03430886,0.03243323,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320442,0.1539137,0.198788,0.1811791,0.1852344,0.133599,0.1356118


The data is prescaled, so I divide it into testing and training sets and fit the model to the data.

In [6]:
targ_scaled = scale(df_target.values)

x_train, x_test, y_train, y_test = train_test_split(df_data.values, targ_scaled, test_size = .2)

model = LassoLarsCV(cv=5)
model = model.fit(x_train,y_train.ravel())

pred = model.predict(x_test)

Here is the $R^2$ score for the model.

In [7]:
model.score(x_test, y_test)

0.52535587058438415

The score indicates that the model explains ~30% of the sample variance.

Here are all the features ranged by their coefficients. 

Body mass index heavily contributes to diabetes progression, the value associated with the 1st and 5th blood serum does so as well. One should also look out for blood pressure, sex also plays a role.

In [8]:
from operator import itemgetter

sorted(list(zip(data['feature_names'], model.coef_)), key=itemgetter(1), reverse = True)

[('bmi', 7.1389996023553532),
 ('s5', 6.2995457397410464),
 ('bp', 3.7618136383775553),
 ('s6', 1.3683896993631872),
 ('s4', 0.54771197804771166),
 ('age', 0.29342676762715197),
 ('s2', 0.0),
 ('s3', -2.0381202024994098),
 ('s1', -2.4489973085257102),
 ('sex', -2.9672772400612497)]