# Part 1: Regularization

A) Use the Boston dataset, and use Ridge regression model with tuning parameter set to 100 (alpha =100). Find the $R^2$ score and number of non zero coefficients.

B) Use Lasso regression instead of Ridge regression, also set the tuning parameter to 100. Find the $R^2$ score and number of non zero coefficients.

C) Change the tuning parameter of the Lasso model to a very low value (alpha =0.001). What is the $R^2$ score.

D) Comment on your result. In this problem, do all feature seem important in making predictions?

In [2]:
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge 
from sklearn.linear_model import Lasso
import numpy as np

dataset = load_boston()
X=dataset.data
Y=dataset.target
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)

RidgeModel = Ridge(alpha=100).fit(X_train, Y_train)
print("R squared score with Ridge： ",RidgeModel.score(X_test,Y_test))
print("non zero coefficients: ", np.sum(RidgeModel.coef_!=0))
print("\n")
LassoModel=Lasso(alpha=100). fit(X_train, Y_train)
print("R squared with Lasso: ",LassoModel.score(X_test,Y_test))
print("non zero coefficients: ", np.sum(LassoModel.coef_!=0))
print("\n")
LassoModel=Lasso(alpha=0.001). fit(X_train, Y_train)
print("R squared score with Lasso and alpha = 0.001: ",LassoModel.score(X_test,Y_test))

R squared score with Ridge：  0.592535803616
non zero coefficients:  13


R squared with Lasso:  0.118669161755
non zero coefficients:  2


R squared score with Lasso and alpha = 0.001:  0.635035312517


In this situation, we can find the r squared score is higher when number of non zero coefficients is more, so maybe all the features are useful. 

# Part 2: Logistic Regression

In this exercise, you will use logistic regression to classify breast cancer as malignant or benign using the sklearn data set. Run the code below to print and read the description of the data set. Use logistic regression, with Lasso regularization (penelty =l1) and the default regularization parameter to build the classifier. What is the accuracy?


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np

DataCancer=load_breast_cancer()
print(DataCancer.keys())
print(DataCancer.DESCR)

X_features=DataCancer.data
Y_targetClass=DataCancer.target

X_train, X_test, Y_train, Y_test= train_test_split(X_features, Y_targetClass, random_state= 0)


dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Ra

In [4]:
LogRegModel= LogisticRegression(C=1,penalty="l1").fit(X_train, Y_train)
print ("The accuracy using Lasso Regularization is: ",LogRegModel.score(X_test,Y_test))
Probabilities=LogRegModel.predict_proba(X_test)
print("\n",Probabilities)

The accuracy using Lasso Regularization is:  0.958041958042

 [[  9.93860932e-01   6.13906844e-03]
 [  2.60280380e-02   9.73971962e-01]
 [  1.49105374e-03   9.98508946e-01]
 [  1.56033658e-01   8.43966342e-01]
 [  5.79392477e-05   9.99942061e-01]
 [  2.29213320e-03   9.97707867e-01]
 [  6.34034179e-03   9.93659658e-01]
 [  1.06929994e-03   9.98930700e-01]
 [  3.65814534e-02   9.63418547e-01]
 [  1.58344868e-04   9.99841655e-01]
 [  4.37763256e-01   5.62236744e-01]
 [  1.47827175e-01   8.52172825e-01]
 [  3.10623829e-03   9.96893762e-01]
 [  7.65672954e-01   2.34327046e-01]
 [  1.89025185e-01   8.10974815e-01]
 [  9.93247768e-01   6.75223202e-03]
 [  2.11257818e-02   9.78874218e-01]
 [  9.99999999e-01   1.04541950e-09]
 [  9.99296476e-01   7.03523940e-04]
 [  1.00000000e+00   1.38356549e-12]
 [  9.99981961e-01   1.80394851e-05]
 [  9.39906285e-01   6.00937146e-02]
 [  1.20697710e-03   9.98793023e-01]
 [  8.59824905e-03   9.91401751e-01]
 [  9.96853880e-01   3.14611968e-03]
 [  8.1692286

 Accuracy is high when using Lasso Regularization.