In this assignment, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using Turi Create, though you can use other solvers). You will:

-  Write a function to normalize features
-  Implement coordinate descent for LASSO
-  Explore effects of L1 penalty

0. Load the sales dataset using Pandas:

In [1]:
import pandas as pd

dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)

1. Create new features by performing following transformation on inputs:

In [2]:
from math import log, sqrt
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']

- .Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
- On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

2. Using the entire house dataset, learn regression weights using an L1 penalty of 5e2. Make sure to add "normalize=True" when creating the Lasso object. Refer to the following code snippet for the list of features.

#### Note. From here on, the list 'all_features' refers to the list defined in this snippet.



In [3]:
from sklearn import linear_model  # using scikit-learn

all_features = ['bedrooms', 'bedrooms_square',
            'bathrooms',
            'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt',
            'floors', 'floors_square',
            'waterfront', 'view', 'condition', 'grade',
            'sqft_above',
            'sqft_basement',
            'yr_built', 'yr_renovated']

model_all = linear_model.Lasso(alpha=5e2, normalize=True) # set parameters
model_all.fit(sales[all_features], sales['price']) # learn weights

Lasso(alpha=500.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=True, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

#### 3. Quiz Question: Which features have been chosen by LASSO, i.e. which features were assigned nonzero weights?

In [19]:
model_all.coef_

array([    0.        ,     0.        ,     0.        ,   134.43931396,
           0.        ,     0.        ,     0.        ,     0.        ,
           0.        ,     0.        , 24750.00458561,     0.        ,
       61749.10309071,     0.        ,     0.        ,    -0.        ,
           0.        ])

In [25]:
for i in list(model_all.coef_):
    if i!=0:
        print(all_features[list(model_all.coef_).index(i)] , '     ' , i)
        

sqft_living       134.43931395541435
view       24750.004585609502
grade       61749.10309070813


4. To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets. Download the provided csv files containing training, validation and test sets

In [26]:
testing = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)
training = pd.read_csv('wk3_kc_house_train_data.csv', dtype=dtype_dict)
validation = pd.read_csv('wk3_kc_house_valid_data.csv', dtype=dtype_dict)

Make sure to create the 4 features as we did in #1:

In [27]:
testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']

training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']

validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']

5. Now for each l1_penalty in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, type np.logspace(1, 7, num=13).)

    - Learn a model on TRAINING data using the specified l1_penalty. Make sure to specify normalize=True in the constructor:

In [28]:
import numpy as np

In [29]:
np.logspace(1, 7, num=13)

array([1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02,
       1.00000000e+03, 3.16227766e+03, 1.00000000e+04, 3.16227766e+04,
       1.00000000e+05, 3.16227766e+05, 1.00000000e+06, 3.16227766e+06,
       1.00000000e+07])

In [52]:
l1_penalty = [.001,.01,.1,1,10,100,1000]
model = linear_model.Lasso(alpha=l1_penalty, normalize=True)

In [53]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
parameters={'alpha':[.001,.01,.1,1,10,100,1000]}
lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)
X =training[all_features]
y = training.price
lasso_regressor.fit(X,y)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


  positive)
  positive)
  positive)
  positive)


{'alpha': 1000}
-44176291959.80895


  positive)


- Compute the RSS on VALIDATION for the current model (print or save the RSS)
	- Report which L1 penalty produced the lower RSS on VALIDATION.

In [35]:
l = lasso_regressor.predict(validation[all_features])

In [37]:
RSS = sum([x**2 for x in l-validation['price'].values])

In [38]:
RSS

402910514991385.2

In [39]:
lasso_regressor.best_params_

{'alpha': 1000.0}

In [40]:
from sklearn.metrics import r2_score

In [41]:
r2_score(validation['price'].values ,l )

0.6703578574706541

6. Quiz Question: Which was the best value for the l1_penalty, i.e. which value of l1_penalty produced the lowest RSS on VALIDATION data?

In [51]:
np.count_nonzero(lasso_regressor.best_estimator_.coef_) + np.count_nonzero(lasso_regressor.best_estimator_.intercept_)

17

In [50]:
lasso_regressor.best_estimator_.coef_

array([-1.24688118e+04,  3.26215700e+02,  4.90429733e+04,  3.05386743e+02,
       -5.31352190e+04,  1.00933430e+00, -8.41468468e+02, -0.00000000e+00,
        5.18429171e+03,  4.92281114e+05,  4.19114540e+04,  2.58066930e+04,
        1.29042207e+05,  4.02241279e+02,  4.04399774e+02, -3.23868278e+03,
        1.51907030e+01])

In [None]:
gs.best_estimator_.coef_