In this assignment, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using GraphLab Create, though you can use other solvers). You will:

* Run LASSO with different L1 penalties.
* Choose best L1 penalty using a validation set.
* Choose best L1 penalty using a validation set, with additional constraint on the size of subset.

In the second assignment, you will implement your own LASSO solver, using coordinate descent.

In [1]:
import graphlab
sales = graphlab.SFrame('kc_house_data.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to mikael.baymani@gmail.com and will expire on May 13, 2020.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1565594043.log


In [2]:
from math import log, sqrt
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']

# In the dataset, 'floors' was defined with type string, 
# so we'll convert them to float, before creating a new feature.
sales['floors'] = sales['floors'].astype(float) 
sales['floors_square'] = sales['floors']*sales['floors']

In [3]:
# * Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of
#   bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses
#   with many bedrooms.
# * On the other hand, taking square root of sqft_living will decrease the separation between big house
#   and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

## Learn regression weights with L1 penalty

In [4]:
all_features = ['bedrooms', 'bedrooms_square',
                'bathrooms',
                'sqft_living', 'sqft_living_sqrt',
                'sqft_lot', 'sqft_lot_sqrt',
                'floors', 'floors_square',
                'waterfront', 'view', 'condition', 'grade',
                'sqft_above',
                'sqft_basement',
                'yr_built', 'yr_renovated']

In [5]:
model_all = graphlab.linear_regression.create(sales, target='price', features=all_features,
                                               validation_set=None, l2_penalty=0., l1_penalty=1e10)

In [74]:
# Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's
# large enough, we are performing a subset selection.

# QUIZ QUESTION: According to this list of weights, which of the features have been chosen?
print len(all_features)
print len(model_all.coefficients['value'])
print [all_features[i-1] for i in range(len(model_all.coefficients['value'])) if i != 0 and model_all.coefficients['value'][i] != 0]

17
18
['bathrooms', 'sqft_living', 'sqft_living_sqrt', 'grade', 'sqft_above']


## Selecting an L1 penalty

In [21]:
# To find a good L1 penalty, we will explore multiple values using a validation set.
# Let us do three way split into train, validation, and test sets:

# * Split our sales data into 2 sets: training and test
# * Further split our training data into two sets: train, validation

In [22]:
(training_and_validation, testing) = sales.random_split(.9,seed=1)
(training, validation) = training_and_validation.random_split(0.5, seed=1)

In [42]:
import numpy as np
min_rss = None
best_l1_penalty = None
for l1_penalty in np.logspace(1, 7, num=13):
    print "l1_penalty: %.f" % l1_penalty
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, l2_penalty=0., l1_penalty=l1_penalty,
                                              verbose=False)
    
    pred = model.predict(validation)
    errors = pred - validation['price']
    rss = np.dot(errors, errors)
    
    if best_l1_penalty is None or rss < min_rss:
        best_l1_penalty = l1_penalty
        min_rss = rss

print "L1 penalty: " + str(best_l1_penalty) + ", RSS: " + str(min_rss)

l1_penalty: 10
l1_penalty: 32
l1_penalty: 100
l1_penalty: 316
l1_penalty: 1000
l1_penalty: 3162
l1_penalty: 10000
l1_penalty: 31623
l1_penalty: 100000
l1_penalty: 316228
l1_penalty: 1000000
l1_penalty: 3162278
l1_penalty: 10000000
L1 penalty: 10.0, RSS: 625766285142460.6


In [33]:
# * QUIZ QUESTION. * What was the best value for the l1_penalty?
# 10

In [34]:
# QUIZ QUESTION Also, using this value of L1 penalty, how many nonzero weights do you have?

In [35]:
model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, l2_penalty=0., l1_penalty=best_l1_penalty,
                                              verbose=False)

In [43]:
model.coefficients['value'].nnz()

18

## Limit the number of nonzero weights

In [44]:
# What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we
# want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.

In [46]:
# In this section, you are going to implement a simple, two phase procedure to achive this goal:

# 1) Explore a large range of l1_penalty values to find a narrow region of l1_penalty values where
#    models are likely to have the desired number of non-zero weights.
# 2) Further explore the narrow region you found to find a good value for l1_penalty that achieves
#    the desired sparsity. Here, we will again use a validation set to choose the best value for l1_penalty.
max_nonzeros = 7

In [50]:
l1_penalties = []
for l1_penalty in np.logspace(8, 10, num=20):
    print "l1_penalty = %.f" % l1_penalty
    
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, l2_penalty=0., l1_penalty=l1_penalty,
                                              verbose=False)
    
    l1_penalties.append( (l1_penalty, model['coefficients']['value'].nnz()) )
    print "(l1_penalty, # nonzeros) = " + str(l1_penalties[-1])

l1_penalty = 100000000
(l1_penalty, # nonzeros) = (100000000.0, 18)
l1_penalty = 127427499
(l1_penalty, # nonzeros) = (127427498.57031322, 18)
l1_penalty = 162377674
(l1_penalty, # nonzeros) = (162377673.91887242, 18)
l1_penalty = 206913808
(l1_penalty, # nonzeros) = (206913808.111479, 18)
l1_penalty = 263665090
(l1_penalty, # nonzeros) = (263665089.87303555, 17)
l1_penalty = 335981829
(l1_penalty, # nonzeros) = (335981828.6283788, 17)
l1_penalty = 428133240
(l1_penalty, # nonzeros) = (428133239.8719396, 17)
l1_penalty = 545559478
(l1_penalty, # nonzeros) = (545559478.1168514, 17)
l1_penalty = 695192796
(l1_penalty, # nonzeros) = (695192796.1775591, 17)
l1_penalty = 885866790
(l1_penalty, # nonzeros) = (885866790.4100832, 16)
l1_penalty = 1128837892
(l1_penalty, # nonzeros) = (1128837891.6846883, 15)
l1_penalty = 1438449888
(l1_penalty, # nonzeros) = (1438449888.2876658, 15)
l1_penalty = 1832980711
(l1_penalty, # nonzeros) = (1832980710.8324375, 13)
l1_penalty = 2335721469
(l1_penalty,

In [58]:
# QUIZ QUESTION. What values did you find for l1_penalty_min and l1_penalty_max, respectively?
l1_penalty_min = None
l1_penalty_max = None
for l1_penalty,nonzeros in l1_penalties:
    if nonzeros >= max_nonzeros:
        if l1_penalty_min is None or l1_penalty_min < l1_penalty:
            l1_penalty_min = l1_penalty
    else:
        if l1_penalty_max is None or l1_penalty_max > l1_penalty:
            l1_penalty_max = l1_penalty

print "l1_penalty_min=" + str(l1_penalty_min)
print "l1_penalty_max=" + str(l1_penalty_max)

l1_penalty_min=2976351441.6313133
l1_penalty_max=3792690190.7322536


In [75]:
min_rss = None
best_l1_penalty = None
best_coefficients = None
for l1_penalty in np.linspace(l1_penalty_min, l1_penalty_max, 20):
    print "l1_penalty = %.f" % l1_penalty
    model = graphlab.linear_regression.create(training, target='price', features=all_features,
                                              validation_set=None, l2_penalty=0., l1_penalty=l1_penalty,
                                              verbose=False)
    
    if model['coefficients']['value'].nnz() != max_nonzeros:
        print "skip!"
        continue
        
    pred = model.predict(validation)
    errors = pred - validation['price']
    rss = np.dot(errors, errors)
    
    print "rss: " + str(rss)
    if best_l1_penalty is None or rss < min_rss:
        best_l1_penalty = l1_penalty
        min_rss = rss
        best_coefficients = model['coefficients']['value']

# QUIZ QUESTIONS
# 1. What value of `l1_penalty` in our narrow range has the lowest RSS on the VALIDATION set
#    and has sparsity *equal* to `max_nonzeros`?
# 2. What features in this model have non-zero coefficients?
print "L1 penalty: " + str(best_l1_penalty) + ", RSS: " + str(min_rss)
print [all_features[i-1] for i in range(len(best_coefficients)) if i != 0 and best_coefficients[i] != 0]

l1_penalty = 2976351442
skip!
l1_penalty = 3019316639
skip!
l1_penalty = 3062281836
skip!
l1_penalty = 3105247034
skip!
l1_penalty = 3148212231
skip!
l1_penalty = 3191177428
skip!
l1_penalty = 3234142626
skip!
l1_penalty = 3277107823
skip!
l1_penalty = 3320073020
skip!
l1_penalty = 3363038218
skip!
l1_penalty = 3406003415
skip!
l1_penalty = 3448968612
rss: 1046937488751711.0
l1_penalty = 3491933809
rss: 1051147625612862.2
l1_penalty = 3534899007
rss: 1055992735342999.6
l1_penalty = 3577864204
rss: 1060799531763288.0
l1_penalty = 3620829401
skip!
l1_penalty = 3663794599
skip!
l1_penalty = 3706759796
skip!
l1_penalty = 3749724993
skip!
l1_penalty = 3792690191
skip!
L1 penalty: 3448968612.163437, RSS: 1046937488751711.0
['bedrooms', 'bathrooms', 'sqft_living', 'sqft_living_sqrt', 'grade', 'sqft_above']
