# LASSO

## Import Module & Load Data

In [1]:
import pandas as pd
import numpy as np
from math import sqrt
from sklearn import linear_model  # using scikit-learn

dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)
testing = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)
training = pd.read_csv('kc_house_train_data.csv', dtype=dtype_dict)
validation = pd.read_csv('kc_house_valid_data.csv', dtype=dtype_dict)

## Creating New Features

In [2]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']

testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']

training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']

validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']

## Learn Regression Weights with L1 Penalty

Let us fit a model with all the features available, plus the features we just created above.

In [3]:
all_features = ['bedrooms', 'bedrooms_square',
            'bathrooms',
            'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt',
            'floors', 'floors_square',
            'waterfront', 'view', 'condition', 'grade',
            'sqft_above',
            'sqft_basement',
            'yr_built', 'yr_renovated']

Applying L1 penalty requires adding an extra parameter (l1_penalty) to the linear regression call in GraphLab Create. (Other tools may have separate implementations of LASSO.) Note that it's important to set l2_penalty=0 to ensure we don't introduce an additional L2 penalty.

In [4]:
model_all = linear_model.Lasso(alpha=5e2, normalize=True) # set parameters
model_all.fit(sales[all_features], sales['price']) # learn weights
coeffs = pd.DataFrame(list(zip(sales[all_features],model_all.coef_)),columns=['features', 'estimated coefficients'])
pos_coeffs = coeffs[coeffs['estimated coefficients']!=0]
print("Positive Coefficients: \n", pos_coeffs)

Positive Coefficients: 
        features  estimated coefficients
3   sqft_living              134.439314
10         view            24750.004586
12        grade            61749.103091


Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection.

## Selecting a L1 Penalty

We write a loop that does the following:

* For `l1_penalty` in [$10^1, 10^1.5, \cdots 10^7$] (to get this in Python, type np.logspace(1,7,num=13)
* Fit a regression model with a given `l1_penalty` on TRAIN data.
* Compute the RSS on VALIDATION data.
* Report which `l1_penalty` produced the lowest RSS on validation data.

In [5]:
validation_rss = {}

for l1_penalty in np.logspace(1,7, num=13):
    model = linear_model.Lasso(l1_penalty, normalize = True)
    model.fit(training[all_features],training['price'])
    predictions = model.predict(validation[all_features])
    residuals = validation['price'] - predictions
    RSS = sum(residuals ** 2)
    validation_rss[l1_penalty] = RSS

optimal_l1_penalty = min(validation_rss.items(), key = lambda x: x[1])[0]
print()
print("Optimal L1 Penalty: ", optimal_l1_penalty)


Optimal L1 Penalty:  10.0


Let's apply this optimal `l1_penalty` on TEST data:

In [6]:
model_best = linear_model.Lasso(alpha=optimal_l1_penalty,normalize = True, max_iter = 10000)
model_best.fit(testing[all_features], testing['price'])
coeffs = pd.DataFrame(list(zip(testing[all_features],model_best.coef_)),columns=['features','estimated coefficients'])
pos_coeffs = coeffs[coeffs['estimated coefficients']!=0]
print("Positive Coefficients (Test Set): \n", pos_coeffs)
print("Number of non-zero coeffs: ", np.count_nonzero(model_best.coef_) + np.count_nonzero(model_best.intercept_))

Positive Coefficients (Test Set): 
             features  estimated coefficients
0           bedrooms           -14227.610864
2          bathrooms            52721.103327
3        sqft_living              511.774116
4   sqft_living_sqrt           -34618.450510
5           sqft_lot                0.715295
6      sqft_lot_sqrt             -709.691210
7             floors          -153098.458411
8      floors_square            50315.741942
9         waterfront           474202.648826
10              view            37753.365695
11         condition            35879.550409
12             grade           125534.816213
13        sqft_above               16.662389
15          yr_built            -3275.606526
16      yr_renovated                6.661506
Number of non-zero coeffs:  16


## Limit the Number of Nonzero Weights

What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.
In this section, you are going to implement a simple, two phase procedure to achive this goal:

1. Explore a large range of `l1_penalty` values to find a narrow region of `l1_penalty` values where models are likely to have the desired number of non-zero weights.

2. Further explore the narrow region you found to find a good value for `l1_penalty` that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for `l1_penalty`.

In [7]:
max_nonzeros = 7

### Exploring the larger range of values to find a narrow range with sparsity

Let's define a wide range of possible `l1_penalty_values`:

In [8]:
l1_penalty_wide = np.logspace(1,4,num=20) # Wide Range
coeffs_dict = {}
for l1_penalty in l1_penalty_wide:
    model = linear_model.Lasso(alpha = l1_penalty, normalize = True)
    model.fit(training[all_features],training['price'])
    nnz = np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)
    coeffs_dict[l1_penalty] = nnz

print("L1 penalty / # of Coefficient")
coeffs_dict

L1 penalty / # of Coefficient


{10.0: 15,
 14.384498882876629: 15,
 20.691380811147901: 15,
 29.763514416313178: 15,
 42.813323987193932: 13,
 61.584821106602639: 12,
 88.586679041008225: 11,
 127.42749857031335: 10,
 183.29807108324357: 7,
 263.66508987303581: 6,
 379.26901907322497: 6,
 545.55947811685144: 6,
 784.75997035146065: 5,
 1128.8378916846884: 3,
 1623.776739188721: 3,
 2335.7214690901214: 2,
 3359.8182862837812: 1,
 4832.9302385717519: 1,
 6951.9279617756056: 1,
 10000.0: 1}

In [9]:
l1_penalty_min = 127.42749857031335
l1_penalty_max = 183.29807108324357
print("L1_penalty_min: ", l1_penalty_min)
print("L1_penalty_max: ", l1_penalty_max)

L1_penalty_min:  127.42749857031335
L1_penalty_max:  183.29807108324357


### Exploring the narrow range of values to find the solution with the right number of non-zeros that has lowest RSS on the validation set

In [10]:
validation_rss = {}
l1_penalty_narrow = np.linspace(l1_penalty_min,l1_penalty_max, num=20)
for l1_penalty in l1_penalty_narrow:
    model = linear_model.Lasso(alpha = l1_penalty, normalize = True)
    model.fit(training[all_features],training['price'])
    predictions = model.predict(validation[all_features])
    residuals = validation['price'] - predictions
    RSS = sum(residuals ** 2)
    validation_rss[l1_penalty] = RSS, np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)

In [11]:
# Find the model that the lowest RSS on the Validation set and has sparsity equal to max_nonzero
print()
bestRSS = np.inf
for i,j in validation_rss.items():
    if (j[0] < bestRSS) and (j[1] == max_nonzeros):
        bestRSS = j[0]
        bestl1 = i
print("BEST RSS: %.4g \nBEST L1 PENALTY: %.4g" %(bestRSS, bestl1))


BEST RSS: 4.398e+14 
BEST L1 PENALTY: 153.9


In [12]:
# Apply best L1 penalty
model_best = linear_model.Lasso(alpha=bestl1,normalize = True, max_iter = 10000)
model_best.fit(training[all_features], training['price'])
coeffs = pd.DataFrame(list(zip(training[all_features],model_best.coef_)),columns=['features','estimated coefficients'])
pos_coeffs = coeffs[coeffs['estimated coefficients']!=0]
print("Positive Coefficients (Test Set): \n", pos_coeffs)
print("Number of non-zero coeffs: ", np.count_nonzero(model_best.coef_) + np.count_nonzero(model_best.intercept_))

Positive Coefficients (Test Set): 
        features  estimated coefficients
2     bathrooms            11074.253516
3   sqft_living              163.235168
9    waterfront           508217.547491
10         view            41997.973196
12        grade           116484.911366
15     yr_built            -2628.351662
Number of non-zero coeffs:  7
