# Predicting Solar Panel Adoption - Feature Selection: 
## Gradient Descent Linear Regression with L1 Regularizaiton
#### UC Berkeley MIDS
`Team: Gabriel Hudson, Noah Levy, Laura Williams`

Gradient descent using linear regression with L1 regularization and an OLS loss function is being used here for the purpose of feature selection.  The dataset input into this regression already has some feature engineering (see Data Set Up notebook). 

In [1]:
# imports
import time
import statistics as stats
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor

This model is trained on a curated dataset with some features already removed.  See Data Set Up file.

In [2]:
# load curated dataset for Stage 1
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1_new_outcome_var.csv', index_col=0)

In [36]:
# load dataset for Stage 2 model
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1_new_outcome_var_S1.csv', index_col=0)

In [86]:
# load dataset for Stage 2 model
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1_new_outcome_var_S2.csv', index_col=0)

Below are the interim datasets using the old outcome variable.

In [None]:
# load curated dataset for Stage 1
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1_no_CountyState.csv', index_col=0)

In [114]:
# load updated dataset after removing first set of variables
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1_S1.csv', index_col=0)

In [146]:
# load updated dataset after removing two sets of variables
deepsolar = pd.read_csv('../Datasets/deepsolar_LW2.csv', index_col=0)

In [87]:
print("Dataset rows and dimensions:", deepsolar.shape)

Dataset rows and dimensions: (71305, 63)


## Pre-process data

Create a small sample dataset for testing implementation

In [88]:
# define outcome variable being used in current dataset
#primary_outcome_variable = 'number_of_solar_system_per_household'
primary_outcome_variable = 'owner_occupied_solar_system_density'

In [89]:
deepsolar_sample = deepsolar.sample(frac=.10)

In [90]:
print("Small sample dataset rows and dimensions:", deepsolar_sample.shape)

Small sample dataset rows and dimensions: (7130, 63)


Split sample and full datasets into training and test sets

In [91]:
# separate outcome variables and features - sample set
X_sample = deepsolar_sample.drop(labels=[primary_outcome_variable], axis=1).values
Y_sample = deepsolar_sample[primary_outcome_variable].values
print("Sample dataset featureset shape is", X_sample.shape)
print("Sample dataset outcome variable shape:", Y_sample.shape)

Sample dataset featureset shape is (7130, 62)
Sample dataset outcome variable shape: (7130,)


In [92]:
X_sample_train, X_sample_test, \
Y_sample_train, Y_sample_test, = train_test_split(X_sample, Y_sample, test_size=0.2, random_state=None, shuffle=True)
print("{:<35}\t{}".format("Sample training data shape:", X_sample_train.shape))
print("{:<35}\t{}".format("Sample training outcome variable:",Y_sample_train.shape ))
print("{:<35}\t{}".format("Sample test data shape:", X_sample_test.shape))
print("{:<35}\t{}".format("Sample test outcome variable:",Y_sample_test.shape ))

Sample training data shape:        	(5704, 62)
Sample training outcome variable:  	(5704,)
Sample test data shape:            	(1426, 62)
Sample test outcome variable:      	(1426,)


In [93]:
# separate outcome variables and features - full dataset
X = deepsolar.drop(labels=[primary_outcome_variable], axis=1).values
Y = deepsolar[primary_outcome_variable].values
print("Full featureset shape is", X.shape)
print("Outcome variable shape:", Y.shape)

Full featureset shape is (71305, 62)
Outcome variable shape: (71305,)


In [94]:
X_train, X_test, Y_train, Y_test, = train_test_split(X, Y, test_size=0.2, random_state=None, shuffle=True)
print("{:<35}\t{}".format("Training data shape:", X_train.shape))
print("{:<35}\t{}".format("Training outcome variable:",Y_train.shape ))
print("{:<35}\t{}".format("Test data shape:", X_test.shape))
print("{:<35}\t{}".format("Test outcome variable - classifier:",Y_test.shape ))

Training data shape:               	(57044, 62)
Training outcome variable:         	(57044,)
Test data shape:                   	(14261, 62)
Test outcome variable - classifier:	(14261,)


## Train Model

Notes on model parameters:  
* Loss is OLS loss for linear regression.  
* Penalty is L1 (lasso) to force redundant coefficients to zero.   
* Tolerance was set .0001 to increase iterations before the model is considered converged. When testing the default, tol=None, sometimes the variables with coefficients that became zero were not intuitively correctly zero because some of them were variables that appeared in important features in the random forest model.  Setting tol=.0001 seemed to solve this problem.
* Alpha value was chosen based on the number of coefficients that were reduced to zero.  A smaller alpha level reduced the number of coefficients reduced to zero, a larger alpha level increased the number of coefficients reduced to zero. 

Removing too many variables at once also seems to remove variables that have shown up in our important features list.  I tried conservative small stages to start with (detailed in the Scratch section at the end) and then chose 3 stages with more variables removed in the earlier stages.


Only variables with coefficients reduced to zero will be removed from the dataset and tested in the model.  Variables with a small coefficient may still have some value in the dataset and will not be removed.


In [143]:
def feature_selection(iterations, features, X_train, Y_train, X_test, Y_test):
    # set variables
    scores = []
    convergences = []
    # use mean of the outcome mean as the starting intercept
    outcome_mean = Y_train.mean()
    # train multiple iterations of the gradient descent regressor
    for i in range(iterations):
        L1 = SGDRegressor(loss='squared_loss', penalty='l1', alpha = .000014, 
                          max_iter=50, tol=.0001, learning_rate="constant")
        L1.fit(X_train, Y_train, intercept_init=outcome_mean)
        # record results
        scores.append(L1.score(X_test, Y_test))
        convergences.append(L1.n_iter_)
        coefficients_iteration = pd.DataFrame(L1.coef_, columns=[i+1], index=features)
        if i==0:
            coefficients = coefficients_iteration
        else:
            coefficients = pd.concat([coefficients, coefficients_iteration], axis=1)
    return coefficients, scores, convergences

In [144]:
# set the number of iterations
iterations = 100
# define the features to match with the coefficients
features = deepsolar.drop(labels=[primary_outcome_variable], axis=1).columns.values.tolist()

In [145]:
# Train the model
coefficients, scores, convergences = feature_selection(iterations, features, X_train, Y_train, X_test, Y_test)

In [146]:
print("Average model R squared is:", stats.mean(scores))

Average model R squared is: 0.352862981792


In [147]:
print("Average number of iterations to converge is:", stats.mean(convergences))

Average number of iterations to converge is: 2


In [148]:
# look at the coefficient list
coefficients.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
per_capita_income,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.8e-05,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.000561,0.0,-0.000612,0.0,-5.1e-05
population_density,-9.4e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-8e-05,0.0,0.0,0.0
education_high_school_graduate_rate,0.007651,0.006472,0.005458,0.006984,0.006481,0.006841,0.006576,0.00624,0.007486,0.006486,...,0.007543,0.00705,0.007049,0.007823,0.007352,0.007825,0.006212,0.007066,0.006077,0.006471
education_college_rate,0.006545,0.007668,0.005037,0.006979,0.006765,0.005893,0.007084,0.006694,0.005438,0.006245,...,0.008118,0.006198,0.006055,0.006084,0.006662,0.007536,0.005797,0.00755,0.005983,0.007522
education_bachelor_rate,-0.003892,-0.003077,-0.002809,-0.003196,-0.002983,-0.003007,-0.004209,-0.003901,-0.002611,-0.003337,...,-0.004158,-0.004499,-0.004939,-0.00362,-0.003889,-0.003723,-0.003157,-0.003843,-0.003161,-0.003869


In [149]:
# Calculate mean of all coefficients
coefficients_combined = pd.DataFrame(coefficients.mean(axis=1), index=features)

### Stage 1 model using new outcome variable.
Alpha value of 0.00001 used in Stage 1.  Larger alpha value of 0.00002 or 0.000015 was too broad and removed variables that were important features in our random forest model.

In [34]:
# Stage 1 from final dataset
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

31 features have coefficients of zero:
                                        0
average_household_income              0.0
gini_index                            0.0
education_less_than_high_school_rate  0.0
education_professional_school_rate    0.0
education_doctoral_rate               0.0
race_indian_alaska_rate               0.0
race_islander_rate                    0.0
employ_rate                           0.0
heating_fuel_other_rate               0.0
electricity_price_residential         0.0
housing_unit_median_value             0.0
elevation                             0.0
cooling_design_temperature            0.0
atmospheric_pressure                  0.0
age_more_than_85_rate                 0.0
age_45_54_rate                        0.0
age_55_64_rate                        0.0
age_15_17_rate                        0.0
occupation_public_rate                0.0
occupation_agriculture_rate           0.0
transportation_home_rate              0.0
transportation_car_alone_rate        

### Stage 2 model using new outcome variable.
Alpha value of 0.0000125 used in Stage 2.  Larger alpha value of 0.000015 was too broad and removed variables that were important features in our random forest model.

In [84]:
# Stage 2 from final dataset
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

9 features have coefficients of zero:
                           0
race_asian_rate          0.0
age_75_84_rate           0.0
age_10_14_rate           0.0
age_5_9_rate             0.0
occupation_finance_rate  0.0
travel_time_30_39_rate   0.0
travel_time_60_89_rate   0.0
voting_2016_dem_win      0.0
voting_2012_dem_win      0.0


### Stage 3 model using new outcome variable
Alpha value of 0.000014 used in Stage 3.  

In [150]:
# Stage 3 from final dataset
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

4 features have coefficients of zero:
                                 0
race_white_rate                0.0
age_18_24_rate                 0.0
dropout_16_19_inschool_rate    0.0
occupation_manufacturing_rate  0.0


## Below feature selection is based on the dataset with without outcome variable adjusted to account for owner occupancy rate

### Stage 1 model - remove these variables first
Alpha value of 0.000015 used in Stage 1.  Larger alpha value of 0.00002 seemed to remove variables that have turned up in important features list.

In [109]:
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

24 features have coefficients of zero:
                                      0
average_household_income            0.0
education_professional_school_rate  0.0
education_doctoral_rate             0.0
race_indian_alaska_rate             0.0
race_islander_rate                  0.0
heating_fuel_other_rate             0.0
electricity_price_residential       0.0
cooling_design_temperature          0.0
atmospheric_pressure                0.0
age_25_34_rate                      0.0
age_more_than_85_rate               0.0
age_75_84_rate                      0.0
age_15_17_rate                      0.0
age_5_9_rate                        0.0
occupation_manufacturing_rate       0.0
occupation_agriculture_rate         0.0
transportation_home_rate            0.0
transportation_car_alone_rate       0.0
transportation_walk_rate            0.0
transportation_bicycle_rate         0.0
health_insurance_public_rate        0.0
travel_time_average                 0.0
number_of_years_of_education        0.0
w

Look at the variables with very small coefficients, out of curiosity. 

In [110]:
# print list of variables between zero and a small coefficient
coefficients_not_zero = coefficients_combined[coefficients_combined!=0].dropna()
cutoff = 0.001
coefficients_small = coefficients_not_zero[coefficients_not_zero < cutoff].dropna()
coefficients_small = coefficients_small[coefficients_small > cutoff*-1].dropna()
print(coefficients_small)

                                       0
gini_index                      0.000004
per_capita_income              -0.000031
race_white_rate                 0.000952
race_asian_rate                -0.000065
employ_rate                    -0.000018
lat                            -0.000311
elevation                      -0.000051
earth_temperature_amplitude     0.000982
age_10_14_rate                 -0.000003
dropout_16_19_inschool_rate    -0.000667
occupation_construction_rate    0.000041
occupation_public_rate          0.000088
occupation_administrative_rate -0.000157
occupation_retail_rate          0.000003
travel_time_less_than_10_rate   0.000096
travel_time_20_29_rate          0.000033
age_median                      0.000447
voting_2016_dem_win             0.000006
voting_2012_dem_win             0.000135
diversity                       0.000199
rebate                          0.000262


### Stage 2 model
This model trained on the dataset after variables in Stage 1 were removed.

Alpha value of 0.00002 used for this stage

In [144]:
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

16 features have coefficients of zero:
                                  0
gini_index                      0.0
per_capita_income               0.0
employ_rate                     0.0
housing_unit_median_value       0.0
elevation                       0.0
age_10_14_rate                  0.0
dropout_16_19_inschool_rate     0.0
occupation_construction_rate    0.0
occupation_public_rate          0.0
occupation_administrative_rate  0.0
occupation_retail_rate          0.0
transportation_motorcycle_rate  0.0
travel_time_20_29_rate          0.0
age_median                      0.0
voting_2016_dem_win             0.0
diversity                       0.0


### Stage 3 model
This model trained on the dataset after variables in Stage 2 were removed.

Alpha value of 0.00002 used for this stage

In [177]:
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

4 features have coefficients of zero:
                                 0
race_asian_rate                0.0
earth_temperature_amplitude    0.0
occupation_finance_rate        0.0
travel_time_less_than_10_rate  0.0


## Stage 4 experiment  - not used
After removing 40 variables, it seemed that this model started to reduce coefficients to zero for features that had previously turned up in our important features lists from the random forest. We tried removing these additional variables and this reduced the accuracy of the model.  So we did not continue to remove variables above the original 40 or so variables we chose in the previous stages.

In [26]:
# print list of variables whose coefficients dropped completely to zero
coefficients_zero = coefficients_combined[coefficients_combined==0].dropna()
print(coefficients_zero.shape[0], "features have coefficients of zero:")
print(coefficients_zero)
# create list for feature selection
feature_drop_list = coefficients_zero.index.values.tolist()

13 features have coefficients of zero:
                                        0
population_density                    0.0
education_less_than_high_school_rate  0.0
race_white_rate                       0.0
heating_fuel_electricity_rate         0.0
lat                                   0.0
age_18_24_rate                        0.0
age_45_54_rate                        0.0
occupation_wholesale_rate             0.0
transportation_carpool_rate           0.0
travel_time_30_39_rate                0.0
travel_time_40_59_rate                0.0
travel_time_60_89_rate                0.0
voting_2012_dem_win                   0.0


## Scratch and notes

In [151]:
# print current drop list for easy copying elsewhere
for i in feature_drop_list:
    print(i)
print(feature_drop_list)

race_white_rate
age_18_24_rate
dropout_16_19_inschool_rate
occupation_manufacturing_rate
['race_white_rate', 'age_18_24_rate', 'dropout_16_19_inschool_rate', 'occupation_manufacturing_rate']


Notes from removing variables in stages:

**Alpha value results - Stage 1**  
* Alpha avlue of .00002 returned almost 40 variables with coefficients reduced to zero
* Alpha value of .00001 returned about 20 variables with coefficients reduced to zero
* Alpha value of .000005 returned about 10 variables of coefficients reduced to zero  


**Alpha value results - Stage 2**  
* Alpha avlue of .00005 returned almost 40 variables with coefficients reduced to zero
* Alpha avlue of .00002 returned about 20 variables with coefficients reduced to zero
* Alpha avlue of .000015 returned 10 variables with coefficients reduced to zero
* Alpha avlue of .00001 returned one variable with coefficients reduced to zero

**Alpha value results - Stage 3**  
* Alpha avlue of .000025 returned about 20 variables with coefficients reduced to zero
* Alpha avlue of .00002 returned about 10 variables with coefficients reduced to zero
* Alpha avlue of .000015 returned 1 variable with coefficients reduced to zero

**Stage 1 Feature Selection:**
These 18 features were removed from the dataset:  
average_household_income  
education_professional_school_rate  
race_indian_alaska_rate  
race_islander_rate  
heating_fuel_other_rate  
electricity_price_residential  
age_25_34_rate  
age_more_than_85_rate  
age_5_9_rate  
occupation_manufacturing_rate  
occupation_retail_rate  
occupation_agriculture_rate  
transportation_walk_rate  
transportation_bicycle_rate  
health_insurance_public_rate  
travel_time_average  
number_of_years_of_education  
water_percent  

**Stage 2 Feature Selection:**  
gini_index  
housing_unit_median_value  
cooling_design_temperature  
atmospheric_pressure  
age_10_14_rate  
age_15_17_rate  
occupation_construction_rate  
occupation_public_rate  
occupation_administrative_rate  
transportation_car_alone_rate  
travel_time_less_than_10_rate  

**Stage 3 Feature Selection:** . 
per_capita_income  
education_doctoral_rate  
race_white_rate  
race_asian_rate  
employ_rate  
elevation  
occupation_finance_rate  
transportation_home_rate  
transportation_motorcycle_rate  
travel_time_20_29_rate  
diversity  

**Stage 3 repeated (additional):**
earth_temperature_amplitude  
age_75_84_rate  
dropout_16_19_inschool_rate  
age_median  
voting_2016_dem_win  
