# Predicting Solar Panel Adoption - Random Forest Models
#### UC Berkeley MIDS
`Team: Gabriel Hudson, Noah Levy, Laura Williams`

Using the dataset defined in the Data Set Up notebook, train two Random Forest sequential models:
* Random Forest Classifier to predict presence or absence of solar panels  
* Random Forest Regressor to predict solar panel density and analyize most important predictive features.

In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing

%matplotlib inline

  from numpy.core.umath_tests import inner1d


In [5]:
# load curated dataset
deepsolar = pd.read_csv('../Datasets/deepsolar_LW1.csv', index_col=0)

In [6]:
print("Dataset rows and dimensions:", deepsolar.shape)

Dataset rows and dimensions: (71305, 108)


## Additional Data Set Up

* Convert string variables to numeric
* Normalize data

In [7]:
# Encode string features (county and state) into numeric features
LE = preprocessing.LabelEncoder()

LE.fit(deepsolar['county'])
deepsolar['county'] = LE.transform(deepsolar['county'])

LE.fit(deepsolar['state'])
deepsolar['state'] = LE.transform(deepsolar['state'])

print("Dataset rows and dimensions:", deepsolar.shape)

Dataset rows and dimensions: (71305, 108)


In [8]:
# Normalize
deepsolar = (deepsolar - deepsolar.mean())/(deepsolar.max() - deepsolar.min())

## Pre-process data

* Define outcome variables
* Split into test/train/dev

In [9]:
# create binary outcome variable for stage 1 RF classifier
deepsolar['solar_flag']=deepsolar['number_of_solar_system_per_household'].apply(lambda x: int(x>0))

In [10]:
# Confirm values in new outcome variable
print("New binary outcome variable for Stage 1 random forest classifier 'solar_flag' has", 
      deepsolar['solar_flag'].nunique(), "values:", deepsolar["solar_flag"].min(),
      "and", deepsolar["solar_flag"].max())

New binary outcome variable for Stage 1 random forest classifier 'solar_flag' has 2 values: 0 and 1


Random shuffle and split data into test, training and development sets. Test data will not be used until model and dataset has been optimized on the training and development datasets.

In [11]:
# separate outcome variables and features
X = deepsolar.drop(labels=['solar_flag', 'number_of_solar_system_per_household'], axis=1).values
Y_classifier = deepsolar['solar_flag'].values
Y_regressor = deepsolar['number_of_solar_system_per_household'].values
print("Full featureset shape is", X.shape)
print("Classifier outcome variable shape:", Y_classifier.shape)
print("Regressor outcome variable shape:", Y_regressor.shape)

Full featureset shape is (71305, 107)
Classifier outcome variable shape: (71305,)
Regressor outcome variable shape: (71305,)


In [12]:
# set a random seed to keep the split the same 
np.random.seed(0)

# shuffle data
shuffle = np.random.permutation(np.arange(X.shape[0]))
X = X[shuffle]
Y_classifier = Y_classifier[shuffle]
Y_regressor = Y_regressor[shuffle]

# split data and labels into test set and initial training set
n_train = int(0.8*X.shape[0])
X_train1 = X[:n_train,:]
X_test = X[n_train:,:]
Y_classifier_train1 = Y_classifier[:n_train]
Y_classifier_test = Y_classifier[n_train:]
Y_regressor_train1 = Y_regressor[:n_train]
Y_regressor_test = Y_regressor[n_train:]

# split training data and labels into training and development sets
n_train = int(0.8*X_train1.shape[0])
X_train = X_train1[:n_train,:]
X_dev = X_train1[n_train:,:]
Y_classifier_train = Y_classifier_train1[:n_train]
Y_classifier_dev = Y_classifier_train1[n_train:]
Y_regressor_train = Y_regressor_train1[:n_train]
Y_regressor_dev = Y_regressor_train1[n_train:]

print("{:<35}\t{}".format("Training data shape:", X_train.shape))
print("{:<35}\t{}".format("Training outcome variable - classifier:",Y_classifier_train.shape ))
print("{:<35}\t{}".format("Training outcome variable - regressor:",Y_regressor_train.shape ))
print("{:<35}\t{}".format("Dev data shape:", X_dev.shape))
print("{:<35}\t{}".format("Dev outcome variable - classifier:",Y_classifier_dev.shape ))
print("{:<35}\t{}".format("Dev outcome variable - regressor:",Y_regressor_dev.shape ))
print("{:<35}\t{}".format("Test data shape:", X_test.shape))
print("{:<35}\t{}".format("Test outcome variable - classifier:",Y_classifier_test.shape ))
print("{:<35}\t{}".format("Test outcome variable - regressor:",Y_regressor_test.shape ))



Training data shape:               	(45635, 107)
Training outcome variable - classifier:	(45635,)
Training outcome variable - regressor:	(45635,)
Dev data shape:                    	(11409, 107)
Dev outcome variable - classifier: 	(11409,)
Dev outcome variable - regressor:  	(11409,)
Test data shape:                   	(14261, 107)
Test outcome variable - classifier:	(14261,)
Test outcome variable - regressor: 	(14261,)


## Train the classifier

In [13]:
# Use best parameters from Noah's hyperparameter tuning
n = 100
depth = None
features = 'auto'

In [14]:
# Fit model
RF1_Classifier = RandomForestClassifier(n_estimators=n, max_depth=depth, max_features=features, n_jobs=1)
RF1_Classifier.fit(X_train, Y_classifier_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [15]:
# R squared
RF1_Classifier.score(X_dev,Y_classifier_dev)

0.9357524761153475

In [16]:
RF1_Regressor = RandomForestRegressor(n_estimators=200, max_depth=depth, max_features=features, n_jobs=1)
RF1_Regressor.fit(X_train, Y_regressor_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

TO DO: Feature Importances for the Classifier, then train the Regressor

In [17]:
from sklearn.metrics import r2_score

In [18]:
dev_preds=RF1_Regressor.predict(X_dev)

In [20]:
r2_score(Y_regressor_dev,dev_preds)

0.7429143031857192

In [21]:
classifier_dev_preds=RF1_Classifier.predict(X_dev)

In [23]:
final_preds=classifier_dev_preds*dev_preds

In [24]:
r2_score(Y_regressor_dev,final_preds)

0.6536429591647941

In [31]:
cols=list(deepsolar.columns)
cols.remove('solar_flag')
cols.remove('number_of_solar_system_per_household')

In [32]:
feature_importances=RF1_Regressor.feature_importances_
features=cols
feature_tuples=[(features[i],feature_importances[i]) for i in range(len(features))]
sorted_features=sorted(feature_tuples,reverse=True,key=lambda k: k[1])
for i in range(0,20):
    print(sorted_features[i])

('incentive_count_residential', 0.21332830328278793)
('occupancy_owner_rate', 0.14036380279260402)
('daily_solar_radiation', 0.05367261341707273)
('lon', 0.037536929510154415)
('median_household_income', 0.034108552340376214)
('education_college_rate', 0.025473944660335358)
('lat', 0.022155867591383598)
('housing_unit_median_gross_rent', 0.021513721200182526)
('voting_2016_dem_percentage', 0.020077521259859208)
('population_density', 0.019615206129611783)
('household_type_family_rate', 0.01639784788315769)
('health_insurance_public_rate', 0.014356232235277782)
('occupation_manufacturing_rate', 0.011978338144598881)
('relative_humidity', 0.011934774179001375)
('sales_tax', 0.01077843059119527)
('heating_fuel_coal_coke_rate', 0.01062253071391972)
('housing_unit_median_value', 0.010365581733798661)
('mortgage_with_rate', 0.008484259671878172)
('land_area', 0.008447438297597457)
('county', 0.007887750244379885)


In [33]:
feature_importances=RF1_Classifier.feature_importances_
features=cols
feature_tuples=[(features[i],feature_importances[i]) for i in range(len(features))]
sorted_features=sorted(feature_tuples,reverse=True,key=lambda k: k[1])
for i in range(0,20):
    print(sorted_features[i])

('lon', 0.050548280741418684)
('daily_solar_radiation', 0.04533691096264032)
('incentive_count_residential', 0.038754985350315685)
('relative_humidity', 0.03841722135125413)
('state', 0.038269524191330485)
('electricity_consume_residential', 0.026694228003234777)
('housing_unit_median_gross_rent', 0.024843467587175577)
('electricity_price_residential', 0.023912121529669938)
('frost_days', 0.021941800314608333)
('occupancy_owner_rate', 0.021927845140966615)
('incentive_residential_state_level', 0.019995138444446673)
('population_density', 0.019721540480844933)
('sales_tax', 0.018670964551530545)
('avg_electricity_retail_rate', 0.018266362392956967)
('median_household_income', 0.017876865831087538)
('land_area', 0.017384474217057987)
('elevation', 0.017351488144585897)
('lat', 0.01596847484597179)
('total_area', 0.015665123591036292)
('average_household_income', 0.015062400733989853)
