# Predicting Solar Panel Adoption - Dataset Variations
#### UC Berkeley MIDS
`Team: Gabriel Hudson, Noah Levy, Laura Williams`

This notebook includes only the code used to create two dataset variations used to train models to compare to our final model.  For more information about the final dataset, see the "Data Set Up - Laura" notebook and the "Feature Selection with L1 Regularization" notebook.

This notebook starts with a public dataset from Stanford's DeepSolar team, available here:  
http://web.stanford.edu/group/deepsolar/home  

The two dataset variations created in this notebook are:  
1) A dataset with the original outcome variable used by the DeepSolar team
2) A dataset with our modified outcome variable but with latitude and longitude removed

In [1]:
# imports
import pandas as pd
import numpy as np
from sklearn import preprocessing

## Load dataset

In [2]:
# load full dataset, using FIPS number as index
deepsolar_original = pd.read_csv('../deepsolar_tract.csv', index_col=4, encoding='ISO-8859-1')
# remove unused indexing variable at position 0
deepsolar_original.drop(labels='Unnamed: 0', axis=1, inplace=True)

In [3]:
# print dataset shape
original_shape = deepsolar_original.shape
print("Dataset rows and dimensions after indexing:", original_shape)

Dataset rows and dimensions after indexing: (72537, 167)


## Remove datapoints

In [4]:
# remove census tracts with population under 100 from the dataset
deepsolar_curated = deepsolar_original[deepsolar_original['population'] >= 100]
print("Dataset shape after removing population counts of less than or equal to 100:", \
      deepsolar_curated.shape)

Dataset shape after removing population counts of less than or equal to 100: (71828, 167)


In [5]:
# remove census tracts with household count under 100 from the dataset
deepsolar_curated = deepsolar_curated[deepsolar_curated['household_count'] >= 100]
print("Dataset shape after removing household counts less than or equal to 100:", \
      deepsolar_curated.shape)

Dataset shape after removing household counts less than or equal to 100: (71482, 167)


In [6]:
# Create variable to calculate water percentage of each census tract
deepsolar_curated['water_percent'] = deepsolar_curated['water_area']/deepsolar_curated['total_area']

In [7]:
# remove census tracts more than 75% water from the dataset
deepsolar_curated = deepsolar_curated[deepsolar_curated['water_percent'] < 0.75]
print("Dataset shape after removing census tracts more than 75% water:", \
      deepsolar_curated.shape)

Dataset shape after removing census tracts more than 75% water: (71305, 168)


In [8]:
# remove the three area variables that contribute to the water_percent variable
deepsolar_curated.drop(labels='land_area', axis=1, inplace=True)
deepsolar_curated.drop(labels='water_area', axis=1, inplace=True)
deepsolar_curated.drop(labels='total_area', axis=1, inplace=True)
deepsolar_curated.shape

(71305, 165)

In [9]:
print("Percentage of datapoints remaining in the curated dataset compared to the original is: "+ \
       str(round((deepsolar_curated.shape[0]/deepsolar_original.shape[0])*100, 2)) + "%")

Percentage of datapoints remaining in the curated dataset compared to the original is: 98.3%


## Define outcome variable

In [10]:
# multiple owner occupied rate times current outcome variable
deepsolar_curated['owner_occupied_solar_system_density'] = deepsolar_curated['occupancy_owner_rate']* \
                                                           deepsolar_curated['number_of_solar_system_per_household']
outcome_var = 'owner_occupied_solar_system_density'

In [11]:
# remove the previously used outcome variable and the occupancy owner rate variable
deepsolar_curated.drop(labels=['occupancy_owner_rate', 'number_of_solar_system_per_household'], 
                       axis=1, inplace=True)
deepsolar_curated.shape

(71305, 164)

For one dataset variation, we're using the original outcome variable used by DeepSolar.

In [10]:
# VARIATION
# outcome_var = 'number_of_solar_system_per_household'

## Remove variables

In [12]:
# extract variable names from header
deepsolar_header = pd.read_csv('../deepsolar_tract.csv', encoding='ISO-8859-1', nrows=0.).columns.tolist()

In [13]:
# load all variables into a dictionary for curating a variable list
variables_dict = {}
for var in deepsolar_header:
    variables_dict[var] = 1

### Remove outcome variables

In [14]:
variables_dict['tile_count'] = 'Unused outcome variable: number of image tiles containing solar power system'
variables_dict['solar_system_count'] = 'Unused outcome variable: number of solar power systems (after merging)'
variables_dict['total_panel_area'] = 'Unused outcome variable: total area of solar panels (m^2)'
variables_dict['solar_panel_area_divided_by_area'] = 'Unused outcome variable: \
    solar panel area divided by total area (m^2/mile^2)'
variables_dict['solar_panel_area_per_capita'] = 'Unused outcome variable: \
    solar panel area per capita (m^2/capita)'
variables_dict['tile_count_residential'] = 'number of image tiles containing residential solar power system'
variables_dict['tile_count_nonresidential'] = 'Unused outcome variable: \
    number of image tiles containing non-residential solar power system'
variables_dict['solar_system_count_residential'] = 'Unused outcome variable: \
    number of residential solar power systems (after merging)'
variables_dict['solar_system_count_nonresidential'] = 'Unused outcome variable: \
    number of non-residential solar power systems (after merging)'
variables_dict['total_panel_area_residential'] = 'Unused outcome variable: \
    total area of residential solar panels (m^2)'
variables_dict['total_panel_area_nonresidential'] = 'Unused outcome variable: \
    total area of non-residential solar panels (m^2)'

### Remove variables that were used to calculate other variables

In [15]:
variables_dict['education_population'] = 'Used for calculating eduation proportions'
variables_dict['heating_fuel_housing_unit_count'] = 'Used for calculating heating proportions'
variables_dict['population'] = 'Used for calculating population density'
variables_dict['poverty_family_count'] = 'Used for calculating poverty level rate'
variables_dict['household_count'] = 'Used for calculating other household proportions'
variables_dict['housing_unit_count'] = 'Used for calculating other housing unit proportions'
variables_dict['housing_unit_occupied_count'] = 'Used for calculating other occupancy rates'

### Remove count variables

In [16]:
variables_dict['race_asian'] = 'Proportion recorded in another variable: race_asian_rate'
variables_dict['race_black_africa'] = 'Proportion recorded in another variable: race_black_africa_rate'
variables_dict['race_indian_alaska'] = 'Proportion recorded in another variable: race_indian_alaska_rate'
variables_dict['race_islander'] = 'Proportion recorded in another variable: race_islander_rate'
variables_dict['race_other'] = 'Proportion recorded in another variable: race_other_rate'
variables_dict['race_two_more'] = 'Proportion recorded in another variable: race_two_more_rate'
variables_dict['race_white'] = 'Proportion recorded in another variable: race_white_rate'
variables_dict['unemployed'] = 'Proportion recorded in another variable: employ_rate'
variables_dict['poverty_family_below_poverty_level'] = 'Proportion recorded in another variable: \
    poverty_family_below_poverty_level_rate'
variables_dict['heating_fuel_none'] = 'Proportion recorded in another variable: heating_fuel_none_rate'
variables_dict['heating_fuel_other'] = 'Proportion recorded in another variable: heating_fuel_other_rate'
variables_dict['heating_fuel_solar'] = 'Proportion recorded in another variable: heating_fuel_solar_rate'
variables_dict['education_professional_school'] = 'Proportion recorded in another variable: \
    education_professional_school_rate'
variables_dict['employed'] = 'Proportion recorded in another variable: employ_rate'
variables_dict['heating_fuel_coal_coke'] = 'Proportion recorded in another variable: heating_fuel_coal_coke_rate'
variables_dict['heating_fuel_electricity'] = 'Proportion recorded in another variable: heating_fuel_electricity_rate'
variables_dict['heating_fuel_fuel_oil_kerosene'] = 'Proportion recorded in another variable: \
    heating_fuel_fuel_oil_kerosene_rate'
variables_dict['heating_fuel_gas'] = 'Proportion recorded in another variable: heating_fuel_gas_rate'
variables_dict['education_bachelor'] = 'Proportion recorded in another variable: education_bachelor_rate'
variables_dict['education_college'] = 'Proportion recorded in another variable: education_college_rate'
variables_dict['education_doctoral'] = 'Proportion recorded in another variable: education_doctoral_rate'
variables_dict['education_high_school_graduate'] = 'Proportion recorded in another variable: \
    education_high_school_graduate_rate'
variables_dict['education_less_than_high_school'] = 'Proportion recorded in another variable: \
    education_less_than_high_school_rate'
variables_dict['education_master'] = 'Proportion recorded in another variable: education_master_rate'

### Remove variables highly correlated with another variable

In [17]:
variables_dict['earth_temperature'] = 'Positively correlated with lat'
variables_dict['heating_degree_days'] = 'Positively correlated with lat'
variables_dict['cooling_degree_days'] = 'Positively correlated with lat'
# VARIATION
# retain air_temperature when removing latitude
#variables_dict['air_temperature'] = 'Positively correalted with lat'
variables_dict['heating_design_temperature'] = 'Positively correlated with variables highly correlated with lat'
variables_dict['voting_2012_dem_percentage'] = 'Positively correlated with voting_2016_dem_percentage'
variables_dict['voting_2012_gop_percentage'] = 'Positively correlated with voting_2016_gop_percentage'
variables_dict['voting_2016_gop_percentage'] = 'Negatively correalted with voting_2016_dem_percentage'

### Remove electricity price, electricity consumption and incentive variables that are not explicitly residential

In [18]:
variables_dict['electricity_price_commercial'] = "Retaining electricity_price_residential in dataset"
variables_dict['electricity_price_industrial'] = "Retaining electricity_price_residential in dataset"
variables_dict['electricity_price_transportation'] = "Retaining electricity_price_residential in dataset"
variables_dict['electricity_price_overall'] = "Retaining electricity_price_residential in dataset"
variables_dict['electricity_consume_commercial'] = "Retaining electricity_consume_residential in dataset"
variables_dict['electricity_consume_industrial'] = "Retaining electricity_consume_residential in dataset"
variables_dict['electricity_consume_total'] = "Retaining electricity_consume_residential in dataset"
variables_dict['incentive_count_nonresidential'] = "Retaining incentive_count_residential in dataset"
variables_dict['incentive_nonresidential_state_level'] = "Retaining incentive_residential_state_level in dataset"

In [19]:
# Convert variable dictionary to list
drop_variables = []
for key,val in variables_dict.items():
    if val!=1:
        drop_variables.append(key)
dropped_var_number = len(drop_variables)
print("Number of variables to be dropped:", dropped_var_number, "\n")
print("After dropping variables, dataset should have", original_shape[1]-3-dropped_var_number, "variables")

Number of variables to be dropped: 58 

After dropping variables, dataset should have 106 variables


In [20]:
deepsolar_curated.drop(labels=drop_variables, axis=1, inplace=True)
print(deepsolar_curated.shape)

(71305, 106)


## Impute missing values

In [21]:
# remove air_temperature when removed above
deepsolar_curated[['elevation', 
                  'daily_solar_radiation', 
                  'lat', 
                  'lon', 
                  'air_temperature',  # VARIATION
                  'cooling_design_temperature',
                  'earth_temperature_amplitude', 
                  'frost_days', 
                  'relative_humidity', 
                  'atmospheric_pressure', 
                  'wind_speed',
                  'housing_unit_median_gross_rent',
                  'housing_unit_median_value',
                  'dropout_16_19_inschool_rate',
                  'mortgage_with_rate',
                  'median_household_income',
                  'travel_time_average']] = deepsolar_curated.groupby(['state', 'county']) \
                                                              ['elevation',
                                                               'daily_solar_radiation', 
                                                               'lat', 
                                                               'lon', 
                                                               'air_temperature', # VARIATION
                                                               'cooling_design_temperature',
                                                               'earth_temperature_amplitude', 
                                                               'frost_days', 
                                                               'relative_humidity', 
                                                               'atmospheric_pressure', 
                                                               'wind_speed',
                                                               'housing_unit_median_gross_rent',
                                                               'housing_unit_median_value',
                                                               'dropout_16_19_inschool_rate',
                                                               'mortgage_with_rate',
                                                               'median_household_income',
                                                               'travel_time_average'
                                                                ] \
                                                               .transform(lambda x: x.fillna(x.median()))

In [22]:
# Confirm missing values have all been replaced
missing_val_count = deepsolar_curated.isnull().sum().sort_values(ascending=False)
for a,b, in missing_val_count.iteritems():
    if b>0:
        print("{:<30}\t{}".format(a,b))
# No output means there is no remaining missing data

earth_temperature_amplitude   	11
atmospheric_pressure          	11
lon                           	11
elevation                     	11
cooling_design_temperature    	11
frost_days                    	11
air_temperature               	11
relative_humidity             	11
daily_solar_radiation         	11
wind_speed                    	11
lat                           	11
housing_unit_median_gross_rent	1
housing_unit_median_value     	1


Imputing values by county and state leaves a small remaining count of missing values. The national median is used to replace those remaining missing values.

In [23]:
# fill in missing lat/lon and weather data with national medians
# remove air_temperature if removed above
deepsolar_curated['elevation']. \
    fillna(deepsolar_curated['elevation'].median(), inplace=True)
deepsolar_curated['daily_solar_radiation']. \
    fillna(deepsolar_curated['daily_solar_radiation'].median(), inplace=True)
deepsolar_curated['lat']. \
    fillna(deepsolar_curated['lat'].median(), inplace=True)
deepsolar_curated['lon']. \
    fillna(deepsolar_curated['lon'].median(), inplace=True)
# VARIATION
deepsolar_curated['air_temperature']. \
    fillna(deepsolar_curated['air_temperature'].median(), inplace=True)
deepsolar_curated['cooling_design_temperature']. \
    fillna(deepsolar_curated['cooling_design_temperature'].median(), inplace=True)
deepsolar_curated['earth_temperature_amplitude']. \
    fillna(deepsolar_curated['earth_temperature_amplitude'].median(), inplace=True)
deepsolar_curated['frost_days']. \
    fillna(deepsolar_curated['frost_days'].median(), inplace=True)
deepsolar_curated['relative_humidity']. \
    fillna(deepsolar_curated['relative_humidity'].median(), inplace=True)
deepsolar_curated['atmospheric_pressure']. \
    fillna(deepsolar_curated['atmospheric_pressure'].median(), inplace=True)
deepsolar_curated['wind_speed']. \
    fillna(deepsolar_curated['wind_speed'].median(), inplace=True)
deepsolar_curated['housing_unit_median_gross_rent']. \
    fillna(deepsolar_curated['housing_unit_median_gross_rent'].median(), inplace=True)
deepsolar_curated['housing_unit_median_value']. \
    fillna(deepsolar_curated['housing_unit_median_value'].median(), inplace=True)
deepsolar_curated['dropout_16_19_inschool_rate']. \
    fillna(deepsolar_curated['dropout_16_19_inschool_rate'].median(), inplace=True)
deepsolar_curated['mortgage_with_rate']. \
    fillna(deepsolar_curated['mortgage_with_rate'].median(), inplace=True)
deepsolar_curated['median_household_income']. \
    fillna(deepsolar_curated['median_household_income'].median(), inplace=True)
deepsolar_curated['travel_time_average']. \
    fillna(deepsolar_curated['travel_time_average'].median(), inplace=True)

In [24]:
# Confirm missing values have all been replaced
missing_val_count = deepsolar_curated.isnull().sum().sort_values(ascending=False)
for a,b, in missing_val_count.iteritems():
    if b>0:
        print("{:<30}\t{}".format(a,b))
# No output means there is no remaining missing data

## Convert string variables (county and state) to numeric values

In [25]:
# Encode string features (county and state) into numeric features
LE = preprocessing.LabelEncoder()

LE.fit(deepsolar_curated['county'])
deepsolar_curated['county'] = LE.transform(deepsolar_curated['county'])

LE.fit(deepsolar_curated['state'])
deepsolar_curated['state'] = LE.transform(deepsolar_curated['state'])

print("Confirm dataset shape has not changed after inspecting and modifying values:", deepsolar_curated.shape)

Confirm dataset shape has not changed after inspecting and modifying values: (71305, 106)


## Normalize variables

In [26]:
# normalize all variables except outcome variable
# remove outcome variable so that only features are normalized
deepsolar_final = deepsolar_curated.drop(labels=outcome_var, axis=1)
print("Dataset shape after removing variables:", deepsolar_final.shape)
# normalize
deepsolar_final = (deepsolar_final - deepsolar_final.mean())/(deepsolar_final.max() - deepsolar_final.min())
# Add outcome variable back in from un-normalized dataset
deepsolar_final[outcome_var] = deepsolar_curated[outcome_var]
print("Dataset shape after adding variables back in:", deepsolar_final.shape)

Dataset shape after removing variables: (71305, 105)
Dataset shape after adding variables back in: (71305, 106)


## Remove geographic variables

In [27]:
# remove county and state variables
geographic_vars = ['county', 'state']
deepsolar_final.drop(labels=geographic_vars, axis=1, inplace=True)
print(deepsolar_final.shape)

(71305, 104)


## Remove variables based on L1 regularization linear model

In [28]:
feature_drop_list = ['average_household_income', 'gini_index', 'education_less_than_high_school_rate', 
                     'education_professional_school_rate', 'education_doctoral_rate', 'race_indian_alaska_rate', 
                     'race_islander_rate', 'employ_rate', 'heating_fuel_other_rate', 'electricity_price_residential', 
                     'housing_unit_median_value', 'elevation', 'cooling_design_temperature', 'atmospheric_pressure', 
                     'age_more_than_85_rate', 'age_45_54_rate', 'age_55_64_rate', 'age_15_17_rate', 
                     'occupation_public_rate', 'occupation_agriculture_rate', 'transportation_home_rate', 
                     'transportation_car_alone_rate', 'transportation_walk_rate', 'transportation_motorcycle_rate', 
                     'transportation_bicycle_rate', 'travel_time_less_than_10_rate', 'travel_time_40_59_rate', 
                     'health_insurance_public_rate', 'travel_time_average', 'number_of_years_of_education', 
                     'diversity', 'race_asian_rate', 'age_75_84_rate', 'age_10_14_rate', 'age_5_9_rate', 
                     'occupation_finance_rate', 'travel_time_30_39_rate', 'travel_time_60_89_rate', 
                     'voting_2016_dem_win', 'voting_2012_dem_win', 'race_white_rate', 'age_18_24_rate', 
                     'dropout_16_19_inschool_rate', 'occupation_manufacturing_rate']

deepsolar_final.drop(labels=feature_drop_list, axis=1, inplace=True)
print(deepsolar_final.shape)

(71305, 60)


### Remove latitude and longitude to assess value of those variables to the model

In [29]:
# VARIATION
deepsolar_final.drop(labels=['lat', 'lon'], axis=1, inplace=True)
print(deepsolar_final.shape)

(71305, 58)


## Print variables and save final dataset

In [30]:
print("There are", len(deepsolar_final.columns.values), "variables in the dataset:")
for i in deepsolar_final.columns.values:
    print(i)

There are 58 variables in the dataset:
per_capita_income
population_density
education_high_school_graduate_rate
education_college_rate
education_bachelor_rate
education_master_rate
race_black_africa_rate
race_other_rate
race_two_more_rate
poverty_family_below_poverty_level_rate
heating_fuel_gas_rate
heating_fuel_electricity_rate
heating_fuel_fuel_oil_kerosene_rate
heating_fuel_coal_coke_rate
heating_fuel_solar_rate
heating_fuel_none_rate
median_household_income
electricity_consume_residential
average_household_size
housing_unit_median_gross_rent
earth_temperature_amplitude
frost_days
air_temperature
relative_humidity
daily_solar_radiation
wind_speed
age_25_34_rate
age_35_44_rate
age_65_74_rate
household_type_family_rate
occupation_construction_rate
occupation_information_rate
occupation_education_rate
occupation_administrative_rate
occupation_wholesale_rate
occupation_retail_rate
occupation_transportation_rate
occupation_arts_rate
occupancy_vacant_rate
mortgage_with_rate
transportation

In [31]:
# deepsolar_final.to_csv('../Datasets/SolarPrediction_Final_Dataset.csv')

In [29]:
# VARIATION
# save dataset with original outcome varable
deepsolar_final.to_csv('../Datasets/Final_OldOutcomeVar.csv')

In [32]:
# VARIATION
# save dataset with latitude and longitude removed
deepsolar_final.to_csv('../Datasets/Final_NoLatLon.csv')