# Format Core Dataset

A large part of this analysis will be incentives/policies, utility rates, and other more "controllable" factors. I want to get the core dataset together first. 

There are three components to the dataset:
- PV data (aggregated to the ZCTA level)
- Environmental data
- Census data

The environmental data is ready to go. I need to go through the Census data and explore/clean a bit. More than anything I need to prep the PV data so I am going to start with that. I will explore the PV data by itself in the future, right now I am more concerned with prepping it to be part of the larger dataset.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## PV Data

PV data has been cleaned and aggregated in `01_prep_pv_data.ipynb`. Read in below and ensure formatted properly.

In [2]:
pv_data = pd.read_csv('../data/pv_systems.csv').drop('Unnamed: 0', axis=1)
pv_data.head()

Unnamed: 0,zcta,num_systems,total_capacity,mean_system_size,median_system_size
0,5001,13,74.705,5.746538,4.2
1,5026,1,5.5,5.5,5.5
2,5031,4,21.516,5.379,5.883
3,5032,11,72.133,6.557545,5.4
4,5033,10,71.86,7.186,7.521


## Census Data

In [3]:
census = pd.read_csv('../data/census.csv').drop('Unnamed: 0', axis=1)
census.head()

Unnamed: 0,zcta,state,lat,long,average_household_income,mean_household_income_lowest_quintile,mean_household_income_second_quintile,mean_household_income_third_quintile,mean_household_income_fourth_quintile,mean_household_income_highest_quintile,...,total_population,per_capita_income,median_household_income,household_count,housing_unit_count,housing_unit_occupied_count,housing_unit_median_value,housing_unit_median_gross_rent,average_years_of_education,population_density
0,85610,Arizona,31.744197,109.722324,53713.747228,15735.0,28976.0,41584.0,60403.0,121871.0,...,1071,22831.0,41422,451,691,451,118200,565,12.518724,2.404896
1,85614,Arizona,31.814301,110.9194,67347.031441,15092.0,33942.0,52059.0,78902.0,156740.0,...,23777,37346.0,51449,13104,16986,13104,174800,1011,14.646702,175.250085
2,85624,Arizona,31.504971,110.692999,56508.955224,12085.0,26596.0,40793.0,63481.0,139590.0,...,1289,25608.0,39013,536,860,536,240000,719,14.829431,3.291417
3,85629,Arizona,31.917838,111.019035,91646.185302,24218.0,55100.0,82356.0,109225.0,187332.0,...,25770,30916.0,82334,8559,9260,8559,194200,1410,14.201352,111.842558
4,85630,Arizona,31.886572,110.181046,57186.339381,6123.0,16639.0,37332.0,58660.0,167178.0,...,1757,30005.0,38507,937,1226,937,163600,493,13.923297,9.599983


In [4]:
census.shape

(15740, 88)

In [5]:
pv_data['zcta'].nunique()

10263

Census data has 15,740 zip codes (ZCTAs) while the PV data only has 10,263. This reflects the number of ZCTAs with no PV systems. It also means the census data will be the core data I use to create the full dataset.

In [6]:
census.isna().sum()

zcta                                0
state                               0
lat                                 0
long                                0
average_household_income          323
                                 ... 
housing_unit_occupied_count         0
housing_unit_median_value           0
housing_unit_median_gross_rent      0
average_years_of_education        204
population_density                  0
Length: 88, dtype: int64

Handling missing data is going to be very interesting with the census data. Imputing will naturally lead to bias (which is dangerous with Census data). But dropping the missing values would result in bias as well (lack of representation). This is something I am going to have to look into quite a bit more. For the purposes of this MVP, I am not going to worry about it too much. 

I am going to leave all of them as missing for now and continue to construct the dataset. Next I want to include the environmental data (`data/zcta_env.json`)

In [7]:
env_df = pd.read_json('../data/zcta_env.json').T.reset_index()
env_df.head()

Unnamed: 0,index,T2M,RH2M,SI_EF_TILTED_SURFACE_HORIZONTAL,PS,CLOUD_AMT_DAY,CDD18_3,HDD18_3,WS2M,TS,FROST_DAYS,TS_AMP,SG_SAA
0,86506,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
1,86512,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
2,85936,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
3,86508,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
4,87319,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91


Change column names

In [8]:
ENV_COLS = {'index': 'zcta', 'T2M': 'air_temp', 'RH2M': 'relative_humidity', 'SI_EF_TILTED_SURFACE_HORIZONTAL': 'solar_radiation',
            'PS': 'atmospheric_pressure', 'CLOUD_AMT_DAY': 'cloud_amount', 'CDD18_3': 'cooling_degree_days', 'HDD18_3': 'heating_degree_days',
            'WS2M': 'wind_speed', 'TS': 'earth_temp', 'FROST_DAYS': 'frost_days', 'TS_AMP': 'earth_temp_amplitude', 'SG_SAA': 'solar_azimuth_angle'}

In [9]:
env_df = env_df.rename(columns=ENV_COLS)
env_df.head()

Unnamed: 0,zcta,air_temp,relative_humidity,solar_radiation,atmospheric_pressure,cloud_amount,cooling_degree_days,heating_degree_days,wind_speed,earth_temp,frost_days,earth_temp_amplitude,solar_azimuth_angle
0,86506,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
1,86512,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
2,85936,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
3,86508,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91
4,87319,11.15,45.83,5.49,79.0,40.9,34.21,228.16,2.93,11.94,10.64,18.21,-100.91


In [10]:
env_df.shape

(15755, 13)

Environmental data has even more ZCTAs than the Census data. Still going to use Census data as base dataset. Below I merge the census data and environmental data on the `zcta` column. I perform a `left` join to preserve all observations in the `census` table. Should have 15,740 rows after merge

In [11]:
base_df = census.merge(env_df, on='zcta', how='left')
base_df.head()

Unnamed: 0,zcta,state,lat,long,average_household_income,mean_household_income_lowest_quintile,mean_household_income_second_quintile,mean_household_income_third_quintile,mean_household_income_fourth_quintile,mean_household_income_highest_quintile,...,solar_radiation,atmospheric_pressure,cloud_amount,cooling_degree_days,heating_degree_days,wind_speed,earth_temp,frost_days,earth_temp_amplitude,solar_azimuth_angle
0,85610,Arizona,31.744197,109.722324,53713.747228,15735.0,28976.0,41584.0,60403.0,121871.0,...,5.78,86.27,32.13,95.2,108.36,2.46,17.4,3.3,18.02,-100.64
1,85614,Arizona,31.814301,110.9194,67347.031441,15092.0,33942.0,52059.0,78902.0,156740.0,...,5.75,90.59,31.87,142.35,64.42,2.43,20.51,0.8,18.44,-101.18
2,85624,Arizona,31.504971,110.692999,56508.955224,12085.0,26596.0,40793.0,63481.0,139590.0,...,5.75,90.59,31.87,142.35,64.42,2.43,20.51,0.8,18.44,-101.18
3,85629,Arizona,31.917838,111.019035,91646.185302,24218.0,55100.0,82356.0,109225.0,187332.0,...,5.75,90.59,31.87,142.35,64.42,2.43,20.51,0.8,18.44,-101.18
4,85630,Arizona,31.886572,110.181046,57186.339381,6123.0,16639.0,37332.0,58660.0,167178.0,...,5.78,86.27,32.13,95.2,108.36,2.46,17.4,3.3,18.02,-100.64


In [12]:
base_df.shape

(15740, 100)

The last thing I need to do is add PV data and to `base_df`

In [13]:
pv_data.head()

Unnamed: 0,zcta,num_systems,total_capacity,mean_system_size,median_system_size
0,5001,13,74.705,5.746538,4.2
1,5026,1,5.5,5.5,5.5
2,5031,4,21.516,5.379,5.883
3,5032,11,72.133,6.557545,5.4
4,5033,10,71.86,7.186,7.521


In [14]:
base_df['zcta'].dtype

dtype('int64')

In [15]:
pv_data['zcta'].dtype

dtype('int64')

In [16]:
base_df = base_df.merge(pv_data, on='zcta', how='left')
base_df.head()

Unnamed: 0,zcta,state,lat,long,average_household_income,mean_household_income_lowest_quintile,mean_household_income_second_quintile,mean_household_income_third_quintile,mean_household_income_fourth_quintile,mean_household_income_highest_quintile,...,heating_degree_days,wind_speed,earth_temp,frost_days,earth_temp_amplitude,solar_azimuth_angle,num_systems,total_capacity,mean_system_size,median_system_size
0,85610,Arizona,31.744197,109.722324,53713.747228,15735.0,28976.0,41584.0,60403.0,121871.0,...,108.36,2.46,17.4,3.3,18.02,-100.64,13.0,70.15,5.396154,5.38
1,85614,Arizona,31.814301,110.9194,67347.031441,15092.0,33942.0,52059.0,78902.0,156740.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,1012.0,7015.507,6.932319,5.985
2,85624,Arizona,31.504971,110.692999,56508.955224,12085.0,26596.0,40793.0,63481.0,139590.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,24.0,150.86,6.285833,5.865
3,85629,Arizona,31.917838,111.019035,91646.185302,24218.0,55100.0,82356.0,109225.0,187332.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,1186.0,8934.678,7.533455,7.2975
4,85630,Arizona,31.886572,110.181046,57186.339381,6123.0,16639.0,37332.0,58660.0,167178.0,...,108.36,2.46,17.4,3.3,18.02,-100.64,37.0,258.01,6.973243,6.48


In [17]:
base_df.shape

(16063, 104)

A few rows were added after the merge

In [18]:
# check duplicates
base_df.duplicated(subset=['zcta']).sum()

323

323 duplicated ZCTA values

In [19]:
duplicate_zctas = base_df[base_df.duplicated(subset=['zcta'])]['zcta']

In [20]:
base_df[base_df['zcta'].isin(duplicate_zctas)]

Unnamed: 0,zcta,state,lat,long,average_household_income,mean_household_income_lowest_quintile,mean_household_income_second_quintile,mean_household_income_third_quintile,mean_household_income_fourth_quintile,mean_household_income_highest_quintile,...,heating_degree_days,wind_speed,earth_temp,frost_days,earth_temp_amplitude,solar_azimuth_angle,num_systems,total_capacity,mean_system_size,median_system_size
2169,6470,Connecticut,41.395083,73.317663,152776.476561,30900.0,76312.0,125581.0,186743.0,344346.0,...,232.55,2.75,12.26,5.61,6.83,-79.49,21.0,214.260,10.202857,9.8400
2170,6470,Connecticut,41.395083,73.317663,152776.476561,30900.0,76312.0,125581.0,186743.0,344346.0,...,232.55,2.75,12.26,5.61,6.83,-79.49,161.0,1763.640,10.954286,9.9200
2171,6472,Connecticut,41.382766,72.775194,129469.583333,41192.0,76412.0,100063.0,139226.0,290454.0,...,232.55,2.75,12.26,5.61,6.83,-79.49,1.0,4.910,4.910000,4.9100
2172,6472,Connecticut,41.382766,72.775194,129469.583333,41192.0,76412.0,100063.0,139226.0,290454.0,...,232.55,2.75,12.26,5.61,6.83,-79.49,11.0,116.260,10.569091,8.8200
2173,6480,Connecticut,41.598834,72.589071,121043.197006,21074.0,57895.0,97127.0,139356.0,289765.0,...,305.34,0.19,8.62,11.46,13.89,-79.26,16.0,124.540,7.783750,7.5850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14678,5452,Vermont,44.538872,73.050085,102689.933491,21305.0,53820.0,85352.0,119009.0,233965.0,...,358.04,0.91,6.99,12.58,13.99,-78.58,7.0,45.906,6.558000,6.5400
14682,5477,Vermont,44.416769,72.962422,107859.183673,29507.0,65140.0,99379.0,132808.0,212461.0,...,356.94,0.39,6.50,13.26,14.41,-78.80,32.0,181.310,5.665938,5.1885
14683,5477,Vermont,44.416769,72.962422,107859.183673,29507.0,65140.0,99379.0,132808.0,212461.0,...,356.94,0.39,6.50,13.26,14.41,-78.80,1.0,6.213,6.213000,6.2130
14685,5495,Vermont,44.429068,73.096245,111245.292560,21745.0,55090.0,91120.0,134183.0,254088.0,...,356.94,0.39,6.50,13.26,14.41,-78.80,51.0,294.404,5.772627,4.8000


Drop duplicated ZCTAs. Should have 15,740 rows

In [21]:
base_df = base_df.drop_duplicates(subset=['zcta']).reset_index(drop=True)
base_df.head()

Unnamed: 0,zcta,state,lat,long,average_household_income,mean_household_income_lowest_quintile,mean_household_income_second_quintile,mean_household_income_third_quintile,mean_household_income_fourth_quintile,mean_household_income_highest_quintile,...,heating_degree_days,wind_speed,earth_temp,frost_days,earth_temp_amplitude,solar_azimuth_angle,num_systems,total_capacity,mean_system_size,median_system_size
0,85610,Arizona,31.744197,109.722324,53713.747228,15735.0,28976.0,41584.0,60403.0,121871.0,...,108.36,2.46,17.4,3.3,18.02,-100.64,13.0,70.15,5.396154,5.38
1,85614,Arizona,31.814301,110.9194,67347.031441,15092.0,33942.0,52059.0,78902.0,156740.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,1012.0,7015.507,6.932319,5.985
2,85624,Arizona,31.504971,110.692999,56508.955224,12085.0,26596.0,40793.0,63481.0,139590.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,24.0,150.86,6.285833,5.865
3,85629,Arizona,31.917838,111.019035,91646.185302,24218.0,55100.0,82356.0,109225.0,187332.0,...,64.42,2.43,20.51,0.8,18.44,-101.18,1186.0,8934.678,7.533455,7.2975
4,85630,Arizona,31.886572,110.181046,57186.339381,6123.0,16639.0,37332.0,58660.0,167178.0,...,108.36,2.46,17.4,3.3,18.02,-100.64,37.0,258.01,6.973243,6.48


In [22]:
base_df.shape

(15740, 104)

Save core dataset as `data/base_data.csv`

In [23]:
base_df.to_csv('../data/base_data.csv')