# Question 1 - Data wrangling

In [5]:
import pandas as pd
import geopandas as gpd
import requests, json

## Joining census data with CalEnviroScreen shapefile

Here are the variables of interest:
- Median household income: B19013_001E
- Median gross rent: B25064_001E
- Units in housing: DP04_0006E to DP04_00013E

First we get median household income and median gross rent from U.S. Census via an API request.

In [6]:
#we have to make a different request bc this variable isn't in the "profiles" subsection
rs = "https://api.census.gov/data/2020/acs/acs5?get=NAME,B19013_001E,B25064_001E&for=tract:*&in=state:06"
r = requests.get(rs)
d=json.loads(r.text)
IncomeRentDf = pd.DataFrame(d[1:], columns = d[0])

Then we repeat the same process, except this time we are pulling housing data from another dataset from the U.S. Census.

In [7]:
rs2 = "https://api.census.gov/data/2020/acs/acs5/profile?get=NAME,DP04_0006E,DP04_0007E,DP04_0008E,DP04_0009E,DP04_0010E,DP04_0011E,DP04_0012E,DP04_0013E&for=tract:*&in=state:06"
r2 = requests.get(rs2)
d2=json.loads(r2.text)
HousingDf = pd.DataFrame(d2[1:], columns = d2[0])

We then renamed the columns to make them more understandable.

In [8]:
IncomeRentDf.rename(columns = {'B19013_001E': 'Income', 'B25064_001E': 'Rent'}, inplace= True)

We create a `GEOID` column by combining the `state`, `county`, and `tract` column in preparation for a tabular join.

In [9]:
columns = ['tract', 'county', 'state']
for i in columns:
    IncomeRentDf[i] = IncomeRentDf[i].astype(str)
    HousingDf[i] = HousingDf[i].astype(str)

HousingDf['GEOID'] = HousingDf['state'] + HousingDf['county'] + HousingDf['tract']
HousingDf['GEOID'] = HousingDf['GEOID'].astype(int)
IncomeRentDf['GEOID'] = IncomeRentDf['state'] + IncomeRentDf['county'] + IncomeRentDf['tract']
IncomeRentDf['GEOID'] = IncomeRentDf['GEOID'].astype(int)

Here we join `HousingDf` and `IncomeRentDf` to a new dataframe - `censusDf`.

In [10]:
censusDf = HousingDf.set_index("GEOID").join(IncomeRentDf.set_index("GEOID"), rsuffix = '_remove')
#dropping duplicate and useless columns
censusDf.drop(columns = ['NAME', 'NAME_remove', 'state', 'state_remove', 'county', 'county_remove', 'tract', 'tract_remove'], inplace = True)

Here we create a geodataframe for the CalEnviroscreen 4.0 data by reading a shapefile retrieved from [OEHHA](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40). 

In [12]:
CalEnviroScreenGdf = gpd.read_file('data/CES4/CES4 Final Shapefile.shp')
CalEnviroScreenGdf['Tract'] = CalEnviroScreenGdf['Tract'].astype(int)

We then join `CalEnviroScreenGdf` and `censusDf` into `tractsDf`.

In [13]:
tractsDf = CalEnviroScreenGdf.set_index('Tract').join(censusDf, how='left')

# Wrangling EV charger data

First, we requested data via an API from NREL and selected only stations that are located in California.

In [None]:
#optional code
#have to run this to fix my geopandas for some reason; if yours works don't run it
##solution from https://gis.stackexchange.com/questions/375361/zonal-stats-returns-proj-error
import os
import pyproj as p
os.environ['PROJ_LIB'] = '/Users/hfrahn/opt/anaconda3/envs/uds/bin/pyproj'
p.datadir.set_data_dir('/Users/hfrahn/opt/anaconda3/envs/uds/bin/pyproj')

In [15]:
#request chargers from NREL 
apiKey = "eCN7llpPT79TmygqmvC71QdnnWdOquoRdnCR1DXo"
nrelString = "https://developer.nrel.gov/api/alt-fuel-stations/v1.geojson?api_key={}&fuel_type=ELEC&state=CA".format(apiKey)
chargers = gpd.read_file(nrelString)

We filtered and keep only chargers that are accessible to the public.

In [16]:
chargers = chargers[chargers['access_code'] == 'public']

We can filter and keep only chargers that are accessible to the public.

In [17]:
chargers_sjoin = gpd.sjoin(tractsDf, chargers.to_crs('EPSG:3310'), how="inner", predicate='intersects')
chargers_sjoin.head()

Unnamed: 0_level_0,ZIP,County,ApproxLoc,TotPop19,CIscore,CIscoreP,Ozone,OzoneP,PM2_5,PM2_5_P,...,ng_fill_type_code,ng_psi,ng_vehicle_class,access_days_time_fr,intersection_directions_fr,bd_blends_fr,groups_with_access_code_fr,ev_pricing_fr,ev_network_ids,federal_agency
Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6083002103,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,10.031114,...,,,,,,,Public,,"{'station': ['837'], 'posts': ['110112-01', '1...",
6083002103,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,10.031114,...,,,,,,,Public,,,
6083002103,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,10.031114,...,,,,,,,Public,,"{'station': ['USCPIL7823641'], 'posts': ['1357...",
6083002103,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,10.031114,...,,,,,,,Public,,"{'station': ['USCPIL7823791'], 'posts': ['1357...",
6083002103,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,10.031114,...,,,,,,,Public,,"{'station': ['USCPIL7821621'], 'posts': ['1357...",


Below we use `groupby` to aggregate the number of EV chargers per census tract.

In [18]:
chargers_by_tract = chargers_sjoin.reset_index().groupby(['Tract'])['Tract'].count()

In [21]:
chargers_by_tract = chargers_by_tract.to_frame().rename(columns = {'Tract': 'Charger count'})
chargers_by_tract['Charger count'] = chargers_by_tract['Charger count'].astype('int16')

We then joined `chargers_by_tract` dataframe to `tractsDf` in preparation for predictive modeling.

In [22]:
joinedDf = tractsDf.join(chargers_by_tract, how = 'left')

Replace `NaN`s in `count` with `0`s

In [24]:
joinedDf['Charger count'] = joinedDf['Charger count'].fillna(0)

We then export `joinedDf` as a GeoJson file for further visualization and analysis in the main notebook.

In [None]:
joinedDf.to_file('data/joinedDf.geojson', driver='GeoJSON')  