# Environmental Justice --- exploring the data behind [Cal Enviro Screen](https://oehha.ca.gov/calenviroscreen)

The California Office of Health Hazard Assessment has put together a tool that assigns scores to areas in the state based on various indicators and proximity to various environmental hazards. The indicators and formula are illustrated in a [fact sheet](https://oehha.ca.gov/media/downloads/calenviroscreen/fact-sheet/ces30factsheetfinal.pdf) and summarized here:
     <img width="220" src="calscreen.png">

The purpose of this notebook is to illustrate using python to explore the data behind this tool to evaluate correlation between various indicators and responses, and also to explore the combination of spatial and tabluar data. 

The objectives on the technical side are to:
 * Read in a shapefile and plot it
 * Read in the Excel sheets
 * Associate the geometry of the tracts with all the data
 * Make a function to calculate pearson’s r between predictors (demographics) and responses (either pollution sources or illness)
 * Make a larger function to make a heatmap of predictor and responses and a scatter plot with a line fit and pearson’s r
 * Loop over all predictors and responses to make a big pdf with a plot of each predictor vs each response on each page

## first we need to import some tools

In [None]:
import matplotlib.pyplot as plt  # most popular plotting package in python
import numpy as np               # numerical package we will need for pearsons r
import pandas as pd              # pandas for tabular data
import geopandas as gp           # geopandas - similar to pandas but works with GIS
from matplotlib.backends.backend_pdf import PdfPages

# read in the excel file into pandas
## note that we can read separate sheets separately. Also, saying `index_col=0` uses the 0th column as the index (which is census tract)

In [None]:
df_pollution = pd.read_excel('social/ces3results.xlsx', sheet_name=('CES 3.0 (2018 Update)'), index_col=0)
df_demog = pd.read_excel('social/ces3results.xlsx', sheet_name=('Demographic profile'), index_col=0,
                        skiprows=1) # note skiprows=1 for deomgraphics -- see spreadsheet for reason why

In [None]:
df_demog.head()

### this pandas DataFrame format is similar to R and has many powerful characteristics

In [None]:
df_pollution.columns

In [None]:
df_pollution.describe()

In [None]:
df_pollution['Drinking Water'].hist(bins=50)

In [None]:
df_pollution['CES 3.0 Score'].plot()

# now read in spatial data with geopandas
## this also is read in as a dataframe but with some special geospatial metadata and a `geometry` that contains a `shapely` geometry object (Point, Polygon, etc.) for each row

## Let's take a quick look at some geographic data - California Counties

In [None]:
ca_co = gp.read_file('social/CA_Counties/CA_Counties_TIGER2016.shp')

In [None]:
ca_co.head()

## plotting is built in!

In [None]:
ca_co.plot()

## geodataframes include data about coordinate reference system

In [None]:
ca_co.crs

In [None]:
gdf = gp.read_file('social/CESJune2018Update_SHP/CES3June2018Update.shp', index_col=0)

In [None]:
gdf.index = gdf.tract
gdf

In [None]:
gdf.plot()

In [None]:
gdf.crs

In [None]:
gdf.crs==ca_co.crs

### uh-oh - these are not in the same projection. So, let's reproject `gdf` to be in the same projection as `ca_co`

In [None]:
gdf.to_crs(ca_co.crs, inplace=True)

In [None]:
gdf.crs==ca_co.crs

## we can focus in on the Bay area and clean up axes

In [None]:
gdf.plot()


## other kinds of plots are possible as well

In [None]:
gdf['drink'].plot( kind='hist')

## specifying a column name means make a colorflood/heatmap of those data

In [None]:
gdf.plot(column='drink', legend=True, rasterized=True)


# make sure the Excel-derived dataframes and shapefile derived dataframes have the same index entries, based on census districts. Then we can merge the tabular data into the geographic data

In [None]:
assert sorted(gdf.index) == sorted(df_demog.index)

In [None]:
assert sorted(gdf.index) == sorted(df_pollution.index)

# now some data munging

### some of the column names are truncated due to shapefiles being an archaic format!

In [None]:
gdf.columns.values

### but there is a raft of good information in the Excel files

In [None]:
df_demog.columns

In [None]:
df_pollution.columns

### assemble a list of predictors

In [None]:
preds = []
for c in df_demog.columns:
    if '%' in c and 'Other' not in c:
        preds.append(c)
preds

In [None]:
preds_ext = []
preds_ext.extend(['Education',
       'Linguistic Isolation',
       'Poverty', 'Unemployment'])
preds_ext

### subset a few columns 

In [None]:
pollution = ['CES 3.0 Score', 'Ozone',
       'PM2.5','Diesel PM', 
       'Drinking Water', 'Pesticides',
       'Tox. Release',  'Traffic',
        'Cleanup Sites',
       'Groundwater Threats', 'Haz. Waste',
        'Imp. Water Bodies',
       'Solid Waste',  'Pollution Burden',
       'Pollution Burden Score']
illness = ['Asthma', 'Low Birth Weight', 'Cardiovascular Disease']

## now some merging to pull data together with the shapefile geometry using `join`

### start out by only keeping 'geometry' column  from gdf and bringing in predictors

In [None]:
gdf  =gdf.join(df_demog[preds])[preds + ['geometry']] 


### what did that leave us with?

In [None]:
gdf.head()

## let's bring in the other indicators we called `pred_ext`

In [None]:
gdf = gdf.join(df_pollution[preds_ext])

In [None]:
gdf.columns

###  we can bring in the indicators of pollution

In [None]:
gdf = gdf.join(df_pollution[pollution])

In [None]:
gdf.columns.values

### finally we bring in the illnesses

In [None]:
gdf = gdf.join(df_pollution[illness])

In [None]:
gdf.columns.values

## Now, what if we want to focus in on a single county or a subset of counties - we can do that with geopandas!

### let's check out a single county only

### first, how do we select just one county from the `ca_co` geodataframe? We use `.loc`

In [None]:
ca_co.head()

### set a variable with that single county name (Pick one!)

In [None]:
ca_co['NAME'].unique()

In [None]:
subset_county = 'Santa Clara'

In [None]:
ca_co.loc[ca_co.NAME==subset_county]

In [None]:
ca_co.loc[ca_co.NAME==subset_county].plot()

### where is this county anyway?

In [None]:
ax=ca_co.plot()
ca_co.loc[ca_co.NAME==subset_county].plot(ax=ax, color='orange')

In [None]:
gdf_s = gp.overlay(gdf, ca_co.loc[ca_co.NAME==subset_county])

In [None]:
gdf_s.plot()

In [None]:
gdf_s.plot(column='CES 3.0 Score', legend=True)

### we can do basically the same idea but with a group of counties

In [None]:
ca_co['NAME'].unique()

### feel free to copy and paste counties into the `list` below. I'm choosing the Bay Area

In [None]:
subset_counties = ['Alameda', 'San Francisco','San Mateo', 'Santa Clara']

In [None]:
ca_co.loc[ca_co.NAME.isin(subset_counties)].plot()

In [None]:
ax=ca_co.plot()
ca_co.loc[ca_co.NAME.isin(subset_counties)].plot(ax=ax, color='orange')

In [None]:
subset_group=True

In [None]:
if subset_group == True:
    gdf_s = gp.overlay(gdf, ca_co.loc[ca_co.NAME.isin(subset_counties)])

In [None]:
gdf_s.plot(column='CES 3.0 Score', legend=True)

In [None]:
from EJ_helpers import find_correlations

In [None]:
pollutioncorrs = pd.DataFrame(index=pollution)
for pred in preds:
    pollutioncorrs[pred] = find_correlations(pred,pollution, gdf_s)
    
illnesscorrs = pd.DataFrame(index=illness)
for pred in preds:
    illnesscorrs[pred] = find_correlations(pred,illness, gdf_s)
    
    

In [None]:
pollutioncorrs.plot.bar(figsize=(10,10), subplots=True, legend=False, grid=True)
plt.tight_layout()

In [None]:
illnesscorrs.plot.bar(figsize=(10,10), subplots=True, legend=False, grid=True)
plt.tight_layout()

In [None]:
from EJ_helpers import plot_relation

In [None]:
plot_relation(gdf_s, 'African American (%)', 'Asthma')

In [None]:
plot_relation(gdf_s,'White (%)','CES 3.0 Score')

In [None]:
plot_relation(gdf_s,'Asian American (%)','CES 3.0 Score')

## We can make a loop over both predictors and responses and put plots into a multipage PDF

In [None]:
with PdfPages('all_plots.pdf') as outpdf:
    for cresp in ['Asthma', 'CES 3.0 Score']:
        for cpred in preds:
            print (f'plotting {cpred} with {cresp}')
            plot_relation(gdf_s, cpred, cresp, outpdf)
            