## Align, combine, subset EPA radon, uranium data and US Census GIS data

### Notebook setup

In [1]:
# import all libraries used in this notebook
import os
import numpy as np
import pandas as pd
# GIS libs
import geopandas as gpd
import libpysal as sa
# plotting libs
import matplotlib.pyplot as plt
import splot as splt
import plotnine as p9
get_ipython().run_line_magic('matplotlib', 'inline')

  def nb_dist(x, y):
  def get_faces(triangle):
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
  def nb_mask_faces(mask, faces):


### State/EPA Residiental Survey (SRRS) datasets

The rawest form of the radon data was collected and archived by Phil Price and is available here:
http://www.stat.columbia.edu/~gelman/arm/examples/radon_complete


* The documentation is in file http://www.stat.columbia.edu/~gelman/arm/examples/radon_complete/SRRSdoc.pdf

* There are 5 files, srrs1.dat through srrs5.dat  - but data is duplicated between them.

* This directory also contains data from both national survey - NRRS - and state surveys - cf. https://link.springer.com/article/10.1007/BF02034901.   This is in a different format and is
not used in Gelman and Hill analysis.

* README notes that files are old backups, things may be missing.

The combined de-duplicated SRRS dataset is in file  [srrs_all.csv](data/srrs_all.csv)

*State counties and tribal lands*

The SRRS dataset contains observations taken from Indian lands.
The county-level information for these entries doesn't line up with US FIPS data -
the names and county codes don't align.
Indian lands have column 'STATE' code R5, R6, R7, RB, RC, RN.
The regions cross state boundaries - for example,
EPA region 5 covers Indian lands in MN, WI, and MI:
https://www.epa.gov/sites/default/files/2015-08/documents/r5-tribal-land-map.pdf.

Data in state counties in file [radon_all_states.csv](data/radon_all_states.csv).

Data from Indian lands is in file [radon_indg_lands.csv](data/radon_indg.csv).


### US census county boundaries GIS files

The US Census provides shapefiles for the US, including Alaska, Hawaii, and territories.  We can use these to visualize radon and uranium levels.

In [2]:
shpfile = os.path.join('geo_data','cb_2018_us_county_20m', 'cb_2018_us_county_20m.shp')
us_geodata = gpd.read_file(shpfile)
# GEOID should be numeric
us_geodata = us_geodata.astype({'GEOID': 'int32'}, copy=False)
print(us_geodata.shape[0])
us_geodata.head(3)

3220


Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,AFFGEOID,GEOID,NAME,LSAD,ALAND,AWATER,geometry
0,37,17,1026336,0500000US37017,37017,Bladen,6,2265887723,33010866,"POLYGON ((-78.90200 34.83527, -78.79960 34.850..."
1,37,167,1025844,0500000US37167,37167,Stanly,6,1023370459,25242751,"POLYGON ((-80.49737 35.20210, -80.29542 35.502..."
2,39,153,1074088,0500000US39153,39153,Summit,6,1069181981,18958267,"POLYGON ((-81.68699 41.13596, -81.68495 41.277..."


### EPA/State Residential Radon Data

In [3]:
us_radon = pd.read_csv(os.path.join('data','radon_all_counties.csv'),
                     usecols=['state', 'stfips', 'floor', 'activity', 'cntyfips'],
                     skipinitialspace=True,    # CSV file has spaces after delimiter, ignore them
    ).convert_dtypes()
print(us_radon.shape)
us_radon.head(3)

(59395, 5)


Unnamed: 0,state,stfips,floor,activity,cntyfips
0,AK,2,0,0.9,20
1,AK,2,0,1.1,20
2,AK,2,0,1.0,20


**datacleanup**

Colorado and CT have data with cntyfips codes '0' and '999'.   Dropping for now.

In [4]:
us_radon.drop(us_radon[us_radon.cntyfips==0].index, inplace=True)
us_radon.drop(us_radon[us_radon.cntyfips==999].index, inplace=True)
print(us_radon.shape)

(57792, 5)


### US county soil uranium levels

Also distributed from Gelman website.

In [5]:
us_uranium = pd.read_csv(os.path.join('data','raw_uranium.csv'),
                        usecols=['st', 'stfips', 'ctfips', 'Uppm'],
                        skipinitialspace=True,
                        ).drop_duplicates().convert_dtypes()
print(us_uranium.shape[0])
us_uranium.head(3)

3111


Unnamed: 0,stfips,ctfips,st,Uppm
0,1,1,AL,1.78331
1,1,3,AL,1.38323
2,1,5,AL,2.10105


### Join and merge tables using US FIPS codes

To join or merge tables, we need to create a common key in both, then
use the [DataFrame.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) method.


We have three datasets:  SRRS survey data, soil uranium measurements, and geodata.
All files use different capitalization and punctuation for county names.
Therefore we rely on 
[FIPS code](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt),
which uniquely identify geographic areas. 
The US census datasets have "GEOID" code, the first 2 digits of which are the state FIPS code, the last 3 are the county-level FIPS code.
The other datasets have separate columns for stats and county codes.

In [6]:
# create merge column
us_uranium['FIPS'] = us_uranium.stfips*1000 + us_uranium.ctfips
us_radon['FIPS'] = us_radon.stfips*1000 + us_radon.cntyfips

### County level information:   uranium, number of homes in radon survey, census county name

We create a new table which contains county-level information from across the three datasets.

In [7]:
us_counties = us_uranium.merge(us_geodata[['GEOID', 'NAME']],
                               how='inner', left_on='FIPS', right_on='GEOID')

homes = us_radon.value_counts(subset=['FIPS'], sort=False).to_frame().reset_index()
homes.rename(columns={0:'homes'}, inplace=True)

us_counties = us_counties.merge(homes, how='left', on='FIPS')
us_counties.fillna(0, inplace=True)

us_counties.drop(columns=['stfips', 'ctfips', 'GEOID'], inplace=True)
us_counties.rename(columns={'st': 'state', 'NAME':'county', 'Uppm':'uranium'}, inplace=True)

print(us_counties.shape[0])
us_counties.head(3)

3105


Unnamed: 0,state,uranium,FIPS,county,count
0,AL,1.78331,1001,Autauga,9.0
1,AL,1.38323,1003,Baldwin,31.0
2,AL,2.10105,1005,Barbour,9.0


In [8]:
us_counties[us_counties.state=='MN'].shape

(87, 5)

#### Put data on log scale

Following Gelman and Hill chapter 4, section 4, we work with data on the log scale,
for two reasons

+ the outcome variable log_radon is always positive.
+ it provides modeling flexibility.

We know from geology that both radon measurements and soil uranium levels are always greater than zero,
however a few radon measurements in the EPA dataset are 0.
In order to be able to work with these measurements on the log scale, we replace 0 with 0.1,
which corresponds to a low radon level (following Gelman and Hill).

In [9]:
us_radon['radon'] = us_radon.activity.apply(lambda x: x if x > 0.1 else 0.1)
us_radon['log_radon'] = np.log(us_radon['radon'])
us_radon.drop(columns=['activity', 'stfips', 'cntyfips'], inplace=True)
us_radon.head(3)

Unnamed: 0,state,floor,FIPS,radon,log_radon
0,AK,0,2020,0.9,-0.105361
1,AK,0,2020,1.1,0.09531
2,AK,0,2020,1.0,0.0


In [10]:
us_counties.uranium.fillna(0.1, inplace=True)
us_counties['u'] = us_counties.uranium.apply(lambda x: x if x > 0.1 else 0.1)
us_counties['log_uranium'] = np.log(us_counties['u'])
us_counties.drop(columns=['u'], inplace=True)
us_counties.head(3)

Unnamed: 0,state,uranium,FIPS,county,count,log_uranium
0,AL,1.78331,1001,Autauga,9.0,0.578471
1,AL,1.38323,1003,Baldwin,31.0,0.324421
2,AL,2.10105,1005,Barbour,9.0,0.742437


### Restrict dataset to Minnesota

In order to work with just the data from Minnesota, we use a 
use a conditional expression to [filter specific rows of a dataframe](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe), combined with operation [reset_index(drop=True)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html?highlight=reset_index#pandas.DataFrame.reset_index) so that the rows are indexed starting from 0.

In [11]:
mn_radon = us_radon[us_radon['state']=='MN'].reset_index(drop=True)
mn_radon.drop(columns=['state'], inplace=True)
mn_radon = mn_radon.merge(us_counties[['FIPS', 'county']], on='FIPS')
mn_radon = mn_radon.sort_values(by='county', axis=0).reset_index(drop=True)
mn_radon.head(3)

Unnamed: 0,floor,FIPS,radon,log_radon,county
0,0,27001,1.0,0.0,Aitkin
1,0,27001,2.2,0.788457,Aitkin
2,0,27001,2.9,1.064711,Aitkin


In [12]:
mn_counties = us_counties[us_counties['state']=='MN'].reset_index(drop=True)
mn_counties.drop(columns=['state'], inplace=True)
mn_counties.head(3)

Unnamed: 0,uranium,FIPS,county,count,log_uranium
0,0.502054,27001,Aitkin,4.0,-0.689048
1,0.428565,27003,Anoka,52.0,-0.847313
2,0.892741,27005,Becker,3.0,-0.113459


#### Unique county ids

In [13]:
# super clunky to index offset from 1
mn_counties.reset_index(inplace=True)
mn_counties['county_id'] = mn_counties.index + 1
mn_counties.drop(columns=['index'], inplace=True)
mn_counties.head(3)

Unnamed: 0,uranium,FIPS,county,count,log_uranium,county_id
0,0.502054,27001,Aitkin,4.0,-0.689048,1
1,0.428565,27003,Anoka,52.0,-0.847313,2
2,0.892741,27005,Becker,3.0,-0.113459,3


Add county ids to radon data as well.

In [14]:
mn_radon = mn_radon.merge(mn_counties[['FIPS', 'county_id']], on='FIPS')
mn_radon.head(3)

Unnamed: 0,floor,FIPS,radon,log_radon,county,county_id
0,0,27001,1.0,0.0,Aitkin,1
1,0,27001,2.2,0.788457,Aitkin,1
2,0,27001,2.9,1.064711,Aitkin,1


**Save as CSV files**

These files are already part of this notebook, therefore calls to the  [pandas.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv) method have been commented out.

In [15]:
# uncomment as needed
# mn_radon.to_csv(os.path.join('data', 'mn_radon.csv'), index=False)
# mn_counties.to_csv(os.path.join('data', 'mn_counties.csv'), index=False)

# us_radon.to_csv(os.path.join('data', 'us_radon.csv'), index=False)
# us_counties.to_csv(os.path.join('data', 'us_counties.csv'), index=False)

### Add GeoSpatial Information for US Counties

If we want to build a spatial model which allows for local pooling of information between nearby counties, we need to create a neighbor graph over all the counties in Minnesota.

**GIS Data**

Geographic information systems (GIS) data is any item which has a geographic location, either a single point or a set of bounding polygons.  In order to manage, analyze, and visualize GIS data, we use specialized packages which can do the geographic math.  In this notebook we use the following packages:

- GeoPandas - manages a set of GIS records in tabular format
- libpysal - spatial analysis package which can analyze distance between locations

Cartographic data (maps) are encoded as a set of records, one per map region.  The [shapefile format](https://en.wikipedia.org/wiki/Shapefile) is an open specification used to insure interoperatility among GIS software packages.  When items in a dataset contain location labels, it is necessary to obtain a set of shapefiles for the corresponding map.

The shapefiles for US counties are available from the [US Census Bureau](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html).
For this analysis, we are using shapefiles where the boundary information is specified with the lowest possible resolution; this greatly speeds up analysis and plotting.
These can be downloaded via URL: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_20m.zip

In [16]:
shpfile = os.path.join('geo_data','cb_2018_us_county_20m', 'cb_2018_us_county_20m.shp')
us_geodata = gpd.read_file(shpfile).convert_dtypes()
us_geodata = us_geodata.astype({'GEOID': 'int32'}, copy=False)
us_geodata.drop(columns=['COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'NAME', 'LSAD', 'ALAND', 'AWATER'], inplace=True)
us_geodata.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   STATEFP   3220 non-null   string  
 1   GEOID     3220 non-null   int32   
 2   geometry  3220 non-null   geometry
dtypes: geometry(1), int32(1), string(1)
memory usage: 63.0 KB


For this analysis, we restrict our analysis to the counties in Minnesota.

In [17]:
# get MN subset, add county-level information
mn_geodata = us_geodata[us_geodata['STATEFP']=='27'].copy()
mn_geodata = mn_geodata.merge(mn_counties[['FIPS', 'county', 'county_id', 'log_uranium']],
                               how='inner', left_on='GEOID', right_on='FIPS')
mn_geodata.drop(columns=['STATEFP', 'GEOID'], inplace=True)
mn_geodata = mn_geodata.sort_values(by='county_id', axis=0).reset_index(drop=True)
mn_geodata.head(3)

Unnamed: 0,geometry,FIPS,county,county_id,log_uranium
0,"POLYGON ((-93.81041 46.25095, -93.77790 46.589...",27001,Aitkin,1,-0.689048
1,"POLYGON ((-93.51007 45.41480, -93.01956 45.411...",27003,Anoka,2,-0.847313
2,"POLYGON ((-96.19467 47.15115, -96.06707 47.151...",27005,Becker,3,-0.113459


To check our work, we plot the counties in Minnesota, colored by log uranium level and labeled by county name.

In [18]:
county_pts = mn_geodata.geometry.centroid.to_crs(mn_geodata.crs)
(p9.ggplot()
 + p9.geom_map(data=mn_geodata, mapping=p9.aes(fill='log_uranium'), alpha=0.7)
 + p9.geom_text(mapping=p9.aes(x=county_pts.x, y=county_pts.y, label=list(mn_geodata.county)),
                size=6, ha='center')
 + p9.theme(figure_size=(10, 8))
 + p9.scale_fill_cmap(cmap_name='viridis')
)




AttributeError: 'Figure' object has no attribute 'set_layout_engine'

We need to find the set of pairs of adjacent counties; to do this we use `libpysal` which parses the bounding polygons to find all counties which have a common border of non-zero length, ("Rook" metric).  To check our work, we plot the resulting neighbor graph.

In [None]:
# neighbors are counties which have a common line boundary
from splot.libpysal import plot_spatial_weights
from libpysal.weights.contiguity import Rook
mn_nbs = Rook(mn_geodata['geometry'])
plot_spatial_weights(mn_nbs, mn_geodata)

As a sanity check, we get the edges list labeled by county name check that the upper rightmost county "Lake" is adjacent to "Cook" county.

In [None]:
mn_nbs_names = Rook(mn_geodata['geometry'], ids=mn_geodata['county'].tolist())
mn_nbs_edges_names =  mn_nbs_names.to_adjlist(remove_symmetric=True).reset_index(drop=True)
mn_nbs_edges_names[mn_nbs_edges_names['focal']=='Cook']

We extract neighbor relationships as an edgelist and output this as a json file.

In [None]:
mn_nbs_edges =  mn_nbs.to_adjlist(remove_symmetric=True).reset_index(drop=True)
mn_nbs_edges.head(3)

node1 = (mn_nbs_edges['focal'] + 1).tolist()
node2 = (mn_nbs_edges['neighbor'] + 1).tolist()

mn_nbs_dict = { 'node1' : node1, 'node2' : node2, 'J_edges' : len(node1) }

from cmdstanpy import write_stan_json
write_stan_json("mn_nbs.json", mn_nbs_dict)