# Spatial merge census and precinct data

This notebook will join precincts with  census data. 

 Spatial unit of analysis is the precinct. 
 The aim is to join census data to each precinct. The problem is the precinct and block group boundaries don't match up. 
 
 So, calculate census values for each precinct this way:

For each precinct, variable value is a weighted average of the values of the bg's with which that precinct overlaps. 

 x_A =  p_A1 \* x_1 + p_A2 \* x_2
 
 where
 
 x_A = variable x for precinct A, block group 1
 
 p_A1 = proportion of precinct A's area that is in block group 1
 

In [1]:
%matplotlib inline

from geopandas import GeoDataFrame, read_file
from geopandas.tools import overlay
import pandas as pd
import spatial_processing_functions as spf
#import importlib
#importlib.reload(spf)


SF voting precincts. Boundaries are updated every ten years, and become active two years after the census. 
We have 1992, 2002, and 2012. 
years = ['1990','2000','2010','2009','2014']

1990 census data -> 1992 precinct + 1990 bg (missing)
2000, 2009 census data  -> 2002 precincts + 2000 bgs
2010, 2014 census data -> 2012 precincts + 2010 bgs

## Step 1: load precinct and census geography shapefiles

 We'll need the following combinations of censusXprecinct:
 
- ce2000pre1992, ce2000pre2002, ce2007pre2002, ce2012pre2012  <- for census data
 
- 'bg2000pre1992', 'bg2000pre2002', 'bg2010pre2012'  <- for block groups (since ce2007 data uses 2000 bg boundaries)

In [6]:


bgXprec = dict.fromkeys(['bg2000pre1992', 'bg2000pre2002', 'bg2010pre2012'])
for yr_key in bgXprec.keys():
    bgs = spf.load_bg_shp(yr_key[2:6])
    precincts = spf.load_prec_shp(yr_key[9:13])
    precincts = spf.reproject_prec(precincts)
    bgXprec[yr_key] = spf.merge_precinct_bg(precincts,bgs,yr_key)
    

#yr_key ='bg2010pre2012'
#bgs = load_bg_shp(yr_key[2:6])
#precincts = load_prec_shp(yr_key[9:13])
#bgXprec[yr_key] = merge_precinct_bg(precincts,bgs,yr_key)

omitted 1 row(s) with missing geometry
Year 1992: total 712 precincts
working on intersection for year bg2000pre1992

Ring Self-intersection at or near point 550048.52970911062 4183089.8677463285
Ring Self-intersection at or near point 552255.16319397336 4178110.7456194456
Ring Self-intersection at or near point 544442.71665843052 4181270.9088203469
Ring Self-intersection at or near point 544053.45127179928 4178214.2207001671



cols not present
Index(['precname', 'area_m', 'ALAND00', 'AWATER00', 'BKGPIDFP00', 'BLKGRPCE00',
       'COUNTYFP00', 'FUNCSTAT00', 'INTPTLAT00', 'INTPTLON00', 'MTFCC00',
       'NAMELSAD00', 'STATEFP00', 'TRACTCE00', 'geoid', 'area_m_2', 'geometry',
       'intersect_area'],
      dtype='object')
New df has 712 precincts
omitted 0 row(s) with missing geometry
Year 2002: total 586 precincts
working on intersection for year bg2000pre2002

Ring Self-intersection at or near point 551953.0842455877 4184833.1744046635



cols not present
Index(['precname', 'sequenceid', 'area_m', 'ALAND00', 'AWATER00', 'BKGPIDFP00',
       'BLKGRPCE00', 'COUNTYFP00', 'FUNCSTAT00', 'INTPTLAT00', 'INTPTLON00',
       'MTFCC00', 'NAMELSAD00', 'STATEFP00', 'TRACTCE00', 'geoid', 'area_m_2',
       'geometry', 'intersect_area'],
      dtype='object')
New df has 586 precincts
omitted 0 row(s) with missing geometry
Year 2012: total 604 precincts
working on intersection for year bg2010pre2012

TopologyException: found non-noded intersection between LINESTRING (553173 4.17682e+06, 553173 4.17682e+06) and LINESTRING (553173 4.17682e+06, 553173 4.17682e+06) at 553173.17171874153 4176823.9786180397
ERROR:shapely.geos:TopologyException: found non-noded intersection between LINESTRING (553173 4.17682e+06, 553173 4.17682e+06) and LINESTRING (553173 4.17682e+06, 553173 4.17682e+06) at 553173.17171874153 4176823.9786180397



Index(['precname', 'area_m', 'ALAND10', 'AWATER10', 'GEOID10', 'TRACTCE10',
       'geoid', 'area_m_2', 'geometry', 'intersect_area'],
      dtype='object')
New df has 604 precincts


In [7]:
bgXprec.keys()

dict_keys(['bg2000pre1992', 'bg2000pre2002', 'bg2010pre2012'])

## Merge with census data

In [8]:
# We'll need the following combinations of censusXprecinct
#ce2000pre1992, ce2000pre2002, ce2007pre2002, ce2012pre2012  <- for census data
#'bg2000pre1992', 'bg2000pre2002', 'bg2010pre2012'  <- for block groups


# dictionary for matching correct year. 
# (although we don't actually need 1990 data. )
census2bg_year = {'1990':'1990', '2000':'2000','2010':'2010','2007':'2000','2012':'2010'}

ce2bgXpre={'ce2000pre1992':'bg2000pre1992','ce2000pre2002':'bg2000pre2002','ce2007pre2002':'bg2000pre2002','ce2012pre2012':'bg2010pre2012'}

In [9]:

# load census data, for each year. Then merge with the appropriate bg/precinct file. 

census_data_by_precinct = dict.fromkeys(['ce2000pre1992', 'ce2000pre2002', 'ce2007pre2002', 'ce2012pre2012'])
for yr_key in census_data_by_precinct.keys():
    print('\n',yr_key)
    census_yr = yr_key[2:6]
    census_df = spf.load_census_data(census_yr)
    
    #lookup correct bgXprec dataframe to use.
    bg_key = ce2bgXpre[yr_key]
    
    # now merge. 
    print('{} precincts before'.format(len(bgXprec[bg_key].precname.unique())))
    df_merged = pd.merge(bgXprec[bg_key], census_df, on = 'geoid')
    print('{} precincts after'.format(len(df_merged.precname.unique())))
    
    vars_to_use = spf.get_vars_to_use()
    cols_to_keep = vars_to_use + ['precname','area_m','intersect_area','geoid']
    df_merged = df_merged[cols_to_keep]
    df_merged_calc = spf.calc_variables(df_merged, vars_to_use) # leave off geo columns, obviously
    
    # aggregate back to precinct level. 
    df_new = spf.agg_vars_by_prec(df_merged_calc)
    
    # clean up by dropping unweighted and other unneeded columns
    df_new.drop(vars_to_use, axis=1, inplace=True)
    df_new.drop(['intersect_area','prop_area'], axis=1, inplace=True)
    df_new = spf.rename_wgt_cols(df_new, vars_to_use)
    
    # store data frame in a dictionary
    census_data_by_precinct[yr_key] = df_new
    # also save as csv. 
    spf.save_census_data(df_new,yr_key)

    



 ce2000pre2002
586 precincts before
586 precincts after
Sum >1.1 or <.97:
 precname
2009    0.968559
Name: prop_area, dtype: float64
saved as census_by_precinct_ce2000pre2002.csv

 ce2007pre2002
586 precincts before
586 precincts after
Sum >1.1 or <.97:
 precname
2009    0.968559
Name: prop_area, dtype: float64
saved as census_by_precinct_ce2007pre2002.csv

 ce2012pre2012
604 precincts before
604 precincts after
Sum >1.1 or <.97:
 Series([], Name: prop_area, dtype: float64)
saved as census_by_precinct_ce2012pre2012.csv

 ce2000pre1992
712 precincts before
712 precincts after
Sum >1.1 or <.97:
 precname
2002    0.935645
2005    0.914213
2014    0.936640
2059    0.953853
2816    0.960311
Name: prop_area, dtype: float64
saved as census_by_precinct_ce2000pre1992.csv


Let's check out the precincts that don't total 1.0.. something may be wrong. 

Precinct 2009(2002) is on the SF border and has 931 registered voters. 
The 5 precincts(1992) with weird results are all on the southern SF border. 

These that are on the border probably don't add up to 1.0 because the boundaries are slightly different from the census shapefiles. 
I think it's close enough that it's not a problem. 

TODO: fix these two precincts.


Precincts(2012) 7025 and 7035 are the Hunter's Point Shipyard area. I wonder if this is messed up because boundaries changed? 
Something's clearly wrong with 7035 because there are 327 registered voters and a tot population of only ~34. 
7025 has 441 registered voters and tot pop of ~1323.
For these, it might be more of a problem because they're really far off. 
Probably have to omit them until I can come back and figure out what to do. 

In [20]:
# look for other missing data. # can't find any other missing data here. 

for yr_key in census_data_by_precinct.keys():
    print(len(census_data_by_precinct[yr_key][pd.isnull(census_data_by_precinct[yr_key]['med_inc_wgt'])]))

0
0
0
0
