# Step 0:
# Data Preproseccing
* 1, Road Network Preprocessing
* 2, Work Commute Data

## 1, Road Networks

Three steps to clean and get the giant connected component from the road shapefile.

- Run GRASS `v.clean.advanced` tools `snap,break,rmdupl,rmsa` with tolerance values `0.0001,0.0,0.0,0.0`, save the result to `cleaned.shp`
- Run GRASS `v.net.components` tool (`weak` or `strong` does not matter since the network is undirected), save the result as `giant_component.csv`
- Using geoPandas combine the two files (shp and csv), filter the roads in the giant component, and save the result as `gcc.shp`:

In [None]:
components = pd.read_csv('../nWMDmap2/giant_component.csv', usecols=[0])
cleaned = gpd.read_file('../nWMDmap2/cleaned.shp')

roads = cleaned[['LINEARID','MTFCC','STATEFP','COUNTYFP','geometry']].join(components)
roads = roads[roads.comp == 1610].drop('comp',axis=1)

roads.to_file('../nWMDmap2/gcc.shp')

## 2, Work Commute Data

To get inter-tract commuting data at census-tract level:

- Download the datasets (6*2 = 12 files in total)
- Aggregate them at tract level (originial data is at block level, i.e. more granular)
- Remove unincluded tracts


In [None]:
# DOWNLOAD SCRIPT
pre = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
#two separate files for the workers living in the same state and for those not
for state in ['ny','nj','ct','pa','ri','ma']:
    for res in ['main','aux']:
        post = '_JT00_2010.csv.gz' if state != 'ma' else '_JT00_2011.csv.gz'
        os.system('wget {0:}{1:}/od/{1:}_od_{2:}{3:}'.format(pre,state,res,post))

We are interested in these columns only (ripping off the rest by `usecols=range(6)`):

- S000: Total number of jobs
- SA01: Number of jobs of workers age 29 or younger
- SA02: Number of jobs for workers age 30 to 54
- SA03: Number of jobs for workers age 55 or older

In [None]:
# CREATE TRACT LEVEL O-D PAIRS
#GEOID: state(2)-county(3)-tract(6): e.g. 09-001-030300
census = gpd.read_file('../nWMDmap2/censusclip1.shp').set_index('GEOID10') #demographic profiles
read_workflow = partial(pd.read_csv,usecols=range(6),dtype={0:str,1:str})\

wf = pd.concat([read_workflow(f) for f in glob('../od/*JT00*')]) #workflow
wf['work'] = wf.w_geocode.str[:11]
wf['home'] = wf.h_geocode.str[:11]

od = wf[(wf.work.isin(census.index)) | (wf.home.isin(census.index))]
od = od.groupby(['work','home']).sum()
od.reset_index().to_csv('../od/tract-od.csv',index=False)