## Data

In this book, we will use a set of public datasets from the Longitudinal Employer Household Dynamic (LEHD) data provided by the United States Census Bureau. In particular, we will use the LEHD Origin-Destination Employment Statistics (LODES) data. These data are based on tabulated administrative data and give information about workplaces and residences of workers at the census block level. There are four main types of data that we will use.
- **Workplace Area Characteristics (WAC):** Census block level. Job totals for workplaces in the census block.
- **Residence Area Characteristics (RAC):** Census block level. Job totals for residences in the census block.
- **Origin-Destination (OD):** Origin census block - Destination census block pair level. 
- **Crosswalk (xwalk):** Census block level. Contains all census blocks within that state, and contains information about that census block (e.g. city, county).


The WAC and RAC data generally look something like the following:

In [1]:
import pandas as pd 
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()

Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,...,CFA02,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate
0,60014001001007,30,2,16,12,4,2,24,0,0,...,0,0,0,0,0,0,0,0,0,20190826
1,60014001001008,4,0,1,3,0,0,4,0,0,...,0,0,0,0,0,0,0,0,0,20190826
2,60014001001011,3,2,1,0,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,20190826
3,60014001001017,11,3,3,5,2,2,7,0,0,...,0,0,0,0,0,0,0,0,0,20190826
4,60014001001024,10,3,3,4,7,1,2,0,0,...,0,0,0,0,0,0,0,0,0,20190826


Here, each of the rows represents a **census block** (this particular table contains data from California). The `w_geocode` indicates the **block code**, serving as the unique identifier for the census block, and the `C000` variable represents the total number of jobs in that census block. The rest of the variable break down the number of jobs by various categories. For example, `CA01`, `CA02`, and `CA03` break down the jobs by age group:
- `CA01`: Number of jobs for workers age 29 or younger
- `CA02`: Number of jobs for workers age 30 to 54
- `CA03`: Number of jobs for workers age 55 or older

So, the sum of those columns should be equal to the value in `C000`. 

The Origin-Destination file looks like this:

In [3]:
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/od/ca_od_main_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()

Unnamed: 0,w_geocode,h_geocode,S000,SA01,SA02,SA03,SE01,SE02,SE03,SI01,SI02,SI03,createdate
0,60014001001007,60014003004007,1,0,1,0,0,0,1,0,0,1,20190826
1,60014001001007,60014027002024,1,1,0,0,0,0,1,0,0,1,20190826
2,60014001001007,60014037011000,1,0,1,0,0,0,1,0,0,1,20190826
3,60014001001007,60014042001011,1,0,1,0,0,0,1,0,0,1,20190826
4,60014001001007,60014042003000,1,0,1,0,0,0,1,0,0,1,20190826


Here, each of the rows represents a `w_geocode`-`h_geocode` pair. That is, each row is a pair of census blocks for which there was at least one person who worked in the `w_geocode` census block and lived in the `h_geocode` census block. The `S000` variable represents 

For more information about the datasets used in the examples, please refer to the data documentation provided [at this link](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.4.pdf). 