## IV Estimate

I am using the 1980 NHGIS counts of families with certain income levels by race for each census tract. I merge this with the final dataset from Baumsnow by MSA to census tract for the year 1980. No corrections are done to ensure that MSA boundaries haven't changed (I believe the final dataset for BS uses 1990 MSA definitions).

I am using the dissimilarity index for a quick measure of segregation. 
$$\frac{1}{2} \sum_{i=1}^N \mid \frac{p_i}{P} - \frac{r_i}{R} \mid$$
where $p_i$ is the number of families with income $<5,000$ in census tract $i$ and $P$ is the total number of families with income $<5,000$. $r_i$ and $R$ are similar measures for families with income $>50,000$.

I calculate this segregation measure for black and white families separately.

Using planned rays as an instrument for the true number of rays, I estimate the effect of highway rays on segregation of white familes and black families. See Stata output below.

In [None]:
First-stage regressions
-----------------------

                                                Number of obs     =        216
                                                F(   1,    214)   =     358.06
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6259
                                                Adj R-squared     =     0.6242
                                                Root MSE          =     1.3756

------------------------------------------------------------------------------
         ray |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  rays_planm |   1.184025   .0625721    18.92   0.000     1.060688    1.307361
       _cons |   .6713187   .1607168     4.18   0.000      .354528    .9881094
------------------------------------------------------------------------------

For white families

Instrumental variables (2SLS) regression          Number of obs   =        216
                                                  Wald chi2(1)    =      19.43
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1000
                                                  Root MSE        =     .16123

------------------------------------------------------------------------------
       seg_w |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ray |   .0273031   .0061937     4.41   0.000     .0151636    .0394426
       _cons |    .981497   .0223478    43.92   0.000     .9376961    1.025298
------------------------------------------------------------------------------
Instrumented:  ray
Instruments:   rays_planm
(437 missing values generated)
(1,238 missing values generated)
(1,297 missing values generated)
(31,156 observations deleted)



For black families

In [None]:
Instrumental variables (2SLS) regression          Number of obs   =        216
                                                  Wald chi2(1)    =       7.28
                                                  Prob > chi2     =     0.0070
                                                  R-squared       =     0.0574
                                                  Root MSE        =     .58842

------------------------------------------------------------------------------
       seg_b |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ray |   .0610001    .022605     2.70   0.007      .016695    .1053051
       _cons |   .9134696   .0815621    11.20   0.000     .7536108    1.073328
------------------------------------------------------------------------------
Instrumented:  ray
Instruments:   rays_planm


However, here we are instrumenting for the number of highways. We need an instrument for segregation. This is the model we would like to estimate where the first stage is 
$$ S_i = \beta_0 + \beta_1 H_i + \beta_2 N_i + \beta_3 H_i \times N_i + e_i $$ 
where $i$ denotes each MSA, $S_i$ is a measure of segregation, $H_i$ is the number of highways (planned or actual?), $N_i$ an index of natural heterogeneity (Lee & Lin already has a measure, need to run code to recreate). 

The second stage is
$$ y_{i,j} = \theta_0 + \theta_1 S_i + \theta_3 X_{j} + \theta_4 Z_{i} + MSA_i + \epsilon_{i,j} $$ where $y_{i,j}$ is the outcome of individual $j$ living in MSA $i$. $X_{j}$ is a vector of controls for the individual and $Z_{i}$ is a vector of controls for the MSA.  $MSA_i$ is a fixed effect for each MSA.

Goal: Estimate $\theta_1$

Question: These two equations are not exactly right. Need to also check the effect from highways (and use the natural features index somehow)

## Differential Effect from IV

In the above model, we assume that cov$(H_i, y_{i,j}) =0$, but this is unlikely. The coefficient estimate for IV is
$$ \beta_{IV} = (Z'X)^{-1}Z'X\beta + (Z'X)^{-1}Z'\epsilon \Rightarrow \beta + (Z'X)^{-1}Z'\epsilon $$
If cov$(H_i, y_{i,j}) \neq 0$, then $(Z'X)^{-1}Z'\epsilon \neq 0$. 
(NOTE: $Z$ is the matrix of instruments)

One suggestion is to estimate $\beta_{IV}$ for different subgroups and then calculate the differential effect. Say, we have group 1 and 2 (this could be gender for example or race). Then, we can estimate
$$\beta_{IV_1} = \beta_1 + (Z'X_1)^{-1}Z'\epsilon_1 $$
$$\beta_{IV_2} = \beta_2 + (Z'X_2)^{-1}Z'\epsilon_2 $$
where $X_1$ is the matrix of covariates for group 1 and $X_2$ is the matrix of covariates for group 2. Then
$$\beta_{IV_1} - \beta_{IV_2} = (\beta_1 - \beta_2) + [(Z'X_1)^{-1}Z'\epsilon_1 - (Z'X_2)^{-1}Z'\epsilon_2]$$

If $(Z'X_1)^{-1}Z'\epsilon_1 = (Z'X_2)^{-1}Z'\epsilon_2$, then we have a good estimate of $(\beta_1 - \beta_2)$. Is this condition met?

## Aggregated census tract estimates

From "Freeway Revolts!", we know there is some suggestive evidence that higher income families sort away from highways and from the central business district when transportation costs are lower. Therefore, we expect that the change in income is negative for a neighborhood that is closer to a highway and closer to the inner city. 

Since the NLSY data is geocoded to the county/MSA level, our estimates need to be at the county/MSA level. We can estimate what the change in income of a county is by aggregating all the census tract data and weighting them by the proportion of rich families. 

Let $r_i$ be the count of families with income $>50,000$ in census tract $i$ and $pop_i$ be the population of census tract $i$ in the year 1950, before the highways were built. Let $dhighway_i$ be the distance to the nearest freeway and $dCBD_i$ the distance to the central business district. $I_j$ is the set of census tracts in MSA $j$. Then we can get an aggregate estimate of the segregation level if we run the regression

$$ S_j = \alpha + \beta_1 \sum_{i \in I_j} \frac{r_i}{pop_i} dhighway_i \times dCBD_i  + \beta_2 \sum_{i \in I_j} \frac{r_i}{pop_i} dhighway_i + \beta_3 \sum_{i \in I_j} \frac{r_i}{pop_i} dCBD_i + \epsilon_j$$

where $S_j$ is the level of segregation. We can then use this segregation to measures its effects on outcome variable $y_{j,k}$ for person $k$ living in county $j$.

$$y_{j,k} =  \theta_0 + \theta_1 \hat{S_j} + \theta_3 X_{k} + \theta_4 Z_{j} + MSA_j + \epsilon_{j,k}$$ 

Not sure if the above equation would work? I also think the type of families that would stay in this county/move into this county is correlated with the outcome variable.

Note: Would need to go back to the original GIS files to build distance from census tract to nearest freeway. (CHECK WITH Jeff Lin for data)


## To do
- Check MSA/county/census tract boundaries are consistent
- Build natural features index from Lee & Lin 
- Write function to calculate rank order information theory index for segregation
- Clean NLSY public use data

## Things to think about
- When building segregation index, income brackets change with time because of inflation. How to address this?
- Endogenous sorting/migration to cities (can we get an estimate of migration?)

## Additional Notes
- From Boustan's homeownership paper, she uses predicted rays as an instrument for white flight.  "We predict the number of completed rays in each city $i$ at time $t$ by interacting the number of assigned rays in the 1947 plan with the national share of highway construction completed by date $t$. 

## Notes from Rebecca 
IV estimate
- Generate summary statistics on the mean of segregation for black and white families
- Generate summary statistics on the mean of the number of highways (planned) for black and white families
- Edit first stage & second stage to only use planned number of highways & natural index

Differential Effect from IV
- Try different controls. If estimates change dramatically with more controls, then likely $(Z'X_1)^{-1}Z'\epsilon_1 \neq (Z'X_2)^{-1}Z'\epsilon_2$
- Think a bit more about this? Go through full algebra

General
- Get 1950s data. So conditional on the level of segregation in 1950 and other characteristics in 1950, cities with more highways have more segregation, and cities with more highways located in areas with rich families leads to more suburbanization/outflow from inner city. 
- Get NHGIS 1950's income data and create segregation index 
- Look at change in segregation to verify that the mechanism of planned highways -> actual highways -> change in segregation exists.
- Try isolation index and different segregation indices
- Handle changing income brackets by assuming the distribution of income within a bracket is uniform. Extrapolate the number of poeple in each bracket.

# 11/27

## Variable List
- Income Segregation in 1950, 1980 by MSA
- Natural Heterogeneity Index by MSA 
- Racial Segregation in 1950 by MSA


## Isolation Index

This usually works for two groups.

For the isolation of a rich household from everyone else, we calculate

$$\frac{ (\sum_{i=1}^N \frac{r_i}{r_{tot} } \times \frac{r_i}{pop_i}) -  \frac{r_{tot}}{pop_{tot}}}{ 1 - \frac{r_{tot}}{pop_{tot}} } $$

For the poor, it is analogous.