# EC486 LT Problem Set 9

Edited by Jack Shannon

Based on [Jin, Ginger Zhe, and Phillip Leslie. “Reputational Incentives for Restaurant Hygiene.” American Economic Journal: Microeconomics 1, no. 1 (2009): 237–67.
](www.jstor.org/stable/25760354)

This document was produced with [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) running [`stata_kernel`](https://kylebarron.dev/stata_kernel/)

## Question 1

> What is the question and the goal of the paper?


The authors begin by asking:

> When do reputations provide effective incentives for firms to maintain high unobservable effort, and when should we invoke government intervention based on a failure of the market to provide adequate information?

After noting a government intervention (the introduction of hygiene scorecards to be prominently displayed) that reduced hospitalizations from food-borne illness by 20%, they introduce the goal of this paper:

> In this study, we ask whether reputational incentives caused at least some restaurants to provide good hygiene.

They pursue this question with two smaller questions:

1. Chain effects?
    - Chain-affiliated restaurants may share the reputation of the chain as a whole
    - Test the hypothesis that chain restaurants tend to face stronger reputational incentives than independent restaurants
    - Evidence of franchisees exerting less effort to maintain good hygiene would provide verification that chain affiliation is a source of reputational incentives.
2. Regional variation
    - All else equal, two restaurants located beside each other face similar consumer learning
    - Implies geographic clustering in the magnitude of restaurants' information differences


## Question 2

> What is the theoretical foundation upon which the empirical model is based?

### Chain Effects

In order to test whether there are different reputational incentives between chain restaurants and non-chain restaurants – and between company-owned chains and franchised chains - the authors develop a model of the effects of a change in consumer learning about product quality (i.e. hygiene).

#### Simplest Model

The simple underlying model has the following characteristics:

- Hygiene (quality) is costly to produce
- Hygiene is imperfectly observed by consumers
    - Scorecard intervention improves consumer learning
    - Chain restaurants have different learning effects than non-chain restaurants

This in turn implies:

- Pre-intervention
    - Chains produce higher quality than non-chains
        - Due to differences in consumer learning
    - Company-owned chains produce higher quality than franchised chains
        - Due to externality of learning on other branches in the chain
        - Company-owned restaurants internalize this, franchises do not
    - Represented by ranked marginal revenue curves:
    $${MR}^b_{nc}(h) < {MR}^b_{cf}(h) < {MR}^b_c(h)$$
    for any level of hygiene quality $h$
        - $b$ – before intervention
        - $nc$ – non-chain restaurant
        - $cf$ – franchised chain restaurant
        - $c$ - company-owned chain restaurant
    - NB: this marginal revenue is the change in profit for a marginal increase in *hygiene*, not food output
- Post-intervention
    - Consumer learning is equalized across chains and non-chains
    - All firms then face the same marginal revenue curve
    $${MR}^a(h) > {MR}^b(h)$$
    for any level of hygiene quality $h$ and any pre-intervention marginal revenue ${MR}^b$

Assuming that all firms face the same marginal cost curve, we expect to see the following ranking of equilibrium hygiene levels:
$$h^b_{nc} < h^b_{cf} < h^b_c < h^a$$

This is illustrated in Figure 1 from the paper:
![Figure 1](./images/figure1.png)

#### Cost Heterogeneity

What if chains and non-chains not only face different reputational incentives, but also different cost curves? Suppose that it is less costly (in terms of effort) for chains to produce hygiene quality than non-chains. What will be the effect of the change in consumer learning?

##### Parallel Shift

Suppose that the marginal cost curves are parallel, e.g. ${MC}_c(h) = {MC}_{nc}(h) - \gamma$ for some $\gamma > 0$. Because all curves are linear, the change in hygiene levels due to cost differences is just a constant. In other words, the horizontal distance between the intersection of a marginal revenue curve with the two marginal cost curves is constant (call it $\Delta$) for all marginal revenue curves. This means that if the difference in hygiene levels between chains and non-chains decreases after the intervention, then chains and non-chains must have had different marginal revenue curves pre-intervention. The reasoning can be summarized as:

- Suppose $h^b_{nc} - h^b_c > h^a_{nc} - h^a_c$
- ${MR}^a_{nc} = {MR}^a_c \Rightarrow h^a_{nc} - h^a_c = \Delta$
- Then $h^b_{nc} - h^b_c > \Delta$
- Therefore ${MR}^b_{nc} \neq {MR}^b_c$

Since this difference in the marginal revenue curves is evidence of reputational incentives, any analysis that relies on the difference in hygiene level changes before and after the intervention is the same as in the basic model.

As before, franchises and company-owned restaurants still face the same marginal cost curve, any pre-intervention differences in hygiene levels must come from a difference in marginal revenue curves, i.e. reputational incentives.

All this can be seen in Figure 2 from the paper:
![Figure 2](./images/figure2.png)

##### Affine Transformation

Suppose instead that chains face a marginal cost curve that is lower but steeper than the marginal cost of non-chains: ${MC}_c(h) = \gamma_1 \cdot {MC}_{nc}(h) - \gamma_0$ with $\gamma_0 > 0$ and $\gamma_1 > 1$. Under this setting, even if both chains and non-chains faced the same marginal revenue curve pre-intervention, the difference in hygiene levels would shrink after the intervention because the two marginal cost curves would be closer together.

To correct for this, the authors propose conditioning on the average post-intervention hygiene level. Their reasoning is as follows:

- For two restaurants $1$ and $2$, suppose:
    1. $h_1^a = h_2^a$
    2. $h_1^b \neq h_2^b$
    3. MR curves do not cross, MC curves do not cross
- 1 and 3 imply ${MC}_1^a = {MC}_2^a$
- MC does not change with the intervention
- Therefore ${MC}^b_1 = {MC}^b_2$
- Then $h_1^b \neq h_2^b$ implies ${MR}^b_1 \neq {MR}^b_2$

I don't buy this argument because conditioning on post-intervention scores shifts one MC curve onto another, so even if they do not cross in reality, they can cross if they are made to go through the same point. The reasoning is correct, but only for factual observations, not counterfactuals, which is what the conditioning amounts to.

This is illustrated by Figure 3 in the paper:
![Figure 3](./images/figure3.png)

## Question 3

> Describe the data being used.

Data comes primarily from health inspections carried out in Los Angeles by the Department of Health Services (DHS). The variation to be exploited comes from the introduction of a grade card system for displaying sanitation scores. The grade card is to be prominently displayed and assign a letter grade based on sanitation inspection scores:

| Score  | Card  |
| :---:  | :---: |
| 90-100 |   A   |
| 80-89  |   B   |
| 70-79  |   C   |
| <70    | numeric score displayed |

The timeline of events that led to the introduction of the grade cards was as follows:

1. **17 November 1997** – Hidden-camera news investigation reveals unsanitary conditions in LA restaurants
2. **17 December 1997** – County board of supervisors votes to implement grade card system
3. **16 January 1998** - Inspectors begin issuing grade cards

The data used is:

- Department of Health Services (DHS) inspections
    - hygiene scores
    - restaurant characteristics
- Name and address of each restaurant
    - demographic data
    - information on local businesses
- Name and Yellow Pages
    - cuisine type
- Zagat Survey restaurant guide
    - restaurants included
    - associated review scores
- Ownership information (from DHS)
    - company-owned or franchise

## Question 4

> Is the data sufficient to answer the question?

The data should be sufficient. It provides a way of identifying chains from non-chains and franchises from company-owned restaurants, which is the key heterogeneity needed to evaluate the hypothesis. To answer the question about the regional effects of learning, they have demographic and employment data for the region corresponding to each restaurant.

The key assumption throughout paper is that the introduction of the inspection grade card system is exogenous, which the authors argue is plausible because its introduction was a) unexpected and b) rapid.

## Question 5

> Summarize the main results.

### Chain Effects

#### Basic Model

The first regression looks at hygiene scores before the introduction of the grade card system and tests to see whether chains and franchises have on average different hygiene scores than non-chains:

$$s^b_{ijt} = 
    \alpha_j 
    + \beta c_i
    + \gamma f_i
    + \delta_1 nchain_i
    + \delta_2 perchain_i
    + X_i \theta 
    + \varepsilon_{ijt}$$
    
| Variable     | Description                                                                         |
| :----------- | :---------------------------------------------------------------------------------- |
| $s^b_{ijt}$  | pre-intervention inspection score of restaurant $i$ in region $j$ at inspection $t$ |
| $\alpha_j$   | region fixed effect                                                                 |
| $c_i$        | indicator for chain restaurant                                                      |
| $f_i$        | indicator for franchised unit                                                       |
| $nchain_i$   | number of restaurants in LA belonging to the same chain as restaurant $i$           |
| $per_i$ | percentage of US chain units located in LA                                          |
| $X_i$        | observable characteristics of restaurant $i$                                        |

The model predicts:
- $\beta > 0$ – chains have higher pre-intervention hygiene scores
- $\gamma <0$ - franchised units have lower pre-intervention hygiene scores than chains

Note for the second point that the company-owned chain effect is $\beta$ but the franchised effect is $\beta + \gamma$ and since we hypothesized that franchised units have lower marginal revenue curves than compnay-owned units, we expect $\beta + \gamma < \beta$, or $\gamma < 0$.

The results of this regression are reported in the first column of Table 3 (included at the end of this heading).

To control for unobserved restaurant-level heterogeneity, the authors run a second regression:

$$s_{it} = 
    \alpha_i
    + \beta_0 g_t
    + \beta g_t c_i
    + \gamma g_t f_i
    + \delta_1 g_t nhcain_i
    + \delta_2 g_t perchain_i
    + \varepsilon_{it}$$
    
| Variable     | Description                                                               |
| ------------ | ------------------------------------------------------------------------- |
| $s_{it}$     | inspection score of restaurant $i$ at inspection $t$                      |
| $\alpha_i$   | restaurant fixed effect                                                   |
| $g_t$        | indicator for whether grade cards were in place at inspection $t$         |
| $c_i$        | indicator for chain restaurant                                            |
| $f_i$        | indicator for franchised unit                                             |
| $nchain_i$   | number of restaurants in LA belonging to the same chain as restaurant $i$ |
| $per_i$ | percentage of US chain units located in LA                                |

The model predicts:
- $\beta < 0$ - the effect on inspection scores is smaller for chains than for non-chains
- $\gamma > 0$ - the effect on inspection scores is bigger for franchise units than for company-owned units

They also re-estimate the model using only the subsample of restaurants with Zagat reviews and with a dummy for an A grade rather than a continuous inspection score. The results are reported in the second column of Table 3.

#### Cost Heterogeneity

To control for heterogeneity in marginal cost curves, the authors run the following regression that controls for the average post-intervention sanitation score $\overline s_i^a$:

$$s^b_{ijt} = 
    \alpha_j
    + \beta c_i
    + \gamma f_i
    + \delta_1 nchain_i
    + \delta_2 perchain_i
    + \delta \overline s_i^a
    + X_i \theta 
    + \varepsilon_{ijt}$$
    
Since including $\overline s_i^a$ theoretically puts restaurants on the same marginal cost curve, the interpretation of $\beta$ and $\gamma$ is as before, so the model still predicts $\beta > 0$ and $\gamma <0$. They run this regression with and without city fixed effects. The results are reported in columns 3 and 4 of Table 3.

![Table 3](./images/table3.png)

For all specifications, $\beta$ has the correct sign and is significantly different from zero. Furthermore, $\gamma$ has the correct sign in all specifications, and is significantly different from zero in all but column 4.

| Specification | Coefficient |  Prediction  | Estimate | Standard Error | Significance Level |
| :-----------: | :---------: | :----------: | -------: | :------------: | :----------------: |
|       1       |   $\beta$   | $\beta  > 0$ |   3.7283 |     0.8761     |         1%         |
|       1       |  $\gamma$   | $\gamma < 0$ |  -0.5722 |     0.2789     |         5%         |
|       2       |   $\beta$   | $\beta  < 0$ |  -3.9350 |     0.5745     |         1%         |
|       2       |  $\gamma$   | $\gamma > 0$ |   1.0948 |     0.3924     |         1%         |
|       3       |   $\beta$   | $\beta  > 0$ |   4.6846 |     1.2806     |         1%         |
|       3       |  $\gamma$   | $\gamma < 0$ |  -1.5693 |     0.4601     |         1%         |
|       4       |   $\beta$   | $\beta  > 0$ |   2.7024 |     0.8806     |         1%         |
|       4       |  $\gamma$   | $\gamma < 0$ |  -0.1556 |     0.2739     |        none        |

### Regional Variation

The model of regional variation has consumer information...

1. $I_{ij} = I(c_i, g, r_j)$
    - $I_{ij}$ - level of consumer information about restaurant $i$ in region $j$
    - $c_i$ - indicator for being a chain
    - $g$ - indicator for grade card system being in place
    - $r_j$ - degree of repeat business in region $j$
2. ${MR}(h_i, I(c_i, g, r_j), w_j)$
    - $h_i$ - hygiene level of restaurant $i$
    - $w_j$ - net value of all other local characteristics
    - Note that region $j$ only affects MR through consumer information $I_{ij}$
3. ${MC}(h_i, c_i, w_j)$
4. $I(c_i, 1, r_j) = I(c_i,\overline r)$
    - $\overline r$ - level of consumer learning from grade cards
    - This is the assumption that the grade cards equalize consumer information
5. $h^*(c_i, g, r_j, w_j, \overline r)$ solves ${MR}={MC}$
    - $h^*_{ij}(g=0) = \underbrace{a_1 r_j + a_2 w_j + a_3 r_j w_j}_{\alpha^b_j} + b_1 c_i$
    - $h^*_{ij}(g=1) = \underbrace{a_1 \overline r + a_2 w_j + a_3 \overline r w_j}_{\alpha^a_j} + b_1 c_i$
6. $H_0: r_j = r$
    - Under $H_0$, can rearrange the equations defining $\alpha^b_j$ and $\alpha^a_j$ to get:
    - $\alpha^a_j = \kappa_1 + \kappa_2 \alpha^b_j$

The unrestricted model consists of two separate regrssions:

1. $s^b_{ijt} = \alpha^b_j + \beta^b c_i + \gamma^b f_i + X_i \theta^b + \varepsilon_{ijt}$
2. $s^a_{ijt} = \alpha^a_j + \beta^a c_i + \gamma^a f_i + X_i \theta^a + \varepsilon_{ijt}$

They test the null hypothesis in two ways.

First, they assume that $a_3 = 0$ (see the definition $h^*$). This implies that $\alpha^a_j - \alpha^b_j = a_1(\overline r - r)$, which is constant across regions $j$. The test then becomes whether $\alpha^a_j - \alpha^b_j$ is constant across regions, using different definitions for regions.

Second, they allow for $a_3 \neq 0$ and use the fact that $r_j=r$ implies $\alpha^a_j = \kappa_1 + \kappa_2 \alpha^b_j$. The restricted model is the two regressions:

1. $s^b_{ijt} = \alpha^b_j + \beta^b c_i + \gamma^b f_i + X_i \theta^b + \varepsilon_{ijt}$
2. $s^b_{ijt} = \kappa_1 + \kappa_2 \alpha^b_j + \beta^a c_i + \gamma^a f_i + X_i \theta^a + \varepsilon_{ijt}$

Note that this equation is nonlinear in parameters (because of the $\kappa_2 \alpha^b_j$ term), so it must be estimated using nonlinear least squares.

The sum of squared residuals from this equation is the $RSS_r$ used for the F-statistic.

The results are reported in Table 5:

![Table 5](./images/table5.png)

Note that all F-statistics lead to rejection of the null of no regional variation in consumer learning at the 99% level.

## Question 6

> On the course webpage, you can find the paper’s dataset `data-for-AEJ.dta`, and the file `readme-for-AEJ.txt` describes the variables. Attempt to replicate Table 2 and Table 3 (optional) of the paper.

In [1]:
* Prepare Stata
clear all
cap log close _all
set more off
set linesize 80
use input/data-for-AEJ, clear

Our goal is to replicate Table 2:

![Table 2](./images/table2.png)

From a theoretical perspective, this is straightforward. All we have to do is run the appropriate regressions and report the RSS and $R^2$. The complicated bit is slogging through the codebook to figure out which variables to include in each regression.

### Data to Use

> Table 2 presents variance decompositions in which observation is a restaurant inspection before the introduction of grade cards.

Since we only use observations before the regulation, we can drop all the other data.

In [2]:
keep if reg_yes==0
cap mkdir data
save data/q6, replace


(43,321 observations deleted)


file data/q6.dta saved


### Constant

> Conditioning the observed scores on quarterly dummies and inspection regime dummies explains 4 percent of the score variation.

Which variables contain dummies for quarters and inspection regimes? We've already conditioned on the inspection regime using `reg_yes`. The readme tells us `yyqq_*` contains year-quarterly dummies.

According to the note in the table, all specifications use the dummies `yyqq_*`, so saying we're regressing on a constant here means we regress the inspection scores on the year-quarter dummies and a constant.

In [3]:
reg score yyqq_*, robust notable // display header but not coefficient table

note: yyqq_11 omitted because of collinearity
note: yyqq_12 omitted because of collinearity
note: yyqq_13 omitted because of collinearity
note: yyqq_14 omitted because of collinearity

Linear regression                               Number of obs     =     83,790
                                                F(10, 83779)      =     538.92
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0419
                                                Root MSE          =     14.412



Now my `e()` vector should contain all the information I need for the table. For a full list of everything included in `e()`, type `ereturn list`.

In [4]:
di "RSS = " e(rss)
di "R^2 = " e(r2)
di "df  = " e(df_m) // df for the model (as opposed to df_r for residual)


RSS = 17402286

R^2 = .04191082

df  = 10


If you want to make your output more readable, you can [format the display command](https://www.stata.com/manuals13/dformat.pdf):

In [5]:
di "RSS = " %-13.0fc e(rss)
di "R^2 = " %-5.4f e(r2)
di "df  = " e(df_m)


RSS = 17,402,286   

R^2 = 0.0419

df  = 10


In order to make the table, you need to store these values. You can either do that using a macro or by generating a variable. I'll go through the code first using variables, then at the end show you how to get the result using local macros and loops with a much less code.

In [6]:
quietly {
    gen rss1 = e(rss)
    gen r21  = e(r2)
    gen df1  = e(df_m)
}

### Restaurant Characteristics

There are 38 observable characteristics (the $X_i$ from the regressions), and finding them all in the codebook is a bit of a pain. I'll type the command with comments for each class of variables:

In [7]:
quietly reg score yyqq_*            ///
    service*                        /// type of inspection
    lflag_*                         /// liquor license dummies
    f_* missfdty                    /// type of food
    sty_* missstyl                  /// style of restaurant
    oldchainyes                     /// chain
    oldindown                       /// independently-owned
    zagat*                          /// zagat survey score and dummy
    restage misage                  /// restaurant age in years
    risk*                           /// risk assessment dummies
    small_rest big_rest mis_bigrest /// restaurant size dummies
    , robust
    
di "RSS = " %-13.0fc e(rss)
di "R^2 = " %-5.4f e(r2)
di "df  = " e(df_m)

quietly {
    gen rss2 = e(rss)
    gen r22  = e(r2)
    gen df2  = e(df_m)
}



RSS = 16,140,837   

R^2 = 0.1114

df  = 47



### City Fixed Effects

This time the variable is actually coded in a sensible way and is easy to find in the readme: `locctyid`. Since this is encoded, we can use factor variables to generate dummies in the regression.

In [8]:
quietly reg score yyqq_* i.locctyid, robust

di "RSS = " %-13.0fc e(rss)
di "R^2 = " %-5.4f e(r2)
di "df  = " e(df_m)

quietly {
    gen rss3 = e(rss)
    gen r23  = e(r2)
    gen df3  = e(df_m)
}



RSS = 14,614,335   

R^2 = 0.1954

df  = 159



### Zip Code Fixed Effects

Again, we can find the variable that stores restaurant zip codes easily in the readme:

In [9]:
quietly reg score yyqq_* i.addrzip, robust

di "RSS = " %-13.0fc e(rss)
di "R^2 = " %-5.4f e(r2)
di "df  = " e(df_m)

quietly {
    gen rss4 = e(rss)
    gen r24  = e(r2)
    gen df4  = e(df_m)
}



RSS = 13,298,310   

R^2 = 0.2679

df  = 313



### Restaurant Fixed Effects

The restaurant identifies are recorded in `dhsid`, which is a string variable. Remember we can't put string variables into regressions, but we can [encode](https://www.stata.com/manuals13/dencode.pdf) them, which assigns a number to each unique observation and labels that value with the original string.

In [10]:
encode dhsid, gen(id)

Note that the labels are the same:

In [11]:
list dhsid id in 1/10


     +-------------+
     | dhsid    id |
     |-------------|
  1. |     1     1 |
  2. |     1     1 |
  3. |     1     1 |
  4. |     1     1 |
  5. |    10    10 |
     |-------------|
  6. |    10    10 |
  7. |    10    10 |
  8. |    10    10 |
  9. |    10    10 |
 10. |   100   100 |
     +-------------+


But now Stata knows to treat id as a numeric variable:

In [12]:
codebook dhsid id


--------------------------------------------------------------------------------
dhsid                                                    scrambled restaurant id
--------------------------------------------------------------------------------

                  type:  string (str5)

         unique values:  22,202                   missing "":  0/83,790

              examples:  "14399"
                         "18006"
                         "2305"
                         "4953"

--------------------------------------------------------------------------------
id                                                       scrambled restaurant id
--------------------------------------------------------------------------------

                  type:  numeric (long)
                 label:  id, but 1 nonmissing value is not labeled

                 range:  [1,22202]                    units:  1
         unique values:  22,202                   missing .:  0/83,790

              examples:

We could conceivably use factor variables to generate the dummies we need, but notice how many unique values this takes: 22,202! Remember that Stata uses matrices to run regressions, and this would create a matrix with 22,202 columns (not counting the year-quarter dummies!). Thankfully, Stata has a command to handle regressions with large numbers of dummy variables: `areg`. [See the documentation](https://www.stata.com/manuals13/rareg.pdf) for syntax and examples.

In [13]:
areg score yyqq_*, absorb(id)

note: yyqq_11 omitted because of collinearity
note: yyqq_12 omitted because of collinearity
note: yyqq_13 omitted because of collinearity
note: yyqq_14 omitted because of collinearity

Linear regression, absorbing indicators         Number of obs     =     83,790
Absorbed variable: id                           No. of categories =     22,202
                                                F(  10,  61578)   =     729.87
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6242
                                                Adj R-squared     =     0.4886
                                                Root MSE          =    10.5290

------------------------------------------------------------------------------
       score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      yyqq_1 |  -9.49710

The contents of the `e()` vector are slightly different for this command, since `e(df_m)` only returns the non-absorbed variables:

In [14]:
di "RSS = " %-13.0fc e(rss)
di "R^2 = " %-5.4f e(r2)
di "dfm = " e(df_m)
di "dfa = " e(df_a)

quietly {
    gen rss5 = e(rss)
    gen r25  = e(r2)
    gen df5  = e(df_a)
}


RSS = 6,826,503    

R^2 = 0.6242

dfm = 10

dfa = 22201



That handles almost all of the rows, but there's one last row for the number of observations. Because that's recorded in the same column as the number of regressors, we'll generate a sixth `df` variable to capture this.

In [15]:
quietly gen df6 = e(N)

Finally, it's a good idea to save this dataset.

In [16]:
cap mkdir data
save data/table2, replace



file data/table2.dta saved


### Creating the Table

Now that we have all of our data, we need to rearrange it into a table that we can print. As usual, we can use some combination of the `reshape` and `collapse` commands to acheive this.

First, we collapse the data.

In [17]:
use data/table2, clear
collapse df* rss* r2*

Next, we reshape the data from wide to long.

In [18]:
gen i = _n
reshape long df rss r2, i(i) j(regression)
drop i



(note: j = 1 2 3 4 5 6)
(note: rss6 not found)
(note: r26 not found)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        1   ->       6
Number of variables                  17   ->       5
j variable (6 values)                     ->   regression
xij variables:
                        df1 df2 ... df6   ->   df
                     rss1 rss2 ... rss6   ->   rss
                        r21 r22 ... r26   ->   r2
-----------------------------------------------------------------------------



We have all of the data we need in the table format we need it in:

In [19]:
list


     +----------------------------------------+
     | regres~n      df        rss         r2 |
     |----------------------------------------|
  1. |        1      10   1.74e+07   .0419108 |
  2. |        2      47   1.61e+07   .1113604 |
  3. |        3     159   1.46e+07   .1954025 |
  4. |        4     313   1.33e+07   .2678567 |
  5. |        5   22201    6826503   .6241644 |
     |----------------------------------------|
  6. |        6   83790          .          . |
     +----------------------------------------+


Now we just need to format this to make it more readable.

In [20]:
* Format data for display
format df  %15.0fc
format rss %15.0fc
format r2  %15.4f

* Label values of regression variable
lab def reglab ///
    1 "Constant" ///
    2 "Restaurant Characteristics" ///
    3 "City Fixed Effects" ///
    4 "Zip Code Fixed Effects" ///
    5 "Restaurant Fixed Effects" ///
    6 "Observations"
lab val regression reglab
lab var regression "Specification"

* List with no separators
list, sep(6)









     +-----------------------------------------------------------+
     |                 regression       df          rss       r2 |
     |-----------------------------------------------------------|
  1. |                   Constant       10   17,402,286   0.0419 |
  2. | Restaurant Characteristics       47   16,140,837   0.1114 |
  3. |         City Fixed Effects      159   14,614,335   0.1954 |
  4. |     Zip Code Fixed Effects      313   13,298,310   0.2679 |
  5. |   Restaurant Fixed Effects   22,201    6,826,503   0.6242 |
  6. |               Observations   83,790            .        . |
     +-----------------------------------------------------------+


### Concise Method

Now that you know how the code works, I'm going to try cleaning it up a bit using macros and loops.

In [21]:
use data/q6, clear
encode dhsid, gen(id)

First, I want to create macros for the dependent variables in each regression:

In [22]:
local X1 ""
local X2 "service* lflag_* f_* sty_* missfdty missstyl oldchainyes oldindown zagatyes zagatfood restage misage risk* small_rest big_rest mis_bigrest"
local X3 "i.locctyid"
local X4 "i.addrzip"
local X5 ""

Next, I want to loop through rows of the table and save the data I need in macros. I can do this with a `forval` loop, but I need to include some special code to handle the case where we use the `areg` command instead of `reg`:

In [23]:
forval i = 1/5 {
    * Use `areg` for row 5, `reg` otherwise
    local cmd = cond(`i'==5, "areg", "reg")
    local abs = cond(`i'==5, "absorb(id)", "")
    
    * Run the regression
    quietly `cmd' score yyqq_* `X`i'', `abs' robust
    
    * Store output in macros
    local rss_`i' = e(rss)
    local r2_`i'  = e(r2)
    local df_`i'  = cond(`i'==5, e(df_a), e(df_m)) // use e(df_a) for `areg`
}
local df_6 = e(N)

Now we can just type out the table:

In [24]:
quietly {
n di "                            Number of  Sum of squared"
n di "                            variables    residuals        R^2"
n di _dup(62) "-"
n di "Constant                    " %8.0fc `df_1' "  " %13.0fc `rss_1' "     " %6.4f `r2_1'
n di "Restaurant Characteristics  " %8.0fc `df_2' "  " %13.0fc `rss_2' "     " %6.4f `r2_2'
n di "City Fixed Effects          " %8.0fc `df_3' "  " %13.0fc `rss_3' "     " %6.4f `r2_3'
n di "Zip Code Fixed Effects      " %8.0fc `df_4' "  " %13.0fc `rss_4' "     " %6.4f `r2_4'
n di "Restaurant Fixed Effects    " %8.0fc `df_5' "  " %13.0fc `rss_5' "     " %6.4f `r2_5'
n di "Observations                " %8.0fc `df_6'
}


                            Number of  Sum of squared
                            variables    residuals        R^2
--------------------------------------------------------------
Constant                          10     17,402,286     0.0419
Restaurant Characteristics        47     16,140,837     0.1114
City Fixed Effects               159     14,614,335     0.1954
Zip Code Fixed Effects           313     13,298,310     0.2679
Restaurant Fixed Effects      22,201      6,826,503     0.6242
Observations                  83,790


Obviously the code is not very clean, but the output looks nice.

For further study, you can download all of the code used for the paper on the [AEA website](https://www.openicpsr.org/openicpsr/project/114314/version/V1/view).