--- 


# Replication of Joshua D. Angrist, Erich Battistin, and Daniela Vuri  (2017) <a class="tocSkip">   
---

This file is the modification of (https://github.com/FedericoAlexanderRizzuto). I don't get permission of revising and reusing from the author,so **please do not redistribute this file**.

This notebook contains introduction of the replication from the following paper:

> Angrist, J. D., Battistin, E., and Vuri, D. (2017). In a small moment: Class size and moral hazard in the Italian Mezzogiorno. _American Economic Journal: Applied Economics, 9(4)_, 216-49.

The original paper, data, and code can be accessed [here](https://www.aeaweb.org/articles?id=10.1257/app.20160267).



##### Information :
* Almost all files are the same as [the original repository](https://www.openicpsr.org/openicpsr/project/113698/version/V1/view) other than following two points.
    * I converted the dta file into a CSV file and added "paper.pdf"(original paper) and ”appendix.pdf"
    * The explanation of the variable is included in varlabel.txt (This file was originally included in dta file)
* As for the other files, please read "Readme.pdf"

##### Notice :

* The authors may provide two datasets that differ by the unit of observation, namely the class or the individual student.
    * However, I cannot find the latter dataset (2023/8/24)
    * The analysis is carried out at the class level and the latter dataset is solely used to produce Figure 7,
    * **So, please skip Figure 7 when you try to replicate.**
* And **please replicate Figure 1 later**
    * because the author did not provide the replication code for this figure 
* I found Table 1 to be partly incorrect, so please dismiss minor differences.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduction" data-toc-modified-id="1.-Introduction-1">1. Introduction</a></span></li><li><span><a href="#2.-Identification-Strategy" data-toc-modified-id="2.-Identification-Strategy-2">2. Identification Strategy</a></span></li><li><span><a href="#3.-Empirical-Strategy" data-toc-modified-id="3.-Empirical-Strategy-3">3. Empirical Strategy</a></span></li></ul></div>

---
# 1. Introduction 
---

- The relationship between class size and student achievement has long been at the center of public and academic discussions on educational policy. 
    - This is the case in japan (If you are interested in, please see following materials).
        - [赤林英夫(2020)　「ヒューモニー特別連載4
第1回 コスパが疑問な“少人数学級”」](https://humonyinter.com/column/eco/edu2-01/)
        - [北條雅一(2023)『少人数学級の経済学』慶応大学出版会](https://www.kinokuniya.co.jp/f/dsg-01-9784766428889)
    - Reducing class size has sizable positive effects on student achievement is a widely held belief among families, teachers, and educationalists.
    
    

- Social scientists have devoted their attention to the topic as well, attempting to quantify potential gains from smaller classes. 

    - the evidence provides some support to the view that smaller classes enhance student performance, although the benefits appear to be small and concentrated at an early stage of education.



- Angrist, Battistin, and Vuri (2017) add to the literature by investigating the implications of test score manipulation for the CSR research using data on INVALSI tests, yearly standardized tests covering mathematics and Italian language skills and administered to certain grades of all Italian schools.
    - Based on Angrist and Lavy (1999), the authors first employ a **parametric fuzzy regression discontinuity (RD) design** that exploits mandatory class-size caps to estimate the effect of smaller classes on student achievement for second and fifth-graders, then examine the role of score manipulation in explaining the substantial gap in estimates of class-size effects between Southern Italy and the rest of the country. In fact, it appears that teachers in the South have engaged to a greater extent in shirking when correcting and transcribing the results of the paper-based tests. The final model, which is augmented with an **instrumental variable** for score manipulation, finds that class-size effects vanish. The main contribution of this paper lies in illustrating how in certain test settings smaller classes can incentivize test score manipulation and thereby drive estimates of class-size effects.

---
# 2. Identification Strategy
---

- Researchers attempting to capture the causal effect of class size (the treatment variable) on student achievement (the outcome variable) are confronted with well-known issues of identification.


- Given that treatment is not randomly assigned, pupils attending smaller classes can be expected to differ in meaningful ways
    - it is not readily apparent whether the **selection bias** should be expected to be positive or negative. 
        - For example, if educational investment in one's child is a positive function of income, better-off families may actively seek out schools with smaller classes and wealthier communities may invest in and/or lobby for CSR interventions. 
        - At the same time, smaller classes may be more frequent in the less densely populated rural areas, which tend to be poorer.    
        
    - It does not suffice to condition on the observable characteristics of the students.
        - there is no assure the factors determining selection are all observed by the researcher
        
        
- Test scores (the outcome variable $Y$) are affected by class size (the treatment variable of interest $D_1$) and observables $W$ like parental education, or immigrant status, but also by unobservables $U_1$, e.g. innate ability, neighborhood or school management.

- While both $W$ and $U_1$ may have a causal effect on selection into treatment $D_1$ *and* the outcome $Y$, only the backdoor path $D_1 \leftarrow W \rightarrow Y$ can be blocked since it is impossible to condition on unobservables.
    - Fortunately, agents have imperfect control over the treatment variable $D_1$.
        - In fact, while families apply to the school of their choice in February, the composition of classes is unknown until shortly before the school start in September, at which point reassignments are difficult and need to be authorized. 
        - In addition, funding and most organizational aspects, especially when it comes to primary education, depend on the central government. Therefore, treatment is not at the full discretion of the school staff either. 
    
    
    
- The nation-wide class-size minima and maxima result in what Angrist and Lavy called **Maimonides' Rule**, which predicts class size to be a nonlinear and discontinuous function of grade enrollment (see Figures 2 and 3 below). This provides **a credible source of exogenous variation needed to deal with the selection bias and the nonlinear nature of the function allows to apply the fuzzy RD design.** 
    - As shown in Figure2 and 3, $D_1$ can be instrumented by the class size predicted by Maimonides' Rule $Z_1$, which in turn is a function of grade enrollment (the running variable $X$). Note that $Z_1$ is assumed to be a conditional IV, meaning that the running variable $X$ might affect the outcome $Y$ and needs to be conditioned on.


- However, Angrist et al. uncover another problematic causal nexus, namely test score manipulation by teachers (the second treatment variable $D_2$) being related to class size. 
    - In particular, class size $D_1$ tends to have a negative effect on $D_2$ because in larger classes it is more likely that tests are corrected and the results are transcribed by more than one teacher. Because it cannot be assumed that $Y$ and $D_2$ are not driven by common unobservables, as suggested by the fact that the varying nature of manipulation patterns across regions cannot be accounted for, the back-door path ($D_1 \rightarrow D_2 \rightarrow Y$) cannot be closed by conditioning on score manipulation $D_2$.
    - Therefore, **another source of exogenous variation is needed to identify the effect of $D_2$ on $Y$.**
    - Since **monitors** are shown to deter manipulation and are **randomly assigned** by the INVALSI institute, their presence at the institution $Z_2$ can be used as an instrument for $D_2$. 

**Figure E1: Causal Graph**
![ERROR: Here causal_graph should be displayed](./causal_graph.JPG)



In short, Maimonides' Rule $Z_1$ represents a valid instrumental variable because it arguably affects the outcome variable $Y$ only through $D_1$ once the back-door path $Z_1 \leftarrow X \rightarrow Y$ is closed by conditioning on grade enrollment $X$; monitoring is a valid IV because random assignment guarantees it to cause $Y$ only via $D_2$. By simultaneously instrumenting $D_1$ and $D_2$ while conditioning on $W$ and $X$, the effect of class size on test scores is finally identified.

---
# 3. Empirical Strategy
---

The starting point for implementing the identification strategy outlined above is the following equation:

$\qquad (2) \qquad \qquad y_{igkt} = \rho_0(t,g) + \beta s_{igkt} + \rho_1 r_{gkt} + \rho_2 r^2_{gkt} + \epsilon_{igkt} $

where $y_{igkt}$, the standardized math or Italian language test score for class $i$ in grade $g$ at school $k$ in year $t$, is estimated as a second-order polynomial function of year and grade controls $\rho_0(t,g)$, class size divided by ten $s_{igkt}$ and grade enrollment $r_{gkt}$. The quadratic running variable allows to exploit the nonlinearity of the Maimonides' Rule:

$\qquad (1) \qquad \qquad f_{igkt} = \frac{r_{gkt}}{int(r_{gkt}-1)/c_{gt}+1} $

where $f_{igkt}$, the predicted size of class $i$, is a function of grade enrollment and of the relevant maximum number of students per class $c_{gt}$ (25 or 27, respectively for classes unaffected and affected by the 2009 reform). In fact, after a benchmark OLS estimation of equation $\text{(2)}$, $s_{igkt}$ is instrumented by $f_{igkt}$. The following table summarizes the set of controls $\rho_0(t,g)$:

|                      **Observables**                                  ||
|------------------------------------|-----------------------------------|
| Female                             | Nonresponses - Female             |
| Immigrant                          | Nonresponses - Immigrant          |
| Father high school dropout         | Father high school graduate       |
| Father college graduate            | Mother employed                   |
| Mother homemaker                   | Mother unemployed                 |
| Nonresponses - Mothers' education  | Region dummy                      | 
| Year dummy                         | Grade dummy                       |
| Reform dummy                       | Region-enrollment interaction term|


To partially account for misreporting, item nonresponses (the percentage of students in a class for which data is missing) are included as observables. The nonresponse variables on fathers' education and mothers' occupation were dropped due to collinearity issues. The reform dummy reflects whether the class was subject to $c_{gt}$ equal to 25 or 27. It should be noted that the data related to the demographic and socioeconomic background of pupils are reported by the school staff, while the other controls are administrative in nature or collected by the INVALSI institute. This distinction will be relevant when discussing misreporting. As the relationship of grade enrollment with the endogenous and the outcome variables is influenced by geographical factors (see Figures 4 to 6), equation $(2)$ is then augmented by adding the polynomial interaction between the running variable and the region dummy and other terms that allow the quadratic function to change over enrollment-windows (the "interacted model"). Table 2 reports the 2SLS estimates for this model in addition to the OLS and 2SLS estimates of equation $(2)$. 

Motivated by the finding of particularly high class-size effects in the South, Angrist et al. include a second endogenous variable in the model, namely the cheating dummy $m_{igtk}$. Using a second larger dataset (the unit of observation is the student), the authors have constructed a dummy that flags classes with abnormally high scores, low variance, and suspicious patterns, i.e. classes where teachers likely have manipulated the outcome variable. As discussed above, this second treatment variable is instrumented by the presence of monitors at the institution $M_{igkt}$. Thus, equation $(2)$ becomes:

$\qquad (4) \qquad \qquad y_{igkt} = \rho_0(t,g) + \beta_1 s_{igkt} + \beta_2 m_{igtk} + \rho_1 r_{gkt} + \rho_2 r^2_{gkt} + \eta_{igkt} $

with the first-stage regressions being:

$\qquad (5) \qquad \qquad s_{igkt} = \lambda_{10}(t,g) + \mu_{11} f_{igkt} + \mu_{12} M_{igtk} + \lambda_{11} r_{gkt} + \lambda_{12} r^2_{gkt} + \xi_{ik} $

$\qquad (6) \qquad \qquad m_{igkt} = \lambda_{20}(t,g) + \mu_{21} f_{igkt} + \mu_{22} M_{igtk} + \lambda_{21} r_{gkt} + \lambda_{22} r^2_{gkt} + v_{ik} $

Estimates for equation $(4)$ are found in Table 8, while those for the first stage are presented in Table 7. The table below summarizes the key elements of the final model. The causal effect of the endogenous variables $s_{igkt}$ and $m_{igkt}$ on math and language scores $y_{igkt}$ is recovered by instrumenting them by the presence of monitors $M_{igtk}$ and the predicted class size $f_{igkt}$, which is a function of the running variable $r_{gkt}$.


| **Main Outcome**       | **Treatments**               | **Instruments**                        | **Assignment Variable**             |
|------------------------|------------------------------|----------------------------------------|-------------------------------------|
| Test scores $y_{igkt}$ | Class size $s_{igkt}$        | Maimonides rule $f_{igkt}$             | Grade enrollment at school $r_{gkt}$|
| .                      | Score manipulation $m_{igkt}$| Monitors at the institution $M_{igtk}$ | .                                   |  