vignettes/introduction.Rmd

---
title: "getspanel: Getting Started"
output: rmarkdown::html_vignette
#output: prettydoc::html_pretty
#  code_download: true

vignette: >
  %\VignetteIndexEntry{getspanel: Getting Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = NA,
  echo = TRUE,
  message = FALSE,
  error = TRUE,
  eval = TRUE,
  out.width = "100%",
  fig.width = 7,
  fig.height = 5,
  dev = "png",
  dpi = 300
)
```

The `getspanel` package can be downloaded and installed from CRAN [here](https://cran.r-project.org/package=getspanel) by simply using: 

```{r, eval=FALSE}
install.packages("getspanel")
```

The source code of the package is on [GitHub](https://github.com/moritzpschwarz/getspanel) and the development version can be installed using:


    # install.packages("devtools")
    devtools::install_github("moritzpschwarz/getspanel", ref = "devel")

Once installed we need to load the library:

```{r setup}
library(getspanel)
library(fixest)
```


Currently the package is called **getspanel** to align with the **gets** package, but it's main function of course remains the **isatpanel** function.

The **isatpanel** function implements the empirical break detection algorithm that is described in a [paper by Felix Pretis and Moritz Schwarz](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4022745) and was applied to a study by Nico Koch and colleagues on EU Road CO~2~ emissions, which was [published in Nature Energy in 2022](https://www.nature.com/articles/s41560-022-01095-6). 

**A quick overview over what has changed:**

-   We can now use the function approach as well as the traditional gets approach. This means that we can specify a model using `y` and `mxreg` as well as `time` and `id` as vectors, but we can now also simply supply a `data.frame` and a `function` in the form `y ~ x + z + I(x^2)` to e.g. specify polynomials. This means we will then need an `index` argument, which specifies the

-   The `ar` argument now works

-   We can now use the `fixest` package to speed up model estimation with large `i` (for short panels, the default method is still faster).The package can be activated using the new `engine` argument.

-   Using the `fixest` package also allows us to calculate **clustered standard errors**.

-   We can now be certain that unbalanced panels would work as intended, which was not the case before.

-   The `mxbreak` and `break.method` arguments have been removed. Instead the function now produces the break matrix itself. This now implements the following saturation methods in a user friendly way:

-   **iis**: Impulse Indicator Saturation

-   **jsis**: **Joint** Step Indicator Saturation (Common Breaks over time)

-   **csis**: **Coefficient** Step Indicator Saturation (Common Coefficient Breaks over time)

-   **fesis**: **Fixed Effect** Step Indicator Saturation (Breaks in the Group Fixed Effect over time)

-   **cfesis**: **Coefficient Fixed Effect** Step Indicator Saturation (Breaks in the coefficient for each individual)

# The isatpanel function

We first load some data of EU CO2 Emissions in the housing sector.


```{r}
data("EUCO2residential")
head(EUCO2residential)

# let's subset this a little bit to speed this up
EUCO2residential <- EUCO2residential[EUCO2residential$year > 2000 & 
                                       EUCO2residential$country %in% c("Germany", "Austria",
                                                                       "Belgium", "Italy", 
                                                                       "Sweden", "Denmark"),]

# let's create a log emissions per capita variable
EUCO2residential$lagg.directem_pc <- log(EUCO2residential$agg.directem/EUCO2residential$pop)

# and let's also turn off printing the intermediate output from isatpanel
options(print.searchoutput = FALSE)
```

Let's look at how we input what we want to model. Each `isatpanel` command takes:

## Basics

-   A specification of the source data, the group and time variable and the group-time characteristics. This can be entered into the function in two ways:

i.  In the **gets** package style i.e. using vectors and matrices to specify `y`, `mxreg`, `time` and `id`

ii. But also in a form that resembles the `lm` and `plm` specification i.e. inputting a `data.frame` (or `matrix` or `tibble`), a `formula` argument as well as character vectors for `index` (in the form `c("group_variable_name", "time_variable_name")`)

-   A an argument for the Fixed Effect Specification using `effect`.

This already means that the following two commands will give the same result:

Using the new method


```{r}
is_lm <- isatpanel(data = EUCO2residential,
                   formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                   index = c("country","year"),
                   
                   effect = "twoways",
                   
                   fesis = TRUE)
```

Using the traditional method

```{r}
is_gets <- isatpanel(y = EUCO2residential$lagg.directem_pc,
                     mxreg = EUCO2residential$lgdp,
                     time = EUCO2residential$year,
                     id = EUCO2residential$country,

                     effect = "twoways",

                     fesis = TRUE)
```


From here onwards, I will use the `lm` notation.

## Plotting

We can plot these simply using the default plotting methods (rely on the **ggplot2** package):

```{r}
plot(is_lm)
```


```{r}
plot_grid(is_lm)
```


```{r}
plot_counterfactual(is_lm)
```


<!-- ![](getspanel_plot.png) -->

## Saturation Methods

### Impulse Indicator Saturation

This argument works just as in the **gets** package. The method simply adds a `0` and `1` dummy for each observation.

Simply set `iis = TRUE`.

```{r}
iis_example <- isatpanel(data = EUCO2residential,
                         formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                         index = c("country","year"),

                         effect = "twoways",

                         iis = TRUE,
                         fesis = TRUE)
```


```{r}
plot(iis_example)
```


### Step Indicator Saturation

Traditional Step Indicator Saturation does not make sense in a panel setting. Therefore, the **gets** function of `sis` is disabled.

### Joint Step Indicator Saturation

It is possible, however, to consider Step Indicator Saturation with common breaks across individuals. Such indicators would be collinear, if `effects = c("twoways")` or `effects = c("time")` i.e. if Time Fixed Effects are included.

If, however, `effect = "individual"` then we can use `jsis = TRUE` to select over all individual time fixed effects.

```{r}
jsis_example <- isatpanel(data = EUCO2residential,
                          formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                          index = c("country","year"),

                          effect = "individual",

                          jsis = TRUE)
```


### Coefficient Step Indicator Saturation

**Note:** This method has only been tested using the `lm` implementation (using `data`, `formula`, and `index`).

This method allows detection of coefficient breaks that are common across all groups. It is the interaction between `jsis` and the relevant coefficient.

To illustrate this, as well as the advantages of using the `lm` approach, we include a non-linear term of the lgdp variable using `I(lgdp^2)`:

```{r}
csis_example <- isatpanel(data = EUCO2residential,
                          formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                          index = c("country","year"),

                          effect = "twoways",
                          t.pval = 0.05,

                          csis = TRUE)
```


By default, all coefficients will be interacted and added to the indicator list - but his can be controlled using the `csis_var`, which takes a character vector of column names i.e. `csis_var = "lgdp"`.

```{r}
csis_example2 <- isatpanel(data = EUCO2residential,
                           formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                           index = c("country","year"),

                           effect = "twoways",

                           csis = TRUE,
                           csis_var = "lgdp")
```


### Fixed Effect Step Indicator Saturation

This is equivalent to supplying a constant to the mxbreak argument in the old method. This essentially breaks the group-specific intercept i.e. the individual fixed effect.

```{r}
fesis_example <- isatpanel(data = EUCO2residential,
                           formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                           index = c("country","year"),

                           effect = "twoways",

                           fesis = TRUE)

```


```{r}
plot(fesis_example)
```


Similar to the `csis_var` idea, we can specify the `fesis` method for a subset of individuals as well using the `fesis_id` variable, which takes a character vector of individuals. In this case we can use e.g. `fesis_id = c("Austria","Denmark")`.

```{r}
fesis_example2 <- isatpanel(data = EUCO2residential,
                           formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                           index = c("country","year"),

                           effect = "twoways",

                           fesis = TRUE,
                           fesis_id = c("Austria","Denmark"))
```


```{r}
plot(fesis_example2)
```


## Post-selection robustness

The options for the `robust_isatpanel` are to use HAC Standard Errors, use a standard White Standard Error Correction (with the option of clustering the S.E. within groups or time):

```{r}
robust_isatpanel(fesis_example, HAC = TRUE, robust = TRUE, cluster = "group")
```


### Coefficient Fixed Effect Step Indicator Saturation

This method combines the `csis` and the `fesis` approach and detects whether coefficients for individual units break over time.

This means we can also combine the subsetting in both the variable and in the individual units using `cfesis_id` and `cfesis_var`.

```{r}
cfesis_example <- isatpanel(data = EUCO2residential,
                            formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                            index = c("country","year"),

                            effect = "twoways",

                            cfesis = TRUE,
                            cfesis_id = c("Belgium","Germany"),
                            cfesis_var = "lgdp",
                            t.pval = 0.001)
```


```{r}
plot(cfesis_example)
```

## The `ar` argument

It is now possible to specify an argument to include autoregressive coefficients, using the `ar` argument.

```{r}
fesis_ar1_example <- isatpanel(data = EUCO2residential,
                               formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                               index = c("country","year"),

                               effect = "twoways",

                               fesis = TRUE,

                               ar = 1)
```


## The `engine` argument

Another new argument is also the `engine` argument. This allows us to use an external package to estimate our models. At this stage, the **fixest** package can be used.

This also means that we can now use an argument to cluster Standard Errors using `cluster`. The following few chunks are not executed by default in the vignette.

```{r,eval = FALSE}
fixest_example <- isatpanel(data = EUCO2residential,
                            formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                            index = c("country","year"),

                            effect = "twoways",

                            fesis = TRUE,

                            engine = "fixest",
                            cluster = "none")
```


We can verify that, using no clustering of Standard Errors at all, using the **fixest** package does not change our estimates:

```{r, eval = FALSE}
head(fixest_example$isatpanel.result$mean.results)
```

Compared to the default estimator:

```{r, eval = FALSE}
head(is_lm$isatpanel.result$mean.results)
```

However, changing the `cluster` specification of course does. **The Standard Error correction with it's current implementation is not valid, so allows for many more indicators than true - clustering is therefore currently not recommended.**

```{r, eval = FALSE}
fixest_example_cluster <- isatpanel(data = EUCO2residential,
                                    formula = lagg.directem_pc ~ lgdp + I(lgdp^2) + pop,
                                    index = c("country","year"),

                                    effect = "twoways",

                                    fesis = TRUE,

                                    engine = "fixest",
                                    cluster = "individual")
```


```{r, eval = FALSE}
plot(fixest_example_cluster)
```