# Advanced Usage

This tutorial showcases advanced functionalities and applications of GETTSIM's interface. For an introductory tutorial see [here](basic_usage.ipynb). The introductory tutorial showcases GETTSIM's two main functions using a minimal working example:

1. `set_up_policy_environment` which loads a policy environment for a specified date.

2. `compute_taxes_and_transfers` which allows you to compute taxes and transfers given a specified policy environment for household or individual observations.

This tutorial dives deeper into the GETTSIM interface to acquaintance you with further useful functionalities. Specifically, this tutorial shows how to navigate the numerous [input and target](../gettsim_objects/input_variables.rst) variables that the package supports as well as how GETTSIM processes them internally using the example of child benefits in the German taxes and transfers system.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from gettsim import (
    compute_taxes_and_transfers,
    create_synthetic_data,
    plot_dag,
    set_up_policy_environment,
)


## Example: Kindergeld (Child Benefits)    
    
For this tutorial, we will focus on *Kindergeld*, which is a child benefit that can be claimed by parents in Germany. *Kindergeld* can be claimed in different ways and eligibility for families to receive it depends on various variables. For instance, *Kindergeld* can be claimed as a monthly payment but also as a tax credit (*Kinderfreibetrag*) which is more advantageous for higher income groups. Additionally, eligibility depends on factors like the age and work status of children. These factors make it a more complex feature of the German taxes and transfers system than one might initially believe.  

In the following, we will inspect in detail how the German *Kindergeld* is implemented in GETTSIM to showcase further functionalities of the package. To start off, we load a policy environment to work with.

In [None]:
policy_params, policy_functions = set_up_policy_environment("2020")

In [None]:
policy_params["wohngeld"]

The according policy parameters are saved under the key `kindergeld`.

In [None]:
policy_params["kindergeld"]

## DAG Plots for Visualization of the Taxes and Transfers System

To get a better picture of how *Kindergeld* is implemented in GETTSIM and, meanwhile, of the structure of the German taxes and transfers system, we can utilize GETTSIM's visualization capabilities which are concentrated in the function `plot_dag`. This function creates a directed acyclic graph (DAG) for the taxes and transfers system. It offers many different visualization possibilities. The [guide on visualizing the taxes and transfers system](../how_to_guides/visualizing_the_system.ipynb) gives an in depth explanation of the function. 

To figure out which variables are relevant for the child benefit, we plot an according slice of the entire taxes and transfers system implemented in GETTSIM using `plot_dag`. The function was already imported with all other relevant packages at the beginning of this tutorial. To select the relevant plot, we have to define selectors that we can pass as arguments to the function. We can check the possible output variables [here](../gettsim_objects/variables_out.rst) to find the relevant variable name for our application.

In [None]:
selectors = {"type": "ancestors", "node": "kindergeld_m"}

Since we are interested in the child benefits, we select the node `kindergeld_m` and plot its `ancestors`, which are all the nodes `kindergeld_m` directly or indirectly depends on. As the plot below shows, the variable depends on many other nodes and generates a very large DAG. Clicking on a node links to the according function or variable.

In [None]:
plot_dag(functions=policy_functions, selectors=selectors).show()

An alternative way to inspect the variable is by looking at its neighbors in the DAG. This depiction shows the related variables and functions up to two nodes away from `kindergeld_m`. It reveals `descendants` of `kindergeld_m`: `kindergeld_m_tu` and `kindergeld_m_hh`. These variables contain the child benefits on tax unit and household level respectively.

In [None]:
selectors = {"type": "neighbors", "node": "kindergeld_m", "order": 2}
plot_dag(functions=policy_functions, selectors=selectors).show()

## Computing Variables of Interest

Once we have inspected the DAG, we now have an impression of the various input variables and functions that influence our variable  of interest. As a next step, we will load a set of simulated household data and inspect how we can compute the *Kindergeld* using `compute_taxes_and_transfers` and use the function's features and error messages to aid us in this process.

### Simulated Data

We simulate a dataset using `create_synthetic_data`. We can easily specify a few variables while all other necessary input variabels will be filled with defaults. 

The specification chosen here creates a set of households with two adults and two children. The households vary in the variable `bruttolohn_m` and are otherwise identical.

In [None]:
data = create_synthetic_data(
    n_adults=2,
    n_children=2,
    specs_heterogeneous={
        "bruttolohn_m": [[i, 0, 0, 0] for i in np.linspace(1000, 8000, 701)]
    },
)

In [None]:
data[["hh_id", "hh_typ", "alter", "kind", "bruttolohn_m"]]

Adults' monthly gross earnings range between €1,000 and €8,000. It is captured in the variable `bruttolohn_m`. We can use the pandas function [pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html) to assess the variable in detail.

In [None]:
data["bruttolohn_m"].describe()

The columns contain all the input variables needed to compute `kindergeld_m`.

In [None]:
data.columns

### Using Errors and Warnings

As the DAG and column list above show, a large number of inputs is required to compute child benefits for a family. While the DAG is very useful to understand the structure within GETTSIM behind a variable or function, it might be difficult to infer which inputs exactly are needed in the data to compute a desired output. The function `compute_taxes_and_transfers` thus directly provides multiple mechanisms that help you identify the required input variables to compute certain taxes and transfers.

As shown in the [basic usage tutorial](basic_usage.ipynb), the function requires `data`, one or multiple `targets`, and `policy_params` as well as `policy_functions` to compute taxes and transfers for a given policy environment. 

Since our data set includes all required input columns already, the function does so without problems.

In [None]:
result = compute_taxes_and_transfers(
    data=data, params=policy_params, targets="kindergeld_m", functions=policy_functions
)
result.head(3)

#### Error Messages: Missing Inputs

However, if we have failed to add a required column, the function throws an error with a message that specifies which columns are missing. For example, the variable `arbeitsstunden_w` holds information on weekly working hours and is required to compute child benefits. Dropping it from the data triggers the error.

In [None]:
incomplete_data = data.drop("arbeitsstunden_w", axis=1)
result = compute_taxes_and_transfers(
    data=incomplete_data,
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
)

Similarly, we can pass an empty pandas.DataFrame to the function to get a list of all the necessary input columns to compute the desired target(s).

In [None]:
result = compute_taxes_and_transfers(
    data=pd.DataFrame({"p_id": []}),
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
)

#### Error Messages and Warnings: Unused Inputs

The function `compute_taxes_and_transfers` also has an option that allows you to check for unused inputs in your data. This functionality is controlled through the argument `check_minimal_specification`. By default, it is set to `ignore`, meaning no check is conduced. However, it can also be set to `warn` to trigger a warning or `raise` an error that includes a message stating the unused inputs.

In [None]:
result = compute_taxes_and_transfers(
    data=data,
    params=policy_params,
    targets="kindergeld_m",
    functions=policy_functions,
    check_minimal_specification="raise",
)

### Debug Mode

In addition to errors and warnings `compute_taxes_and_transfers` can also be used in debug mode by setting the argument `debug=True`. In this mode, the function returns all inputs and outputs that can be computed while issuing error messages for the parts where the code fails. It is thus a very useful tool to help you set up your code correctly and detect the sources of problems that might arise in the process. Check out the [troubleshooting tutorial](debugging.ipynb) for more information.

### Computing Child Benefits and Taxes

In this section we will compute lump-sum child benefits (*Kindergeld*) for example households. Since households can also claim a tax credit (*Kinderfreibetrag*) instead of the child benefit, we will also compute the income taxes for each household. By default, GETTSIM chooses the financially more favorable option for each case. The results will thus let us inspect how the policy affects different income levels in our data. 

#### Income Taxes

The income tax of a tax unit depends on the child benefit since the tax credit is only claimed if it more beneficial than the child benefit. To compare, we can additionally compute the income taxes for our data set `eink_st_y_tu`. We also compute the variable `bruttolohn_m_tu`, which gives the monthly gross income per tax unit (in our case, this is the combined income of the two adults in the household). 

In [None]:
df = compute_taxes_and_transfers(
    data=data,
    params=policy_params,
    targets=["eink_st_y_tu", "bruttolohn_m_tu", "kindergeld_m_tu"],
    functions=policy_functions,
)

Since the gross income and child benefit per tax unit is computed on a monthly level while taxes are computed for the time unit of one year, we multiply the former by 12 and drop unused variables as well as duplicates from our DataFrame. The final DataFrame contains the yearly gross income, income tax, child benefit, and number of children in the household.

In [None]:
# Multiply variables by 12 to generate yearly values.
df[["bruttolohn_tu", "kindergeld_tu"]] = df[["bruttolohn_m_tu", "kindergeld_m_tu"]] * 12
# Select variables of interest for further steps.
df = df[["bruttolohn_tu", "eink_st_y_tu", "kindergeld_tu"]].drop_duplicates()
df.head().round(2)

At a certain income level (around €80,000-€90,000) the tax credit becomes more favorable and GETTSIM assigns the tax break. The next cells plot the resulting income tax and child benefits.

In [None]:
def plot_kindergeld(df):
    """Plot the child benefit and income taxes by household type."""

    return px.line(
        data_frame=df,
        x="bruttolohn_tu",
        y=["eink_st_y_tu", "kindergeld_tu"],
    )

In [None]:
plot_kindergeld(df).show()

### Columns Overriding Functions

Lastly, it is also possible to substitute internally computed variables using input columns in the data. To override an internal function, it is necessary to specify a column with the same name and pass it to `compute_taxes_and_transfers` using the argument `columns_overriding_functions`.

For instance, for this application we could override the internal function `kindergeld_m` and set the child benefit to 0. 

In [None]:
new_data = data.copy()
new_data["kindergeld_m"] = 0.0

Again, we compute the child benefit and income tax by tax unit. The argument `columns_overriding_functions` also accepts lists of columns to overwrite multiple functions.

In [None]:
outputs = compute_taxes_and_transfers(
    data=new_data,
    params=policy_params,
    targets=["kindergeld_m_tu", "eink_st_y_tu", "bruttolohn_m_tu"],
    functions=policy_functions,
    columns_overriding_functions=["kindergeld_m"],
)

In [None]:
df_new = outputs.set_index(new_data.tu_id)
df_new[["bruttolohn_tu", "kindergeld_tu"]] = (
    df_new[["bruttolohn_m_tu", "kindergeld_m_tu"]] * 12
)
df_new = df_new[["bruttolohn_tu", "eink_st_y_tu", "kindergeld_tu"]].drop_duplicates()

Since the child benefits are set to zero, GETTSIM computes the tax credit for all households instead. 

In [None]:
plot_kindergeld(df_new).show()

Aside from overriding internal function outputs using data columns, it is also possible to substitute the functions entirely. Please refer to the [policy functions tutorial](policy_functions.ipynb) for more information.

#### Use Case for Columns Overriding Functions: Retirement Earnings

Retirement earnings (`ges_rente_m`) can be calculated by GETTSIM which requires several input variables including `entgeltp` or `grundr_zeiten`. 

However, in most data sets (e.g. the SOEP) retirement earnings are observed and those input variables are not. For some applications, it is, hence, more straight-forward to specify `columns_overriding_functions=["ges_rente_m"]` and use the measured retirement earnings directly. Then the pension-specific input variables like `entgeltp` or `grundr_zeiten` are not needed as input variables.

