# Data preparation – expenditure

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, @guerrero_oa)

This tutorial will show how to prepared a dataset containing expenditure programmes that are linked, somehow, to the development indicators. I will assume that the raw expenditure data has already certain structure, and I will provide examples of structures with different levels of granularity. The aim will be to prepare two files: (1) a disbursement schedule and (2) a relational table.

## Import the necessary python libraries to manipulate data

In [2]:
import pandas as pd
import numpy as np

## On expenditure linked data

In any impact evaluation of public expenditure, it is necessary to have some information about the level of expenditure that is devoted to an indicator. Usually, broad tranches such as educatino, public health, or national defense are used for that purpose. In the context of multidimensional impact evaluation, these data need to be more disaggregated. In an ideal scenario, there should be one development indicator directly linked to one expenditure programme. In real life, such one-to-one mapping does not exist because there can be multiple government prgorammes designed to affect the same indicator, or several indicators affected by the same programme. PPI was designed with this imperfect matching in mind.

Today, it is still difficult to find expenditure datasets with a high degree of disaggregation. Thus, in this tutorial, I will show two examples. Before elaborating in these examples, I need to explain how PPI uses the expenditure data and some important considerations that one need to take into account before preparing the final dataset.


## PII and expenditure data

To accomodate different linkage qualities between expenditure and indicators, PPI relies on a model of how the government priorises it spending (see more in the Model chapter of the book). For instance, if there is only aggregate data for the tranche of education, but there are several indicators capturing different policy issues within education, the model determines the spending distribution within education endogenously. However, if the used has data on how the educational budget was actually allocated across more fine-grained policy issues, this information can be incorporated into PPI to rely less on the model and more on the data.

## Temporal factors

Before preparing the example datasets, it is important to mention three important adjustments that should be done beforehand:

* Accounting for inflation
* Accounting for population growth
* Accounting for spending inertia

Controlling for these (and perhaps other) temporal variables is important to remove their influence from the expenditure-indicator relationship that PPI models. Dealing for inflacion is straightforward as it consists of turning the expenditure time series in constant monetary units. Population growth is also easy as one needs to divide the data by the populaiton size (which changes through time) to obtain per capita spending.

Once the two previous adjustments have been made, there may still be certain inertia in the spending time series. This should also be removed for technical reasons regarding the model in PPI. In a nutshell, this is necessary because of the calibration of a parameter $\beta_i$ that normalises the expenditure destined to indicator $i$ into 0 and 1 to determine the probability of success of the indicator.

Note that $\beta_i$ does not have a time sub-index, which means that it is a cnostant parameter. This means that, if the expenditure data related to $i$ has a positive trend, the latter periods of the simulation will tend to have a higher probability of success than the early ones. These inter-temporal differenes in success rates is an artifact of not removing the trend component of expenditure, as the indicator data does not suggest a systematic improvement in success rates, but rather that spending in policy issues becomes more expensive–in real terms–with time.

There are various ways in which one could remove the trend component form a, expenditure time series. In the book, we use the naive approach of simply calculating the inter-temporal average of each expenditure programme, and applying it in every period. This is a simple approach that meets the technical level of the book, and that is acceptable is one is not concerned about specific points in time during the sample period. However, one may want to consider more nuanced methods like running a linear regression and taking the differences with respect to the predicted values, or a Hodrick–Prescott filter that is popular in macroecnomics, or any signal-processing method (as we do in http://dx.doi.org/10.2139/ssrn.4101378).

In this tutorial, I assume that the raw expenditure data consists of a table with expenditure time series of various expenditure SDGs. Hence, in absence of expenditure programmes, the SDGs provide the imperfect link between spending data and indicators. 