# Data preparation – development indicators

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, @guerrero_oa)

In this tutorial, we will pre-process a raw dataset that has been prepared for the tutorials. These data come from the Sustainable Development Report 2022, but do not represent any particular country as I have chosen a sample of indicators randomly. The objective of the tutorial is to show you how to normalise and extract the relevant features from these data to calibrat the model of PPI.

## Import the necessary python libraries to manipulate data

In [1]:
import pandas as pd
import numpy as np

## Import the raw development indicators

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/oguerrer/ppi/main/tutorials/raw_data/raw_indicators.csv')

In [3]:
data

Unnamed: 0,seriesCode,sdg,2000,2001,2002,2003,2004,2005,2006,2007,...,2019,2020,2021,2022,seriesName,bestBound,worstBound,instrumental,invert,color
0,sdg8_unemp,8,2.853,2.843,2.892,2.940,2.972,3.015,3.039,3.085,...,3.466,3.810,3.982,3.814,"Unemployment rate (% of total labor force, age...",25.9,0.50,0,1.0,#A21942
1,sdg5_familypl,5,51.500,52.400,53.300,54.600,55.900,58.100,60.400,62.800,...,82.000,82.400,82.800,83.100,Demand for family planning satisfied by modern...,17.5,100.00,1,0.0,#FF3A21
2,sdg11_slums,11,0.031,0.031,0.031,0.031,0.031,0.031,0.031,0.031,...,0.002,0.002,0.002,0.002,Proportion of urban population living in slums...,90.0,0.00,1,1.0,#FD9D24
3,sdg1_wpc,1,1.218,1.225,1.231,1.236,1.241,1.246,1.251,1.256,...,0.384,0.427,0.377,0.346,Poverty headcount ratio at $1.90/day (%),72.6,0.00,1,1.0,#E5243B
4,sdg1_320pov,1,33.232,33.342,33.470,33.604,33.724,33.816,33.877,33.914,...,26.765,28.674,28.765,28.310,Poverty headcount ratio at $3.20/day (%),51.5,0.00,1,1.0,#E5243B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,sdg16_rsf,16,36.697,36.617,36.537,36.457,36.379,36.304,36.232,36.162,...,36.190,36.550,36.395,36.340,Press Freedom Index (best 0-100 worst),80.0,10.00,1,1.0,#00689D
68,sdg16_justice,16,0.540,0.540,0.540,0.540,0.540,0.540,0.540,0.540,...,0.527,0.524,0.523,0.523,Access to and affordability of justice (worst ...,0.1,0.75,1,0.0,#00689D
69,sdg17_govex,17,9.003,8.980,8.751,8.802,8.819,8.747,8.682,8.650,...,8.112,8.111,8.110,8.109,Government spending on health and education (%...,0.0,15.00,1,0.0,#19486A
70,sdg17_govrev,17,29.534,30.017,29.195,29.147,29.451,29.885,30.150,30.739,...,20.629,20.574,20.568,20.563,Other countries: Government revenue excluding ...,10.0,40.00,1,0.0,#19486A


As we can see from the previous table, the dataset contains different development indicators in their original units. While normalizing the observations is not a requirement to run PPI, it helps with the callibration. Likewise, it is recommended to invert the direction of those indicators where better outcomes are reflected in lower values. This inversion is recommended to make the analysis easier to interpret.

Next, let me explain the different columns of this dataset:


* <strong>seriesCode</strong>: The code assigned to the indicator. It captures the SDG to which it belongs and the main policy issue that it relates to.
* <strong>sdg</strong>: The sustainable development goal (SDG) in which the indicator is classified.
* <strong>2000...2022</strong>: The value of the indicator in the corresponding year.
* <strong>seriesName</strong>: The complete name of the indicator.
* <strong>bestBound</strong>: The highest value that the indicator can take.
* <strong>worstBound</strong>: The lowest value that the indicator can take.
* <strong>instrumental</strong>: Takes 1 if an indicator is instrumental and 0 if it is collateral.
* <strong>invert</strong>: Takes 1 of it needs to be inverted and 0 if not.
* <strong>color</strong>: The color code of the SDG to which the indicator belongs.

Some of the columns in this dataset may seem odd to the user, as they reflect concepts explained in the book and other prior publications. Let me briefly explain these terms for those not fully acquainted with PPI.

The <strong>bestBound</strong> and <strong>worstBound</strong> are the so-called technical or theoretical limits of an indicator. The former determines the highest possible value and the latter the lowest. In this tutorial, they will help us to normalise each indicator between 0 and 1. Sometimes, technical bounds are provided by the data; others, you need to determine them according to prior knowledge or expert advice. In this example, I have taken values from the Sustainable Development Report that are declared as the optimum and the possible worst. Therefore, strictly speaking they are not technical bounds. We will also normalise the technical bounds (turning them into 1s and 0s) to use them in PPI.

An indicator is considered <strong>instrumental</strong> when there is certainty that there exists at least one government programme designed to impact it (which does not mean that such programme is necessarily effective). Collateral indicators, on the other hand, are those for which there are no expenditure programmes, either because they simply do not exist, or because the indicator is too aggregate for any government to claim that they have a reliable impact their programmes.

## Normalise values

First, we will normalise the observations between 0 and 1 using the <strong>bestBound</strong> and <strong>worstBound</strong> columns, through the formula

$$ normalisedValue = \frac{observedValue - worstBound}{bestBound - worstBound} .$$

Then, we invert the direction of those indicators whose better better are expressed through lower values. We do this by applying the formula

$$ invertedValue = 1 - normalisedValue $$

to those indicators with a value of 1 in the column <strong>invert</strong>.

In [4]:
years = [column_name for column_name in data.columns if str(column_name).isnumeric()]
# 3a) Exclude indicators with fewer than 5 datapoints
data['count_valid'] = data[years].notnull().sum(axis = 'columns')
data = data.loc[(data.count_valid > 4),]

normalised_series = []
for index, row in data.iterrows():
    time_series = row[years].values
    normalised_serie = (time_series - row.worstBound)/(row.bestBound - row.worstBound)
    if row.invert == 1:
        final_serie = 1 - normalised_serie
    else:
        final_serie = normalised_serie
    normalised_series.append( normalised_serie )
    
df = pd.DataFrame(normalised_series, columns=years)

## Normalise the theoretical bounds and add all the other columns

In [5]:
df['seriesCode'] = data.seriesCode
df['sdg'] = data.sdg
df['minVals'] = np.zeros(len(data))
df['maxVals'] = np.ones(len(data))
df['instrumental'] = data.instrumental
df['seriesName'] = data.seriesName
df['color'] = data.color

## Building new variables

Now, we will build a couple of additional variables that are necessary to calibrate and run PPI:

* <strong>I0</strong>: the initial values of the indicators in the sample period
* <strong>IF</strong>: the final values of the indicators in the sample period
* <strong>successRates</strong>: the number of times that an indicator improved as a fraction of the number of times it changed in the sample period

Parameter <strong>I0</strong> provides the initial condition of each indicator when running retrospective simulations. This means that <strong>I0</strong> is necessary for calibration. However, once calibrated, <strong>I0</strong> can be changed to perform prospective simulations.

The levels in <strong>IF</strong> corresond to the last value that each indicator achieved in the samle period. Together with <strong>I0</strong>, this vector helps building the trend component that PPI will attempt to calibrate. Nevertheless, it is recommended to manually check <strong>IF</strong>, as it could be the case that the last value in the time series was the result of an exogenous shock or idiosynchratic factors (it would exhibit a behaviour that is not consistent with the indicator's historical patern). If that was the case, it is advised to adjust the last value of the time series in an attempt to better represent the trend component of the indicator, for example, one could fit a regreesion model or a Gaussian process and correct the last observation using the predicted value. In these tutorials, we will not make such verisication and directly used the final values reported in the dataset.

The column <strong>successRates</strong> is used to calibrate PPI by minimising a second type of error different from the trend component. The idea is that the model endogenously produces a success rate that, on average, matches the empirical one. This means that one needs to compute, from the data, how often each indicator improves, in relation to the number of attempts to improve. For this dataset, in which we have more than 20 observation per time series, we can simply count how many times an indicator improved from one period to another, and divide that number by the number times it changed. Other ways to obtain this success rate include expert assessments or pooling indicators in the same categoty (like an SDG).


In [6]:
df['I0'] = df[years[0]]
df['IF'] = df[years[-1]]
successRates = np.sum(df[years].values[:,1::] > df[years].values[:,0:-1], axis=1)/(len(years)-1)

# if a success rate is 0 or 1, it is recommended to replace them by a very low or high value as 
# zeros and ones are usually an artifact of lacking data on enough policy trials in the indicator
successRates[successRates==0] = .05
successRates[successRates==1] = .95
df['successRates'] = successRates

## Development gaps

To capture the trend component of indicators, PPI measures an error with respect to the development gap shown in the historical data. That is, the difference between the final and initial values. It is important to make sure that these two values are different for each indicator, i.e. tha tthe development gap is non-zero, otherwise the calibration method will not be able to define the gap error. Thus, here I will introduce a slight modification to the final values to asure that we get non-zero gaps.

In [7]:
df.loc[df.I0==df.IF, 'IF'] = df.loc[df.I0==df.IF, 'IF']*1.05

## Governance parameters

PPI takes into account the role of public governance. First, it considers the quality of the monitoring mechanisms of the central authority to spot inefficiencies. Second, it accounts for the quality of the rule of law to exercise corrective measures when an inefficiency is discovered. In PPI's empirical applications, these parameters come from public indicators from the World Bank's Worlwide Governance Indicators. Their values need to be between 0 and 1, and they can be specific to each indicator if the user has information about how heterogeneous is public governance across different policy issues.

For the hypothetical country in these tutorials, let us assume that both parameters are homogeneous across indicators, and that both have a value of 0.5. Note that these parameters only affect (directly) the instrumental indicators. Thus, if you assign values to the collateral ones, PPI will ignore them.

In [8]:
df['qm'] = 0.5 # quality of monitoring
df['rl'] = 0.5 # quality of the rule of law

## Export data

Now the data is ready to be exported for its use with PPI

In [9]:
df.to_csv('clean_data/data_indicators.csv', index=False)