# Data preparation – development indicators

In this tutorial, we will pre-process a raw dataset that has been prepared for the tutorials. These data come from the Sustainable Development Report 2022, but do not represent any particular country as I have chosen a sample of indicators randomly. The objective of the tutorial is to show you how to normalise and extract the relevant features from these data to calibrat the model of PPI.

## Import the necessary python libraries to visualise and manipulate data

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

## Import the raw development indicators

In [15]:
data = pd.read_csv('https://raw.githubusercontent.com/oguerrer/ppi/main/tutorials/raw_data/raw_indicators.csv')

In [21]:
data

Unnamed: 0,seriesCode,sdg,2000,2001,2002,2003,2004,2005,2006,2007,...,2019,2020,2021,2022,seriesName,bestBound,worstBound,instrumental,invert,color
0,sdg8_unemp,8,20.955,21.100,20.592,20.213,20.019,19.751,19.233,19.059,...,19.280,21.002,21.618,21.032,"Unemployment rate (% of total labor force, age...",25.9,0.50,0,1.0,#A21942
1,sdg11_slums,11,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,Proportion of urban population living in slums...,90.0,0.00,1,1.0,#FD9D24
2,sdg5_familypl,5,71.900,72.100,72.400,72.700,73.000,73.300,73.600,73.800,...,76.900,77.100,77.200,77.400,Demand for family planning satisfied by modern...,17.5,100.00,1,0.0,#FF3A21
3,sdg1_wpc,1,17.298,17.369,17.450,17.534,17.612,17.678,17.730,17.770,...,13.337,14.616,14.777,14.510,Poverty headcount ratio at $1.90/day (%),72.6,0.00,1,1.0,#E5243B
4,sdg1_320pov,1,33.232,33.342,33.470,33.604,33.724,33.816,33.877,33.914,...,26.765,28.674,28.765,28.310,Poverty headcount ratio at $3.20/day (%),51.5,0.00,1,1.0,#E5243B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,sdg16_rsf,16,27.740,27.754,27.766,27.775,27.783,27.790,27.796,27.801,...,30.670,30.855,31.147,31.148,Press Freedom Index (best 0-100 worst),80.0,10.00,1,1.0,#00689D
68,sdg16_justice,16,0.544,0.545,0.545,0.545,0.545,0.545,0.545,0.545,...,0.579,0.585,0.585,0.585,Access to and affordability of justice (worst ...,0.1,0.75,1,0.0,#00689D
69,sdg17_govex,17,9.003,8.980,8.751,8.802,8.819,8.747,8.682,8.650,...,8.112,8.111,8.110,8.109,Government spending on health and education (%...,0.0,15.00,1,0.0,#19486A
70,sdg17_govrev,17,29.534,30.017,29.195,29.147,29.451,29.885,30.150,30.739,...,20.629,20.574,20.568,20.563,Other countries: Government revenue excluding ...,10.0,40.00,1,0.0,#19486A


As we can see from the previous table, the dataset contains different development indicators in their original units. While normalizing the observations is not a requirement to run PPI, it helps with the callibration. Likewise, it is recommended to invert the direction of those indicators where better outcomes are reflected in lower values. This inversion is recommended to make the analysis easier to interpret.

Next, let me explain the different columns of this dataset:


* <strong>seriesCode</strong>: The code assigned to the indicator. It captures the SDG to which it belongs and the main policy issue that it relates to.
* <strong>sdg</strong>: The sustainable development goal (SDG) in which the indicator is classified.
* <strong>2000...2022</strong>: The value of the indicator in the corresponding year.
* <strong>seriesName</strong>: The complete name of the indicator.
* <strong>bestBound</strong>: The highest value that the indicator can take.
* <strong>worstBound</strong>: The lowest value that the indicator can take.
* <strong>instrumental</strong>: Takes 1 if an indicator is instrumental and 0 if it is collateral.
* <strong>invert</strong>: Takes 1 of it needs to be inverted and 0 if not.
* <strong>color</strong>: The color of the SDG to which the indicator belongs.

Some of the columns in this dataset may seem odd to the user, as they reflect concepts explained in the book and other prior publications. Let me briefly explain these terms for those not fully acquainted with PPI.

The <strong>bestBound</strong> and <strong>worstBound</strong> are the so-called technical or theoretical limits of an indicator. The former determines the highest possible value and the latter the lowest. In this tutorial, they will help us to normalise each indicator between 0 and 1. Sometimes, technical bounds are provided by the data; others, you need to determine them according to prior knowledge or expert advice. In this example, I have taken values from the Sustainable Development Report that are declared as the optimum and the possible worst. Therefore, strictly speaking they are not technical bounds. We will also normalise the technical bounds (turning them into 1s and 0s) to use them in PPI.



