# World Inequality Database

The [World Inequality Database (WID.world)](https://wid.world/wid-world/) aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries. The dataset addresses some of the main limitations household surveys produce in national statistics of this kind: under-coverage at the top of the distribution due to non-response (the richest tend to not answer this kind of surveys or omit their income) or measurement error (the richest underreport their income for convenience or not actually knowing an exact figure if all their activities are added). The problem is handled with the combination of fiscal and national accounts data along household surveys based on the work of the leading researchers in the area: Anthony B. Atkinson, Thomas Piketty, Emmanuel Saez, Facundo Alvaredo, Gabriel Zucman, and hundreds of others. The initiative is based in the Paris School of Economics (as the [World Inequality Lab](https://inequalitylab.world/)) and compiles the World Inequality Report, a yearly publication about how inequality has evolved until the last year.

Besides income and wealth distribution data, the WID has recently added carbon emissions to generate carbon inequality indices. It also offers decomposed stats on national income. The data can be obtained from the website and by R and Stata commands.

## Distributions considered in this analysis

Three income distributions are considered, coming in three different csv files:
- **wid_pretax_992j_dist.csv** is the pretax income distribution `ptinc`, which includes social insurance benefits (and remove corresponding contributions), but exclude other forms of redistribution (income tax, social assistance benefits, etc.).
- **wid_posttax_nat_992j_dist.csv** is the post-tax national income distribution `diinc`, which includes both in-kind and in-cash redistribution.
- **wid_posttax_dis_992j_dist.csv** is the post-tax disposable income distribution `cainc`, which excludes in-kind transfers (because the distribution of in-kind transfers requires a lot of assumptions).

All of these distributions are generated using equal-split adults (j) as the population unit, meaning that the unit is the individual, but that income or wealth is distributed equally among all household members. The age group is individuals over age 20 (adult population), which excludes children (with 0 income in most of the cases). Extrapolations and interpolations are excluded from these files, as WID discourages its use at the level of individual countries (see the `exclude` description at `help wid` in Stata). More information about the variables and definitions can be found on [WID's codes dictionary](https://wid.world/codes-dictionary/).

The distributions analysed in this notebook come from commands given in the `wid` function in Stata. These commands are located in the `wid_distribution.do` file from this same folder. Opening the file and pressing the *Execute (do)* button will generate the most recent data from WID. Both `.csv` and `.dta` files are available for analysis.

## Main variables

In [3]:
import pandas as pd
from pathlib import Path
import time

file = Path('wid_pretax_992j_dist.csv')
wid_pretax = pd.read_csv(file)

file = Path('wid_posttax_nat_992j_dist.csv')
wid_posttax_nat = pd.read_csv(file)

file = Path('wid_posttax_dis_992j_dist.csv')
wid_posttax_dis = pd.read_csv(file)

In [5]:
wid_pretax.describe()

Unnamed: 0,year,p,threshold,average,share,inv_paretolorenz,age
count,712337.0,712337.0,517877.0,518535.0,712337.0,5505.0,712337.0
mean,1997.578781,0.602389,373743.7,481542.0,0.007871,75665.8,992.0
std,17.7196,0.328698,2108352.0,3450155.0,0.009236,1180822.0,0.0
min,1900.0,0.0,-139636.4,-69818.29,-0.037,0.9999999,992.0
25%,1989.0,0.31,4489.474,5525.011,0.0022,1.423047,992.0
50%,2001.0,0.63,19109.22,20989.93,0.0054,1.498587,992.0
75%,2010.0,0.95,69439.93,78125.12,0.0102,1.658319,992.0
max,2021.0,0.99999,342500900.0,871165100.0,0.3415,31220230.0,992.0


In [6]:
wid_posttax_nat.describe()

Unnamed: 0,year,p,threshold,average,share,inv_paretolorenz,age
count,186563.0,186563.0,165735.0,165735.0,186563.0,1469.0,186563.0
mean,1998.041525,0.601969,284436.8,363418.7,0.007874,1.991355,992.0
std,16.24901,0.32862,1227903.0,2194219.0,0.006054,0.581568,0.0
min,1913.0,0.0,-23311800.0,-542060.6,-0.1082,1.0,992.0
25%,1989.0,0.31,18650.87,18916.93,0.0036,1.600373,992.0
50%,2001.0,0.63,36723.36,37078.79,0.007,1.871296,992.0
75%,2010.0,0.95,82228.56,84235.49,0.0107,2.216935,992.0
max,2020.0,0.99999,64226740.0,170607600.0,0.0616,6.889863,992.0


In [7]:
wid_posttax_dis.describe()

Unnamed: 0,year,p,threshold,average,share,inv_paretolorenz,age
count,173101.0,173101.0,152273.0,152273.0,173101.0,1363.0,173101.0
mean,2000.572267,0.601969,217220.8,291104.3,0.007874,2.167789,992.0
std,11.090193,0.328621,837099.0,1686922.0,0.006316,0.596611,0.0
min,1970.0,0.0,-19013570.0,-442409.1,-0.1179,1.000103,992.0
25%,1991.0,0.31,13867.21,14474.16,0.0034,1.68757,992.0
50%,2002.0,0.63,27686.16,28894.63,0.0068,2.131961,992.0
75%,2010.0,0.95,64007.78,68846.5,0.0108,2.553951,992.0
max,2020.0,0.99999,51407300.0,147768100.0,0.0668,7.032377,992.0


## Sanity checks for the income distributions