# World Inequality Database

The [World Inequality Database (WID.world)](https://wid.world/wid-world/) aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries. The dataset addresses some of the main limitations household surveys produce in national statistics of this kind: under-coverage at the top of the distribution due to non-response (the richest tend to not answer this kind of surveys or omit their income) or measurement error (the richest underreport their income for convenience or not actually knowing an exact figure if all their activities are added). The problem is handled with the combination of fiscal and national accounts data along household surveys based on the work of the leading researchers in the area: Anthony B. Atkinson, Thomas Piketty, Emmanuel Saez, Facundo Alvaredo, Gabriel Zucman, and hundreds of others. The initiative is based in the Paris School of Economics (as the [World Inequality Lab](https://inequalitylab.world/)) and compiles the World Inequality Report, a yearly publication about how inequality has evolved until the last year.

Besides income and wealth distribution data, the WID has recently added carbon emissions to generate carbon inequality indices. It also offers decomposed stats on national income. The data can be obtained from the website and by R and Stata commands.

## Distributions considered in this analysis

Three income distributions are considered, coming in three different csv files:
- **wid_pretax_992j_dist.csv** is the pretax income distribution `ptinc`, which includes social insurance benefits (and remove corresponding contributions), but exclude other forms of redistribution (income tax, social assistance benefits, etc.).
- **wid_posttax_nat_992j_dist.csv** is the post-tax national income distribution `diinc`, which includes both in-kind and in-cash redistribution.
- **wid_posttax_dis_992j_dist.csv** is the post-tax disposable income distribution `cainc`, which excludes in-kind transfers (because the distribution of in-kind transfers requires a lot of assumptions).

These distributions are the main DINA (distributional national accounts) income variables available at WID. DINA income concepts are distributed income concepts that are consistent with national accounts aggregates. The precise definitions are outlined in the [DINA guidelines](https://wid.world/es/news-article/2020-distributional-national-accounts-guidelines-dina-4/) and country-specific papers. 

All of these distributions are generated using equal-split adults (j) as the population unit, meaning that the unit is the individual, but that income or wealth is distributed equally among all household members. The age group is individuals over age 20 (992, adult population), which excludes children (with 0 income in most of the cases). Extrapolations and interpolations are excluded from these files, as WID discourages its use at the level of individual countries (see the `exclude` description at `help wid` in Stata). More information about the variables and definitions can be found on [WID's codes dictionary](https://wid.world/codes-dictionary/).

The distributions analysed in this notebook come from commands given in the `wid` function in Stata. These commands are located in the `wid_distribution.do` file from this same folder. Opening the file and pressing the *Execute (do)* button will generate the most recent data from WID. Both `.csv` and `.dta` files are available for analysis.

## Main variables

In [1]:
import pandas as pd
from pathlib import Path
import time
import seaborn as sns

#keep_default_na and na_values are included because there is a country labeled NA, Namibia, which becomes null without the parameters

file = Path('wid_pretax_992j_dist.csv')
wid_pretax = pd.read_csv(file, keep_default_na=False,
                         na_values=['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '', '#NA', 
                                    'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''])

file = Path('wid_posttax_nat_992j_dist.csv')
wid_posttax_nat = pd.read_csv(file, keep_default_na=False,
                              na_values=['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '', '#NA',
                                         'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''])

file = Path('wid_posttax_dis_992j_dist.csv')
wid_posttax_dis = pd.read_csv(file, keep_default_na=False,
                              na_values=['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '', '#NA', 
                                         'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''])

#The variable 'country_year' is created, to identify unique distributions:
wid_pretax['country_year'] = wid_pretax['country'] + wid_pretax['year'].astype(str)
wid_posttax_nat['country_year'] = wid_posttax_nat['country'] + wid_posttax_nat['year'].astype(str)
wid_posttax_dis['country_year'] = wid_posttax_dis['country'] + wid_posttax_dis['year'].astype(str)

The key variables that come following transformations in Stata are:
- **country** mostly follows the ISO 3166-1 alpha-2 standard, but also includes world regions, country subregions (rural and urban China, for example), former countries and countries not officially included in the standard. All the countries available are extracted. See https://wid.world/codes-dictionary/#country-code
- **year** is the year of the distribution. All available years are extracted.
- **percentile** is the percentile (or, more broadly, quantile) of the distribution. They are in the format *pXpY*, where X and Y are both numbers between 0 and 100. X correspond to the percentile for the lower bound of the group, and Y to the percentile for the upper bound (hence X < Y). 130 different quantiles are extracted, from p0p1 to p99p100, tenths of a percentile in the top 1% (p99p99.1, p99.1p99.2, p99.2p99.3, …, p99.8p99.9, p99.9p100), hundreds of a percentile in the top 0.1% (p99.9p99.91, p99.92p99.93, …, p99.98p99.99, p99.99p100), and thousands of a percentile in the top 0.01% (p99.99p99.991, p99.992p99.993, … , p99.998p99.999, p99.999p100). See https://wid.world/codes-dictionary/#percentile-code 
- **p** represent the same variable *percentile*, but presented in a more simple way to sort the dataset: the lower bound X is extracted from *pXpY* and divided by 100 to get only numbers from 0 to 1.
- **threshold** is the minimum level of income that gets you into a group. For example, the income threshold of the group p90p100 is the income of the poorest individuals in the top 10%. By definition, it is equal to the income threshold of the groups p90p99 or p90p91.
- **average** is the average income of the people in the group. For example, the wealth average of the group p90p99 is the average income of the top 10% excluding the top 1%.
- **share** is the income of the group, divided by the total for the whole population. For example, the income of the group p99p100 is the top 1% income share.

Threshold and average data is converted to 2017 USD PPP with the `xlcusp` command in Stata (see https://wid.world/codes-dictionary/#exchange-rate). The variables **age** and **pop** (age group and population unit, respectively) are also in the dataset, but mainly for internal reference as it is the same value for each observation (992 and j). Although there are more age groups and population units available to query, most of them do not return results as massive as with the 992 and j combination or they just do not return data (see the options [here](https://wid.world/codes-dictionary/#three-digit-code) and [here](https://wid.world/codes-dictionary/#one-letter-code)).

Basic descriptive statistics are presented for the three distributions:

In [2]:
wid_pretax.describe(include='all')

Unnamed: 0,country,year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop,country_year
count,731157,731157.0,731157,731157.0,530108.0,532089.0,731157.0,22020.0,731157.0,731157,731157
unique,225,,130,,,,,,,1,6454
top,US,,p99.9p100,,,,,,,j,AE1998
freq,13780,,6358,,,,,,,731157,130
mean,,1997.424206,,0.612528,384572.1,525326.2,0.009878,112584.3,992.0,,
std,,17.98679,,0.330383,2101428.0,3507536.0,0.020805,1784960.0,0.0,,
min,,1870.0,,0.0,-139636.4,-69818.29,-0.037,0.9999999,992.0,,
25%,,1989.0,,0.32,4664.6,5764.0,0.0023,1.681378,992.0,,
50%,,2001.0,,0.65,20089.82,22236.88,0.0056,2.208449,992.0,,
75%,,2010.0,,0.97,78365.61,93560.16,0.0107,2.583091,992.0,,


With 731,157 observations, the pretax income distribution file is with difference the largest out of the three. It also contains 224 different countries/regions, almost 5 times the number of the post-tax files. This makes up for a total of 6451 different distributions (different country-years available). Although there is data starting from the year 1870, the data is concentrated mostly in the last three decades (the median of the *year* variable is 2001). 

In [3]:
wid_posttax_nat.describe(include='all')

Unnamed: 0,country,year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop,country_year
count,191034,191034.0,191034,191034.0,169650.0,169650.0,191034.0,5876.0,191034.0,191034,191034
unique,48,,130,,,,,,,1,1533
top,US,,p99p100,,,,,,,j,AL1996
freq,13780,,1533,,,,,,,191034,130
mean,,1998.023551,,0.611196,293545.5,391258.3,0.008559,1.972184,992.0,,
std,,16.279737,,0.330177,1223517.0,2213586.0,0.009196,0.528526,0.0,,
min,,1900.0,,0.0,-23311800.0,-542060.6,-0.1082,1.0,992.0,,
25%,,1989.0,,0.32,19050.12,19320.18,0.0037,1.622764,992.0,,
50%,,2001.0,,0.65,37735.12,38106.82,0.0071,1.864115,992.0,,
75%,,2010.0,,0.97,91027.25,94938.97,0.0109,2.194164,992.0,,


The post-tax national income distribution file contains 191,034 observations for only 48 countries, which make 1533 different distributions. The minimum year is 1900, although the distributions are again concentrated more recently (median 2001).

In [4]:
wid_posttax_dis.describe(include='all')

Unnamed: 0,country,year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop,country_year
count,177572,177572.0,177572,177572.0,155870.0,155870.0,177572.0,5452.0,177572.0,177572,177572
unique,48,,130,,,,,,,1,1533
top,FR,,p99p100,,,,,,,j,AL1996
freq,5394,,1533,,,,,,,177572,130
mean,,2000.48921,,0.611895,224561.1,313434.9,0.008684,2.108109,992.0,,
std,,11.308023,,0.330284,836220.6,1701817.0,0.009994,0.540296,0.0,,
min,,1900.0,,0.0,-19013570.0,-442409.1,-0.1179,1.000103,992.0,,
25%,,1991.0,,0.32,14159.95,14774.83,0.0035,1.679212,992.0,,
50%,,2002.0,,0.65,28472.79,29761.64,0.0069,2.072798,992.0,,
75%,,2010.0,,0.97,71008.78,77438.28,0.0111,2.461455,992.0,,


The post-tax disposable income file is the one with less observations (177,552) for 48 countries making 1533 different distributions again. The minimum year is 1900 (median 2002).

## Sanity checks for the income distributions

The distributions are explored more in detail to find and correct (if possible) errors in the original data. 

### The same quantiles available for each country-year

It is very important that the distribution contains all 130 quantiles requested by the original query in Stata, to be able to estimate inequality statistics properly.
One way to see if this holds is by counting the different occurrences of *percentile* for each distribution. The dataframes are grouped by country and year for this purpose.

#### Pretax income

In [5]:
pretax_count = wid_pretax.groupby(['country', 'year', 'country_year']).nunique()
pretax_not130 = pretax_count[pretax_count['percentile']!=130].reset_index()
pretax_not130

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
0,AR,1932,AR1932,3,3,0,0,3,0,1,1
1,AR,1933,AR1933,3,3,0,0,3,0,1,1
2,AR,1934,AR1934,3,3,0,0,3,0,1,1
3,AR,1935,AR1935,3,3,0,0,3,0,1,1
4,AR,1936,AR1936,3,3,0,0,3,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
846,ZW,1974,ZW1974,2,2,0,2,2,0,1,1
847,ZW,1975,ZW1975,2,2,0,2,2,0,1,1
848,ZW,1976,ZW1976,2,2,0,2,2,0,1,1
849,ZW,1977,ZW1977,2,2,0,2,2,0,1,1


In the case of the pretax data there are 851 different distributions that do not have the 130 quantiles. The main stats of this group are in the following table.

In [6]:
pretax_not130.describe(include='all')

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
count,851,851.0,851,851.0,851.0,851.0,851.0,851.0,851.0,851.0,851.0
unique,21,,851,,,,,,,,
top,JP,,AR1932,,,,,,,,
freq,93,,1,,,,,,,,
mean,,1942.028202,,3.251469,3.156287,0.0,2.443008,3.162162,0.0,1.0,1.0
std,,23.922555,,5.087954,4.568976,0.0,5.328943,4.623421,0.0,0.0,0.0
min,,1870.0,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
25%,,1927.0,,2.0,2.0,0.0,0.0,2.0,0.0,1.0,1.0
50%,,1944.0,,3.0,3.0,0.0,2.0,3.0,0.0,1.0,1.0
75%,,1961.0,,3.0,3.0,0.0,3.0,3.0,0.0,1.0,1.0


21 different countries are in this situation, with a range of years from 1870 to 1979 (median 1944). The amount of percentiles in this group range from 1 to 31 (median 3). The 21 countries are:

In [7]:
pretax_not130.country.value_counts(dropna=False)

JP    93
DE    71
GB    71
DK    65
FI    60
NO    50
SE    48
IE    46
ZW    45
NL    42
ZA    40
HU    30
IN    27
AR    27
CH    24
MW    24
ES    24
SG    21
ID    20
GR    13
KR    10
Name: country, dtype: int64

The list of country-years without 130 quantiles can be extracted and filtered to the original dataset to see which are the few quantiles presented.

In [8]:
pretax_not130_list = list(pretax_not130.country_year.unique())
wid_pretax_not130 = wid_pretax[wid_pretax['country_year'].isin(pretax_not130_list)].reset_index(drop=True)
wid_pretax_clean = wid_pretax[~wid_pretax['country_year'].isin(pretax_not130_list)].reset_index(drop=True)

wid_pretax_not130.percentile.value_counts(dropna=False)

p99.9p100         755
p99p100           703
p99.99p100        553
p99.95p99.96       27
p99.998p99.999     27
p99.997p99.998     27
p99.996p99.997     27
p99.995p99.996     27
p99.994p99.995     27
p99.993p99.994     27
p99.992p99.993     27
p99.991p99.992     27
p99.99p99.991      27
p99.98p99.99       27
p99.97p99.98       27
p99.96p99.97       27
p99.93p99.94       27
p99.94p99.95       27
p99.92p99.93       27
p99.91p99.92       27
p99.9p99.91        27
p99.8p99.9         27
p99.7p99.8         27
p99.6p99.7         27
p99.5p99.6         27
p99.4p99.5         27
p99.3p99.4         27
p99.2p99.3         27
p99.1p99.2         27
p99p99.1           27
p99.999p100        27
Name: percentile, dtype: int64

All of them come from the 1%, the last percentile or one of its subdivisions.

#### Post-tax national income

For the post-tax national income distribution there are less cases:

In [9]:
posttax_nat_count = wid_posttax_nat.groupby(['country', 'year', 'country_year']).nunique()
posttax_nat_not130 = posttax_nat_count[posttax_nat_count['percentile']!=130].reset_index()
posttax_nat_not130

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
0,FR,1900,FR1900,1,1,0,0,1,0,1,1
1,FR,1910,FR1910,1,1,0,0,1,0,1,1
2,FR,1915,FR1915,1,1,0,0,1,0,1,1
3,FR,1916,FR1916,1,1,0,0,1,0,1,1
4,FR,1917,FR1917,1,1,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
59,FR,1973,FR1973,1,1,0,0,1,0,1,1
60,FR,1974,FR1974,1,1,0,0,1,0,1,1
61,FR,1976,FR1976,1,1,0,0,1,0,1,1
62,FR,1977,FR1977,1,1,0,0,1,0,1,1


64 different distributions do not have 130 percentiles.

In [10]:
posttax_nat_not130.describe(include='all')

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
count,64,64.0,64,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0
unique,1,,64,,,,,,,,
top,FR,,FR1900,,,,,,,,
freq,64,,1,,,,,,,,
mean,,1944.390625,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
std,,19.389587,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,,1900.0,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
25%,,1928.75,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
50%,,1944.5,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
75%,,1960.25,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0


Only one country is in this situation (France), with a range of years from 1900 to 1978 (median 1944). There is always 1 percentile for each of these distributions.

All of these 64 percentiles are p99p100, the top 1%:

In [11]:
posttax_nat_not130_list = list(posttax_nat_not130.country_year.unique())
wid_posttax_nat_not130 = wid_posttax_nat[wid_posttax_nat['country_year'].isin(posttax_nat_not130_list)].reset_index(drop=True)
wid_posttax_nat_clean = wid_posttax_nat[~wid_posttax_nat['country_year'].isin(posttax_nat_not130_list)].reset_index(drop=True)

wid_posttax_nat_not130.percentile.value_counts(dropna=False)

p99p100    64
Name: percentile, dtype: int64

#### Post-tax disposable income

There are more post-tax disposable income distributions that follow this category:

In [13]:
posttax_dis_count = wid_posttax_dis.groupby(['country', 'year', 'country_year']).nunique()
posttax_dis_not130 = posttax_dis_count[posttax_dis_count['percentile']!=130].reset_index()
posttax_dis_not130

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
0,FR,1900,FR1900,1,1,0,0,1,0,1,1
1,FR,1910,FR1910,1,1,0,0,1,0,1,1
2,FR,1915,FR1915,1,1,0,0,1,0,1,1
3,FR,1916,FR1916,1,1,0,0,1,0,1,1
4,FR,1917,FR1917,1,1,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
165,US,2014,US2014,3,3,0,0,3,0,1,1
166,US,2015,US2015,3,3,0,0,3,0,1,1
167,US,2016,US2016,3,3,0,0,3,0,1,1
168,US,2017,US2017,3,3,0,0,3,0,1,1


In the case of the post-tax disposable data there are 170 different distributions that do not have the 130 quantiles. The main stats of this group are in the following table.

In [14]:
posttax_dis_not130.describe(include='all')

Unnamed: 0,country,year,country_year,percentile,p,threshold,average,share,inv_paretolorenz,age,pop
count,170,170.0,170,170.0,170.0,170.0,170.0,170.0,170.0,170.0,170.0
unique,2,,170,,,,,,,,
top,US,,FR1900,,,,,,,,
freq,106,,1,,,,,,,,
mean,,1957.552941,,2.247059,2.247059,0.0,0.0,2.247059,0.0,1.0,1.0
std,,28.854873,,0.971863,0.971863,0.0,0.0,0.971863,0.0,0.0,0.0
min,,1900.0,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
25%,,1934.0,,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
50%,,1955.0,,3.0,3.0,0.0,0.0,3.0,0.0,1.0,1.0
75%,,1977.0,,3.0,3.0,0.0,0.0,3.0,0.0,1.0,1.0


Only two countries are in this situation (France, US), with a range of years from 1900 to 2018 (median 1955). The amount of different percentiles for these groups range between 1 and 3. The cases are distributed as this table shows:

In [15]:
posttax_dis_not130.country.value_counts(dropna=False)

US    106
FR     64
Name: country, dtype: int64

In this case the percentiles are the top 1%, 0.1% and 0.01%:

In [16]:
posttax_dis_not130_list = list(posttax_dis_not130.country_year.unique())
wid_posttax_dis_not130 = wid_posttax_dis[wid_posttax_dis['country_year'].isin(posttax_dis_not130_list)].reset_index(drop=True)
wid_posttax_dis_clean = wid_posttax_dis[~wid_posttax_dis['country_year'].isin(posttax_dis_not130_list)].reset_index(drop=True)

wid_posttax_dis_not130.percentile.value_counts(dropna=False)

p99p100       170
p99.9p100     106
p99.99p100    106
Name: percentile, dtype: int64

### Monotonicity

When ordered by **p**, the threshold and average values for each country-year should not decrease. These can increase or stay the same, but never decrease. If this happens the construction of the distribution failed.

### Comparability of values between periods

This is important to check the robustness of the **threshold** and **average** data across the years, to see a logical evolution of these numbers and not sudden jumps which might due to errors in the construction or due to the quality of the microdata.

### Negative values

### Total sum of shares equalling 1

The shares are all part of a total which have to sum 1 (if the percentile brackets represent the entire population analysed). Four different checks can be done here, playing with the tenths, hundreds and thousands of percentile at the 1%:
- The share of the percentiles p0p1 to p99p100 should sum 1.
- The share of the percentiles p0p1 to p98p99 and p99p99.1 to p99.9p100 should sum 1.
- The share of the percentiles p0p1 to p98p99.9 and p99.9p99.91 to p99.99p100 should sum 1.
- The share of the percentiles p0p1 to p98p99.99 and p99.99p99.991 to p99.999p100 should sum 1.