# Inequality Data Processing (WIID)

## Data Dictionary

| Variable           | Definition                                                                                                                                                                                         |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| id                 | Identifier                                                                                                                                                                                         |
| country            | Country/area                                                                                                                                                                                       |
| c3                 | 3-digit country code in ISO 3166-1 alpha-3 format                                                                                                                                                  |
| c2                 | 2-digit country code in ISO 3166-1 alpha-2 format                                                                                                                                                  |
| year               | Year. Note that when a survey continues for more than a year, the year when it is finished is considered                                                                                           |
| gini_reported      | Gini coefficient as reported by the source (in most cases based on microdata, in some older observations estimates derive from grouped data)                                                       |
| q1-q5              | Quintile group shares of resource                                                                                                                                                                  |
| d1-d10             | Decile group shares of resource                                                                                                                                                                    |
| bottom5 and top5   | Bottom five and top five percent group shares of resource                                                                                                                                          |
| resource           | Resource concept                                                                                                                                                                                   |
| resource_detailed  | Detailed resource concept                                                                                                                                                                          |
| scale              | Equivalence scale                                                                                                                                                                                  |
| scale_detailed     | Detailed equivalence scale                                                                                                                                                                         |
| sharing_unit       | Income sharing unit/statistical unit                                                                                                                                                               |
| reference_unit     | Unit of analysis, indicates whether the data has been weighted with a person or a household weight                                                                                                 |
| areacovr           | Area coverage. The land area which was included in the original sample surveys etc.                                                                                                                |
| areacovr_detailed  | Detailed area coverage                                                                                                                                                                             |
| popcovr            | Population coverage. The population covered in the sample surveys in the land area (all, rural, urban etc.) which was included                                                                     |
| popcovr_detailed   | Detailed population coverage, including age coverage information in certain cases                                                                                                                  |
| region_un          | Regional grouping based on United Nations geoscheme                                                                                                                                                |
| region_un_sub      | Sub-regional grouping based on United Nations geoscheme                                                                                                                                            |
| region_wb          | Regional grouping based on World Bank classification                                                                                                                                               |
| eu                 | Current EU member state                                                                                                                                                                            |
| oecd               | Current OECD member state                                                                                                                                                                          |
| incomegroup        | World Bank classification by country income                                                                                                                                                        |
| mean               | Survey mean given with the same underlying definitions as the Gini coefficient and the share data                                                                                                  |
| median             | Survey median given with the same underlying definitions as the Gini coefficient and the share data                                                                                                |
| currency           | Currency for the mean and median values. If the reference is US$2011PPP it means that the currency is in 2011 US dollar per month, with purchasing power parity applied on it.                     |
| reference_period   | Time period for measuring mean and median values                                                                                                                                                   |
| exchangerate       | Conversion rate from local currency units (LCU) to United States Dollars (USD)                                                                                                                     |
| mean_usd           | Mean measure in United States Dollar (USD)                                                                                                                                                         |
| median_usd         | Median measure in United States Dollar (USD)                                                                                                                                                       |
| gdp_ppp_pc_usd2011 | Gross Domestic Product (GDP) is converted to United States Dollars (USD) using purchasing power parity rates and divided by total population. Data are in constant 2011 United States Dollar (USD) |
| population         | Population of countries from the UN population prospects                                                                                                                                           |
| revision           | Indicates the time of the revision when the observation was included to the database                                                                                                               |
| quality            | Quality assessment                                                                                                                                                                                 |
| quality_score      | Computed quality score                                                                                                                                                                             |
| source             | Source type                                                                                                                                                                                        |
| source_detailed    | Source from which the observation was obtained                                                                                                                                                     |
| source_comments    | Additional source comments                                                                                                                                                                         |
| survey             | Originating survey information                                                                                                                                                                     |

In [1]:
import re

import numpy as np
import pandas as pd
import pycountry

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_columns', None)

## Load The File

In [2]:
df = pd.read_excel('../data/external/Inequality/WIID/WIID_19Dec2018.xlsx')

## Standardize Country Codes

In [3]:
""" Only Select rows with valid country codes
"""
country_locations = []
for country in df['c3']:
    try:
        pycountry.countries.lookup(country)
        country_locations.append(True)
    except LookupError:
        country_locations.append(False)
df = df[country_locations]

## Standardize Indexes

In [4]:
df.rename(
    {
        "c3": "Country Code",
        "year": "Year"
    },
    axis='columns',
    inplace=True)

## Remove out of scope rows (consumption/gross)

In [5]:
df = df[(df.resource != "Consumption")]

## Remove out of scope rows by year

In [6]:
df = df[df["Year"] > 1994]
df = df[df["Year"] < 2018]

df = df.groupby(["Country Code","Year"]).mean()

## Save Data

In [7]:
df.to_pickle("../data/processed/Inequality_WIID.pickle")