# Who's the biggest tax evader?

#### Imports:

In [61]:
import plotly.plotly as py
import pandas as pd
import pycountry
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

## 1. Data cleaning and preprocessing

### 1.1 Loading data

The goal is to better understand how different socioeconomic factors are linked with tax evasion occurrences in different countries. To help us visualize this, we will use choropleth world maps to display information.

We have observed in our datasets that certain countries are referred to under different names, for instance "China" vs "People's Republic of China". Also, since we are considering data that spans over multiple years, we run into issues with countries that have changed their names, for instance Swaziland becoming Eswatini. Thus, we decided to use [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) three-letter country codes as the identifiers for countries, as these values are consistent for all countries and through the years. From the Plotly choropleth map [documentation](https://plot.ly/python/choropleth-maps/#world-choropleth-map), we got a list of ISO country codes and country names. We choose to complete this dataset ourselves manually, when we come across new ways of writing country names. We load the dataset of country codes as a Pandas `DataFrame`:

In [62]:
# Load country codes
df_countries_codes = pd.read_csv('data/countries_codes.csv', low_memory=False).set_index('COUNTRY')

We load the Panamas Papers, UN and World Bank datasets into Pandas `DataFrame`s:

In [4]:
# Load datasets
## Load panama papers datasets
pp_edges = pd.read_csv('data/panama_papers/panama_papers.edges.csv', low_memory=False)
pp_nodes_address = pd.read_csv('data/panama_papers/panama_papers.nodes.address.csv', low_memory=False)
pp_nodes_entity = pd.read_csv('data/panama_papers/panama_papers.nodes.entity.csv', low_memory=False)
pp_nodes_intermediary = pd.read_csv('data/panama_papers/panama_papers.nodes.intermediary.csv', low_memory=False)
pp_nodes_officer = pd.read_csv('data/panama_papers/panama_papers.nodes.officer.csv', low_memory=False)
## Load UN datasets
un_hdi_components_2014 = pd.read_csv('data/un/hdi_components.csv', low_memory=False)
un_gdp_per_capita = pd.read_csv('data/un/gdp_per_capita.csv', low_memory=False)
un_gdp_per_capita_ppp = pd.read_csv('data/un/gdp_per_capita_PPP.csv', low_memory=False)
## Load world bank datasets
wb_gini = pd.read_csv('data/world_bank/gini_index.csv', low_memory=False)
wb_income_share_20_per = pd.read_csv('data/world_bank/income_share_20_per.csv', low_memory=False)
wb_population_total = pd.read_csv('data/world_bank/population_total.csv', low_memory=False)

### 1.2 Examining data

We look at a few of the UN datasets, to see what sort of preprocessing and cleaning we will have to do.

#### GDP per capita:

In [5]:
un_gdp_per_capita.head()

Unnamed: 0,Country,Year,Item,Value
0,Afghanistan,2016,Gross Domestic Product (GDP),583.882867
1,Afghanistan,2015,Gross Domestic Product (GDP),610.854517
2,Afghanistan,2014,Gross Domestic Product (GDP),651.158326
3,Afghanistan,2013,Gross Domestic Product (GDP),681.033974
4,Afghanistan,2012,Gross Domestic Product (GDP),694.885886


#### HDI components (2014):

In [6]:
un_hdi_components_2014.head()

Unnamed: 0,HDI rank,Country,Human Development Index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank
0,1,Norway,0.944,81.6,17.5,12.6,64992,5
1,2,Australia,0.935,82.4,20.2,13.0,42261,17
2,3,Switzerland,0.93,83.0,15.8,12.8,56431,6
3,4,Denmark,0.923,80.2,18.7,12.7,44025,11
4,5,Netherlands,0.922,81.6,17.9,11.9,45435,9


#### Gini coefficient index:

In [7]:
wb_gini.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,GINI index (World Bank estimate),SI.POV.GINI,,,,,,,...,,,,,,,,,,
1,Afghanistan,AFG,GINI index (World Bank estimate),SI.POV.GINI,,,,,,,...,,,,,,,,,,
2,Angola,AGO,GINI index (World Bank estimate),SI.POV.GINI,,,,,,,...,42.7,,,,,,,,,
3,Albania,ALB,GINI index (World Bank estimate),SI.POV.GINI,,,,,,,...,30.0,,,,29.0,,,,,
4,Andorra,AND,GINI index (World Bank estimate),SI.POV.GINI,,,,,,,...,,,,,,,,,,


#### Income share top 20%:

In [8]:
wb_income_share_20_per.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,Income share held by highest 20%,SI.DST.05TH.20,,,,,,,...,,,,,,,,,,
1,Afghanistan,AFG,Income share held by highest 20%,SI.DST.05TH.20,,,,,,,...,,,,,,,,,,
2,Angola,AGO,Income share held by highest 20%,SI.DST.05TH.20,,,,,,,...,48.5,,,,,,,,,
3,Albania,ALB,Income share held by highest 20%,SI.DST.05TH.20,,,,,,,...,39.0,,,,37.8,,,,,
4,Andorra,AND,Income share held by highest 20%,SI.DST.05TH.20,,,,,,,...,,,,,,,,,,


We see that only two of the four displayed datasets have a `Country Code` column. Fortunately, we are able to map `Country Name` values to their corresponding `Country Code` values. This will allow us to join `DataFrame`s later on, when performing analyses on socioeconomic development factors.

We also observe that there are many NaN values in the Gini coefficient index dataset and the income share top 20% dataset. This is because these values are not measured annually in every country. We will solve this problem by taking the most recent value since 2000, for each of these datasets. This allows us to have the most accurate recent data possible.

### 1.3 Data cleaning

#### Panama Papers address

In [10]:
pp_nodes_address.head()

Unnamed: 0,node_id,name,address,country_codes,countries,sourceID,valid_until,note
0,14000001,,-\t27 ROSEWOOD DRIVE #16-19 SINGAPORE 737920,SGP,Singapore,Panama Papers,The Panama Papers data is current through 2015,
1,14000002,,"""Almaly Village"" v.5, Almaty Kazakhstan",KAZ,Kazakhstan,Panama Papers,The Panama Papers data is current through 2015,
2,14000003,,"""Cantonia"" South Road St Georges Hill Weybridg...",GBR,United Kingdom,Panama Papers,The Panama Papers data is current through 2015,
3,14000004,,"""CAY-OS"" NEW ROAD; ST.SAMPSON; GUERNSEY; CHANN...",GGY,Guernsey,Panama Papers,The Panama Papers data is current through 2015,
4,14000005,,"""Chirag"" Plot No 652; Mwamba Road; Kizingo; Mo...",KEN,Kenya,Panama Papers,The Panama Papers data is current through 2015,


One thing we might be interested in is how many references to each country there are in the panama papers dataset: 

In [11]:
pp_references_country = pp_nodes_address.groupby(['country_codes', 'countries']).size().reset_index(name='counts')
pp_references_country.head()

Unnamed: 0,country_codes,countries,counts
0,ABW,Aruba,18
1,AGO,Angola,38
2,AIA,Anguilla,105
3,ALB,Albania,23
4,AND,Andorra,35


We will start by looking a

In [12]:
pp_references_country.sort_values('counts', ascending=False).head(20)

Unnamed: 0,country_codes,countries,counts
33,CHN,China,20267
75,HKG,Hong Kong,9147
61,GBR,United Kingdom,3996
193,VGB,British Virgin Islands,3467
155,RUS,Russia,3346
189,USA,United States,3094
90,JEY,Jersey,2852
31,CHE,Switzerland,2827
142,PAN,Panama,2508
184,TWN,Taiwan,2249


We now look at the relative number of occurences, defined as 1000 * number_occures / population_size_2014

( TO IMPROVE )

In [55]:
wb_population_2014 = wb_population_total[['Country Code', '2014']]
occurence_population = pp_references_country.merge(wb_population_2014, left_on='country_codes', right_on='Country Code')
occurence_population['counts_1000'] = 1000 * occurence_population['counts'] /  occurence_population['2014']
sortedLst = occurence_population.sort_values('counts_1000', ascending=False)
sortedLst.head()
wb_population_total[wb_population_total[['Country Code', '2014']].isna().any(axis=1)]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
67,Eritrea,ERI,"Population, total",SP.POP.TOTL,1397491.0,1432640.0,1469645.0,1508273.0,1548187.0,1589179.0,...,4232636.0,4310334.0,4390840.0,4474690.0,,,,,,
108,Not classified,INX,"Population, total",SP.POP.TOTL,,,,,,,...,,,,,,,,,,


In order to display the values on the maps, we need to join the UN datasets with the corresponding country code. We'll start by trying to automate this process, before looking at possible exceptions:

In [13]:
# Join UN datasets with country codes
un_hdi_components_2014 = un_hdi_components_2014.join(df_countries_codes, on='Country')
un_gdp_per_capita = un_gdp_per_capita.join(df_countries_codes, on='Country')
un_gdp_per_capita_ppp = un_gdp_per_capita_ppp.join(df_countries_codes, on='Country')

Remove parts containing paranthesis (for instance, Iran (Islamic Republic of) becomes Iran)

In [14]:
un_dfs = [un_hdi_components_2014, un_gdp_per_capita, un_gdp_per_capita_ppp]
countries = {}

for country in pycountry.countries:
    countries[country.name] = country.alpha_3  

for df in un_dfs:
    nan_values = df['CODE'].isna()
    input_countries = list(df[nan_values]['Country'].values)
        
    codes = []
    for country in input_countries:
        if country in countries:
            codes.append(countries.get(country))
        else:        
            accepted = []
            str_country = str(country)
            # see if string contains either common_name or name of countries
            for p_country in pycountry.countries:
                if p_country.name in str_country or (hasattr(p_country, 'common_name') and p_country.common_name in str_country):
                    accepted.append(p_country.alpha_3)
            if len(accepted) == 1:
                codes.append(accepted[0])
            else:
                codes.append(None)

    df.loc[nan_values, 'CODE'] = codes
    # Remove this once we done debugging
    print(df[df['CODE'].isnull()]['Country'].unique())
    # Remove rows that were not found
    df = df[df['CODE'].notnull()]

[]
['Former Czechoslovakia' 'Former USSR' 'Former Yugoslavia']
['Arab World' 'Caribbean small states' 'Central Europe and the Baltics'
 'Early-demographic dividend' 'East Asia & Pacific'
 'East Asia & Pacific (excluding high income)'
 'East Asia & Pacific (IDA & IBRD)' 'Euro area' 'Europe & Central Asia'
 'Europe & Central Asia (excluding high income)'
 'Europe & Central Asia (IDA & IBRD)' 'European Union'
 'Fragile and conflict affected situations'
 'Heavily indebted poor countries (HIPC)' 'High income' 'IBRD only'
 'IDA & IBRD total' 'IDA blend' 'IDA only' 'IDA total' 'Korea'
 'Late-demographic dividend' 'Latin America & Caribbean'
 'Latin America & Caribbean (excluding high income)'
 'Latin America & Caribbean (IDA & IBRD)'
 'Least developed countries: UN classification' 'Low & middle income'
 'Low income' 'Lower middle income' 'Middle East & North Africa'
 'Middle East & North Africa (excluding high income)'
 'Middle East & North Africa (IDA & IBRD)' 'Middle income' 'North America'

### Visualization

#### Human Development Index (HDI)

In [47]:
data = [ dict(
        type = 'choropleth',
        locations = un_hdi_components_2014['CODE'],
        z = un_hdi_components_2014['Human Development Index (HDI)'],
        text = un_hdi_components_2014['Country'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '%',
            title = 'HDI'),
      ) ]

layout = dict(
    title = 'Human Development Index (HDI)',
    geo = dict(
        showcountries = True,
        countriescolor = 'rgb(180,180,180)',
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )

iplot( fig, validate=False)

In [57]:
data = [ dict(
        type = 'choropleth',
        locations = pp_references_country['country_codes'],
        z = pp_references_country['counts'],
        text = pp_references_country['countries'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Number references'),
      ) ]

layout = dict(
    title = 'References in panama papers',
    geo = dict(
        showcountries = True,
        countrycolor = "rgb(217, 217, 217)",
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        ),
        bgcolor = 'rgba(255, 255, 255, 0.0)',
    )
)

fig = dict( data=data, layout=layout )

iplot( fig, validate=False)