In [4]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgetsimport
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections
from utils import visualizer, preprocessing, visualizers_1, preprocessing_1,computations_1, visualizer_q3
import sklearn
import json
import folium
%load_ext autoreload
%autoreload 2
%matplotlib inline
#Import new style
plt.style.use('bmh')

# Inside the Leak

As global markets expand and become more interconnected, businesses are increasingly looking for resources to help identify competitive and profitable opportunities. Several data leakages in the last years have shown that a common approach to this is the creation of offshore companies, i.e. companies created in low-tax, offshore jurisdictions.
Our goal is to analyze motivating factors for creating
such entities. We believe that a better understanding of the reasons can help to find ways to deal with those tendencies. This has an impact on the social good, because fiscal prudence and opennesss in international trade can have a powerful effect on improving society.

Our analysis is based on data provided in the [Offshore Leaks Database](https://www.occrp.org/en/panamapapers/database). It contains information about 500,000 offshore companies, foundations and trusts including links to people and companies in more than 200 countries and territories. The information comes from the Panama Papers, the Offshore Leaks and the Bahamas Leaks investigations cunducted in the years 2013 to 2016. While those data leaks contain diverse sorts of documents from emails to bank documents, the database provides structured information excluding raw files. The latest investigation on the Paradise Papers that was released in November 2017 is not included in our analysis.

In order to better understand the underlying structures of the offshore business we analyze the available data on the level of countries. We identify the most involved countries and try to find factors that characterize those countries. To this aim we enrich the dataset with information about the economical and social background of countries from the [Index of Economic Freedom](http://www.heritage.org/index/about).
Furthermore, we investigate how the different countries are connected and how their presence in the offshore business evolved in the last years. In particular, we wanted to see whether the publication of the leaks influenced those developments in some way.

In [5]:
#TODO explain terminology + add things if needed

Before we present our findings we want to clarify some of the used terminonology:
    -  offshore entity (explain that not per se illegal?)
    -  jurisdiction
    -  incoorporation

## Most involved countries and their characteristics

In this section we analyze which countries are most involved in the offshore business and try to identify their economic characteristics.

First of all, we want to give an overview of the available data.

In [6]:
# TODO:
# - explain missing data for tax havens
# - most involved jurisdictions
# - most involved tax havens
# - correlation nrofoffshores + index data + CONCLUSION

## International relations

Let us now take a closer look at how the different countries are connected. We measure the connectedness of two countries by the number of offshores coming from one country founded in the other country.

To begin with, we want to see if there is a pattern in the way players in origin countries select special countries for their offshores. Therefore, we cluster the origin countries into groups with similar selection information using kMeans clustering. 

The selection patterns of the four resulting clusters are visualized in the matrices below. Each row corresponds to an origin country and each column to a goal country. The color of a cell indicates for the corresponding origin country the relative frequency of offshores that where founded in the corresponding goal country.

And indeed it is easy to see a pattern that characterizes the countries which are in the same  cluster: The first cluster contains those countries where the majority of offshore entities are founded on the British Virgin Islands. The second cluster contains the countries with most offshore entities in Panama, the third those countries with a mojority of entities on the Seychelles. The countries in the fourth cluster show more diverse distributions of destination countries. However there are still interesting patterns. For example, for many countries in this cluster the main destination of entities is the country itself, see for example the Cook Islands or Samoa.

Now that we know that the countries can be categorized by the way jurisdictions for offshores are selected, an obvious question to ask is what causes those different structures. In other words, we want to know how the countries that are in the same cluster are similar to each other and different to the coutries in other clusters. This in turn could help us to find the underlying factors that motivate the selection of destination countries. 

The first thing we consider as a possible factor is geographical closeness. As can be seen in the map, most South American countries (Brasil being the most apparent exception) have the majority of offshore entities in Panama (cluster 2). In Northern America and the UK, most offshore entities are founded on the British Virgin Islands. So indeed geographical closeness might be a factor. 

The next possible factors we consider are of economical nature. We investigate which influence the economic factors specified in the Index of Eonomic freedom data set have on the assignment to the clusters. There are several ways to tackle this question. One approach we took was to do a multinomial logit which predicts the cluster from the economical factors. As economical factors, we considered the gross domestic product (GDP), the GDP growth rate, government expenditures, the infalation, foreign direct investment (FDI) inflow and the public debt. We then interpreted the coefficients of the classifier. It turned out that none of the mentionned factors have a significant influence. Another approach is to do a principal component analysis (PCA). A PCA considering the three main factors can be visulized as follows, where every point corresponds to a country and the color indicates its cluster:

Similarly to the previous analysis, it is difficult to see a pattern in this. Therefore, we conclude that the economic factors we analyzed have no influence. 

Note that in general this kind of reasoning only enables us to restrict the set of possible factors, while it does not enable us to identify the relevant factors with certainty. This would require more background research by experts. We still think that the map is a good basis to detact interesting tendencies. For example, starting from there one could try to see whether countries belonging to the Commonwealth of nations belong to the same cluster. We observe that the UK, India and Australia are in the same cluster. However, Canada is not. One can think of many questions like this considereing the historical and political background of countries.

## Analyzing trends of the countries

In [7]:
entities = r'./panama_csv/Entities.csv'
entities = pd.read_csv(entities,index_col='name', header=0, low_memory=False)
entities=entities.rename(columns = {'countries':'Country'})

## Getting the most involved countries/jurisdictions.
___
From the Entities dataset we select:
- **15 most involved countries**, on which we compute our analysis. Those countries are the countries that have the highest number of offshores accounts in tax haven jurisdictions.
- **5 most involved jurisdictions**, on which we compute our analysis. Those jurisdictions are the jurisdictions that are managing / managed the highest registered number of offshores accounts worldwide.

### What do we do?

For each of the most involved countries/jurisdictions we analyze how much are they involved and what's their behavior throughout the years. 

1. We define how much each jurisdiction is involved by looking at how many offshores accounts are registered in it.
2. We define how much those countries are involved by looking at how many offshores accounts are registered from this country as country of origin.

It's important to note that we want to look also at the number of **new incorporations** - **inactivations** and **active offshores** for every year.

The number of new incorporations as the number of inactivations for each year is easily derived by the dataset since for each account there it's registered the date of incorporation as the one of inactivation. 
The number of actives offshores in one year is easily computed with a simple algorithm given the dates above.

### Why?

To state wheter one scandals has visible worldwide consequences we must base our analysis on the number of, in this way analyzing:
- **new incorporations**, this number can give us the proof is one scandal actually acted as a marketing tool, inviting people of one specific **COUNTRY** to invest in offshores account in one specific tax haven **JURISDICTION**.
- **inactivations**, this number can give us the proof is one scandal scared people in one specific **COUNTRY** that are actually investing in one specific tax haven **JURISDICTION**.

We are also interested to look at key events or particular years where the global market has undergone a major change. To do this we will look at the behavior of the most involved countries throughout the years, how they invested, how much and in which jurisdictions.

## Further assumptions.
We must note that all the following results are based on a fraction of the whole real data, which is the only available and emerged. We can consider the eventual results realistics since the data we are analyzing contains a good amount of data. However we cannot exclude that further scandals can obviously revert them.
### Getting most involved countries and jurisdictions
___
We group by __Country/Jurisdiction__, we count the elements sorting by **node_id** which represent the unique id of each offshore account.

In [8]:
most_involved = entities.groupby('Country').count().sort_values('node_id', ascending=False).head(50).index
most_involved = most_involved[most_involved!='Not identified']
most_involved_countries = []
for involved in most_involved:
    if involved == 'British Virgin Islands':
        name = 'Virgin Islands (British)'
    elif involved == 'United Kingdom':
        name = 'United Kingdom of Great Britain and Northern Ireland'
    else:
        name = involved
    most_involved_countries.append(name)
most_involved_jur = entities.groupby('jurisdiction_description').count().sort_values('node_id', ascending=False).head().index
most_involved_jur = most_involved_jur[most_involved_jur!='Undetermined']
most_involved_jur

Index(['Bahamas', 'British Virgin Islands', 'Panama', 'Seychelles'], dtype='object', name='jurisdiction_description')

## Preprocessing dataset
___
We process the entity dataframe to obtain a new dataframe that maps the flows of **inactivations**, **incorporations**, **actives** offshores for each year from one **country** to each **jurisdiction**.

The structure of the new dataframe is composed in this way:
- **jurisdiction**, the name of the jurisdiction where the offshore account is opened.
- **Country**, the name of the origin country of the offshore account.
- **date**, the year that the entry is considering.
- **action**, wheter we are counting incorporations/inactivations/active offshores/strucks off offshores in the specific date described above
- **offshores**, the number directly related to the field action described above in the specific year.

In [9]:
countries_frame = preprocessing.process_countries(entities=entities, first_involved_countries=most_involved, analisys_on='jurisdiction', from_year=1980, to_year=2016)

In [10]:
time_series = preprocessing.process_countries_unstacked(entities=entities, first_involved_countries=most_involved, analisys_on='jurisdiction', from_year=1980, to_year=2016)

### Visualizations

In [11]:
visualizer.visualize_clusters_with_brush(countries_frame, cluster_info, 'inactivations')

NameError: name 'cluster_info' is not defined

In [None]:
visualizer.visualize_clusters_with_brush(countries_frame, cluster_info, 'incorporations')

In [None]:
#visualizer.visualize_clusters_with_brush(countries_frame, cluster_info, 'strucks')
#visualizer.visualize_clusters_with_brush(countries_frame, cluster_info, 'active offshores')

In [None]:
visualizer.visualize_candle_country(time_series, filter_top_for_story=True)

In [None]:
visualizer.visualize_slider_country(countries_frame, 'incorporations', filter_top_for_story=True)

In [None]:
from IPython.display import IFrame
IFrame("https://tvchart.tradingeconomics.com/c?s=SHCOMP:IND&interval=M&locale=com&originUrl=https://tradingeconomics.com/china/stock-market", width=700, height=350)

## Conclusion

In [None]:
#TODO

## References

In [None]:
#TODO