# Importing libraries

In [1]:
import pandas as pd

In [2]:
import matplotlib.pyplot as plt

In [3]:
from urllib.request import urlopen

In [4]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet, HoverTool

In [5]:
output_notebook()

# Loading the data

In [6]:
h0 = pd.read_stata("H0_cpy_all.dta")

FileNotFoundError: [Errno 2] No such file or directory: 'H0_cpy_all.dta'

In [None]:
# comnapes = urlopen("http://unstats.un.org/unsd/tradekb/Attachment439.aspx?AttachmentType=1")

comnames = pd.read_excel("UN Comtrade Commodity Classifications.xlsx")

## Attributes

Selecting columns to explore:

In [None]:
h0 = h0.loc[:,['year',
               'exporter',
               'commoditycode',
               'export_value',
               'population',
               'rca',
               'mcp',
               'eci',
               'pci',
               'oppgain',
               'distance',
               'import_value']]

The attributes of the dataset and its data types:

In [None]:
h0.info()

In [None]:
h0.describe()

The data contains more than 6M rows.

Explanation for some of the data created by the Atlas researchers:
- Distance ('distance'): "The extent of a location's existing capabilities to make the product" based on the products distance to current exports as measured by co-export probabilities.
- Economic complexity index ('eci'): Country rank base on its export basket's diversification and complexity.
- Opportunity gain ('oppgai'): How much a location could benefit by deveoping a particular product.
- Product complexity index ('pci'): "Ranks the diversity and sophistication of productio know-how required to produce a product" based on other number of countries producing that product and their economic complexity.
- Revealed comparative advantage ('rca'): Whether a country is an 'effective' exporter of a product (i.e. exports more than its 'fair share'). The bigger the value, the more important exporter the country is.
- Country-Product connection ('mcp'): Marks whether the particular country export the specific product with an `rca` greater than 1. This also allows us to measure country diversity and product ubiquity.

# Example rows

A row stands for an exporter country-commodity-year summary data.

In [None]:
h0.sample(10)

In [None]:
h0[(h0.commoditycode == '0409') & (h0.exporter == 'FIN')] # The annual summary data for Finland and 'natural honey'

# Missing values

This is a cleaned dataset and, therefore, there are no missing values.

In [None]:
h0.isna().describe()

# Attribute distributions

Examining the distributions of the variables:

In [None]:
h0.hist(figsize=(24, 24))

Because this is a summary dataset which also tries to be consistent, a number of  attributes contains lots of zero values (e.g. import/export values, rce and mcp). These rows, nevertheless, give information about the country's distance and possible opportunity gain in relation that particular products and therefore we do not drop them.

In [None]:
h0[(h0.export_value == 0) & (h0.import_value == 0)]

# First plot

## Getting the data

In [None]:
comtoexp = h0[(h0.year == 2016)].drop(columns=['year', 'population']).copy()

In [None]:
query = comtoexp[(comtoexp.exporter == 'IDN') & (comtoexp.mcp == 0)]
query.sort_values(by='distance').head()

In [None]:
comnames.head()

In [None]:
comnames = comnames[(comnames.Classification == 'H5')]

In [None]:
query = pd.merge(
                comnames[comnames.isLeaf ==  0].loc[:,['Code',
                                                       'Description']],
                query.loc[:,['commoditycode',
                             'mcp',
                             'distance',
                             'pci',
                             'oppgain']], 
    
                left_on='Code',
                right_on='commoditycode'
    
                ).drop(columns=['Code', 'commoditycode'])

In [None]:
query = query.sort_values(by='distance').head(30)
query

## Rendering the plot

In [None]:
source = ColumnDataSource(data = dict(names = list(query.Description),
                                      opg = query.oppgain,                                      
                                      pci = query.pci, 
                                      dist=query.distance))

In [None]:
hover = HoverTool(
    tooltips=[
        ("Desc", "@names"),
        ("PCI", "$x"),
        ("Oppgain", "$y"),
        ("Distance", "@dist")
],
    formatters={
        '@names' : 'printf',
    },)

In [None]:
p = figure(plot_width = 600,
           plot_height = 600,
           tools=['pan', hover, 'zoom_out', 'zoom_in', 'reset'])

In [None]:
p.scatter(x = 'pci',
          y = 'opg',
          size = 15,
          color = 'indigo',
          alpha = 0.6,
          source = source)

In [None]:
p.xaxis.axis_label = 'Product Complexity Index'
p.yaxis.axis_label = 'Opportunity Gain'

## Plot 1
The graph can help users to identify opportunities for production within a country.

It shows the least 30 least 'distant' but yet not produced prodcuts in a country (in this case, Indonesia). That is, starting to produce them would be relatively easy (in relation to the whole product universe), but nonetheless the country is not an 'effective' exporter of them.

Axes:
* X: 'Product complexity index': An index showing the relative complexity of that particular product as based on the diversity of countries producing it and the ubiquity of countries these countries make. That is, products with high PCI are typically produced by only a few countries with a wide procuction line.
* Y: 'Opportunity gain': The degree with which new opporunities emerge to more complex countries when producing that particular product.

Accordingly, the graph can help users to see those product types which are easy to produce, not produced within the country but can lead to novel valuable skills and know-how.

In [None]:
show(p)