Review session: Trade, product space and economic complexity
============================================================



September 16 2021, Matte Hartog



## Notes



Google colab link:

[https://colab.research.google.com/github/matteha/product-space-eci-workshop/blob/main/product-space-eci-workshop.ipynb](https://colab.research.google.com/github/matteha/product-space-eci-workshop/blob/main/product-space-eci-workshop.ipynb)



## To do first



In Google Colab:

1.  Turn on Table of Contents: (in browser, click on &rsquo;View&rsquo; in top, then &rsquo;Table of Contents&rsquo;)

2.  Expand all sections (&rsquo;View&rsquo; > &rsquo;Expand Sections&rsquo; if not greyed out)

(In Google Colab equations will show up properly, in github they don&rsquo;t work)



## Outline of lab session



-   Introduction to trade data

-   Calculating RCAs, product co-occurences and product proximity, density / density regressions

-   Product space visualization

-   Calculating Economic Complexity / Product Complexity



## Trade data



### Background



The product space is, as well as its derivations / related measures such as economic complexity and the Growth&rsquo;s annual rankings of countries by economic complexity (at [https://atlas.cid.harvard.edu](https://atlas.cid.harvard.edu)), are based on trade data between countries.

The Growth Lab maintains and periodically updates a cleaned version of trade data at Harvard Dataverse:

[https://dataverse.harvard.edu/dataverse/atlas](https://dataverse.harvard.edu/dataverse/atlas)

This dataset contains bilateral trade data among 235 countries and territories in thousands of different products categories (a description of the data can be found at: [http://atlas.cid.harvard.edu/downloads](http://atlas.cid.harvard.edu/downloads)).

How does the data look like? We will explore the data in Python using the &rsquo;pandas&rsquo; (most popular Python package for data analysis).



#### Footnote on trade and services (ICT, tourism, etc.):



-   Services and tourism are included in the Growth Lab&rsquo;s Atlas and trade data as well as of September 2018. See announcement at:

[https://atlas.cid.harvard.edu/announcements/2018/services-press-release](https://atlas.cid.harvard.edu/announcements/2018/services-press-release)

Obtained from IMF, trade in services covers four categories of economic activities between producers and consumers across borders:

-   services supplied from one country to another (e.g. call centers)
-   consumption in other countries (e.g. international tourism)
-   firms with branches in other countries (e.g. bank branches overseas)
-   individuals supplying services in another country (e.g. IT consultant abroad)



### Load necessary Python libraries



In [1]:
# -- Global settings
# - import python libraries necessary for this workshop
# suppress warnings on google colab for now
import warnings
warnings.filterwarnings("ignore")
# to interact with os, e.g. to execute shell comands such as 'ls', 'pwd' etc.
import os
# to do data processing
import pandas as pd
# backend of pandas, working with matrices
import numpy as np
# to visualize data (in pandas)
import matplotlib.pyplot as plt
import matplotlib.colors as colors
# to process a json file
import json
# work with regex in python
import re
# work with networks in python, to create product space
import networkx as nx
# python tools to work with combinations of arrays
from itertools import count
from itertools import combinations
from itertools import product
# to run regressions
import statsmodels.api as sm
# to download files
import urllib.request, json
# -- set scientific notation to display numbers fully rather than exponential
pd.set_option('display.float_format', '{:.2f}'.format)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # Show all results of jupyter
import seaborn as sns
sns.set_style('whitegrid') # Display grids on dark background
# Enlarged pandas display - more colums and rows with greater width
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth',300)
print('necessary libraries loaded')

### Download trade dataset and load into memory



In [1]:
# Load the necessary data into pandas

# In pandas terminilogy this is called a 'dataframe' (df)
product_classification = 'hs' # Harmonized System 1992; alternative is 'SITC - Standard Industrial Trade Classification'

N_digits = '4' # alternative is 2 or 6, the higher the more detailed product info


# Trade data: we're using s3 storage from Amazon here because we can directly download the data into pandas in Google Colab but this is no longer maintained by the Growth Lab - rather download from Dataverse.

data_url = f"https://intl-atlas-downloads.s3.amazonaws.com/country_{product_classification}product{N_digits}digit_year.csv.zip"
print('Downloading data and loading into memory')
df_orig = pd.read_csv(data_url, compression="zip", low_memory=False)

# Fix product label strings ('hs_product_name_short_en') (some products with different product codes erronuously have the same strings - hence remove these duplicates)
# e.g. product codes 5209 and 5211 in Zimbabwe have same product string
# download original UN classification
with urllib.request.urlopen("https://comtrade.un.org/data/cache/classificationH0.json") as url:
    hs1992_json = json.loads(url.read())
dft = pd.DataFrame.from_dict(hs1992_json['results'])[['text']]
dft['hs_product_code'] = dft['text'].str.split('-').str[0].str.strip()
dft['hs_product_name_short_en'] = dft['text'].str.split('-',1).str[1].str.strip()
dft['N_dig'] = dft['hs_product_code'].str.len()
dft2 = dft[dft['N_dig']==int(N_digits)].copy()
df_orig = pd.merge(df_orig,dft2[['hs_product_code','hs_product_name_short_en']],how='left',on=f'hs_product_code') # unmerged are services (obtained from IMF)
# replace product name now with downloaded strings (if not missing in either)
df_orig['hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_x']
df_orig.loc[ df_orig['hs_product_name_short_en_y'].notnull(),'hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_y']
df_orig.drop(['hs_product_name_short_en_x'],axis=1,inplace=True,errors='ignore')
df_orig.drop(['hs_product_name_short_en_y'],axis=1,inplace=True,errors='ignore')
df_orig.rename(columns={f'hs_product_name_short_en_new':f'hs_product_name_short_en'}, inplace=True)

# Cross check that each row is a unique year-location-product entry
df_orig['count'] = 1
df_orig['sum'] = df_orig.groupby(['year','location_name_short_en','hs_product_name_short_en'])['count'].transform('sum')
if df_orig['sum'].max() != 1:
    print(f'duplicates found, stopping')
    stop

# rename variable names for convenience
df_orig.rename(columns={f'location_name_short_en':f'country_name'}, inplace=True)
df_orig.rename(columns={f'location_code':f'country_code'}, inplace=True)
df_orig.rename(columns={f'hs_product_code':f'product_code'}, inplace=True)
df_orig.rename(columns={f'hs_product_name_short_en':f'product_name'}, inplace=True)

# Keep only relevant columns
df_orig = df_orig[['year',
         'country_code',
         'country_name',
         'product_code',
         'product_name',
         'export_value']]

print('trade dataset ready')

### Exploring the trade data



#### Structure of dataset



In [1]:
# show 5 random rows
df_orig.sample(n=5)

#### What years are in the data?



In [1]:
df_orig['year'].unique()

#### How many products are in the data?



In [1]:
df_orig['product_name'].nunique()

#### Finding specific countries / products based on partial string matching



In [1]:
STRING = 'Netherland'
df_orig[df_orig['country_name'].str.contains(STRING)][['country_name']].drop_duplicates()

# Can also include regex expressions here, e.g. to ignore lower/uppercase ('wine' vs 'Wine')
STRING = 'wine'
df_orig[df_orig['product_name'].str.contains(STRING,flags=re.IGNORECASE, regex=True)][['product_name']].drop_duplicates()

# [goto error]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_10002/3130682011.py in <module>
      1 STRING = 'Netherland'
----> 2 df_orig[df_orig['country_name'].str.contains(STRING)][['country_name']].drop_duplicates()
      3 
      4 # Can also include regex expressions here, e.g. to ignore lower/uppercase ('wine' vs 'Wine')
      5 STRING = 'wine'

NameError: name 'df_orig' is not defined

#### Example: What were the major export products of the USA in 2012?



In [1]:
# create a 'dataframe' called 'df2' with only exports from USA in 2012
df2 = df_orig[ (df_orig['country_code']=='USA') & (df_orig['year'] == 2012) ].copy()
# create another dataframe 'df3' that contains the sum of exports per product
df3 = df2.groupby(['product_code','product_name'],as_index=False)['export_value'].sum()
# sort
df3.sort_values(by=['export_value'],ascending=False,inplace=True)
# show first 10 rows
df3[0:10]

#### Example: How did exports of Cars evolve over time in the USA?



From about 10 billion USD up to almost $60 billion USD.



In [1]:
df2 = df_orig[ (df_orig['country_code']=='USA')].copy()
#df3 = df2[df2['product_name']=='Cars']
df3 = df2[df2['product_code']=='8703']
df3.plot(x='year', y='export_value')