# Countries
This notebook will merge country data from two different sources:

- Country codes table from Wikipedia.
- Data on R+D indices by countries, extracted from OECD

![logo_oecdilibrary.png](attachment:logo_oecdilibrary.png)

OECD © Organisation for Economic Co-operation and Development: https://www.oecd.org/

    OECD (2023), Researchers (indicator). doi: 10.1787/20ddfb0f-en (Accessed on 30 November 2023)
    


Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

INDEX

In [1]:
# Libraries
import pandas as pd

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

___
## Country Codes
#### Open the data

In [2]:
countries = pd.read_excel('../data/countries/countries_codes.xlsx')

In [3]:
countries.head()

Unnamed: 0,Country,Official_state_name,Sovereignty,Alpha_2_code,Alpha_3_code,Numeric,code_links,internet_ccTLD
0,Afghanistan,The Islamic Republic of Afghanistan,UN member state,AF,AFG,4,ISO 3166-2:AF,.af
1,Åland Islands,Åland,Finland,AX,ALA,248,ISO 3166-2:AX,.ax
2,Albania,The Republic of Albania,UN member state,AL,ALB,8,ISO 3166-2:AL,.al
3,Algeria,The People's Democratic Republic of Algeria,UN member state,DZ,DZA,12,ISO 3166-2:DZ,.dz
4,American Samoa,The Territory of American Samoa,United States,AS,ASM,16,ISO 3166-2:AS,.as


In [4]:
countries.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Country              249 non-null    object
 1   Official_state_name  249 non-null    object
 2   Sovereignty          249 non-null    object
 3   Alpha_2_code         248 non-null    object
 4   Alpha_3_code         249 non-null    object
 5   Numeric              249 non-null    int64 
 6   code_links           249 non-null    object
 7   internet_ccTLD       249 non-null    object
dtypes: int64(1), object(7)
memory usage: 123.8 KB


### Transform

In [5]:
countries[countries['Alpha_2_code'].isna()]

Unnamed: 0,Country,Official_state_name,Sovereignty,Alpha_2_code,Alpha_3_code,Numeric,code_links,internet_ccTLD
153,Namibia,The Republic of Namibia,UN member state,,NAM,516,ISO 3166-2:NA,.na


In [6]:
# Here, NA for Namibia was considerer as a NaN value

In [7]:
countries['Alpha_2_code'][153] = 'NA'

In [8]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Country              249 non-null    object
 1   Official_state_name  249 non-null    object
 2   Sovereignty          249 non-null    object
 3   Alpha_2_code         249 non-null    object
 4   Alpha_3_code         249 non-null    object
 5   Numeric              249 non-null    int64 
 6   code_links           249 non-null    object
 7   internet_ccTLD       249 non-null    object
dtypes: int64(1), object(7)
memory usage: 15.7+ KB


####  Export as .csv

In [9]:
countries.to_csv('../data/countries/country_codes_db.csv', index = False)

___
## OECD Research index

**Gross domestic spending on R&D**

Gross domestic spending on R&D is defined as the total expenditure (current and capital) on R&D carried out by all resident companies, research institutes, university and government laboratories, etc., in a country. It includes R&D funded from abroad, but excludes domestic funds for R&D performed outside the domestic economy. This indicator is measured in USD constant prices using 2015 base year and Purchasing Power Parities (PPPs) and as percentage of GDP.

**Researchers**

Researchers are professionals engaged in the conception or creation of new knowledge, products, processes, methods and systems, as well as in the management of the projects concerned. This indicator is measured in per 1 000 people employed and in number of researchers; the data are available as a total and broken down by gender.

#### Open Data

In [10]:
oecd_mln = pd.read_csv('../data/research_index/OECD_MLN_USD.csv')
oecd_gdp = pd.read_csv('../data/research_index/OECD_PC_GDP.csv')
oecd_res = pd.read_csv('../data/research_index/OECD_researchers.csv')

In [11]:
display(oecd_mln.head(2), oecd_gdp.head(2), oecd_res.head(2))

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,GDEXPRD,TOT,MLN_USD,A,2010,21366.681644,
1,AUS,GDEXPRD,TOT,MLN_USD,A,2011,21522.601176,


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,GDEXPRD,TOT,PC_GDP,A,2010,2.179564,
1,AUS,GDEXPRD,TOT,PC_GDP,A,2011,2.113019,


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,CZE,RESEARCHER,WOMEN,HEADCOUNT,A,2010,12197.97,
1,CZE,RESEARCHER,WOMEN,HEADCOUNT,A,2011,12936.03,


#### Transform Data

##### mln: Million US dollars

In [12]:
# Group the data to have one row per country
mln_df = pd.pivot_table(oecd_mln,
                        index = ['LOCATION'],
                        columns = ['TIME'],
                        values = ['Value'])

display(mln_df.head(2))

# Create a list with the names of the columns:
colnames = []

for year in range(2010, 2023):
    name = f'mln_{year}'
    colnames.append(name)

# Asign a name to the first columns
mln_df.columns = colnames

# Remove Multiindex
mln_df = mln_df.reset_index()

mln_df = mln_df.rename(columns={'LOCATION': 'country'})

mln_df.head(2)

Unnamed: 0_level_0,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
TIME,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
LOCATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
ARG,4545.805493,4863.909996,5401.974043,5388.675657,5001.362356,5363.843618,4502.054682,4857.017029,4203.059951,3983.823581,4063.512748,4288.019039,
AUS,21366.681644,21522.601176,,22441.069224,,21157.076647,,21236.873383,,21738.736802,,,


Unnamed: 0,country,mln_2010,mln_2011,mln_2012,mln_2013,mln_2014,mln_2015,mln_2016,mln_2017,mln_2018,mln_2019,mln_2020,mln_2021,mln_2022
0,ARG,4545.805493,4863.909996,5401.974043,5388.675657,5001.362356,5363.843618,4502.054682,4857.017029,4203.059951,3983.823581,4063.512748,4288.019039,
1,AUS,21366.681644,21522.601176,,22441.069224,,21157.076647,,21236.873383,,21738.736802,,,


##### gpd: % of gross domestic spending

In [13]:
# Group the data to have one row per country
gdp_df = pd.pivot_table(oecd_gdp,
                        index = ['LOCATION'],
                        columns = ['TIME'],
                        values = ['Value'])

display(gdp_df.head(2))

# Create a list with the names of the columns:
colnames = []

for year in range(2010, 2023):
    name = f'gdp_{year}'
    colnames.append(name)
    
# Asign a name to the first columns
gdp_df.columns = colnames

# Remove Multiindex
gdp_df = gdp_df.reset_index()

gdp_df = gdp_df.rename(columns={'LOCATION': 'country'})

gdp_df.head(2)

Unnamed: 0_level_0,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
TIME,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
LOCATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
ARG,0.56405,0.569338,0.638878,0.622336,0.592493,0.618541,0.530192,0.556315,0.494351,0.478132,0.541543,0.517637,
AUS,2.179564,2.113019,,2.093805,,1.881043,,1.794278,,1.798561,,,


Unnamed: 0,country,gdp_2010,gdp_2011,gdp_2012,gdp_2013,gdp_2014,gdp_2015,gdp_2016,gdp_2017,gdp_2018,gdp_2019,gdp_2020,gdp_2021,gdp_2022
0,ARG,0.56405,0.569338,0.638878,0.622336,0.592493,0.618541,0.530192,0.556315,0.494351,0.478132,0.541543,0.517637,
1,AUS,2.179564,2.113019,,2.093805,,1.881043,,1.794278,,1.798561,,,


##### res: number of researchers
Here there are two SUBJECTs:
* TOT: Total amount of researchers
* WOMEN: Number of women researchers 

In [14]:
oecd_res['SUBJECT'].value_counts()

TOT      395
WOMEN    376
Name: SUBJECT, dtype: int64

In [15]:
total = oecd_res[oecd_res['SUBJECT'] == 'TOT']

# Group the data to have one row per country
tot_df = pd.pivot_table(total,
                        index = ['LOCATION'],
                        columns = ['TIME'],
                        values = ['Value'])

display(tot_df.head(2))

# Create a list with the names of the columns:
colnames = []

for year in range(2010, 2022):
    name = f'tot_{year}'
    colnames.append(name)
    
# Asign a name to the first columns
tot_df.columns = colnames

# Remove Multiindex
tot_df = tot_df.reset_index()

tot_df = tot_df.rename(columns={'LOCATION': 'country'})

tot_df.head(2)

Unnamed: 0_level_0,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
TIME,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
LOCATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
ARG,72208.0,77354.0,80245.0,81964.0,83837.0,82407.0,86562.0,84284.0,88872.0,90656.0,91243.0,93925.0
AUT,,65609.0,,71448.0,,78051.0,,83648.0,,93179.0,,96270.0


Unnamed: 0,country,tot_2010,tot_2011,tot_2012,tot_2013,tot_2014,tot_2015,tot_2016,tot_2017,tot_2018,tot_2019,tot_2020,tot_2021
0,ARG,72208.0,77354.0,80245.0,81964.0,83837.0,82407.0,86562.0,84284.0,88872.0,90656.0,91243.0,93925.0
1,AUT,,65609.0,,71448.0,,78051.0,,83648.0,,93179.0,,96270.0


In [16]:
women = oecd_res[oecd_res['SUBJECT'] == 'WOMEN']

# Group the data to have one row per country
wom_df = pd.pivot_table(women,
                        index = ['LOCATION'],
                        columns = ['TIME'],
                        values = ['Value'])

display(wom_df.head(2))

# Create a list with the names of the columns:
colnames = []

for year in range(2010, 2022):
    name = f'wom_{year}'
    colnames.append(name)
    
# Asign a name to the first columns
wom_df.columns = colnames

# Remove Multiindex
wom_df = wom_df.reset_index()

wom_df = wom_df.rename(columns={'LOCATION': 'country'})

wom_df.head(2)

Unnamed: 0_level_0,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
TIME,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
LOCATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
ARG,37709.0,40763.0,42170.0,42978.0,44387.0,43350.0,45875.0,45311.0,47407.0,48174.0,48586.0,50519.0
AUT,,19020.0,,21145.0,,23020.0,,25144.0,,28319.0,,30086.0


Unnamed: 0,country,wom_2010,wom_2011,wom_2012,wom_2013,wom_2014,wom_2015,wom_2016,wom_2017,wom_2018,wom_2019,wom_2020,wom_2021
0,ARG,37709.0,40763.0,42170.0,42978.0,44387.0,43350.0,45875.0,45311.0,47407.0,48174.0,48586.0,50519.0
1,AUT,,19020.0,,21145.0,,23020.0,,25144.0,,28319.0,,30086.0


#### Export Data

In [17]:
mln_df.to_csv('../data/research_index/mln_df.csv', index = False)
gdp_df.to_csv('../data/research_index/gdp_df.csv', index = False)
tot_df.to_csv('../data/research_index/tot_df.csv', index = False)
wom_df.to_csv('../data/research_index/wom_df.csv', index = False)

___
## Merge Data from both sources

In [18]:
countries.head(2)

Unnamed: 0,Country,Official_state_name,Sovereignty,Alpha_2_code,Alpha_3_code,Numeric,code_links,internet_ccTLD
0,Afghanistan,The Islamic Republic of Afghanistan,UN member state,AF,AFG,4,ISO 3166-2:AF,.af
1,Åland Islands,Åland,Finland,AX,ALA,248,ISO 3166-2:AX,.ax


In [19]:
countries.shape, mln_df.shape, gdp_df.shape, tot_df.shape, wom_df.shape

((249, 8), (47, 14), (47, 14), (41, 13), (38, 13))

In [20]:
countries = countries.copy()

In [21]:
# mln_df
output = pd.merge(countries, mln_df, how='outer', left_on='Alpha_3_code', right_on='country')

# gdp_df
output = pd.merge(output, gdp_df, how='outer', left_on='Alpha_3_code', right_on='country')

# tot_df
output = pd.merge(output, tot_df, how='outer', left_on='Alpha_3_code', right_on='country')

# wom_df
output = pd.merge(output, wom_df, how='outer', left_on='Alpha_3_code', right_on='country')

In [22]:
# Check
output[output['Alpha_3_code'] == 'USA']

Unnamed: 0,Country,Official_state_name,Sovereignty,Alpha_2_code,Alpha_3_code,Numeric,code_links,internet_ccTLD,country_x,mln_2010,...,wom_2012,wom_2013,wom_2014,wom_2015,wom_2016,wom_2017,wom_2018,wom_2019,wom_2020,wom_2021
236,United States of America (the),The United States of America,UN member state,US,USA,840.0,ISO 3166-2:US,.us,USA,444708.58189,...,,,,,,,,,,


In [23]:
output = output.rename(columns={'Numeric': 'index'})

In [24]:
output.columns

Index(['Country', 'Official_state_name', 'Sovereignty', 'Alpha_2_code',
       'Alpha_3_code', 'index', 'code_links', 'internet_ccTLD', 'country_x',
       'mln_2010', 'mln_2011', 'mln_2012', 'mln_2013', 'mln_2014', 'mln_2015',
       'mln_2016', 'mln_2017', 'mln_2018', 'mln_2019', 'mln_2020', 'mln_2021',
       'mln_2022', 'country_y', 'gdp_2010', 'gdp_2011', 'gdp_2012', 'gdp_2013',
       'gdp_2014', 'gdp_2015', 'gdp_2016', 'gdp_2017', 'gdp_2018', 'gdp_2019',
       'gdp_2020', 'gdp_2021', 'gdp_2022', 'country_x', 'tot_2010', 'tot_2011',
       'tot_2012', 'tot_2013', 'tot_2014', 'tot_2015', 'tot_2016', 'tot_2017',
       'tot_2018', 'tot_2019', 'tot_2020', 'tot_2021', 'country_y', 'wom_2010',
       'wom_2011', 'wom_2012', 'wom_2013', 'wom_2014', 'wom_2015', 'wom_2016',
       'wom_2017', 'wom_2018', 'wom_2019', 'wom_2020', 'wom_2021'],
      dtype='object')

In [25]:
output = output[['index', 'Country', 'Official_state_name', 'Sovereignty', 'Alpha_2_code', 'Alpha_3_code', 
                 'mln_2010', 'mln_2011', 'mln_2012', 'mln_2013', 'mln_2014', 'mln_2015', 
                 'mln_2016', 'mln_2017', 'mln_2018', 'mln_2019', 'mln_2020', 'mln_2021', 'mln_2022', 
                 'gdp_2010', 'gdp_2011', 'gdp_2012', 'gdp_2013', 'gdp_2014', 'gdp_2015', 
                 'gdp_2016', 'gdp_2017', 'gdp_2018', 'gdp_2019', 'gdp_2020', 'gdp_2021', 'gdp_2022', 
                 'tot_2010', 'tot_2011', 'tot_2012', 'tot_2013', 'tot_2014', 'tot_2015', 
                 'tot_2016', 'tot_2017', 'tot_2018', 'tot_2019', 'tot_2020', 'tot_2021', 
                 'wom_2010', 'wom_2011', 'wom_2012', 'wom_2013', 'wom_2014', 'wom_2015', 
                 'wom_2016', 'wom_2017', 'wom_2018', 'wom_2019', 'wom_2020', 'wom_2021']]

In [26]:
output = output.dropna(subset=['index'])
output.shape

(249, 56)

In [27]:
output['index'] = output['index'].astype(int)

In [28]:
output = output.fillna(0)

In [29]:
output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 249 entries, 0 to 248
Data columns (total 56 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                249 non-null    int32  
 1   Country              249 non-null    object 
 2   Official_state_name  249 non-null    object 
 3   Sovereignty          249 non-null    object 
 4   Alpha_2_code         249 non-null    object 
 5   Alpha_3_code         249 non-null    object 
 6   mln_2010             249 non-null    float64
 7   mln_2011             249 non-null    float64
 8   mln_2012             249 non-null    float64
 9   mln_2013             249 non-null    float64
 10  mln_2014             249 non-null    float64
 11  mln_2015             249 non-null    float64
 12  mln_2016             249 non-null    float64
 13  mln_2017             249 non-null    float64
 14  mln_2018             249 non-null    float64
 15  mln_2019             249 non-null    flo

In [30]:
output.head(2)

Unnamed: 0,index,Country,Official_state_name,Sovereignty,Alpha_2_code,Alpha_3_code,mln_2010,mln_2011,mln_2012,mln_2013,...,wom_2012,wom_2013,wom_2014,wom_2015,wom_2016,wom_2017,wom_2018,wom_2019,wom_2020,wom_2021
0,4,Afghanistan,The Islamic Republic of Afghanistan,UN member state,AF,AFG,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,248,Åland Islands,Åland,Finland,AX,ALA,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Rename the columns
cols_old = ['index', 'Country', 'Official_state_name', 'Sovereignty', 
            'Alpha_2_code', 'Alpha_3_code']

cols_new = ['country_id', 'country_name', 'official_state_name', 'sovereignty', 
            'alpha_2_code', 'alpha_3_code']

In [13]:
for old, new in zip(cols_old, cols_new):
    output = output.rename(columns={old: new})

In [14]:
output.columns

Index(['country_id', 'country_name', 'official_state_name', 'sovereignty',
       'alpha_2_code', 'alpha_3_code', 'mln_2010', 'mln_2011', 'mln_2012',
       'mln_2013', 'mln_2014', 'mln_2015', 'mln_2016', 'mln_2017', 'mln_2018',
       'mln_2019', 'mln_2020', 'mln_2021', 'mln_2022', 'gdp_2010', 'gdp_2011',
       'gdp_2012', 'gdp_2013', 'gdp_2014', 'gdp_2015', 'gdp_2016', 'gdp_2017',
       'gdp_2018', 'gdp_2019', 'gdp_2020', 'gdp_2021', 'gdp_2022', 'tot_2010',
       'tot_2011', 'tot_2012', 'tot_2013', 'tot_2014', 'tot_2015', 'tot_2016',
       'tot_2017', 'tot_2018', 'tot_2019', 'tot_2020', 'tot_2021', 'wom_2010',
       'wom_2011', 'wom_2012', 'wom_2013', 'wom_2014', 'wom_2015', 'wom_2016',
       'wom_2017', 'wom_2018', 'wom_2019', 'wom_2020', 'wom_2021'],
      dtype='object')

## Export as .csv

In [7]:
output.to_csv('../data/neuropapers_db/countries.csv', index=False)