Often when I browse different publically available datasets on for example Kaggle I find most of them are already tidy. Sure, to use the data for machine learning you must do feature engineering, but most of the data cleaning is already done. I believe that this bad, since most of the data in the "real" world is messy. So, in this post, I will use R to clean messy data from the IMF data homepage (International Financial Statistics) with the ultimate goal of producing a tidy dataframe. I will use the same data as I used in the post [Python - Cleaning excel files from the International Monetary Fund](https://htp84.github.io/blog/posts/python-cleaning-excel-files-from-imf/). 
<!-- TEASER_END -->

I have downloaded 8 excel files and they contain the following data:

* Financial Market Prices, Equities, End of Period, Index
* Prices, Consumer Price Index, All items, Index
* Prices, Producer Price Index, All Commodities, Index
* Economic Activity, Industrial Production, Index
* Labor Force, Persons, Number of
* Unemployment, Persons, Number of
* Employment, Persons, Number of
* Labor Markets, Unemployment Rate, Percent

The post consists of the following four parts:

1. Setup and the first glance at a file
2. Parse and clean all files
3. Merge with external data
4. Save data

## 1. Setup and a first look at one file
First I load the libraries I need. Below I list som examples of what I use the libraries for in this post. Follow the links to see more use cases. 
* [dplyr](https://dplyr.tidyverse.org/) is used manipulate data, e.g. add new variables and select variables.
* [tidyr](https://tidyr.tidyverse.org/) is used to create tidy data. 
* [stringr](https://stringr.tidyverse.org/) consists of functions that make working with strings as easy as possible.
* [purrr](https://purrr.tidyverse.org/)
* [tibble](https://tibble.tidyverse.org/)
* [readxl](https://readxl.tidyverse.org/) makes it easy to get data out of Excel and into R.
* [readr](https://readr.tidyverse.org/)
* [janitor](https://github.com/sfirke/janitor) is used to format column names.

I also create a custom cat (print) function to add newlines. This is used to add explanations to output.

In [3]:
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(tibble)
library(readxl)
library(readr)
library(janitor)

In [4]:
catn <- function(x){
    cat("\n\n", x, "\n\n", sep = "")
}

shape <- function(df){
    cat("\n\n", dim(df)[1],",", dim(df)[2], sep="")
}

To see which excel files I have in the current directory I use the [list.files](https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html) function. The files are named poorly, as seen below. However, I know there is information in the files that will help to identify them. Therefore, I parse the first file in files. I view its first 10 rows using the head function.

In [7]:
catn("Files:")
files <- list.files(path = "../../../R/imf/", pattern = "\\.xlsx$", full.names = TRUE)
files

catn("Data from the first file:")
head(read_excel(files[1]))



Files:





Data from the first file:



New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems


"Prices, Production and Labor selected indicators",...2,...3,...4,...5,...6,...7,...8,...9,...10,⋯,...96,...97,...98,...99,...100,...101,...102,...103,...104,...105
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"Prices, Consumer Price Index, All items, Index","getSelectionEl(0,1,""Indicator"")","getSelectionEl(0,1,""Indicator"")",,,,,,,,⋯,,,,,,,,,,
,,,,,,,,,,⋯,,,,,,,,,,
Source: International Financial Statistics (IFS),,,,,,,,,,⋯,,,,,,,,,,
,,,,,,,,,,⋯,,,,,,,,,,
Country,Scale,Base Year,2013.0,2013Q1,2013M01,2013M02,2013M03,2013Q2,2013M04,⋯,2018M05,2018M06,2018Q3,2018M07,2018M08,2018M09,2018Q4,2018M10,2018M11,2018M12
"Afghanistan, Islamic Republic of",Units,2010=100,127.795223073732,125.223096076787,125.191195214315,125.515928270858,124.962164745187,126.716990862207,126.997097858061,⋯,145.34294838289301,145.19114224173501,145.51246349769201,145.38508863027101,145.297193117007,145.855108745797,...,147.25501270754501,148.21762074521999,...


The interesting data starts at index 5 in the Unnamed:1 column. The first column does not include data so it is unnecessary. The row at index 5 is the header. Also, the KPI ID is at index 1 column Unnamed:1. This information is useful.

I will import the data again and then clean the dataframe. In the cell below I explain the purpose of every line.

In [8]:
data <- read_excel(files[1]) # parse the excel file
kpi <- str_c(data[1, 1])  # get the kpi name
df <- data[-(1:5),] # get the data
cols <- data[5,] # get columnnames
colnames(df) = cols # set columnnames
df <- df %>% add_column(kpi, .after=1) # add kpi at position 1
catn("Cleaned data v1 (one file) and shape:")
head(df)
shape(df)

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems




Cleaned data v1 (one file) and shape:



Country,kpi,Scale,Base Year,2013,2013Q1,2013M01,2013M02,2013M03,2013Q2,⋯,2018M05,2018M06,2018Q3,2018M07,2018M08,2018M09,2018Q4,2018M10,2018M11,2018M12
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"Afghanistan, Islamic Republic of","Prices, Consumer Price Index, All items, Index",Units,2010=100,127.795223073732,125.223096076787,125.191195214315,125.515928270858,124.962164745187,126.716990862207,⋯,145.34294838289301,145.19114224173501,145.51246349769201,145.38508863027101,145.297193117007,145.855108745797,...,147.25501270754501,148.21762074521999,...
Albania,"Prices, Consumer Price Index, All items, Index",Units,2010=100,107.581662954714,108.463251670379,107.57238307349699,108.79732739420901,109.02004454343,107.943578322197,⋯,116.843122494432,116.777638084633,116.844479955457,116.470825167038,116.896940979955,117.165673719376,117.18271752041601,116.963307349666,116.67275055679301,117.91209465478801
Algeria,"Prices, Consumer Price Index, All items, Index",Units,2010=100,117.521838067973,117.773862830018,117.69067019012,117.514497540923,118.11642075901101,117.345665418777,⋯,149.533876532335,151.15613301035,148.865888570799,148.25662482566301,148.93195331424801,149.409087572488,...,151.19283564559899,150.51750715701399,...
Angola,"Prices, Consumer Price Index, All items, Index",Units,2010=100,136.131179284704,132.310430069409,131.29906311983601,132.378692172749,133.253534915644,135.114554341117,⋯,329.14408793675602,332.96174100655298,342.88282726555701,337.05145577843803,341.11130241019299,350.48572360804201,...,355.286581821288,360.20852718653498,...
Anguilla,"Prices, Consumer Price Index, All items, Index",Units,2010=100,106.355019921305,106.065480462273,...,...,...,106.62971120295,⋯,...,...,...,...,...,...,...,...,...,...
Antigua and Barbuda,"Prices, Consumer Price Index, All items, Index",Units,2010=100,108.083497160332,108.142201834862,108.232306684142,108.346985583224,107.847313237222,108.073940585408,⋯,113.83519003931799,113.990825688073,114.05362603757099,113.83519003931799,114.162844036697,114.162844036697,...,...,...,...




203,106

As can be seen in the last row above, missing data is coded with three dots (...). This must be replaced with Na. Also, in my opinion, the kpi column and all column names will look better if comma and spaces were replaced with underscores and set to lowercase. Since I have added the kpi column I need to remove it again before renaming the columns. Otherwise, there will be a Value Error since there is a length mismatch between the cols list and the df columns. I reset the kpi column using the kpi variable set in the previous code cell. To clean the column names I use the [make_clean_names and clean_names](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html) functions from janitor.

In [9]:
df <- df %>% select(-c("kpi")) %>% clean_names()
kpi <- kpi %>% make_clean_names()
df <- df %>% add_column(kpi, .after=1) %>%
  na_if("...")
catn("Cleaned data v2 (one file):")
head(df)



Cleaned data v2 (one file):



country,kpi,scale,base_year,x2013,x2013q1,x2013m01,x2013m02,x2013m03,x2013q2,⋯,x2018m05,x2018m06,x2018q3,x2018m07,x2018m08,x2018m09,x2018q4,x2018m10,x2018m11,x2018m12
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"Afghanistan, Islamic Republic of",prices_consumer_price_index_all_items_index,Units,2010=100,127.795223073732,125.223096076787,125.191195214315,125.515928270858,124.962164745187,126.716990862207,⋯,145.342948382893,145.191142241735,145.512463497692,145.385088630271,145.297193117007,145.855108745797,,147.255012707545,148.21762074522,
Albania,prices_consumer_price_index_all_items_index,Units,2010=100,107.581662954714,108.463251670379,107.572383073497,108.797327394209,109.02004454343,107.943578322197,⋯,116.843122494432,116.777638084633,116.844479955457,116.470825167038,116.896940979955,117.165673719376,117.182717520416,116.963307349666,116.672750556793,117.912094654788
Algeria,prices_consumer_price_index_all_items_index,Units,2010=100,117.521838067973,117.773862830018,117.69067019012,117.514497540923,118.116420759011,117.345665418777,⋯,149.533876532335,151.15613301035,148.865888570799,148.256624825663,148.931953314248,149.409087572488,,151.192835645599,150.517507157014,
Angola,prices_consumer_price_index_all_items_index,Units,2010=100,136.131179284704,132.310430069409,131.299063119836,132.378692172749,133.253534915644,135.114554341117,⋯,329.144087936756,332.961741006553,342.882827265557,337.051455778438,341.11130241019293,350.485723608042,,355.286581821288,360.208527186535,
Anguilla,prices_consumer_price_index_all_items_index,Units,2010=100,106.355019921305,106.065480462273,,,,106.62971120295,⋯,,,,,,,,,,
Antigua and Barbuda,prices_consumer_price_index_all_items_index,Units,2010=100,108.083497160332,108.142201834862,108.232306684142,108.346985583224,107.847313237222,108.073940585408,⋯,113.835190039318,113.990825688073,114.053626037571,113.835190039318,114.162844036697,114.162844036697,,,,


## 2. Parse and clean all files

Now the dataframe looks pretty good. But I have only imported one excel file. Let's summarise the steps above in one cell and parse all excel files. This can be done with a for loop iterating over the files list.

In [10]:
financial_data <- tibble()
for (i in files){
  data <- read_excel(i)
  kpi <- str_c(data[1, 1]) %>% 
    make_clean_names()
  cols <- data[5,] %>% 
    make_clean_names()
  colnames(data) = cols
  df <- data[-(1:5),] %>% 
    add_column(kpi, .after=1) %>% 
    na_if("...")
  financial_data <- bind_rows(financial_data, df)
  #head(df)
}
colnames(financial_data) <- colnames(df)

catn("Cleaned data v1 (all files) and shape:")
sample_n(financial_data, 10)
shape(financial_data)

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 99 more problems




Cleaned data v1 (all files) and shape:



country,kpi,scale,base_year,x2013,x2013q1,x2013m01,x2013m02,x2013m03,x2013q2,⋯,x2018m05,x2018m06,x2018q3,x2018m07,x2018m08,x2018m09,x2018q4,x2018m10,x2018m11,x2018m12
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Netherlands Antilles,prices_consumer_price_index_all_items_index,Units,,,106.742473941249,,,,107.675486551498,⋯,,,,,,,,,,
"Serbia, Republic of",unemployment_persons_number_of,Thousands,,774.8742499999998,787.0716666666668,778.5789999999998,790.292,792.344,784.374,⋯,,,,,,,,,,
Trinidad and Tobago,prices_consumer_price_index_all_items_index,Units,2010=100,120.811837601268,119.563210325074,119.239139123619,119.582273336924,119.868218514678,122.136716924864,⋯,141.302497472048,141.170562367032,141.742281155434,141.698302787095,141.698302787095,141.830237892111,,142.489913417191,,
Moldova,labor_markets_unemployment_rate_percent,Units,,5.2,8.099999999999998,,,,4.7,⋯,,,,,,,,,,
Nigeria,prices_consumer_price_index_all_items_index,Units,2010=100,134.924642338,131.252390788769,130.273123708974,131.282992885013,132.20105577232,133.792364776987,⋯,236.209652658316,239.132055640098,244.207193936844,241.839425179456,244.374077677667,246.408078953408,,248.239507183695,,
Nigeria,employment_persons_number_of,Thousands,,,,,,,,⋯,,,,,,,,,,
Philippines,unemployment_persons_number_of,Thousands,,2904.5,2894.0,,,,3087.0,⋯,,,,,,,,,,
San Marino,prices_consumer_price_index_all_items_index,Units,2010=100,107.506708961308,106.854112056411,106.776642488346,106.887794477309,106.897899203579,107.271774075545,⋯,112.445393925458,112.607069545768,113.04830925953,112.788954618617,113.263876753277,113.092096406697,,,,
Gabon,prices_consumer_price_index_all_items_index,Units,2010=100,104.474645977231,103.226521248783,104.563126732099,102.389772289699,102.726664724553,102.210840700811,⋯,119.293278872144,120.576922475503,120.890702022991,120.833651196175,121.004803676623,120.833651196175,,121.860566078862,,
Czech Republic,prices_producer_price_index_all_commodities_index,Units,2010=100,108.66388308977,109.041282838622,109.041282838622,109.041282838622,109.041282838622,108.336176504038,⋯,104.682225344412,105.309065017133,,105.622484853494,,,,,,




860,106

All excel files are now parsed and appended to the financial_data dataframe. But the dataframe is still not easy to analyze. Every period has its own column. Period is a variable so it should only be one column. Hence, the dataframe must be unpivoted/melted/gathered. This can be done with the [gather](https://tidyr.tidyverse.org/reference/gather.html) function from tidyr. I also use the unique function to see which periods are included.

In [11]:
cols <- c("country", "kpi", "scale", "base_year")
financial_data <- financial_data %>% gather(key = "period", value = "value", -cols)

catn("Unique periods:")
print(unique(financial_data$period))

catn("Cleaned data v2 (all files) and shape:")
sample_n(financial_data, 10)
shape(financial_data)



Unique periods:

  [1] "x2013"    "x2013q1"  "x2013m01" "x2013m02" "x2013m03" "x2013q2" 
  [7] "x2013m04" "x2013m05" "x2013m06" "x2013q3"  "x2013m07" "x2013m08"
 [13] "x2013m09" "x2013q4"  "x2013m10" "x2013m11" "x2013m12" "x2014"   
 [19] "x2014q1"  "x2014m01" "x2014m02" "x2014m03" "x2014q2"  "x2014m04"
 [25] "x2014m05" "x2014m06" "x2014q3"  "x2014m07" "x2014m08" "x2014m09"
 [31] "x2014q4"  "x2014m10" "x2014m11" "x2014m12" "x2015"    "x2015q1" 
 [37] "x2015m01" "x2015m02" "x2015m03" "x2015q2"  "x2015m04" "x2015m05"
 [43] "x2015m06" "x2015q3"  "x2015m07" "x2015m08" "x2015m09" "x2015q4" 
 [49] "x2015m10" "x2015m11" "x2015m12" "x2016"    "x2016q1"  "x2016m01"
 [55] "x2016m02" "x2016m03" "x2016q2"  "x2016m04" "x2016m05" "x2016m06"
 [61] "x2016q3"  "x2016m07" "x2016m08" "x2016m09" "x2016q4"  "x2016m10"
 [67] "x2016m11" "x2016m12" "x2017"    "x2017q1"  "x2017m01" "x2017m02"
 [73] "x2017m03" "x2017q2"  "x2017m04" "x2017m05" "x2017m06" "x2017q3" 
 [79] "x2017m07" "x2017m08" "x2017m09" "x2017

country,kpi,scale,base_year,period,value
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Emerging and Developing Asia,prices_consumer_price_index_all_items_index,Units,2010=100,x2018m10,138.163588411221
Poland,economic_activity_industrial_production_index,Units,2010=100,x2014m08,104.9087423952
Ireland,financial_market_prices_equities_end_of_period_index,Units,,x2015m02,207.168902291082
Moldova,unemployment_persons_number_of,Thousands,,x2015m06,
"Timor-Leste, Dem. Rep. of",labor_markets_unemployment_rate_percent,Units,,x2013q2,
"Venezuela, Republica Bolivariana de",labor_force_persons_number_of,Thousands,,x2015q1,14169.2896666667
Algeria,prices_producer_price_index_all_commodities_index,Units,2010=100,x2017m12,
Hungary,prices_consumer_price_index_all_items_index,Units,2010=100,x2014m04,111.879079354174
Austria,prices_consumer_price_index_all_items_index,Units,2010=100,x2018m10,117.320117431166
Canada,economic_activity_industrial_production_index,Units,2010=100,x2016m09,113.827341834934




87720,6

The periods include quarters (e.g. 2013q1), year (e.g. 2015) and months (e.g. 2018m06). There is one problem, there is a 'x' in front of all rows in period column. This problem is caused by gather but is easily fixed with [str_replace](https://stringr.tidyverse.org/reference/str_replace.html) from stringr.  I also set the value column to numeric and count the number of null values in the value column.

In [12]:
financial_data <- financial_data %>% mutate(period = str_replace(period, "x", ""))
financial_data$value <- as.numeric(financial_data$value)

catn("Control periods:")
print(unique(financial_data$period))

catn("Number of nulls in the value column:")
sum(is.na(financial_data$value))



Control periods:

  [1] "2013"    "2013q1"  "2013m01" "2013m02" "2013m03" "2013q2"  "2013m04"
  [8] "2013m05" "2013m06" "2013q3"  "2013m07" "2013m08" "2013m09" "2013q4" 
 [15] "2013m10" "2013m11" "2013m12" "2014"    "2014q1"  "2014m01" "2014m02"
 [22] "2014m03" "2014q2"  "2014m04" "2014m05" "2014m06" "2014q3"  "2014m07"
 [29] "2014m08" "2014m09" "2014q4"  "2014m10" "2014m11" "2014m12" "2015"   
 [36] "2015q1"  "2015m01" "2015m02" "2015m03" "2015q2"  "2015m04" "2015m05"
 [43] "2015m06" "2015q3"  "2015m07" "2015m08" "2015m09" "2015q4"  "2015m10"
 [50] "2015m11" "2015m12" "2016"    "2016q1"  "2016m01" "2016m02" "2016m03"
 [57] "2016q2"  "2016m04" "2016m05" "2016m06" "2016q3"  "2016m07" "2016m08"
 [64] "2016m09" "2016q4"  "2016m10" "2016m11" "2016m12" "2017"    "2017q1" 
 [71] "2017m01" "2017m02" "2017m03" "2017q2"  "2017m04" "2017m05" "2017m06"
 [78] "2017q3"  "2017m07" "2017m08" "2017m09" "2017q4"  "2017m10" "2017m11"
 [85] "2017m12" "2018"    "2018q1"  "2018m01" "2018m02" "2018m03" "2

There are 32141 rows with null values. One should always be careful when handling null values. But in this case, they are just in the way, so I will drop them and then control if there are any left.

The period column is not optimal, it would be nice if one easily could distinguish the time period. Also, it would be nice if year and period (month, quarter) were separate columns. This can be accomplished with the custom functions.

I do all the above operations based on the period column. Then I replace the old period column with the new column that specifies the current month/quarter. NA is set if the data is yearly.

In [13]:
columns <- c("country", "kpi", "value")
year <- substr(financial_data$period, start = 1, stop = 4)

extract_time_period_value <- function(x) {
  if (grepl("q", x)) {
    time_period <- str_sub(x, start = -1)
  } else if (grepl("m", x)) {
    time_period <- str_sub(x, start = -2)
  } else {
    time_period <- NA
  }
}
extract_time_period <- function(x) {
  if (grepl("q", x)) {
    time_period <- "quarter"
  } else if (grepl("m", x)) {
    time_period <- "month"
  } else {
    time_period <- "year"
  }
}

period <- map_chr(financial_data$period, extract_time_period_value)
time_period <- map_chr(financial_data$period, extract_time_period)

financial_data <- financial_data %>%
  select(columns) %>% 
  add_column(year, .after=2) %>% 
  add_column(period, .after=3) %>% 
  add_column(time_period, .after=4) %>% 
  drop_na(value)

financial_data$value <- as.numeric(financial_data$value)

catn("Cleaned data v3 (all files):")
sample_n(financial_data,10)



Cleaned data v3 (all files):



country,kpi,year,period,time_period,value
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
Netherlands,unemployment_persons_number_of,2014,10.0,month,634.0
Botswana,financial_market_prices_equities_end_of_period_index,2017,8.0,month,139.588863
Canada,labor_force_persons_number_of,2013,3.0,quarter,19303.2
Qatar,financial_market_prices_equities_end_of_period_index,2013,,year,119.557803
Philippines,financial_market_prices_equities_end_of_period_index,2016,4.0,month,145.803979
Japan,labor_markets_unemployment_rate_percent,2017,,year,2.808333
West Bank and Gaza,prices_consumer_price_index_all_items_index,2014,11.0,month,109.905851
Thailand,labor_force_persons_number_of,2018,9.0,month,38391.08
Cabo Verde,prices_consumer_price_index_all_items_index,2013,9.0,month,109.580618
Egypt,prices_consumer_price_index_all_items_index,2015,7.0,month,155.661275


Now there are two steps left until the dataframe is tidy. First, the kpi column needs to be spread/pivoted. This because the kpi column contains multiple kpi's and every kpi is a variable. For this, I will use the [spread](https://tidyr.tidyverse.org/reference/spread.html) function from tidyr.

Also, the dataframe contains three different time periods (year, month, quarter). The time periods need to be split into three separate dataframes.

## 3. Merge with external data
However, it would help the analysis if the dataframe contained more information about the country, for example, region. This link contains a csv with countries and their region. Let's look at the data.

In [14]:
region <- read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv") %>% 
  clean_names()
catn("Regions:")
sample_n(region, 10)

catn("Regions and sub-regions:")
region %>% select("region", "sub_region") %>% 
  distinct() %>%
  arrange(region, sub_region)

Parsed with column specification:
cols(
  name = [31mcol_character()[39m,
  `alpha-2` = [31mcol_character()[39m,
  `alpha-3` = [31mcol_character()[39m,
  `country-code` = [31mcol_character()[39m,
  `iso_3166-2` = [31mcol_character()[39m,
  region = [31mcol_character()[39m,
  `sub-region` = [31mcol_character()[39m,
  `intermediate-region` = [31mcol_character()[39m,
  `region-code` = [31mcol_character()[39m,
  `sub-region-code` = [31mcol_character()[39m,
  `intermediate-region-code` = [31mcol_character()[39m
)




Regions:



name,alpha_2,alpha_3,country_code,iso_3166_2,region,sub_region,intermediate_region,region_code,sub_region_code,intermediate_region_code
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
North Macedonia,MK,MKD,807,ISO 3166-2:MK,Europe,Southern Europe,,150,39,
Haiti,HT,HTI,332,ISO 3166-2:HT,Americas,Latin America and the Caribbean,Caribbean,19,419,29.0
United Arab Emirates,AE,ARE,784,ISO 3166-2:AE,Asia,Western Asia,,142,145,
Sierra Leone,SL,SLE,694,ISO 3166-2:SL,Africa,Sub-Saharan Africa,Western Africa,2,202,11.0
Guadeloupe,GP,GLP,312,ISO 3166-2:GP,Americas,Latin America and the Caribbean,Caribbean,19,419,29.0
Trinidad and Tobago,TT,TTO,780,ISO 3166-2:TT,Americas,Latin America and the Caribbean,Caribbean,19,419,29.0
Tuvalu,TV,TUV,798,ISO 3166-2:TV,Oceania,Polynesia,,9,61,
Germany,DE,DEU,276,ISO 3166-2:DE,Europe,Western Europe,,150,155,
Montenegro,ME,MNE,499,ISO 3166-2:ME,Europe,Southern Europe,,150,39,
Curaçao,CW,CUW,531,ISO 3166-2:CW,Americas,Latin America and the Caribbean,Caribbean,19,419,29.0




Regions and sub-regions:



region,sub_region
<chr>,<chr>
Africa,Northern Africa
Africa,Sub-Saharan Africa
Americas,Latin America and the Caribbean
Americas,Northern America
Asia,Central Asia
Asia,Eastern Asia
Asia,South-eastern Asia
Asia,Southern Asia
Asia,Western Asia
Europe,Eastern Europe


There are many columns, but I find the name, region and sub-region columns most interesting. The name column can be used as a key when joining with the financial_data dataframe. Optimal would have been if the original data had contained one of the other columns since it would have increased the likelihood for a perfect match. But I have to take what I have got and deal with it.

So, I filter the columns in the region dataframe and rename the name column to country. Then I use the [left_join](https://dplyr.tidyverse.org/reference/join.html) method from dplyr to join the financial_data and region dataframes using the country column as the key. I create a new dataframe called financial_data2. This is a temporary dataframe, I will use it to see which countries in financial_data that do not have a match in the region dataframe.

In [15]:
region <- region %>% 
  select("name", "region", "sub_region") %>% 
  rename(country = name)
catn("Cleaned region data:")
sample_n(region, 10)

financial_data2 <- left_join(x=financial_data, y=region, by=c("country"))

catn("Countries with no match in region dataframe:")
financial_data2 %>% select(country, region) %>% 
  filter_all(any_vars(is.na(.))) %>% 
  distinct() %>% 
  rownames_to_column("index")



Cleaned region data:



country,region,sub_region
<chr>,<chr>,<chr>
Hungary,Europe,Eastern Europe
Lebanon,Asia,Western Asia
Israel,Asia,Western Asia
Yemen,Asia,Western Asia
Burundi,Africa,Sub-Saharan Africa
Faroe Islands,Europe,Northern Europe
Albania,Europe,Southern Europe
Paraguay,Americas,Latin America and the Caribbean
Papua New Guinea,Oceania,Melanesia
Bosnia and Herzegovina,Europe,Southern Europe




Countries with no match in region dataframe:



index,country,region
<chr>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",
2,"Armenia, Republic of",
3,"Azerbaijan, Republic of",
4,"Bahamas, The",
5,"Bahrain, Kingdom of",
6,Bolivia,
7,"China, P.R.: Hong Kong",
8,"China, P.R.: Macao",
9,"China, P.R.: Mainland",
10,"Congo, Democratic Republic of",


There are 48 "countries" that do not have a match. But, as can be seen above, the "countries" from index 37 to 47 are not countries. They are regions. These can be removed. This is done in the missing dataframe below.

Also, many of the countries in the financial_data dataframe that have no match in the region dataframe have longer names, often split by a comma, e.g. Bahamas, The. Therefore I create a new column in the missing dataframe called country2 where I split the country name on comma (,).

In [16]:
missing <- financial_data2 %>%
  select(country, region) %>% 
  filter_all(any_vars(is.na(.))) %>% 
  distinct() %>% 
  select(country)
  
missing <- missing %>% bind_cols(missing %>% separate(country, c("country2"), ",")) %>% 
  rownames_to_column("index") 
missing$index <- as.numeric(missing$index)
  
missing <- missing %>% filter(index < 37 | index==48)
catn("Countries with no match v2:")
missing

“Expected 1 pieces. Additional pieces discarded in 21 rows [1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 19, 20, 23, 30, 33, 36, ...].”



Countries with no match v2:



index,country,country2
<dbl>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",Afghanistan
2,"Armenia, Republic of",Armenia
3,"Azerbaijan, Republic of",Azerbaijan
4,"Bahamas, The",Bahamas
5,"Bahrain, Kingdom of",Bahrain
6,Bolivia,Bolivia
7,"China, P.R.: Hong Kong",China
8,"China, P.R.: Macao",China
9,"China, P.R.: Mainland",China
10,"Congo, Democratic Republic of",Congo


Now I probably will get a match on for example Afghanistan, but many of the countries in the missing dataframe did not have a comma. Most likely, this means there is no match for them. To solve this problem I will use the [agrep](https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/agrep) function. This functions searches for approximate matches and returns a list of the best matches.

I map the result of agrep to a new column, country3, in the missing dataframe.

In [17]:
matches <- c()
for (i in missing$country2) {
  row <- agrep(i, region$country)[1]
  matches <- c(matches, region$country[row])
}
catn("Matched countries:")
matches

missing <- missing %>% mutate(country3 = matches)
catn("Countries with no match v3:")
missing



Matched countries:





Countries with no match v3:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",Afghanistan,Afghanistan
2,"Armenia, Republic of",Armenia,Armenia
3,"Azerbaijan, Republic of",Azerbaijan,Azerbaijan
4,"Bahamas, The",Bahamas,Bahamas
5,"Bahrain, Kingdom of",Bahrain,Bahrain
6,Bolivia,Bolivia,Bolivia (Plurinational State of)
7,"China, P.R.: Hong Kong",China,China
8,"China, P.R.: Macao",China,China
9,"China, P.R.: Mainland",China,China
10,"Congo, Democratic Republic of",Congo,Congo


Now it looks pretty good. But it is not perfect. There are some NA's and one error (Iran is not the same as France). I set the errors to NA. Then I create a new dataframe called corrected which contains all corrected countries. I then overwrite the missing dataframe with the countries that do have NA in the country3 column.

In [18]:
missing[16, 4] <- NA #Iran
corrected <- drop_na(missing)
missing <- missing %>%  filter(is.na(country3))

catn("Corrected:")
corrected
catn("Countries with no match v4:")
missing



Corrected:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",Afghanistan,Afghanistan
2,"Armenia, Republic of",Armenia,Armenia
3,"Azerbaijan, Republic of",Azerbaijan,Azerbaijan
4,"Bahamas, The",Bahamas,Bahamas
5,"Bahrain, Kingdom of",Bahrain,Bahrain
6,Bolivia,Bolivia,Bolivia (Plurinational State of)
7,"China, P.R.: Hong Kong",China,China
8,"China, P.R.: Macao",China,China
9,"China, P.R.: Mainland",China,China
10,"Congo, Democratic Republic of",Congo,Congo




Countries with no match v4:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
12,Czech Republic,Czech Republic,
14,French Territories: New Caledonia,French Territories: New Caledonia,
16,"Iran, Islamic Republic of",Iran,
17,"Kosovo, Republic of",Kosovo,
18,Kyrgyz Republic,Kyrgyz Republic,
25,Slovak Republic,Slovak Republic,
27,St. Lucia,St. Lucia,
34,Vietnam,Vietnam,
35,West Bank and Gaza,West Bank and Gaza,
48,Netherlands Antilles,Netherlands Antilles,


There are only 10 countries left. Below I try to find a match using the [str_sub](https://stringr.tidyverse.org/reference/str_sub.html) function from stringr.  This function works similarly to the LIKE keyword in SQL. I will only use the first 4 letters in the country2 column. The matches are stored in a character vector named matches. Then I replace the country3 column in the missing dataframe with this vector.

In [19]:
matches <- c()
for (i in missing$country2) {
  i_ <- str_sub(i,1,4)
  row <- agrep(i_, region$country,)[1]
  matches <- c(matches, region$country[row])
}

missing <- missing %>% mutate(country3=matches) %>% 
  select(-index) %>% 
  rownames_to_column("index")
missing$index <- as.numeric(missing$index)

catn("Countries with no match v5:")
missing



Countries with no match v5:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
1,Czech Republic,Czech Republic,Czechia
2,French Territories: New Caledonia,French Territories: New Caledonia,France
3,"Iran, Islamic Republic of",Iran,France
4,"Kosovo, Republic of",Kosovo,
5,Kyrgyz Republic,Kyrgyz Republic,Kyrgyzstan
6,Slovak Republic,Slovak Republic,Slovakia
7,St. Lucia,St. Lucia,
8,Vietnam,Vietnam,Viet Nam
9,West Bank and Gaza,West Bank and Gaza,"Palestine, State of"
10,Netherlands Antilles,Netherlands Antilles,Netherlands


In the output above it is visible that most countries find a correct match in the country3 column. However, French Territories: New Caledonia and Iran are not France and Netherlands Antilles is not the Netherlands. Also, Kosovo and St. Lucia are NA. 

I replace the errors with NA and move the correct rows to the corrected dataframe. Then I remove all rows in the missing dataframe except for the NA's.

In [20]:
missing[2, 4] <- NA
missing[3, 4] <- NA
missing[10, 4] <- NA
corrected <- corrected %>% bind_rows(drop_na(missing)) %>% 
  select(-index) %>% 
  rownames_to_column("index")
corrected$index <- as.numeric(corrected$index)
  
missing <- missing %>% filter(is.na(country3))

catn("Corrected v2:")
corrected
catn("Countries with no match v6:")
missing



Corrected v2:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",Afghanistan,Afghanistan
2,"Armenia, Republic of",Armenia,Armenia
3,"Azerbaijan, Republic of",Azerbaijan,Azerbaijan
4,"Bahamas, The",Bahamas,Bahamas
5,"Bahrain, Kingdom of",Bahrain,Bahrain
6,Bolivia,Bolivia,Bolivia (Plurinational State of)
7,"China, P.R.: Hong Kong",China,China
8,"China, P.R.: Macao",China,China
9,"China, P.R.: Mainland",China,China
10,"Congo, Democratic Republic of",Congo,Congo




Countries with no match v6:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
2,French Territories: New Caledonia,French Territories: New Caledonia,
3,"Iran, Islamic Republic of",Iran,
4,"Kosovo, Republic of",Kosovo,
7,St. Lucia,St. Lucia,
10,Netherlands Antilles,Netherlands Antilles,


Only 5 countries left! Now I will use agrep again, but instead of looping over all rows I will do one at a time. 

In [21]:
catn("New Caledonia")
region$country[agrep("Caled", region$country)]
catn("Iran")
region$country[agrep("Iran", region$country)]
catn("Kosovo")
region$country[agrep("Kos", region$country)]
catn("St.Lucia")
region$country[agrep("Luci", region$country)]
catn("Antilles 1")
region$country[agrep("Antill", region$country)]
catn("Antilles 2")
region$country[agrep("Cura", region$country)]



New Caledonia





Iran





Kosovo





St.Lucia





Antilles 1





Antilles 2



There are matches on all countries except Kosovo and Antilles 1. In Antilles 2 I use Curacao since it is the  main island. For Kosovo, I use Serbia as a proxy. This is a good alternative since I only use the country3 as a key to join the financial_data dataframe with the region dataframe.

In [22]:
missing[1, 4] <- "New Caledonia"
missing[2, 4] <- "Iran (Islamic Republic of)"
missing[3, 4] <- "Serbia"
missing[4, 4] <- "Saint Lucia"
missing[5, 4] <- "Curaçao"

corrected <- corrected %>% bind_rows(drop_na(missing)) %>% 
  select(-index) %>% 
  rownames_to_column("index")
corrected$index <- as.numeric(corrected$index)

missing <- missing %>%  filter(is.na(country3))

catn("Corrected v3:")
corrected
catn("Countries with no match v7:")
missing



Corrected v3:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>
1,"Afghanistan, Islamic Republic of",Afghanistan,Afghanistan
2,"Armenia, Republic of",Armenia,Armenia
3,"Azerbaijan, Republic of",Azerbaijan,Azerbaijan
4,"Bahamas, The",Bahamas,Bahamas
5,"Bahrain, Kingdom of",Bahrain,Bahrain
6,Bolivia,Bolivia,Bolivia (Plurinational State of)
7,"China, P.R.: Hong Kong",China,China
8,"China, P.R.: Macao",China,China
9,"China, P.R.: Mainland",China,China
10,"Congo, Democratic Republic of",Congo,Congo




Countries with no match v7:



index,country,country2,country3
<dbl>,<chr>,<chr>,<chr>


To get correct regions in the region dataframe, I join the corrected dataframe with the region dataframe. I join the country3 column in the corrected dataframe on the country columns in the region dataframe.

In [23]:
corrected <- left_join(x=corrected, y=region, by = c("country3" = "country")) %>% 
  select(country, region, sub_region)

region <- bind_rows(region, corrected)

sample_n(region, 20)

country,region,sub_region
<chr>,<chr>,<chr>
Saudi Arabia,Asia,Western Asia
Philippines,Asia,South-eastern Asia
Samoa,Oceania,Polynesia
Zambia,Africa,Sub-Saharan Africa
Uganda,Africa,Sub-Saharan Africa
United States,Americas,Northern America
Pitcairn,Oceania,Polynesia
Argentina,Americas,Latin America and the Caribbean
Dominica,Americas,Latin America and the Caribbean
Kuwait,Asia,Western Asia


Now the region dataframe is cleaned. I join the financial dataframe with the region dataframe. Then I control that there are no missing countries. As can be seen below there are none except for those regions we purposely left out.

In [24]:
financial_data2 <- left_join(financial_data, region, by=c("country"))

catn("Countries with no match in region dataframe:")
financial_data2 %>% select(country, region) %>% 
  filter_all(any_vars(is.na(.))) %>% 
  distinct()



Countries with no match in region dataframe:



country,region
<chr>,<chr>
Advanced Economies,
CIS,
Emerging and Developing Asia,
Emerging and Developing Countries,
Emerging and Developing Europe,
Europe,
"Middle East, North Africa, Afghanistan, and Pakistan",
Sub-Saharan Africa,
Western Hemisphere,
World,


Now I need to spread the financial dataframe to get all the kpi's in separate columns. I only use the time period month. Personally, I like to have a column that contains both year and period. Therefore I create a new column called year_period. This column is based on the year and month column. To do this I use the [pull](https://dplyr.tidyverse.org/reference/pull.html) function from dplyr in combination with the [unite](https://tidyr.tidyverse.org/reference/unite.html) function from tidyr.

In [25]:
imf <- financial_data2 %>% spread(key = kpi, value = value) %>% 
  filter(time_period == "month") %>% 
  select(-time_period)
year_period <- pull(imf %>% unite(year_period, year, period, sep = "-") %>% 
  select(year_period))


imf <- imf %>% add_column(year_period, .after=3)
sample_n(imf, 10)

country,year,period,year_period,region,sub_region,economic_activity_industrial_production_index,employment_persons_number_of,financial_market_prices_equities_end_of_period_index,labor_force_persons_number_of,labor_markets_unemployment_rate_percent,prices_consumer_price_index_all_items_index,prices_producer_price_index_all_commodities_index,unemployment_persons_number_of
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
El Salvador,2017,10,2017-10,Americas,Latin America and the Caribbean,109.55846,,,,,110.1523,103.1882,
"Armenia, Republic of",2013,7,2013-07,Asia,Western Asia,,,,,,116.6198,119.6779,57.9
"Iran, Islamic Republic of",2014,5,2014-05,Asia,Southern Asia,,,,,,241.6339,276.9071,
Myanmar,2018,11,2018-11,Asia,South-eastern Asia,,,,,,160.9204,,
Cambodia,2015,6,2015-06,Asia,South-eastern Asia,,,,,,117.4136,,
Fiji,2017,6,2017-06,Oceania,Melanesia,,,187.3843,,,124.1257,,
Japan,2016,8,2016-08,Asia,Eastern Asia,92.69228,64890.0,,67020.0,3.2,103.2893,98.5542,2120.0
West Bank and Gaza,2014,10,2014-10,Asia,Western Asia,,,,,,109.8241,,
Djibouti,2018,9,2018-09,Africa,Sub-Saharan Africa,,,,,,116.1368,,
Montserrat,2015,8,2015-08,Americas,Latin America and the Caribbean,,,,,,110.3699,,


Now the dataframe is tidy. Now I need to save it.

## 4. Save data

In [26]:
write.csv(imf, "../../../R/imf/imf_monthly_r.csv",)