**SYPA: Fundamental Analysis of Foreign Direct Investment** <br>
*2_ETL (Extract, Transform, Load)* <br>
Harvard SYPA <br>
User: Jake Schneider <br>
Date Created: February 8, 2020 <br>
Date Updated: February 9, 2020

____

**Load Packages and Libraries**

In [1]:
## Install packages
#
#install.packages("plyr")
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("stringr")
#install.packages("readxl")
#install.packages("data.table")
#install.packages("hablar")
#install.packages("naniar")
#install.packages("DataCombine")
#install.packages("panelaggregation")
#install.packages("jsonlite", repos = 'https://cran.r-project.org')

In [2]:
# Load relevant libraries

library(plyr)
library(dplyr)
library(tidyverse)
library(stringr)
library(readxl)
library(data.table)
library(reshape2)
library(hablar)
library(naniar)
library(DataCombine)
library(panelaggregation)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ stringr 1.4.0
✔ tidyr   1.0.2     ✔ forcats 0.4.0
✔ readr   1.3.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename(

In [3]:
# Setting A Working Directory 

getwd()

#setwd('.')

----

**Load WDI Data** <br>
Downloaded from World Bank Website <br>
*Note: Will need to be cleaned and transformed*

In [4]:
# Load WDI Data

wdi <- read_csv("../../2_Inputs/WDI_csv/WDIData.csv")

“Missing column names filled in: 'X65' [65]”Parsed with column specification:
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character(),
  X65 = col_logical()
)
See spec(...) for full column specifications.


In [5]:
# View WDI Data

wdi[c(1:5),]

Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,⋯,2011,2012,2013,2014,2015,2016,2017,2018,2019,X65
Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per international $)",PA.NUS.PPP.05,,,,,,,⋯,,,,,,,,,,
Arab World,ARB,"2005 PPP conversion factor, private consumption (LCU per international $)",PA.NUS.PRVT.PP.05,,,,,,,⋯,,,,,,,,,,
Arab World,ARB,Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,,,,,,,⋯,82.78329,83.1203,83.53346,83.8976,84.1716,84.51017,,,,
Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,⋯,86.42827,87.07058,88.17684,87.34274,89.13012,89.67869,90.27369,,,
Arab World,ARB,"Access to electricity, rural (% of rural population)",EG.ELC.ACCS.RU.ZS,,,,,,,⋯,73.9421,75.2441,77.1623,75.53898,78.74115,79.66564,80.74929,,,


In [6]:
# Subset WDI Data into WDI Individual and WDI Aggregates

wdi_individual = wdi[-c(1:67164),]
wdi_aggregate = wdi[c(1:67164),]

In [7]:
# View WDI Individual and WDI Aggregates

wdi_individual[c(1:5),]
#wdi_aggegate

Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,⋯,2011,2012,2013,2014,2015,2016,2017,2018,2019,X65
Afghanistan,AFG,"2005 PPP conversion factor, private consumption (LCU per international $)",PA.NUS.PRVT.PP.05,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,,,,,,,⋯,22.33,24.08,26.17,27.99,30.1,32.44,,,,
Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,⋯,43.22202,69.1,70.15348,89.5,71.5,97.7,97.7,,,
Afghanistan,AFG,"Access to electricity, rural (% of rural population)",EG.ELC.ACCS.RU.ZS,,,,,,,⋯,29.57288,60.84916,62.87569,86.50051,64.57335,97.09936,97.09197,,,
Afghanistan,AFG,"Access to electricity, urban (% of urban population)",EG.ELC.ACCS.UR.ZS,,,,,,,⋯,86.56778,95.0,92.73573,98.7,92.5,99.5,99.5,,,


In [8]:
# Rename Variable Columns and Omit Series Code

colnames(wdi_individual)
setnames(wdi_individual, c("Country Name", "Country Code", "Indicator Name"), c("country", "code", "indicator"))

wdi_individual$`Indicator Code` <- NULL
wdi_individual$X65 <- NULL

colnames(wdi_individual)

In [9]:
# Reshape

wdi_individual = melt(setDT(wdi_individual), id.vars = c("country", "code", "indicator"), variable.name = "date")
wdi_df = dcast(wdi_individual, country + code + date ~ indicator, fun.aggregate = mean)
wdi_df[c(1:5),]


country,code,date,"2005 PPP conversion factor, GDP (LCU per international $)","2005 PPP conversion factor, private consumption (LCU per international $)",Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),⋯,"Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49)",Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%)
Afghanistan,AFG,1960,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1961,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1962,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1963,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1964,,,,,,,,⋯,,,,,,,,,,


In [10]:
# Save Transformed WDI Data

#save(wdi_individual, file = "../../2_Inputs/Culled/wdi_df.RData")
write.csv(wdi_df, file = "../../2_Inputs/Culled/wdi_df.csv")

---

**Load UN Comtrade Data** <br>
Downloeaded from the UN Website <br>
*Note: Will need to be cleaned and transformed*

In [11]:
# Read in Comtrade data
# 2019 data not available

comtrade_list = as.character(excel_sheets("../../2_Inputs/UNCTAD/UNCTAD_comtrade_data.xlsx"))

str(comtrade_list)

comtrade = vector("list", length(comtrade_list))

 chr [1:23] "comtrade_1996" "comtrade_1997" "comtrade_1998" "comtrade_1999" ...


In [12]:
# Function to read in data
# Not working right now due to data limits
#jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

comtrade_long = data.frame()

for (i in seq_along(comtrade_list)) {
  comtrade <- as.data.frame(lapply(i, read_excel, path = "../../2_Inputs/UNCTAD/UNCTAD_comtrade_data.xlsx"))
  #print(comtrade)
  comtrade_long <- rbind(comtrade_long, comtrade)
  #print(comtrade_long)
}

In [13]:
# Rename Variable Columns and Omit Series Code

colnames(comtrade_long)
comtrade_long <- comtrade_long[c(2, 8, 10:11, 32)]
setnames(comtrade_long, c("Year", "Reporter", "Reporter.ISO", "Trade.Value..US..", "Trade.Flow"), c("date", "country", "code", "value", "indicator"))
colnames(comtrade_long)

comtrade_long

date,indicator,country,code,value
1996,Import,Albania,ALB,938481792
1996,Export,Albania,ALB,211140400
1996,Import,Algeria,DZA,9105595392
1996,Export,Algeria,DZA,11099222016
1996,Import,Andorra,AND,1061358400
1996,Export,Andorra,AND,46242984
1996,Import,Argentina,ARG,23761555456
1996,Export,Argentina,ARG,23809628160
1996,Import,Australia,AUS,61400054057
1996,Export,Australia,AUS,60206784511


In [14]:
# Reshape

comtrade_wide = dcast(comtrade_long, country + code + date ~ indicator, value.var = "value")

# Rename Data Set as Final

comtrade_df = comtrade_wide

In [15]:
#View Comtrade

comtrade_df[c(0:5),]

country,code,date,Export,Import,Re-Export,Re-Import
Afghanistan,AFG,2008,540065594,3019860129,,
Afghanistan,AFG,2009,403441006,3336434781,,
Afghanistan,AFG,2010,388483635,5154249867,,
Afghanistan,AFG,2011,375850935,6390310947,,
Afghanistan,AFG,2012,428902710,6204984101,,


In [16]:
# Save Transformed Comtrade Data

write.csv(comtrade_df, file = "../../2_Inputs/Culled/comtrade_df.csv")

----

**Load Culled Countries List** <br>
Culled from 1_API

In [17]:
# Read countries_df csv file

country_df = read.csv(file = "../../2_Inputs/Culled/country_df.csv")
country_df$X = NULL

country_df = country_df %>% 
  rename(
    code = id
    )

country_df[c(1:5),]

code,iso2Code,country,region,adminregion,incomeLevel,lendingType,capitalCity,longitude,latitude
AFG,AF,Afghanistan,South Asia,South Asia,Low income,IDA,Kabul,69.1761,34.5228
ALB,AL,Albania,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,41.3317
DZA,DZ,Algeria,Middle East & North Africa,Middle East & North Africa (excluding high income),Upper middle income,IBRD,Algiers,3.05097,36.7397
ASM,AS,American Samoa,East Asia & Pacific,East Asia & Pacific (excluding high income),Upper middle income,Not classified,Pago Pago,-170.691,-14.2846
AND,AD,Andorra,Europe & Central Asia,,High income,Not classified,Andorra la Vella,1.5218,42.5075


----

**Load Culled Data Sets: Doing Business, WGI, Debt, CPIA, Bureaucracy, ESG** <br>
Culled from 1_API

In [18]:
# Read bureaucracy_df csv file

bureaucracy_df = read.csv(file = "../../2_Inputs/Culled/bureaucracy_df.csv")
bureaucracy_df[c(1:5),]

country,date,Public.sector.employment.as.a.share.of.formal.employment,Public.sector.employment.as.a.share.of.paid.employment.by.gender..Female.,Public.sector.employment.as.a.share.of.paid.employment.by.gender..Male.,Public.sector.employment.as.a.share.of.paid.employment.by.location..Rural.,Public.sector.employment.as.a.share.of.paid.employment.by.location..Urban.,Public.sector.employment.as.a.share.of.paid.employment,Number.of.employed.individuals,Public.sector.employment.as.a.share.of.total.employment.by.gender..Female.,⋯,Relative.wage.of.Professionals.in.private.sector..using.clerk.as.reference.,Relative.wage.of.Senior.officials.in.private.sector..using.clerk.as.reference.,Relative.wage.of.Technicians.in.private.sector...using.clerk.as.reference.,Female.to.male.wage.ratio.in.the.public.sector..using.median.,Female.to.male.wage.ratio.in.the.public.sector..using.mean.,Relative.wage.of.Professionals.in.public.sector..using.clerk.as.reference.,Relative.wage.of.Senior.officials.in.public.sector..using.clerk.as.reference.,Relative.wage.of.Technicians.in.public.sector..using.clerk.as.reference.,Wage.bill.as.a.percentage.of.GDP,Wage.bill.as.a.percentage.of.Public.Expenditure
Afghanistan,2016,,,,,,,,,⋯,,,,,,,,,,
Afghanistan,2015,,,,,,,,,⋯,,,,,,,,,,
Afghanistan,2014,,,,,,,,,⋯,,,,,,,,,,
Afghanistan,2013,,,,,,,,,⋯,,,,,,,,,,
Afghanistan,2012,,,,,,,,,⋯,,,,,,,,,,


In [19]:
# Read cpia_df csv file

cpia_df = read.csv(file = "../../2_Inputs/Culled/cpia_df.csv")
cpia_df[c(1:5),]

country,date,CPIA.business.regulatory.environment.rating..1.low.to.6.high.,CPIA.debt.policy.rating..1.low.to.6.high.,CPIA.economic.management.cluster.average..1.low.to.6.high.,CPIA.policy.and.institutions.for.environmental.sustainability.rating..1.low.to.6.high.,CPIA.quality.of.budgetary.and.financial.management.rating..1.low.to.6.high.,CPIA.financial.sector.rating..1.low.to.6.high.,CPIA.fiscal.policy.rating..1.low.to.6.high.,CPIA.gender.equality.rating..1.low.to.6.high.,⋯,CPIA.quality.of.public.administration.rating..1.low.to.6.high.,CPIA.equity.of.public.resource.use.rating..1.low.to.6.high.,CPIA.property.rights.and.rule.based.governance.rating..1.low.to.6.high.,CPIA.social.protection.rating..1.low.to.6.high.,CPIA.public.sector.management.and.institutions.cluster.average..1.low.to.6.high.,CPIA.efficiency.of.revenue.mobilization.rating..1.low.to.6.high.,CPIA.policies.for.social.inclusion.equity.cluster.average..1.low.to.6.high.,CPIA.structural.policies.cluster.average..1.low.to.6.high.,CPIA.trade.rating..1.low.to.6.high.,CPIA.transparency..accountability..and.corruption.in.the.public.sector.rating..1.low.to.6.high.
Aruba,2019,,,,,,,,,⋯,,,,,,,,,,
Aruba,2018,,,,,,,,,⋯,,,,,,,,,,
Aruba,2017,,,,,,,,,⋯,,,,,,,,,,
Aruba,2016,,,,,,,,,⋯,,,,,,,,,,
Aruba,2015,,,,,,,,,⋯,,,,,,,,,,


In [20]:
# Read debt_df csv file

debt_df = read.csv(file = "../../2_Inputs/Culled/debt_df.csv")
debt_df = debt_df[debt_df$date <= 2019,] 
debt_df[c(60:65),]

Unnamed: 0,country,date,Imports.of.goods..services.and.primary.income..BoP..current.US..,Current.account.balance..BoP..current.US..,Grants..excluding.technical.cooperation..current.US..,Technical.cooperation.grants..current.US..,Exports.of.goods..services.and.primary.income..BoP..current.US..,Foreign.direct.investment..net.inflows.in.reporting.economy..DRS..current.US..,Primary.income.on.FDI..current.US..,Portfolio.investment..equity..DRS..current.US..,⋯,PRVG..private.creditors..TDS..current.US..,PS..private.creditors..TDS..current.US..,Total.amount.of.debt.rescheduled..current.US..,Undisbursed.external.debt..total..UND..current.US..,Undisbursed.external.debt..official.creditors..UND..current.US..,Undisbursed.external.debt..private.creditors..UND..current.US..,Total.reserves..includes.gold..current.US..,Total.reserves....of.total.external.debt.,Total.reserves.in.months.of.imports,GNI..current.US..
60,Afghanistan,2019,,,,,,,,,⋯,,,,,,,,,,
68,Albania,1960,,,,,,,,,⋯,,,,,,,,,,
69,Albania,1961,,,,,,,,,⋯,,,,,,,,,,
70,Albania,1962,,,,,,,,,⋯,,,,,,,,,,
71,Albania,1963,,,,,,,,,⋯,,,,,,,,,,
72,Albania,1964,,,,,,,,,⋯,,,,,,,,,,


In [21]:
# Read doing_business_df csv file

doing_business_df = read.csv(file = "../../2_Inputs/Culled/doing_business_df.csv")
doing_business_df = doing_business_df[doing_business_df$date <= 2019,] 
doing_business_df[c(60:65),]

Unnamed: 0,country,date,Enforcing.contracts..Alternative.dispute.resolution..0.3...DB16.20.methodology.,Enforcing.contracts..Attorney.fees....of.claim.,Enforcing.contracts..Cost....of.claim.,Enforcing.contracts..Cost....of.claim....Score,Enforcing.contracts..Case.management..0.6...DB16.20.methodology.,Enforcing.contracts..Court.automation..0.4...DB17.20.methodology.,Enforcing.contracts..Court.fees....of.claim.,Enforcing.contracts..Court.structure.and.proceedings..0.5...DB16.methodology.,⋯,Trading.across.borders..Cost.to.import..US..per.container.deflated..DB06.15.methodology.,Trading.across.borders..Cost.to.import..US..per.container..DB06.15.methodology....Score,Trading.across.borders..Cost.to.import..Documentary.compliance..USD...DB16.20.methodology.,Trading.across.borders..Cost.to.import..Documentary.compliance..USD...DB16.20.methodology....Score,Time.to.import..Documentary.compliance..hours...DB16.20.methodology.,Time.to.import..days...DB06.15.methodology.,Trading.across.borders..Time.to.import..Border.compliance..hours...DB16.20.methodology....Score,Trading.across.borders..Time.to.import..Documentary.compliance..hours...DB16.20.methodology....Score,Trading.across.borders..Time.to.import..days...DB06.15.methodology....Score,Rank..Trading.across.borders..1.most.business.friendly.regulations.
60,Afghanistan,2019,2.0,24.0,29.0,67.49156,1.0,0.0,5.0,,⋯,,,900.0,0.0,324.0,,65.94982,0.0,,
62,Albania,1960,,,,,,,,,⋯,,,,,,,,,,
63,Albania,1961,,,,,,,,,⋯,,,,,,,,,,
64,Albania,1962,,,,,,,,,⋯,,,,,,,,,,
65,Albania,1963,,,,,,,,,⋯,,,,,,,,,,
66,Albania,1964,,,,,,,,,⋯,,,,,,,,,,


In [22]:
# Read esg_df csv file

esg_df = read.csv(file = "../../2_Inputs/Culled/esg_df.csv")
esg_df = esg_df[esg_df$date <= 2019,] 
esg_df[c(1:5),]

country,date,Agricultural.land....of.land.area.,Forest.area....of.land.area.,Food.production.index..2004.2006...100.,Control.of.Corruption..Estimate,Access.to.clean.fuels.and.technologies.for.cooking....of.population.,Energy.intensity.level.of.primary.energy..MJ..2011.PPP.GDP.,Access.to.electricity....of.population.,Electricity.production.from.coal.sources....of.total.,⋯,Labor.force.participation.rate..total....of.total.population.ages.15.64...modeled.ILO.estimate.,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate
Afghanistan,1960,,,,,,,,,⋯,,,,,,32.446,7.45,2.798308,,
Afghanistan,1961,57.74592,,53.21,,,,,,⋯,,,,,,32.962,7.45,2.808131,,
Afghanistan,1962,57.83782,,53.86,,,,,,⋯,,,,-20000.0,,33.471,7.45,2.804113,,
Afghanistan,1963,57.91441,,53.83,,,,,,⋯,,,,,,33.971,7.45,2.786171,,
Afghanistan,1964,58.01091,,58.23,,,,,,⋯,,,,,,34.463,7.45,2.754223,,


In [23]:
# Read wgi_df csv file

wgi_df = read.csv(file = "../../2_Inputs/Culled/wgi_df.csv")
wgi_df[c(1:5),]

country,date,Control.of.Corruption..Estimate,Control.of.Corruption..Number.of.Sources,Control.of.Corruption..Percentile.Rank,Control.of.Corruption..Percentile.Rank..Lower.Bound.of.90..Confidence.Interval,Control.of.Corruption..Percentile.Rank..Upper.Bound.of.90..Confidence.Interval,Control.of.Corruption..Standard.Error,Government.Effectiveness..Estimate,Government.Effectiveness..Number.of.Sources,⋯,Regulatory.Quality..Percentile.Rank,Regulatory.Quality..Percentile.Rank..Lower.Bound.of.90..Confidence.Interval,Regulatory.Quality..Percentile.Rank..Upper.Bound.of.90..Confidence.Interval,Regulatory.Quality..Standard.Error,Voice.and.Accountability..Estimate,Voice.and.Accountability..Number.of.Sources,Voice.and.Accountability..Percentile.Rank,Voice.and.Accountability..Percentile.Rank..Lower.Bound.of.90..Confidence.Interval,Voice.and.Accountability..Percentile.Rank..Upper.Bound.of.90..Confidence.Interval,Voice.and.Accountability..Standard.Error
Aruba,2018,1.252027,2,87.01923,79.32692,92.30769,0.2826696,1.058392,2,⋯,77.88461,62.01923,90.38461,0.3641367,1.305925,1,91.62562,77.3399,99.50739,0.2493472
Aruba,2017,1.291643,2,87.98077,79.80769,92.30769,0.2823816,0.9177416,2,⋯,84.13461,72.11539,94.71154,0.3507003,1.294318,1,92.61084,73.89162,100.0,0.2823006
Aruba,2016,1.285848,2,89.42308,76.92308,92.78846,0.3071863,0.8932294,2,⋯,88.94231,75.48077,98.07692,0.3596922,1.279304,1,92.11823,71.92118,100.0,0.291442
Aruba,2015,1.297111,2,88.46154,76.92308,92.78846,0.2905574,0.8835687,2,⋯,90.38461,76.44231,99.03846,0.3363636,1.273942,1,91.62562,72.4138,100.0,0.2764198
Aruba,2014,1.018918,2,81.73077,69.23077,90.86539,0.3047443,0.8916514,2,⋯,88.46154,73.07692,97.11539,0.3530255,1.276822,1,92.11823,71.92118,100.0,0.2671471


----

**Load and Transform Bloomberg Data**

In [24]:
# Read bloomberg csv file

bloomberg_df = read_excel("../../2_Inputs/Bloomberg/FDI Input Data.xlsx", sheet = "CSDR")
bloomberg_df[c(1:5),]

Country,Country Code,Indicator,1969,1970,1971,1972,1973,1974,1975,⋯,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Afghanistan,AFG,sp,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,mdy,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,fitch,,,,,,,,⋯,,,,,,,,,,
Albania,ALB,sp,,,,,,,,⋯,,16.0,16.0,16.0,17.0,17.0,17.0,16.0,16.0,
Albania,ALB,mdy,,,,,,,,⋯,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,


In [25]:
# Rename variables

setnames(bloomberg_df, c("Country", "Country Code", "Indicator"), c("country", "code", "indicator"))
bloomberg_df[c(1:5),]

country,code,indicator,1969,1970,1971,1972,1973,1974,1975,⋯,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Afghanistan,AFG,sp,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,mdy,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,fitch,,,,,,,,⋯,,,,,,,,,,
Albania,ALB,sp,,,,,,,,⋯,,16.0,16.0,16.0,17.0,17.0,17.0,16.0,16.0,
Albania,ALB,mdy,,,,,,,,⋯,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,


In [26]:
# Reshape: Melt

bloomberg_df = melt(setDT(bloomberg_df), id.vars = c("country", "code", "indicator"), variable.name = "date")
bloomberg_df[c(30200:30220),]

“'measure.vars' [1969, 1970, 1971, 1972, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.”

country,code,indicator,date,value
Chad,TCD,mdy,2015,
Chad,TCD,fitch,2015,
Channel Islands,CHI,sp,2015,
Channel Islands,CHI,mdy,2015,
Channel Islands,CHI,fitch,2015,
Chile,CHL,sp,2015,6.0
Chile,CHL,mdy,2015,4.0
Chile,CHL,fitch,2015,7.0
China,CHN,sp,2015,6.0
China,CHN,mdy,2015,4.0


In [27]:
# Set as Numeric

str(bloomberg_df$value)
bloomberg_df$value = as.numeric(as.character(bloomberg_df$value))
str(bloomberg_df$value)

 chr [1:33354] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
 num [1:33354] NA NA NA NA NA NA NA NA NA NA ...


In [28]:
# Reshape: Cast

bloomberg_df <- dcast(bloomberg_df, country + code + date ~ indicator, mean, value.var = "value")
bloomberg_df[c(1000:1010),]


Unnamed: 0,country,code,date,fitch,mdy,sp
1000,Belize,BLZ,1999,,,
1001,Belize,BLZ,2000,,,
1002,Belize,BLZ,2001,,,14.0
1003,Belize,BLZ,2002,,,15.0
1004,Belize,BLZ,2003,,12.0,16.0
1005,Belize,BLZ,2004,,13.0,16.0
1006,Belize,BLZ,2005,,16.0,18.0
1007,Belize,BLZ,2006,,20.0,21.0
1008,Belize,BLZ,2007,,20.0,
1009,Belize,BLZ,2008,,18.0,17.0


In [29]:
# Create Average csdr

bloomberg_df$csdr_avg <- rowMeans(bloomberg_df[,c("fitch", "mdy", "sp")], na.rm = TRUE)
bloomberg_df[c(1000:1010),]

Unnamed: 0,country,code,date,fitch,mdy,sp,csdr_avg
1000,Belize,BLZ,1999,,,,
1001,Belize,BLZ,2000,,,,
1002,Belize,BLZ,2001,,,14.0,14.0
1003,Belize,BLZ,2002,,,15.0,15.0
1004,Belize,BLZ,2003,,12.0,16.0,14.0
1005,Belize,BLZ,2004,,13.0,16.0,14.5
1006,Belize,BLZ,2005,,16.0,18.0,17.0
1007,Belize,BLZ,2006,,20.0,21.0,20.5
1008,Belize,BLZ,2007,,20.0,,20.0
1009,Belize,BLZ,2008,,18.0,17.0,17.5


In [30]:
# Clean csdr_avg

bloomberg_df$csdr_avg <- as.numeric(bloomberg_df$csdr_avg)
bloomberg_df$csdr_avg[is.nan(bloomberg_df$csdr_avg)] <- NA
bloomberg_df[c(1000:1010),]

Unnamed: 0,country,code,date,fitch,mdy,sp,csdr_avg
1000,Belize,BLZ,1999,,,,
1001,Belize,BLZ,2000,,,,
1002,Belize,BLZ,2001,,,14.0,14.0
1003,Belize,BLZ,2002,,,15.0,15.0
1004,Belize,BLZ,2003,,12.0,16.0,14.0
1005,Belize,BLZ,2004,,13.0,16.0,14.5
1006,Belize,BLZ,2005,,16.0,18.0,17.0
1007,Belize,BLZ,2006,,20.0,21.0,20.5
1008,Belize,BLZ,2007,,20.0,,20.0
1009,Belize,BLZ,2008,,18.0,17.0,17.5


In [31]:
# Save bloomberg_df

write.csv(bloomberg_df, file = "../../2_Inputs/Culled/bloomberg_df.csv")

---

**Merge Data Sets**

In [32]:
# Merge Country_df and WDI

country_wdi_df <- join(country_df, wdi_df, by = c("code"), type = "left", match = "all")

In [33]:
# View country_wdi_df

country_wdi_df[11] = NULL
country_wdi_df[c(1:5, 60:65), c(1:3,8:12, 100:120)]

Unnamed: 0,code,iso2Code,country,capitalCity,longitude,latitude,date,"2005 PPP conversion factor, GDP (LCU per international $)",Antiretroviral therapy coverage (% of people living with HIV),Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV),⋯,"Automated teller machines (ATMs) (per 100,000 adults)",Average number of visits or required meetings with tax officials (for affected firms),Average precipitation in depth (mm per year),Average time to clear exports through customs (days),Average transaction cost of sending remittances from a specific country (%),Average transaction cost of sending remittances to a specific country (%),"Average working hours of children, study and work, ages 7-14 (hours per week)","Average working hours of children, study and work, female, ages 7-14 (hours per week)","Average working hours of children, study and work, male, ages 7-14 (hours per week)","Average working hours of children, working only, ages 7-14 (hours per week)"
1,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,1960,,,,⋯,,,,,,,,,,
2,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,1961,,,,⋯,,,,,,,,,,
3,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,1962,,,,⋯,,,327.0,,,,,,,
4,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,1963,,,,⋯,,,,,,,,,,
5,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,1964,,,,⋯,,,,,,,,,,
60,AFG,AF,Afghanistan,Kabul,69.1761,34.5228,2019,,,,⋯,,,,,,,,,,
61,ALB,AL,Albania,Tirane,19.8172,41.3317,1960,,,,⋯,,,,,,,,,,
62,ALB,AL,Albania,Tirane,19.8172,41.3317,1961,,,,⋯,,,,,,,,,,
63,ALB,AL,Albania,Tirane,19.8172,41.3317,1962,,,,⋯,,,1485.0,,,,,,,
64,ALB,AL,Albania,Tirane,19.8172,41.3317,1963,,,,⋯,,,,,,,,,,


In [34]:
# Merge Country_wdi_df and comtrade_df

country_wdi_comtrade_df <- join(country_wdi_df, comtrade_df, by = c("code", "date"), type = "left", match = "all")

In [35]:
# Find the number of columns in country_wdi_comtrade_df

dim(country_wdi_comtrade_df)

In [36]:
# Delete country.1

#names(country_wdi_comtrade_df)[names(country_wdi_comtrade_df[1441])] <- "un_country_name"

country_wdi_comtrade_df[1441] = NULL

In [37]:
# View country_wdi_comtrade_df

country_wdi_comtrade_df[c(60:65),c(1:3, 11, 1440:1444)]

Unnamed: 0,code,iso2Code,country,date,Women's share of population ages 15+ living with HIV (%),Export,Import,Re-Export,Re-Import
60,AFG,AF,Afghanistan,2019,,,,,
61,ALB,AL,Albania,1960,,,,,
62,ALB,AL,Albania,1961,,,,,
63,ALB,AL,Albania,1962,,,,,
64,ALB,AL,Albania,1963,,,,,
65,ALB,AL,Albania,1964,,,,,


In [38]:
# Merge country_wdi_comtrade_df and bloomberg_df

country_wdi_comtrade_bloomberg_df <- join(country_wdi_comtrade_df, bloomberg_df, by = c("code", "date"), type = "left", match = "all")

In [39]:
dim(country_wdi_comtrade_bloomberg_df)

In [40]:
country_wdi_comtrade_bloomberg_df[1445] = NULL

In [41]:
country_wdi_comtrade_bloomberg_df[c(1105:1110),]

Unnamed: 0,code,iso2Code,country,region,adminregion,incomeLevel,lendingType,capitalCity,longitude,latitude,⋯,Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Export,Import,Re-Export,Re-Import,fitch,mdy,sp,csdr_avg
1105,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,,,
1106,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,,,
1107,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,,,
1108,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,,,
1109,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,,,
1110,BEL,BE,Belgium,Europe & Central Asia,,High income,Not classified,Brussels,4.36761,50.8371,⋯,,,,,,,,2.0,,2.0


In [42]:
# Merge All of the Datasets

df = Reduce(function(x, y) merge(x, y, by = c("country", "date"), all = TRUE), 
            list(country_wdi_comtrade_bloomberg_df,
                 cpia_df, debt_df, wgi_df,
                 doing_business_df, bureaucracy_df, esg_df))

In [46]:
#View df

df = df[-c(13043:13154),]
df[c(60:65),]

Unnamed: 0,country,date,code,iso2Code,region,adminregion,incomeLevel,lendingType,capitalCity,longitude,⋯,Labor.force.participation.rate..total....of.total.population.ages.15.64...modeled.ILO.estimate.,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y
60,Afghanistan,2019,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,⋯,67.772,59.47911,1.519,,,,,,,
61,Albania,1960,ALB,AL,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,⋯,,,,,,62.283,6.489,5.410827,,
62,Albania,1961,ALB,AL,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,⋯,,,,,,63.301,6.401,5.390893,,
63,Albania,1962,ALB,AL,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,⋯,,,,-99.0,,64.19,6.282,5.405407,,
64,Albania,1963,ALB,AL,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,⋯,,,,,,64.914,6.133,5.43502,,
65,Albania,1964,ALB,AL,Europe & Central Asia,Europe & Central Asia (excluding high income),Upper middle income,IBRD,Tirane,19.8172,⋯,,,,,,65.463,5.96,5.456235,,


----

**Save Data**

In [47]:
write.csv(df, file = "../../2_Inputs/Final/analysis_df.csv")