**SYPA: Fundamental Analysis of Foreign Direct Investment** <br>
*2_ETL (Extract, Transform, Load)* <br>
Harvard SYPA <br>
User: Jake Schneider <br>
Date Created: February 8, 2020 <br>
Date Updated: February 9, 2020

____

**Load Packages and Libraries**

In [1]:
## Install packages
#
#install.packages("plyr")
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("stringr")
#install.packages("readxl")
#install.packages("data.table")
#install.packages("hablar")
#install.packages("naniar")
#install.packages("DataCombine")
#install.packages("panelaggregation")
#install.packages("jsonlite", repos = 'https://cran.r-project.org')

In [2]:
# Load relevant libraries

library(plyr)
library(dplyr)
library(tidyverse)
library(stringr)
library(readxl)
library(data.table)
library(reshape2)
library(hablar)
library(naniar)
library(DataCombine)
library(panelaggregation)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ stringr 1.4.0
✔ tidyr   1.0.2     ✔ forcats 0.4.0
✔ readr   1.3.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename(

In [3]:
# Setting A Working Directory 

getwd()

#setwd('.')

----

**Load WDI Data** <br>
Downloaded from World Bank Website <br>
*Note: Will need to be cleaned and transformed*

In [17]:
# Load WDI Data

wdi <- read_csv("../../2_Inputs/WDI_csv/WDIData.csv")

“Missing column names filled in: 'X65' [65]”Parsed with column specification:
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character(),
  X65 = col_logical()
)
See spec(...) for full column specifications.


In [27]:
# View WDI Data

wdi[c(1:5),]

country,code,indicator,Indicator Code,1960,1961,1962,1963,1964,1965,⋯,2011,2012,2013,2014,2015,2016,2017,2018,2019,X65
Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per international $)",PA.NUS.PPP.05,,,,,,,⋯,,,,,,,,,,
Arab World,ARB,"2005 PPP conversion factor, private consumption (LCU per international $)",PA.NUS.PRVT.PP.05,,,,,,,⋯,,,,,,,,,,
Arab World,ARB,Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,,,,,,,⋯,82.78329,83.1203,83.53346,83.8976,84.1716,84.51017,,,,
Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,⋯,86.42827,87.07058,88.17684,87.34274,89.13012,89.67869,90.27369,,,
Arab World,ARB,"Access to electricity, rural (% of rural population)",EG.ELC.ACCS.RU.ZS,,,,,,,⋯,73.9421,75.2441,77.1623,75.53898,78.74115,79.66564,80.74929,,,


In [19]:
# Subset WDI Data into WDI Individual and WDI Aggregates

wdi_individual = wdi[-c(1:67164),]
wdi_aggregate = wdi[c(1:67164),]

In [21]:
# View WDI Individual and WDI Aggregates

wdi_individual[c(1:5),]
#wdi_aggegate

Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,⋯,2011,2012,2013,2014,2015,2016,2017,2018,2019,X65
Afghanistan,AFG,"2005 PPP conversion factor, private consumption (LCU per international $)",PA.NUS.PRVT.PP.05,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,,,,,,,⋯,2.233000e+01,2.408000e+01,2.617000e+01,2.799000e+01,3.010000e+01,3.244000e+01,,,,
Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,⋯,4.322202e+01,6.910000e+01,7.015348e+01,8.950000e+01,7.150000e+01,9.770000e+01,9.770000e+01,,,
Afghanistan,AFG,"Access to electricity, rural (% of rural population)",EG.ELC.ACCS.RU.ZS,,,,,,,⋯,2.957288e+01,6.084916e+01,6.287569e+01,8.650051e+01,6.457335e+01,9.709936e+01,9.709197e+01,,,
Afghanistan,AFG,"Access to electricity, urban (% of urban population)",EG.ELC.ACCS.UR.ZS,,,,,,,⋯,8.656778e+01,9.500000e+01,9.273573e+01,9.870000e+01,9.250000e+01,9.950000e+01,9.950000e+01,,,
Afghanistan,AFG,Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),FX.OWN.TOTL.ZS,,,,,,,⋯,9.005013e+00,,,9.961000e+00,,,1.489331e+01,,,
Afghanistan,AFG,"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",FX.OWN.TOTL.FE.ZS,,,,,,,⋯,2.616230e+00,,,3.812426e+00,,,7.160685e+00,,,
Afghanistan,AFG,"Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)",FX.OWN.TOTL.MA.ZS,,,,,,,⋯,1.541546e+01,,,1.578467e+01,,,2.253650e+01,,,
Afghanistan,AFG,"Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)",FX.OWN.TOTL.OL.ZS,,,,,,,⋯,1.053888e+01,,,1.150844e+01,,,1.801650e+01,,,
Afghanistan,AFG,"Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+)",FX.OWN.TOTL.40.ZS,,,,,,,⋯,1.125380e+00,,,6.070460e+00,,,1.380250e+01,,,


In [22]:
# Rename Variable Columns and Omit Series Code

colnames(wdi_individual)
setnames(wdi_individual, c("Country Name", "Country Code", "Indicator Name"), c("country", "code", "indicator"))

wdi_individual$`Indicator Code` <- NULL
wdi_individual$X65 <- NULL

colnames(wdi_individual)

In [23]:
# Reshape

wdi_individual <- melt(setDT(wdi_individual), id.vars = c("country", "code", "indicator"), variable.name = "date")
wdi_individual <- dcast(wdi_individual, country + code + date ~ indicator, fun.aggregate = mean)
wdi_individual[c(1:5),]


country,code,date,"2005 PPP conversion factor, GDP (LCU per international $)","2005 PPP conversion factor, private consumption (LCU per international $)",Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),⋯,"Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49)",Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%)
Afghanistan,AFG,1960,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1961,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1962,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1963,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1964,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1965,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1966,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1967,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1968,,,,,,,,⋯,,,,,,,,,,
Afghanistan,AFG,1969,,,,,,,,⋯,,,,,,,,,,


In [25]:
# Save Transformed WDI Data

#save(wdi_individual, file = "../../2_Inputs/Culled/wdi_df.RData")
write.csv(wdi_individual, file = "../../2_Inputs/Culled/wdi_df.csv")

---

**Load UN Comtrade Data** <br>
Downloeaded from the UN Website <br>
*Note: Will need to be cleaned and transformed*

In [30]:
# Read in Comtrade data

comtrade_list = as.character(excel_sheets("../../2_Inputs/UNCTAD/UNCTAD_comtrade_data.xlsx"))

str(comtrade_list)

comtrade = vector("list", length(comtrade_list))

 chr [1:23] "comtrade_1996" "comtrade_1997" "comtrade_1998" "comtrade_1999" ...


In [33]:
# Function to read in data
# Not working right now due to data limits
#jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

comtrade_long = data.frame()

for (i in seq_along(comtrade_list)) {
  comtrade <- as.data.frame(lapply(i, read_excel, path = "../../2_Inputs/UNCTAD/UNCTAD_comtrade_data.xlsx"))
  print(comtrade)
  comtrade_long <- rbind(comtrade_long, comtrade)
  #print(comtrade_long)
}

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



----

**Load Culled Countries List** <br>
Culled from 1_API

----

**Load Culled Data Sets: Doing Business, WGI, Debt, CPIA, Bureaucracy, ESG** <br>
Culled from 1_API

---

**Merge Data Sets**

----

**Save Data**