In [None]:
options(jupyter.rich_display = F)

# Data import, wrangling and analysis

**by Serhat Çevikel**

Today we will import an external data:

In [None]:
weo <- read.csv("../file/weo_clean.csv")

In [None]:
str(weo)

We have 162 countries and 12 variables:

In [None]:
names(weo)

Keeping the old names, we will change the variable names to more comprehensive ones:

In [None]:
old_names <- names(weo)

In [None]:
new_names <- c("Country", "GDP_growth", "GDP", "GDP_per_capita",
               "Output_gap", "Investment", "Saving", "Inflation",
               "Unemployment", "Primary_balance", "Net_debt", "Current_account")

In [None]:
names(weo) <- new_names

In [None]:
weo

## Add a new variable

Add a new variable Saving - Investment:

In [None]:
weo$Saving_gap <- with(weo, Saving - Investment)

## Discretize variables

Now we will three categories of income level: Low, medium and high

In [None]:
weo$Income_level <- with(weo, cut(GDP_per_capita,
              breaks = c(0, 5000, 20000,
                         max(GDP_per_capita, na.rm = T)),
              labels = c("low", "medium", "high")))

See the distribution across levels:

In [None]:
barplot(table(weo$Income_level))

## Get summaries

In [None]:
weo

Aggregate variables for each income level:

In [None]:
t(with(weo, aggregate(weo[,2:13], by = list(Income_level), FUN = median, na.rm = T)))

Please interpret this table ...

Create scatterplots of selected variables:

In [None]:
palette(c("green", "red", "blue"))

In [None]:
plot(weo[,c("GDP_growth", "Primary_balance", "Net_debt", "Current_account", "Saving_gap")],
    col = as.numeric(weo$Income_level))

In [None]:
plot(weo[,c("Current_account", "Saving_gap")],
    col = as.numeric(weo$Income_level))

We see Saving Gap is nearly identical to Current Account Balance for many countries

We may look at correlations:

In [None]:
cor(weo[,c("GDP_growth", "Primary_balance", "Net_debt", "Current_account", "Saving_gap")])

We can check the scatter between primary balance and current account:

In [None]:
plot(weo[,c("Primary_balance", "Current_account")],
    col = as.numeric(weo$Income_level))

Though there are outliers, for a majority of low income countries the current account balance and primary balance are lower

# WRANGLING AN ECONOMIC DATA SET: IMF WORLD ECONOMIC OUTLOOK, CONTINUED

We continue to wrangle and analyze 2016 data of World Economic Outlook dataset by IMF

First please download following two data files:

[weo_2016_wide_2.csv](~/file/weo_2016_wide_2.csv)

[weo_description.csv](~/file/weo_description.csv)

And read the data into R as such:

In [None]:
weo_data <- read.csv("~/file/weo_2016_wide_2.csv")
weo_desc <- read.csv("~/file/weo_description.csv")

Let's take a quick snapshot of the data:

In [None]:
str(weo_data)

In [None]:
str(weo_desc)

In [None]:
weo_desc

There are 45 numeric variables for 194 countries (some of the data might be missing). We will be interested in only a few of those series

## ADDING SAVINGS GAP

- NID_NGDP is "Total investment"
- NGSD_NGDP is "Gross national savings"

Now let's add a new variable "savings_gap":

In [None]:
weo_data$savings_gap <- with(weo_data, NGSD_NGDP - NID_NGDP)

## DISCRETIZE INCOME VARIABLE

PPPPC is Gross domestic product per capita, current prices with	purchasing power parity and in international dollars

Now we will three categories of income level: Low, medium and high

In [None]:
weo_data$income_level <- with(weo_data, cut(PPPPC,
              breaks = c(0, 5000, 20000,
                         max(PPPPC, na.rm = T)),
              labels = c("low", "medium", "high")))

Let's see in a single plot:
- the distribution across levels
- and the dispersion of income within income levels
- scatterplot across savings gap and incomee

In [None]:
options(repr.plot.width=15, repr.plot.height=15)
par(mfrow = c(2,2))
barplot(table(weo_data$income_level))
with(weo_data, boxplot(PPPPC ~ income_level))
with(weo_data, plot(savings_gap, PPPPC, col = income_level))

See that high income level has more outliers

## SUMMARIZE DATA

Let's get the median of all variables across income categories

In [None]:
weo_data_subset1 <- weo_data[,setdiff(names(weo_data),
                                             c("WEO.Country.Code",
                                               "ISO",
                                               "Country",
                                              "income_level"))]

weo_sum <- aggregate(weo_data_subset1,
          by = weo_data["income_level"],
          FUN = median,
          na.rm = T)

In [None]:
weo_sum

Now let's reshape this data frame two times and merge with descriptions so that it is in a more interpretable format

First melt it:

In [None]:
cols <- names(weo_sum)[-1]

weo_long <- reshape(weo_sum,
                      idvar = c("income_level"),
                      varying = cols,
                        times = cols,
                    timevar = "variable",
                      v.name = "value",
                      direction = "long")

rownames(weo_long) <- NULL

In [None]:
weo_long

And then cast it:

In [None]:
weo_wide <- reshape(weo_long,
                      idvar = c("variable"),
                      v.names = "value",
                      timevar = "income_level",
                      direction = "wide")

weo_wide

And finally merge it:

In [None]:
weo_sum_merged <- merge(
    weo_desc[c("WEO.Subject.Code", "Subject.Descriptor", "Units")],
    weo_wide,
    by.y = "variable",
    by.x= "WEO.Subject.Code",
    all.y = T)

weo_sum_merged

Take some time to interpret this data frame

## SCATTERPLOTS

Create scatterplots of selected variables:

In [None]:
cols2 <- c("NGDP_RPCH", "GGXONLB_NGDP", "GGXWDN_NGDP", "BCA_NGDPD", "savings_gap")
newnames <- c("GDP_growth", "Primary_balance", "Net_debt", "Current_account", "Saving_gap")

weo_data_subset2 <- weo_data[cols2]
names(weo_data_subset2) <- newnames

plot(weo_data_subset2, col = weo_data$income_level)

We see Saving Gap is nearly identical to Current Account Balance for many countries

## CORRELATIONS

We may look at correlations:

In [None]:
weo_desc

In [None]:
vars3 <- with(weo_desc, WEO.Subject.Code[which(Units != "National currency" & !(Scale %in% c("Millions", "Billions")))])
vars3 <- as.character(vars3)
length(vars3)

weo_data_subset3 <- weo_data[,vars3] 
weo_data_subset3

In [None]:
cormat <- round(cor(weo_data_subset3, use = "pairwise.complete.obs"), 2)
cormat

Order the descriptions according to the correlation matrix:

In [None]:
desc_ordered <- weo_desc[match(rownames(cormat), weo_desc$WEO.Subject.Code),]
desc_ordered

An combine descriptor and units into a single column:

In [None]:
desc_ordered$longdesc <- with(desc_ordered, paste(Subject.Descriptor, Units, sep = " - "))

desc_ordered

Get the indices of high (but less than perfect) correlations:

In [None]:
highcor <- which(abs(cormat) > 0.5 & abs(cormat) < 0.8, arr.ind = T)
rownames(highcor) <- NULL
highcor

And get unique rows (eliminate second row of each pair):

In [None]:
highcor2 <- unique(t(apply(highcor, 1, sort)))
highcor2

And see which variable pairs are highly correlated:

In [None]:
pairs <- as.data.frame(t(apply(highcor2, 1, function(x) desc_ordered$longdesc[x])))

And add the correlations:

In [None]:
pairs$cor <- cormat[highcor2]

In [None]:
pairs[order(-abs(pairs$cor)),]

Now some definitions:

> The output gap is actual minus potential output, as a percentage of potential output. Structural balances are expressed as a percentage of potential output. The structural balance is the actual net lending/borrowing minus the effects of cyclical output from potential output, corrected for one-time and other factors, such as asset and commodity prices and output composition effects. Changes in the structural balance consequently include effects of temporary fiscal measures, the impact of fluctuations in interest rates and debt-service costs, and other noncyclical fluctuations in net lending/borrowing. The computations of structural balances are based on the IMF staff’s estimates of potential GDP and revenue and expenditure elasticities. (See Annex I of the October 1993 WEO.) Net debt is calculated as gross debt minus financial assets corresponding to debt instruments. Estimates of the output gap and of the structural balance are subject to significant margins of uncertainty.

(https://www.elibrary.imf.org/view/IMF081/28248-9781513508214/28248-9781513508214/ch04.xml?redirect=true)

> The output gap is an economic measure of the difference between the actual output of an economy and its potential output. Potential output is the maximum amount of goods and services an economy can turn out when it is most efficient—that is, at full capacity. Often, potential output is referred to as the production capacity of the economy. ...

> Various methodologies are used to estimate potential output, but they all assume that output can be divided into a trend and a cyclical component. The trend is interpreted as a measure of the economy’s potential output and the cycle as a measure of the output gap. The trick to estimating potential output, therefore, is to estimate trends—that is, to remove the cyclical changes. ...

> A common method of measuring potential output is the application of statistical techniques that differentiate between the short-term ups and downs and the long-term trend. The Hodrick-Prescott filter is one popular technique for separating the short from the long term. Other methods estimate the production function, a mathematical equation that calculates output based on an economy’s inputs, such as labor and capital. Trends are estimated by removing the cyclical changes in the inputs.

(https://www.imf.org/external/pubs/ft/fandd/2013/09/basics.htm)

Now let's create the scatterplots of just those variable pairs into a single grid:

In [None]:
nrow(highcor2)

In row major order with mfrow:

In [None]:
options(repr.plot.width=20, repr.plot.height=20)
highcor3 <- cbind(highcor2, seq_along(highcor2[,1]))
par(mfrow = c(4, 2))
apply(highcor3, 1, function(x) plot(weo_data_subset3[,x[-3]],
                                    xlab = desc_ordered$longdesc[x[1]],
                                    ylab = strwrap(desc_ordered$longdesc[x[2]], width=35, simplify=FALSE),
                                    main = paste("Plot:", x[3]),
                                    col = weo_data$income_level,
                                    cex.lab = 1.5))

And then column major order with mfcol:

In [None]:
options(repr.plot.width=20, repr.plot.height=20)
highcor3 <- cbind(highcor2, seq_along(highcor2[,1]))
par(mfcol = c(4, 2))
apply(highcor3, 1, function(x) plot(weo_data_subset3[,x[-3]],
                                    xlab = desc_ordered$longdesc[x[1]],
                                    ylab = strwrap(desc_ordered$longdesc[x[2]], width=35, simplify=FALSE),
                                    main = paste("Plot:", x[3]),
                                    col = weo_data$income_level,
                                    cex.lab = 1.5))

## REGRESSION MODELING

Now we will conduct a multiple linear regression analysis to understand the relationship among variables

Let's try to explain:

- NGDP_RPCH,"Gross domestic product, constant prices",Percent change,

with

- PPPSH,Gross domestic product based on purchasing-power-parity (PPP) share of world total,Percent,
- NID_NGDP,Total investment,Percent of GDP,
- PCPIPCH,"Inflation, average consumer prices",Percent change,
- GGXONLB_NGDP,General government primary net lending/borrowing,Percent of GDP,
- GGXWDN_NGDP,General government net debt,Percent of GDP

Let's user weo_data_subset3

In [None]:
independent_vars <- c("PPPSH", "NID_NGDP", "PCPIPCH", "GGXONLB_NGDP", "GGXWDN_NGDP")
dependent_var <- "NGDP_RPCH"

In [None]:
names(weo_data_subset3)

Let's first subset necessary variables 

In [None]:
weo_data_subset4 <- weo_data_subset3[, c(dependent_var, independent_vars)]

And delete rows with missing values

In [None]:
weo_data_subset5 <- na.omit(weo_data_subset4)

In [None]:
str(weo_data_subset5)

In [None]:
modelx <- paste(dependent_var, paste(independent_vars, collapse = "+"), sep ="~")

In [None]:
reg_mod <- lm(modelx, weo_data_subset5)

In [None]:
reg_mod

In [None]:
summary(reg_mod)

As we see onlu NID_NGDP (Total investment,Percent of GDP) is a significant dependent variable for explaining the variation in NGDP_RPCH,"Gross domestic product, constant prices",Percent change

The coefficient is 0.156906.

So every unit increase in investment/GDP ratio results in 0.15% increase in GDP growth

Let's leave alone this single significant variable:

In [None]:
reg_mod2 <- lm(NGDP_RPCH ~ NID_NGDP, weo_data_subset5)

In [None]:
summary(reg_mod2)

Investment/GDP ratio alone explains around 20% cross-section variation in GDP growth rates across countries 

# MULTIPLE LINE PLOTS

Let's draw sine and cosine curves on the same plot and with a legend:

In [None]:
deg <- 1:1000
rad <- deg / 180 * pi

sinx <- sin(rad)
cosx <- cos(rad)

In [None]:
plot(deg, sinx, type = "l", col = "blue")
lines(deg, cosx, col = "red")
legend("right", "top", legend=c("sin", "cos"),
       col=c("blue", "red"), lty=1, cex=0.8)