This package contains R functions corresponding to useful Stata commands.
The package includes:
- data.frame functions (summarize, tabulate, merge)
- panel data functions (monthly/quarterly dates, lead/lag, fillin)
- vector functions (xtile, pctile, winsorize)
- graph functions (binscatter)
sum_up
prints detailed summary statistics (corresponds to Stata summarize
)
N <- 100
df <- data_frame(
id = 1:N,
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE)
)
sum_up(df)
df %>% sum_up(starts_with("v"), d = TRUE)
df %>% group_by(v1) %>% sum_up()
tab
prints distinct rows with their count. Compared to the dplyr function count
, this command adds frequency, percent, and cumulative percent.
N <- 1e2 ; K = 10
df <- data_frame(
id = sample(c(NA,1:5), N/K, TRUE),
v1 = sample(1:5, N/K, TRUE)
)
tab(df, id)
tab(df, id, na.rm = TRUE)
tab(df, id, v1)
join
is a wrapper for dplyr merge functionalities, with two added functions
-
The option
check
checks there are no duplicates in the master or using data.tables (as in Stata).# merge m:1 v1 join(x, y, kind = "full", check = m~1)
-
The option
gen
specifies the name of a new variable that identifies non matched and matched rows (as in Stata).# merge m:1 v1, gen(_merge) join(x, y, kind = "full", gen = "_merge")
-
The option
update
allows to update missing values of the master dataset by the value in the using dataset
The classes "monthly" and "quarterly" print as dates and are compatible with usual time extraction (ie month
, year
, etc). Yet, they are stored as integers representing the number of elapsed periods since 1970/01/0 (resp in week, months, quarters). This is particularly handy for simple algebra:
# elapsed dates
library(lubridate)
date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))
datem <- as.monthly(date)
# displays as a period
datem
#> [1] "1992m04" "1992m01" "1992m03"
# behaves as an integer for numerical operations:
datem + 1
#> [1] "1992m05" "1992m02" "1992m04"
# behaves as a date for period extractions:
year(datem)
#> [1] 1992 1992 1992
tlag
/tlead
a vector with respect to a number of periods, not with respect to the number of rows
year <- c(1989, 1991, 1992)
value <- c(4.1, 4.5, 3.3)
tlag(value, 1, time = year)
library(lubridate)
date <- mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))
datem <- as.monthly(date)
value <- c(4.1, 4.5, 3.3)
tlag(value, time = datem)
In constrast to comparable functions in zoo
and xts
, these functions can be applied to any vector and be used within a dplyr
chain:
df <- data_frame(
id = c(1, 1, 1, 2, 2),
year = c(1989, 1991, 1992, 1991, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id) %>% mutate(value_l = tlag(value, time = year))
is.panel
checks whether a dataset is a panel i.e. the time variable is never missing and the combinations (id, time) are unique.
df <- data_frame(
id1 = c(1, 1, 1, 2, 2),
id2 = 1:5,
year = c(1991, 1993, NA, 1992, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id1) %>% is.panel(year)
df1 <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year)
fill_gap transforms a unbalanced panel into a balanced panel. It corresponds to the stata command tsfill
. Missing observations are added as rows with missing values.
df <- data_frame(
id = c(1, 1, 1, 2),
datem = as.monthly(mdy(c("04/03/1992", "01/04/1992", "03/15/1992", "05/11/1992"))),
value = c(4.1, 4.5, 3.3, 3.2)
)
df %>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, full = TRUE)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest")
# sample_mode returns the statistical mode
sample_mode(c(1, 2, 2))
sample_mode(c(1, 2))
sample_mode(c(NA, NA, 1))
sample_mode(c(NA, NA, 1), na.rm = TRUE)
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
v <- c(NA, 1:10)
pctile(v, probs = c(0.3, 0.7), na.rm = TRUE)
# xtile creates integer variable for quantile categories (corresponds to Stata xtile)
v <- c(NA, 1:10)
xtile(v, n_quantiles = 3) # 3 groups based on terciles
xtile(v, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(v, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
stat_binmean()
is a stat
for ggplot2. It returns the mean of y
and x
within bins of x
. It's a bareborne version of the Stata command binscatter
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean()
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n=10)
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n=10) + stat_smooth(method = "lm", se = FALSE)
You can install