Skip to content

The second example

Manuel Reif edited this page Mar 6, 2014 · 5 revisions

The example we uses again the Austrian PIAAC public use file. We are considering only persons who are labeled as employed or unemployed and skipping those who are out of labour force. We are doing this because we want to estimate a logistic regression to get an idea which variables increase the risk of being unemployed and which don't. So we have to build a subsample of the data set in a first step as shown in the code below. We are using this data.frame now to build another survey design with the svrepdesign() function.

d3 <- subset(d2, subset=C_D05 != "Out of the labour force")

d3$emp <- droplevels(d3$C_D05)
sd3 <- svrepdesign(variables=d3, repweights=d3[,nr.SPFWT_rep], weights=d3[,nr.SPFWT0], type="JK1")

After we built this object, we have already everything we need to start our estimation procedure. We use the svyPVglm() function to fit a logistic regression with two variables. That is not a very sophisticated model (the purpose is here to demonstrate the routine). But feel free to estimate any model you like - we try to keep the output as short as possible.

Note that the notation for this function is slightly different to the notation we have seen before. We have to provide a formula statement. In this example the variable emp should be explained by GENDER_R and PVLIT... The notation PVLIT.. means that the variable starts with PVLIT and for the ellipses the numbers of the vector placeholder is imputed. So this function successively uses PVLIT1, PVLIT2, ... , PVLIT10 as predictor variables. This results in 10 different estimations of the model. Fortunately, the svyPVglm() function neatly summarizes this model into one output.

res2 <- svyPVglm(formula= emp  ~ GENDER_R +  PVLIT.., design=sd3, placeholder=1:10, family=binomial())

res2

$coef
                     mean        se   t.value       Pr.t
(Intercept)    -1.1739159 0.5220458 -2.248684 0.02731835
GENDER_RFemale  0.1801614 0.1796719  1.002724 0.31905533
PVLIT1         -0.0074055 0.0019518 -3.794150 0.00028875

$mod.fit
 Working.2.logLR          p
        16.55812 0.00062525

Keep in mind which was the reference category for the employment variable (you can check this by typing levels(d3$emp) - R uses the first category as reference category as default).

To visualize the results of the logistic regression, we use the ggplot2 package again. This kind of plot can help to find significant variables quickly.

coef2 <- res2$coef
coef3 <- data.frame("what"=factor(rownames(coef2), levels=rownames(coef2)),coef2)

p3 <- ggplot(coef3,aes(x=what,ymin=0,ymax=t.value))
p3a <- p3 + geom_linerange() + geom_point(aes(y=t.value,color=ifelse(abs(t.value) > qt(0.975,79),"yes","no")), size=3,shape=15) + geom_hline(yintercept=c(0,-qt(0.975,79),qt(0.975,79)), linetype=2) + coord_cartesian(ylim=c(-4.5,5.5)) + scale_color_manual("significant",values=c("black","red")) + theme_bw(base_size=15) + xlab("") + ylab("t value") + scale_x_discrete(labels=c("intercept","gender","Literacy"))

print(p3a)

vislogR

To choose one result from this logistic regression model: When our skills in Literacy are increasing, our chance of being unemployed is decreasing. This sounds reasonable. For example, if we compare two guys who differ about 30 points on the Literacy scale, the risk for the 'smarter' guy being unemployed is smaller by a factor of ~ 0.80 compared to the less 'smarter' guy.

exp(-0.0074055*30)
[1] 0.8007832

Here ends the second tutorial - keep exploring the other functions which are estimating for instance:

  • proficiency levels
  • t-Test
  • ...

Clone this wiki locally