-
Notifications
You must be signed in to change notification settings - Fork 1
The first example
The first example, which includes an original PIAAC public use file, demonstrates the purpose and functionality of this package. The functions in the package and especially the type of output are widely following those of the idb analyzer (http://www.iea.nl/data.html) so that the change from idb+SPSS+Excel to R is as comfortable as possible for the user.
First of all we are loading all the packages we need for this example. Note that these packages must be installed previously. The memisc package will help us to read SPSS files properly and the survey package is a well known and proved package which handles all kinds of data obtained from a survey sampling design.
library(memisc)
library(survey)
library(svyPVpack)
library(ggplot2)Before sourcing the next two lines, make sure that you working directory is set correctly to the location you want to save the files of this tutorial. To find out what your current working directory is, type getwd() in your R console.
getwd()
dir.create("firstEX")
download.file("http://vs-web-fs-1.oecd.org/piaac/puf-data/SPSS/prgautp1.sav","./firstEX/prgautp1.sav", method="auto", mode="wb")The last line downloads the Austrian PIAAC public use SPSS file into the firstEX folder. This can take a few seconds (depending on your internet connection speed). Now you have got a large .sav file in your folder, and it seems tempting to start SPSS (as long as you have this software) to make some analyses. But we will continue using R -- therefore we have to push this file into the R workspace. For this task we will use the memisc package. Follow the code lines, and be patient -- the transfer process will take a few seconds. After reading the SPSS file into the R workspace (object d1), we explore this object and examine whether it has been read correctly for further analyses. In the next step, we explore which columns are containing only missing values and create a new R object d2, which includes a data.frame without these columns.
As always, converting data, data manipulation, cleaning data etc. makes > 50% of a data analysts workload.
setwd("./firstEX")
#1
d <- as.data.set(spss.system.file("prgautp1.sav"))
d1 <- data.frame(d)
head(d1)
dim(d1)
#2
hmNA <- sapply(d1, function(x) sum(is.na(x)))
onlyNAcols <- which(hmNA == nrow(d1))
length(onlyNAcols)
d2 <- d1[,-onlyNAcols]
Now we have to care about two things before we can start our first analysis with the svyPV package:
- Which columns contain the weights?
- Which columns contain the plausible values?
So we try to find them by searching the column names with the magic of regular expressions. We know that the main "weight" is denoted as SPFWT0 and the other 80 weights are denoted as SPFWT1 to SPFWT80. With regard to the plausible values we know that the appropriate variable names start with PV. This information is very useful because we have to handle more than 1500 variables in this data set. The positions of the weight-columns are found rather quickly, and saved as 2 new objects. The plausible values are saved as the three objects LITERACY, NUMERACY and PROBLEMSOLVING.
In the last step, we create a svrepdesign, applying of a function of the survey package. In this statement you have to define the data, the weights, the replicated weights, and the type of resampling method used to compute the SE. (Have a closer look at the survey package (?survey) for details.)
# get the columns which contain the weights
nr.SPFWT0 <- grep("^SPFWT0$",colnames(d2),value=F)
nr.SPFWT_rep <- grep("^SPFWT[1-9][0-9]*$",colnames(d2),value=F)
# get the variable names of the plausible values
LITERACY <- grep("^PVLIT[0-9]+", colnames(d2),value=TRUE)
NUMERACY <- grep("^PVNUM[0-9]+", colnames(d2),value=TRUE)
PROBLEMSOLVING <- grep("^PVPSL[0-9]+", colnames(d2),value=TRUE)
# create the survey design
sd2 <- svrepdesign(variables=d2, repweights=d2[,nr.SPFWT_rep],
weights=d2[,nr.SPFWT0], type="JK1")
The creation of a survey design is always necessary if you want to use the svyPVpack package for your analysis. After we created the sd2 object, which is a svrepdesign object, we can start with our actual computations. We call the svyPVpm() function, by setting the arguments:
- a grouping variable (by)
- a svrepdesign object which contains at least the plausible values, the weights and the grouping variables
- a character vector with the variable names of the plausible values
The function now computes the mean, standard deviation, proportion of population totals and their standard-errors, population totals and sum of weights. Note that the standard errors include sampling variance as well as imputation variance and are therefore unbiased estimates! That is, because the function uses all plausible values for the computations.
As can be seen from the output table, male participants achieved a slightly better result than female participants did.
res1 <- svyPVpm(by = ~ GENDER_R, svydat=sd2, pvs=LITERACY)
res1
Group1 Number.of.cases Sum.of.weights Proportion SE.Proportion PVLIT_mean
1 Male 2479 2764088 0.4985604 0.0009794639 271.5255
2 Female 2546 2780051 0.5014396 0.0009794639 267.3887
PVLIT_mean:SE PVLIT_stddev PVLIT_stddev:SE
1 1.0447129 44.63807 0.8634961
2 0.9320693 43.17261 0.7521920To visualize this output, we are using the great ggplot2 package (https://github.com/hadley/ggplot2).
# rename the colnames of the output slightly
colnames(res1) <- gsub(":SE","_SE",colnames(res1))
p1 <- ggplot(res1, aes(x=Group1, y=PVLIT_mean, ymin=PVLIT_mean - PVLIT_mean_SE*qt(0.975,79) , ymax=PVLIT_mean + PVLIT_mean_SE*qt(0.975,79), color=Group1)) + scale_color_manual("Gender",values=c("blue","red"))
p2 <- p1 + geom_crossbar(width=0.8, size=1.2) + theme_bw(base_size=15) + coord_cartesian(ylim=c(264,276)) + ylab("Literacy\n") + xlab("")
p2
The computations continue in the Second Tutorial.
