Calibration

Antoine R edited this page Jun 22, 2018 · 8 revisions
Clone this wiki locally

Calibration on margins

Example survey

Let's perform a simple calibration on margins using gaston and the example dataset data_ex2. First, load icarus:

library(icarus)

The example dataset is loaded along with icarus:

> data_ex2

    id service categ sexe salaire cinema poids
1  a01       1     1    1    1000      1    10
2  a02       1     2    2    1100      2    10
3  a03       2     2    2    1500      4    10
4  a04       2     3    1    2300     15    10
5  a05       2     1    1    1000      2    10
6  a06       1     1    2     500      3    10
7  a07       2     2    2    1000      1    10
8  b01       1     3    2    2000      0    20
9  b02       1     1    1    2100      0    20
10 b03       2     2    1    2000      3    20
11 b04       2     1    2    3200      6    20
12 b05       1     1    2    1800      0    20
13 b06       1     2    1    2800      0    20
14 b07       1     3    1    1100      1    20
15 b08       2     1    2    2500      1    20

Our example dataset is the result of a survey conducted among the 300 employees of a firm, in order to measure how many times per month employees go to the movies (quantitative variable, column cinema)

Sampling weights are given in the column poids. Auxiliary variables here are:

  • service, the department in which the employee works (categorical, 2 modalities)
  • categ, the hierarchical level of the employee in the company (categorical, 3 modalities)
  • sexe, sex of the employee (categorical, 2 modalities)
  • salaire, salary of the employee (quantitative)

Horvitz-Thompson estimate

The total number of employees in the firm is 300:

N <- 300

The mean number of times employees go the movies each month can be estimated using the Horvitz-Thompson estimator:

> 1/N * HTtotal(data_ex2$cinema, data_ex2$poids)
    1.666667

Estimation using calibration on margins

The goal of using calibration on auxiliary variables (margins) is to improve the Horvitz-Thompson estimate by using known totals of these auxiliary variables. In this case, we know the number of employees in each category for the categorical auxiliary variables:

  • categ: 80 (modality 1) ; 90 (modality 2) ; 60 (modality 3)
  • sexe: 140 (modality 1) ; 90 (modality 2)
  • service: 100 (modality 1) ; 130 (modality 2)

We also know that the total salaries paid by the company are: 470000

To compute the calibration estimator, we create the margin matrix, which contains the totals of the auxiliary variables. The format of the margin matrix is very similar to the margin table in the SAS macro "Calmar":

## Calibration margins
mar1 <- c("categ",3,80,90,60)
mar2 <- c("sexe",2,140,90,0)
mar3 <- c("service",2,100,130,0)
mar4 <- c("salaire", 0, 470000,0,0)
margins <- rbind(mar1, mar2, mar3, mar4)
wCalesRaking <- calibration(data=data_ex2, marginMatrix=margins, colWeights="poids"
                           , method="raking", description=FALSE)

Using the parameter description=TRUE would ouptut stats on the calibration method (distribution of initial v. calibrated weights). In our example, the calibrated weights are stored in vector wCalesRaking. We can now compute the calibration estimator:

> 1/N * HTtotal(data_ex2$cinema, wCalesRaking)
   2.471917

Margins in percentage

Just like in Calmar, it is also possible to (which is convenient when dealing with huge numbers, for example on large populations):

mar1_2 <- c("categ",3,0.35,0.4,0.25)
mar2_2 <- c("sexe",2,0.60,0.40,0)
mar3_2 <- c("service",2,0.45,0.55,0)
mar4_2 <- c("salaire", 0, 470000,0,0)
margins_2 <- rbind(mar1_2, mar2_2, mar3_2, mar4_2)

In this case, set parameter pct to TRUE when performing calibration:

wCalRakingPct <- calibration(data=data_ex2, marginMatrix=margins_2, colWeights="poids"
                              , method="logit", description=FALSE, bounds=c(0.4,2.2), pct=TRUE
                              , popTotal=N)

As of version 0.3.0, Icarus also supports margins written as percentages adding up to 100:

mar1_3 <- c("categ",3,35,40,25)
mar2_3 <- c("sexe",2,60,40,0)
mar3_3 <- c("service",2,45,55,0)
mar4_3 <- c("salaire", 0, 470000,0,0)
margins_3 <- rbind(mar1_3, mar2_3, mar3_3, mar4_3)

wCalRakingPct <- calibration(data=data_ex2, marginMatrix=margins_3, colWeights="poids"
                              , method="logit", description=FALSE, bounds=c(0.4,2.2), pct=TRUE
                              , popTotal=N)

Using other distance functions

Other distances than the raking ratio are implemented in gaston. For example, we can use the logit method, which allows us to set bounds on the ratio calibrated weights / initial weights:

  wCalesLogit1 <- calibration(data=data_ex2, marginMatrix=margins, colWeights="poids"
                              , method="logit", description=FALSE, bounds=c(0.4,2.2))