The DALSM R-package enables to fit a nonparametric double location-scale model to right- and interval-censored data, see Lambert (2021) for methodological details.
Consider a vector
A location-scale model is considered to describe the distribution of the
response conditionally on the covariates,
Assume that independent copies
Gaussian Markov random field (GMRF) priors penalizing changes in
neighbour B-spline parameters are taken,
where
The error term
Marginal posterior distributions of the penalty parameters in the error
density and in the location and dispersion submodels are approximated by
starting from the following identity,
where
Model fitting alternates the estimation of the regression and spline parameters in the location-scale model (see Algorithm 1 in Lambert 2021), the selection of the penalty parameters (see Algorithm 2) and the update of the error density (see Algorithm 3). The whole procedure is implemented in function DALSM of the package.
The data of interest come from the European Social Survey (ESS 2016)
with a focus on the money available per person in Belgian households for
respondents aged 25-55 when the main source of income comes from wages
or salaries (
We model the relationship between disposable income per person (
Let us load the package and visualize typical response values:
## Package installation from R-CRAN
## install.packages("DALSM")
## Alternatively, installation from GitHub:
## install.packages("devtools")
## devtools::install_github("plambertULiege/DALSM")
## Package loading
library(DALSM)
## Data reading
data(DALSM_IncomeData)
resp = DALSM_IncomeData[,1:2]
head(resp,20) ## Visualize first 20 censored income values
## inc.low inc.up
## 1 1.5300000 1.8700000
## 2 1.1333333 1.3851852
## 3 2.0400000 2.4933333
## 4 2.4933333 3.0200000
## 5 1.8120000 2.2320000
## 6 1.7200000 2.1000000
## 7 1.7200000 2.1000000
## 8 1.7200000 2.1000000
## 9 1.2240000 1.4960000
## 10 2.4933333 3.0200000
## 11 1.7200000 2.1000000
## 12 2.0777778 2.5166667
## 13 1.4000000 1.7000000
## 14 2.0400000 2.4933333
## 15 1.8120000 2.2320000
## 16 1.1666667 1.4000000
## 17 1.6800000 2.0400000
## 18 2.1571429 2.6571429
## 19 0.5384615 0.6615385
## 20 2.6571429 Inf
summary(DALSM_IncomeData[,3:5]) ## Summary stats for covariates
## twoincomes age eduyrs
## Min. :0.0000 Min. :25.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:34.00 1st Qu.:12.00
## Median :1.0000 Median :41.00 Median :15.00
## Mean :0.6415 Mean :41.03 Mean :14.88
## 3rd Qu.:1.0000 3rd Qu.:49.00 3rd Qu.:17.00
## Max. :1.0000 Max. :55.00 Max. :26.00
Most reported income values (in thousand euros) are interval-censored (with finite values for the upper bound of the interval), some are right-censored (with an Inf value for the upper bound of the interval) for respondents belonging to households whose net monthly income exceeds the last decile.
The nonparametric double location-scale model (with additive age and eduyrs terms and a fixed effect for the twoincomes indicator can be fitted using the DALSM function:
fit = DALSM(y=resp,
formula1 = ~twoincomes+s(age)+s(eduyrs),
formula2 = ~twoincomes+s(age)+s(eduyrs),
data = DALSM_IncomeData)
print(fit)
## -----------------------------------------------------------------------
## Double Additive Location-SCALE Model
## -----------------------------------------------------------------------
## Fixed effects for Location:
## est se low up Z Pval
## (Intercept) 1.572 0.062 1.450 1.695 25.234 0
## twoincomes 0.252 0.051 0.152 0.352 4.957 0
##
## 2 additive term(s) in Location: Eff.dim / Test No effect
## ED.hat low up Chi2 Pval
## age 4.247 2.277 4.977 13.839 0.01
## eduyrs 3.554 1.811 4.679 118.901 0.00
##
## Fixed effects for Dispersion:
## est se low up Z Pval
## (Intercept) -0.436 0.086 -0.604 -0.268 -5.092 0.000
## twoincomes -0.042 0.070 -0.179 0.094 -0.610 0.542
##
## 2 additive term(s) in Dispersion: Eff.dim / Test No effect
## ED.hat low up Chi2 Pval
## age 3.318 1.456 4.409 15.249 0.002
## eduyrs 3.311 1.508 4.329 41.345 0.000
##
## 10 B-splines per additive component in location
## 10 B-splines per additive component in dispersion
## 20 B-splines for the error density on (-3.76,6.64)
##
## Total sample size: 756 ; Credible level for CI: 0.95
## Uncensored data: 0 (0 percents)
## Interval Censored data: 691 (91.4 percents)
## Right censored data: 65 (8.6 percents)
## -----------------------------------------------------------------------
## Convergence status: TRUE
## Elapsed time: 0.6 seconds (5 iterations)
## -----------------------------------------------------------------------
It suggests an average increase of 252 euros (available per person in
the household) when the respondent and his/her partner are in paid work
(conditionally on Age and Educ), while the effect on dispersion is
not statistically significant.
The effects of Age and Educ on the conditional mean and dispersion
can be visualized on the plots below with the estimated additive terms.
plot(fit, new.dev=FALSE)
The amount of money available per person in the household tends to
decrease with age, see
Besides fitting Nonparametric Double Additive Location-Scale Model to censored data, the DALSM package contains an independent and very fast function, densityLPS, for density estimation from right- or interval-censored data with possible constraints on the mean and variance using Laplace P-splines.
Let us generate interval-censored (IC) data from a Gamma(10,2) distribution with mean 5.0 and variance 2.5. The mean width of the simulated IC intervals is 2.0. Part of the data are also right-censored (RC) with RC values generated from an exponential distribution with mean 15.0.
## Generation of right- and interval-censored data
set.seed(123)
n = 500 ## Sample size
x = rgamma(n,10,2) ## Exact (unobserved) data
width = runif(n,1,3) ## Width of the IC data (mean width = 2)
w = runif(n) ## Positioning of the exact data within the interval
xmat = cbind(pmax(0,x-w*width),x+(1-w)*width) ## Generated IC data
t.cens = rexp(n,1/15) ## Right-censoring values
idx.RC = (1:n)[t.cens<x] ## Id's of the right-censored units
xmat[idx.RC,] = cbind(t.cens[idx.RC],Inf) ## Data for RC units: (t.cens,Inf)
head(xmat,15)
## [,1] [,2]
## [1,] 0.8279792 Inf
## [2,] 6.6526685 8.204466
## [3,] 1.1579633 Inf
## [4,] 4.1784981 6.277058
## [5,] 6.6143492 8.306603
## [6,] 3.7696512 5.799582
## [7,] 2.4766237 Inf
## [8,] 2.3888812 4.441426
## [9,] 5.9867783 7.392858
## [10,] 3.6321041 Inf
## [11,] 4.7570358 6.498024
## [12,] 4.3883940 5.995044
## [13,] 2.0614651 Inf
## [14,] 5.1459398 Inf
## [15,] 3.3849118 Inf
The density can be estimated from the censored data using function densityLPS. Optionally, the mean and variance of the estimated density can also be forced to some fixed values, here 5.0 and 2.5, respectively. We also choose to force the left end of the distribution support to be 0:
## Density estimation from IC data
obj.data = Dens1d(xmat,ymin=0) ## Prepare the IC data for estimation
obj = densityLPS(obj.data, Mean0=10/2, Var0=10/4) ## Estimation with fixed mean and variance
print(obj)
## -----------------------------------------------------------------------
## Constrained Density/Hazard estimation from censored data using LPS
## -----------------------------------------------------------------------
## INPUT:
## Total sample size: 500
## Uncensored data: 0 (0 percents)
## Interval-censored (IC) data: 367 (73.4 percents)
## Right-censored (RC) data: 133 (26.6 percents)
## ---
## Range of the IC data: (0.1708629,12.56085)
## Range of the RC data: (0.009145886,7.785088)
## ---
## Assumed support: (0,14.69175)
## Number of small bins on the support: 501
## Number of B-splines: 25 ; Penalty order: 2
##
## OUTPUT:
## Returned functions: ddist, pdist, hdist, Hdist(x)
## Parameter estimates: phi, tau
## Value of the estimated cdf at +infty: 1
## Constraint on the Mean: 5 ; Fitted mean: 5
## Constraint on the Variance: 2.5 ; Fitted variance: 2.499669
## Selected penalty parameter <tau>: 22.5
## Effective number of parameters: 5.4
## -----------------------------------------------------------------------
## Elapsed time: 0.1 seconds (6 iterations)
## -----------------------------------------------------------------------
The estimated density and cdf can also be visualized and compared to their ‘true’ Gamma(10,2) counterparts used to generate the data:
plot(obj) ## Plot the estimated density
curve(dgamma(x,10,2), ## ... and compare it to the true density (in red)
add=TRUE,col="red",lwd=2,lty=2)
legend("topright",col=c("black","red"),lwd=c(2,2),lty=c(1,2),
legend=c("Estimated density","True density"),bty="n")
## Same story for the cdf
with(obj, curve(pdist(x),ymin,ymax,lwd=2,xlab="",ylab="F(x)"))
curve(pgamma(x,10,2),add=TRUE,col="red",lwd=2,lty=2)
legend("right",col=c("black","red"),lwd=c(2,2),lty=c(1,2),
legend=c("Estimated cdf","True cdf"),bty="n")
Estimated density (pdf), distribution (cdf), hazard and cumulative hazard functions are also directly available:
xvals = seq(2,10,by=2)
with(obj, cbind(x=xvals, fx=ddist(xvals), Fx=pdist(xvals),
hx=hdist(xvals), Hx=Hdist(xvals)))
## x fx Fx hx Hx
## [1,] 2 0.031622032 0.01849226 0.03221816 0.01866538
## [2,] 4 0.234756524 0.26468056 0.31927749 0.30745026
## [3,] 6 0.186674735 0.76413790 0.79144393 1.44450795
## [4,] 8 0.037761707 0.95797288 0.89847167 3.16944027
## [5,] 10 0.006598059 0.99509435 1.34498959 5.31736848
DALSM: Nonparametric Double Additive Location-Scale Model (DALSM). Copyright (C) 2021-2023 Philippe Lambert
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
[1] Lambert, P. (2021) Fast Bayesian inference using Laplace approximations in nonparametric double additive location-scale models with right- and interval-censored data. Computational Statistics and Data Analysis, 161: 107250. doi:10.1016/j.csda.2021.107250
[2] Lambert, P. (2021). DALSM: Nonparametric Double Additive Location-Scale Model - R package version 0.9.1. https://CRAN.R-project.org/package=DALSM ; GitHub: https://github.com/plambertULiege/DALSM