/
verif_OT.Rd
163 lines (138 loc) · 9.97 KB
/
verif_OT.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/verif_OT.R
\name{verif_OT}
\alias{verif_OT}
\title{verif_OT()}
\usage{
verif_OT(
ot_out,
group.class = FALSE,
ordinal = TRUE,
stab.prob = FALSE,
min.neigb = 1
)
}
\arguments{
\item{ot_out}{an otres object from \code{\link{OT_outcome}} or \code{\link{OT_joint}}}
\item{group.class}{a boolean indicating if the results related to the proximity between outcomes by grouping levels are requested in output (\code{FALSE} by default).}
\item{ordinal}{a boolean that indicates if \eqn{Y} and \eqn{Z} are ordinal (\code{TRUE} by default) or not. This argument is only useful in the context of groups of levels (\code{group.class}=TRUE).}
\item{stab.prob}{a boolean indicating if the results related to the stability of the algorithm are requested in output (\code{FALSE} by default).}
\item{min.neigb}{a value indicating the minimal required number of neighbors to consider in the estimation of stability (1 by default).}
}
\value{
A list of 7 objects is returned:
\item{nb.profil}{the number of profiles of covariates}
\item{conf.mat}{the global confusion matrix between \eqn{Y} and \eqn{Z}}
\item{res.prox}{a summary table related to the association measures between \eqn{Y} and \eqn{Z}}
\item{res.grp}{a summary table related to the study of the proximity of \eqn{Y} and \eqn{Z} using group of levels. Only if the \code{group.class} argument is set to TRUE.}
\item{hell}{Hellinger distances between observed and predicted distributions}
\item{eff.neig}{a table which corresponds to a count of conditional probabilities according to the number of neighbors used in their computation (only the first ten values)}
\item{res.stab}{a summary table related to the stability of the algorithm}
}
\description{
This function proposes post-process verifications after data fusion by optimal transportation algorithms.
}
\details{
In a context of data fusion, where information from a same target population is summarized via two specific variables \eqn{Y} and \eqn{Z} (two ordinal or nominal factors with different number of levels \eqn{n_Y} and \eqn{n_Z}), never jointly observed and respectively stored in two distinct databases A and B,
Optimal Transportation (OT) algorithms (see the models \code{OUTCOME}, \code{R_OUTCOME}, \code{JOINT}, and \code{R_JOINT} of the reference (2) for more details)
propose methods for the recoding of \eqn{Y} in B and/or \eqn{Z} in A. Outputs from the functions \code{OT_outcome} and \code{OT_joint} so provides the related predictions to \eqn{Y} in B and/or \eqn{Z} in A,
and from these results, the function \code{verif_OT} provides a set of tools (optional or not, depending on the choices done by user in input) to estimate:
\enumerate{
\item the association between \eqn{Y} and \eqn{Z} after recoding
\item the similarities between observed and predicted distributions
\item the stability of the predictions proposed by the algorithm
}
A. PAIRWISE ASSOCIATION BETWEEN \eqn{Y} AND \eqn{Z}
The first step uses standard criterions (Cramer's V, and Spearman's rank correlation coefficient) to evaluate associations between two ordinal variables in both databases or in only one database.
When the argument \code{group.class = TRUE}, these informations can be completed by those provided by the function \code{\link{error_group}}, which is directly integrate in the function \code{verif_OT}.
Assuming that \eqn{n_Y > n_Z}, and that one of the two scales of \eqn{Y} or \eqn{Z} is unknown, this function gives additional informations about the potential link between the levels of the unknown scale.
The function proceeds to this result in two steps. Firsty, \code{\link{error_group}} groups combinations of modalities of \eqn{Y} to build all possible variables \eqn{Y'} verifying \eqn{n_{Y'} = n_Z}.
Secondly, the function studies the fluctuations in the association of \eqn{Z} with each new variable \eqn{Y'} by using adapted comparisons criterions (see the documentation of \code{\link{error_group}} for more details).
If grouping successive classes of \eqn{Y} leads to an improvement in the initial association between \eqn{Y} and \eqn{Z} then it is possible to conclude in favor of an ordinal coding for \eqn{Y} (rather than nominal)
but also to emphasize the consistency in the predictions proposed by the algorithm of fusion.
B. SIMILARITIES BETWEEN OBSERVED AND PREDICTED DISTRIBUTIONS
When the predictions of \eqn{Y} in B and/or \eqn{Z} in A are available in the \code{datab} argument, the similarities between the observed and predicted probabilistic distributions of \eqn{Y} and/or \eqn{Z} are quantified from the Hellinger distance (see (1)).
This measure varies between 0 and 1: a value of 0 corresponds to a perfect similarity while a value close to 1 (the maximum) indicates a great dissimilarity.
Using this distance, two distributions will be considered as close as soon as the observed measure will be less than 0.05.
C. STABILITY OF THE PREDICTIONS
These results are based on the decision rule which defines the stability of an algorithm in A (or B) as its average ability to assign a same prediction
of \eqn{Z} (or \eqn{Y}) to individuals that have a same given profile of covariates \eqn{X} and a same given level of \eqn{Y} (or \eqn{Z} respectively).
Assuming that the missing information of \eqn{Z} in base A was predicted from an OT algorithm (the reasoning will be identical with the prediction of \eqn{Y} in B, see (2) and (3) for more details), the function \code{verif_OT} uses the conditional probabilities stored in the
object \code{estimatorZA} (see outputs of the functions \code{\link{OT_outcome}} and \code{\link{OT_joint}}) which contains the estimates of all the conditional probabilities of \eqn{Z} in A, given a profile of covariates \eqn{x} and given a level of \eqn{Y = y}.
Indeed, each individual (or row) from A, is associated with a conditional probability \eqn{P(Z= z|Y= y, X= x)} and averaging all the corresponding estimates can provide an indicator of the predictions stability.
The function \code{\link{OT_joint}} provides the individual predictions for subject \eqn{i}: \eqn{\widehat{z}_i}, \eqn{i=1,\ldots,n_A} according to the the maximum a posteriori rule:
\deqn{\widehat{z}_i= \mbox{argmax}_{z\in \mathcal{Z}} P(Z= z| Y= y_i, X= x_i)}
The function \code{\link{OT_outcome}} directly deduces the individual predictions from the probablities \eqn{P(Z= z|Y= y, X= x)} computed in the second part of the algorithm (see (3)).
It is nevertheless common that conditional probabilities are estimated from too rare covariates profiles to be considered as a reliable estimate of the reality.
In this context, the use of trimmed means and standard deviances is suggested by removing the corresponding probabilities from the final computation.
In this way, the function provides in output a table (\code{eff.neig} object) that provides the frequency of these critical probabilities that must help the user to choose.
According to this table, a minimal number of profiles can be imposed for a conditional probability to be part of the final computation by filling in the \code{min.neigb} argument.
Notice that these results are optional and available only if the argument \code{stab.prob = TRUE}.
When the predictions of \eqn{Z} in A and \eqn{Y} in B are available, the function \code{verif_OT} provides in output, global results and results by database.
The \code{res.stab} table can produce NA with \code{OT_outcome} output in presence of incomplete shared variables: this problem appears when the \code{prox.dist} argument is set to 0 and can
be simply solved by increasing this value.
}
\examples{
### Example 1
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions
# - When Y and Z are predicted in B and A respectively
# - Using an outcome model (individual assignment with knn)
#-----
data(simu_data)
outc1 <- OT_outcome(simu_data,
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
dist.choice = "G", percent.knn = 0.90, maxrelax = 0,
convert.num = 8, convert.class = 3,
indiv.method = "sequential", which.DB = "BOTH", prox.dist = 0.30
)
verif_outc1 <- verif_OT(outc1)
verif_outc1
\donttest{
### Example 2
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions and studying
# associations by grouping levels of Z
# - When only Y is predicted in B
# - Tolerated distance between a subject and a profile: 0.30 * distance max
# - Using an outcome model (individual assignment with knn)
#-----
data(simu_data)
outc2 <- OT_outcome(simu_data,
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
dist.choice = "G", percent.knn = 0.90, maxrelax = 0, prox.dist = 0.3,
convert.num = 8, convert.class = 3,
indiv.method = "sequential", which.DB = "B"
)
verif_outc2 <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE)
verif_outc2
### Example 3
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions and studying
# associations by grouping levels of Z
# - Studying the stability of the conditional probabilities
# - When Y and Z are predicted in B and A respectively
# - Using an outcome model (individual assignment with knn)
#-----
verif_outc2b <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE, stab.prob = TRUE, min.neigb = 5)
verif_outc2b
}
}
\references{
\enumerate{
\item Liese F, Miescke K-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer
\item Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.
Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
\item Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. \doi{10.1080/01621459.2020.1775615}
}
}
\seealso{
\code{\link{OT_outcome}}, \code{\link{OT_joint}}, \code{\link{proxim_dist}}, \code{\link{error_group}}
}
\author{
Gregory Guernec
\email{otrecod.pkg@gmail.com}
}