### Subversion checkout URL

You can clone with HTTPS or Subversion.

Margins Logregr: New interface + functionality for interaction terms

Pivotal Tracker: 67684630, 60733090

Shengwen Yang <syang@gopivotal.com>
Qian, Hai <hqian@gopivotal.com>

Changes:
- Deprecated the old interface and introduced a new single 'margins'
function.
- The new function takes the model table from regression as an input and
does not run the underlying regression again. The 'margins' function detects
the regression method from the model summary table and runs the appropriate
calculation.
- If interaction terms are present in the independent variables, then
an x_design string is expected that describes the interactions.
 @@ -5,11 +5,11 @@ \begin{moduleinfo} \item[Authors] {Rahul Iyer and Hai Qian} \item[History] - \begin{modulehistory} + \begin{modulehistory} \item[v0.3] Added section on Clustered Sandwich Estimators \item[v0.2] Added section on Marginal Effects - \item[v0.1] Initial version, including background of regularization - \end{modulehistory} + \item[v0.1] Initial version, including background of regularization + \end{modulehistory} \end{moduleinfo} \newcommand{\bS}[1]{\boldsymbol{#1}} @@ -718,19 +718,29 @@ \section{Marginal Effects} % (fold) linear function of $(x_1, \dots, x_m) = X$ and $y$ is a continuous variable, a linear regression model can be stated as follows: \begin{align*} - & y = X' \beta \\ + & y = X^T\beta \\ & \text{or} \\ & y = \beta_0 + \beta_1 x_1 + \dots + \beta_l x_m. \end{align*} From the above equation it is straightforward to see that the marginal effect of -variable $x_k$ on the dependent variable is $\partial y / \partial x = \beta_k$. +variable $x_k$ on the dependent variable is $\partial y / \partial x = +\beta_k$. However, this is just for the cases where there is no +interactions between the variables. If there is any interactions, the +model would be +\begin{align*} + & y = F^T\beta \\ + & \text{or} \\ + & y = \beta_0 + \beta_1 f_1 + \dots + \beta_l f_m. +\end{align*} +where $f_i$ is a function of the base variables $x_1, x_2, \dots, x_l$ and describes the +interaction between the base variables. The standard approach to modeling dichotomous/binary variables (so $y \in {0, 1}$) is to estimate a generalized linear model under the assumption that y follows some form of Bernoulli distribution. Thus the expected value of $y$ becomes, \begin{equation*} - y = G(X' \beta), + y = G(X^T \beta), \end{equation*} where G is the specified binomial distribution. Here we assume to use logistic regression and use $g$ to refer to the inverse logit function. @@ -739,15 +749,34 @@ \subsection{Logistic regression} % (fold) \label{sub:logistic_regression} In logistic regression: \begin{align*} - P &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_m x_m)}} \\ + P &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 f_1 + \dots \beta_m f_m)}} \\ &= \frac{1}{1 + e^{-z}} \end{align*} \begin{align*} - \implies \frac{\partial P}{\partial X_k} &= \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot - \frac{e^{-z}}{1 + e^{-z}} \\ - &= \beta_k \cdot P \cdot (1-P) + \implies \frac{\partial P}{\partial X_k} &= P \cdot (1-P) \cdot + \frac{\partial z}{\partial x_k}, \end{align*} - +where the partial derivative in the last equation equals to $\beta_k$ +if there is no interaction terms. However, in general cases, there is +no simple expression for it, and we just keep it as it is. + +For categorical variables, things are a little bit complicated. Dummy +variables are created for a categorical variable. There +are two options. First, we can treat the dummy variables as if they +were continuous ones, and use the above equation to compute their +marginal effect. Second, for each dummy variable we can compute the discrete change with +respect to the reference level of this categorical variable: +\begin{align*} + \Delta_{x_k^{(v)}} P = \left.P\right\vert_{x_k^{(v)}=1, x_k^{(w)}=0\ (w\neq + v)} - \left.P\right\vert_{x_k^{(0)}=1, x_k^{(w)}=0 \ (w \neq 0)}\ = P_{set} - P_{unset}, +\end{align*} +where $x_k$ is a categorical variable, and $v$, $w$ denote the levels +of the categorical variable. $0$ denotes the reference level of this +categorical variable. Note that in many cases, the dummy variable for +the reference level does not appear in the regression model, and +setting $x_k^{(w)}$ for $w\neq 0$ is enough in the second term of the +above equation. +Both options are valid. The default for MADlib is the second one. There are two main methods of calculating the marginal effects for dichotomous dependent variables. @@ -755,9 +784,9 @@ \subsection{Logistic regression} % (fold) \item The first uses the average of the marginal effects at every sample observation. This is calculated as follows: \begin{gather*} - \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^{n} P(y_i = 1)(1-P(y_i = 1))}{n}, \\ - \text{where, } P(y_i=1) = g(X^{(i)} \beta) \\ - \text{and, } g(z) = \frac{1}{1 + e^{-z}} \\ + \langle \frac{\partial P(y_i = 1)}{\partial x_k} \rangle = \frac{1}{n}\sum_{i=1}^{n} P(y_i = 1)(1-P(y_i = 1))\cdot\frac{\partial z_i}{\partial x_k}, \\ + \text{where, } P(y_i=1) = \frac{1}{1 + e^{-z_i}}, \\ + \text{and, } z_i = F(X_i)^T\beta \\ \end{gather*} \item The second approach calculates the marginal effect for $x_k$ by taking @@ -765,12 +794,19 @@ \subsection{Logistic regression} % (fold) value from the same formulation with the exception of adding one unit to $x_k$. The derivation of this marginal effect is captured by the following: \begin{gather*} - \frac{\partial y}{\partial x_k} = \quad \beta_k P(y=1|\bar{X})(1-P(y=1|\bar{X})) \\ - \text{where, } \bar{X} = \frac{\sum_{i=1}^{n}X^{(i)}}{n} + \left.\frac{\partial P(y=1)}{\partial + x_k}\right\vert_{X=\bar{X}} = \quad + P(y=1|\bar{X})(1-P(y=1|\bar{X})) + \left.\frac{\partial z}{\partial x_k}\right\vert_{X=\bar{X}} \\ + \text{where, } \bar{X} = \frac{\sum_{i=1}^{n}X_i}{n} \end{gather*} \end{enumerate} % subsection logistic_regression (end) +For categorical variables, we do the same thing: either evaluate the +marginal effect for each data record and compute the average, or +evaluate the marginal effect at the means of the variables. + \subsection{Discrete change effect} % (fold) \label{sub:discrete_change_effect} Along with marginal effects we can also compute the following discrete change @@ -840,68 +876,74 @@ \subsection{Standard Errors} % (fold) function. The delta method therefore relies on finding a linear approximation of the function by using a first-order Taylor expansion. -We can approximate a function $f(x)$ about a value $a$ as, +We can approximate a function $g(x)$ about a value $a$ as, $- f(x) \approx f(a) + (x-a)f'(a) +g(x) \approx g(a) + (x-a)g'(a)$ Taking the variance and setting $a = \mu_x$, $- Var(f(X)) \approx \left[f'(\mu_x)\right]^2 Var(X) +Var(g(X)) \approx \left[g'(\mu_x)\right]^2 Var(X)$ \subsubsection*{Logistic Regression} -Using this technique, to compute the variance of the marginal effects at the -mean observation value in \emph{logistic regression}, we obtain: -\begin{gather*} - Var(ME_k) = \frac{\partial (\beta_k \bar{P} (1- \bar{P}))}{\partial \beta_k} Var(\beta_k),\\ - \text{where, } \bar{P} = g(\bar{X}' \beta) = \frac{1}{1 + e^{-\bar{z}}} \\ - \text{and } \bar{z} = \beta_0 + \beta_1 \bar{x}_1 + \dots + \beta_m \bar{x}_m -\end{gather*} - -Thus, using the rule for differentiating compositions of functions, we get +Using this technique, to compute the variance of the marginal effects +at the mean observation value in \emph{logistic regression}, we obtain +the standard error by first computing the marginal effect's derivative +over the coefficients, which is a $n\times m$ matrix $S_{kl} += \frac{\partial \mathit{ME}_k}{\partial \beta_l}$: +\begin{eqnarray*} + S_{kl} &=& \frac{\partial}{\partial\beta_l} \left[P (1- P) + \cdot \frac{\partial z}{\partial x_k}\right]\\ + &=& P (1- P) \cdot \frac{\partial}{\partial\beta_l}\left(\frac{\partial z}{\partial x_k}\right) + + \frac{\partial \left[P (1- P)\right]}{\partial\beta_l} \cdot \frac{\partial z}{\partial x_k}\\ + &=& P(1-P)\cdot\frac{\partial^2 z}{\partial x_k\partial\beta_l} + + P(1-P)(1-2P) \cdot \frac{\partial z}{\partial \beta_l} \cdot \frac{\partial z}{\partial x_k},\\ + \text{where } P &=& \frac{1}{1 + e^{-z}} \\ + \text{and } z &=& \beta_0 + \beta_1 f_1(X)+ \dots + \beta_m f_m(X), + X = x_1, x_2, \dots, x_n. +\end{eqnarray*} +And for categorical variables, just replace $P(1-P)\cdot(\partial z/\partial x_k)$ +with $\Delta_{x_k^{(v)}}P$ in the first equation above. And we can get +\begin{eqnarray*} + S_{kl} &=& \frac{\partial(P_{set}-P_{unset})}{\partial\beta_l} \\ + &=& P_{set} (1 - P_{set}) \cdot f_{l_{set}} - P_{unset} (1 - P_{unset}) \cdot f_{l_{unset}} +\end{eqnarray*} + +Thus, the variance of the marginal effects is \begin{align*} - Var(ME_k) & = \left(-\beta_k \bar{P} \frac{\partial \bar{P}}{\partial \beta_k} + - \beta_k (1-\bar{P})\frac{\partial \bar{P}}{\partial \beta_k} + - \bar{P}(1-\bar{P}) \right) Var(\beta_k) \\ - & = \left( (1-2\bar{P})\beta_k \frac{\partial \bar{P}}{\partial \beta_k} + \bar{P}(1-\bar{P}) \right) Var(\beta_k) -\end{align*} -We have, -\begin{align*} - \frac{\partial \bar{P}}{\partial \beta_k} & = \frac{\partial (\frac{1}{1 + e^{-z}})}{\partial \beta_k} \\ - & = \frac{1}{(1+e^{-z})^2} e^{-z} \frac{\partial z}{\partial \beta_k} \\ - & = \frac{x_k e^{-z}}{(1+e^{-z})^2} \\ - & = x_k \bar{P} (1 - \bar{P}) -\end{align*} -Replacing this in the equation for $Var(ME_k)$, - -\begin{align*} - Var(ME_k) = \bar{P}(1-\bar{P}) \left(1 + (1-2\bar{P})\beta_k x_k \right) Var(\beta_k) + Var(\mathit{ME}) = S \cdot Var(\beta)\cdot S^T\, \end{align*} +where $Var(\beta)$ is a $m\times m$ matrix and $S$ is a $n\times m$ +matrix. $n$ is the number of different base variables, and $m$ is the +number of $\beta_i$. -Since $\beta$, is a multivariate variable, we will have to use the variance- -covariance matrix of $\beta$ to compute the variance of the marginal effects. -Thus for the vector of marginal effects the equation becomes, +Note: The $Var(\beta)$ is computed with respect to the training data +for the logistic regression, but not the data used to compute the +marginal effects (if we use a different data set for computing the +marginal effects). -\begin{gather*} - Var(ME) = \bar{P}^2(1-\bar{P})^2 \left[I + (1-2\bar{P})\beta \bar{X}' \right] V \left[I+ (1-2\bar{P}) \bar{X} \beta' \right], -\end{gather*} -where $V$ is the estimated variance-covariance matrix of $\beta$. +Using the definition of $z$, we can simplify $S$ a little bit +\begin{equation} + S_{kl} = P(1-P)\left(\frac{\partial f_l}{\partial x_k} + (1-2P)\cdot f_l\sum_{i=1}^{m}\frac{\beta_i\partial f_i(X)}{\partial x_k}\right) +\end{equation} +So we just need to compute $\partial f_i/\partial x_k$ and all the +other derivatives can be obtained. \subsubsection*{Multinomial Logistic Regression} For multinomial logistic regression, the coefficients $\beta$ form a matrix of dimension $(J-1) \times K$ where $J$ is the number of categories and $K$ is the number of features. In order to compute the standard errors on the marginal effects of category $j$ for independent variable $k$, we need to compute -the term $\frac{\partial ME_{k,j}} {\partial \beta_{k_1, j_1}}$ for each +the term $\frac{\partial \mathit{ME}_{k,j}} {\partial \beta_{k_1, j_1}}$ for each $k_1 \in \{1 \ldots K \}$ and $j_1 \in \{1 \ldots J-1 \}$. The result is a column vector of length $K \times (J-1)$ denoted by -$\frac{\partial ME_{k,j}}{\partial \vec{\beta}}$. Hence, for each category +$\frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}$. Hence, for each category $j \in \{1 \ldots J\}$ and independent variable $k \in \{1 \ldots K\}$, we perform the following computation \begin{equation} - Var(ME_{j,k}) = \frac{\partial ME_{k,j}}{\partial \vec{\beta}}^T V \frac{\partial ME_{k,j}}{\partial \vec{\beta}}. + Var(\mathit{ME}_{j,k}) = \frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}^T V \frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}. \end{equation} where $V$ is the variance-covariance matrix of the multinomial logistic @@ -911,13 +953,13 @@ \subsubsection*{Multinomial Logistic Regression} From our earlier derivation, we know that the marginal effects for multinomial logistic regression are: \begin{gather*} - \frac{ME_{j,k}}{\partial x} = \bar{P}_j \left[ \beta_{kj} - \sum_{l=1}^{j}\beta_{kl} \bar{P}_l \right] + \frac{\mathit{ME}_{j,k}}{\partial x} = \bar{P}_j \left[ \beta_{kj} - \sum_{l=1}^{j}\beta_{kl} \bar{P}_l \right] \end{gather*} where \begin{gather*} \bar{P}_j = \frac{e^{X\beta_{j,.}}}{\sum_{l=1}^{j} e^{X\beta_{l,.}}} \ \ \ \forall j \in \{ 1 \ldots J \} \end{gather*} -We now compute the term $\frac{\partial ME_{k,j}}{\partial \vec{\beta}}$. First, +We now compute the term $\frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}$. First, we define the following three indicator variables: e_{j,j\_1} = \begin{cases} 1 & \mbox{if } j=j\_1 \\ @@ -933,7 +975,7 @@ \subsubsection*{Multinomial Logistic Regression} Using the above definition, we can show that for each j_1 \in \{ 1 \ldots J \} and k_1 \in \{1 \ldots K\}, the partial derivative \begin{align*} - \frac{\partial ME_{k,j}}{\partial \beta_{j_1, k_1}} &= \frac{\partial \bar{P}_{j}}{\partial \beta_{j_1, k_1}} + \frac{\partial \mathit{ME}_{k,j}}{\partial \beta_{j_1, k_1}} &= \frac{\partial \bar{P}_{j}}{\partial \beta_{j_1, k_1}} + \bar{P}_j \Bigg{[} e_{j,j_1,k,k_1} - e_{k,k_1} \bar{P}_{j_1} - \sum_{l=1}^{j} \beta_{l,k} \frac{\partial \bar{P}_l}{\partial \beta_{j_1, k_1}} \Bigg{]} \end{align*} @@ -944,16 +986,16 @@ \subsubsection*{Multinomial Logistic Regression} \end{align*} The two expressions above can be simplified to obtain \begin{align*} - \frac{\partial ME_{k,j}}{\partial \beta_{j_1, k_1}} &= \bar{P}_j + \frac{\partial \mathit{ME}_{k,j}}{\partial \beta_{j_1, k_1}} &= \bar{P}_j \bar{X}_{k_1} [e_{j,j_1} - \bar{P}_{j_1}] [\beta_{j,k} - \beta_{. k}^T \bar{P}] + \bar{P}_j \Bigg{[} e_{j,j_1,k,k_1} - e_{k,k_1} \bar{P}_{j_1} - \bar{X}_{k_1} \bar{P}_{j_1} ( \beta_{k,k_1} - \beta_{. k}^T \bar{P}) \Bigg{]} \\ - &= \bar{X}_{k_1} \bar{ME}_{j,k} [ e_{j,j_1} - \bar{P}_{j_1} ] + + &= \bar{X}_{k_1} \bar{\mathit{ME}}_{j,k} [ e_{j,j_1} - \bar{P}_{j_1} ] + \bar{P}_j [e_{k,k_1,j,j_1} - e_{k,k_1}\bar{P}_{j_1} - \bar{X}_{k_1} - \bar{ME}_{j_1,k} ] + \bar{\mathit{ME}}_{j_1,k} ] \end{align*} -where \bar{ME} is the marginal effects computed at the mean observation \bar{X}. +where \bar{\mathit{ME}} is the marginal effects computed at the mean observation \bar{X}. % subsection standard_errors (end) 2  src/config/Version.yml  @@ -1 +1 @@ -version: 1.5 +version: 1.5dev 116 src/modules/regress/logistic.cpp  @@ -35,7 +35,7 @@ enum { IN_PROCESS, COMPLETED, TERMINATED, NULL_EMPTY }; // Internal functions AnyType stateToResult(const Allocator &inAllocator, const HandleMap >& inCoef, - const ColumnVector &diagonal_of_inverse_of_X_transp_AX, + const Matrix & hessian, const double &logLikelihood, const double &conditionNo, int status, const uint64_t &numRows); @@ -200,7 +200,7 @@ class LogRegrCGTransitionState { * @brief Logistic function */ inline double sigma(double x) { - return 1. / (1. + std::exp(-x)); + return 1. / (1. + std::exp(-x)); } /** @@ -308,16 +308,16 @@ logregr_cg_step_final::run(AnyType &args) { // Note: k = state.iteration if (state.iteration == 0) { - // Iteration computes the gradient + // Iteration computes the gradient - state.dir = state.gradNew; - state.grad = state.gradNew; - } else { + state.dir = state.gradNew; + state.grad = state.gradNew; + } else { // We use the Hestenes-Stiefel update formula: // - // g_k^T (g_k - g_{k-1}) - // beta_k = ------------------------- - // d_{k-1}^T (g_k - g_{k-1}) + // g_k^T (g_k - g_{k-1}) + // beta_k = ------------------------- + // d_{k-1}^T (g_k - g_{k-1}) ColumnVector gradNewMinusGrad = state.gradNew - state.grad; state.beta = dot(state.gradNew, gradNewMinusGrad) @@ -341,8 +341,8 @@ logregr_cg_step_final::run(AnyType &args) { // d_k = g_k - beta_k * d_{k-1} state.dir = state.gradNew - state.beta * state.dir; - state.grad = state.gradNew; - } + state.grad = state.gradNew; + } // H_k = - X^T A_k X // where A_k = diag(a_1, ..., a_n) and a_i = sigma(x_i c_{k-1}) sigma(-x_i c_{k-1}) @@ -405,7 +405,7 @@ internal_logregr_cg_result::run(AnyType &args) { state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse); return stateToResult(*this, state.coef, - decomposition.pseudoInverse().diagonal(), state.logLikelihood, + state.X_transp_AX, state.logLikelihood, decomposition.conditionNo(), state.status, state.numRows); } @@ -693,9 +693,9 @@ AnyType logregr_irls_step_final::run(AnyType &args) { // of X^T A X, so that we don't have to recompute it in the result function. // Likewise, we store the condition number. // FIXME: This feels a bit like a hack. - state.X_transp_Az = inverse_of_X_transp_AX.diagonal(); - state.X_transp_AX(0,0) = decomposition.conditionNo(); - + // state.X_transp_Az = inverse_of_X_transp_AX.diagonal(); + // state.X_transp_AX(0,0) = decomposition.conditionNo(); + // state.X_transp_Az(0) = decomposition.conditionNo(); return state; } @@ -724,8 +724,8 @@ AnyType internal_logregr_irls_result::run(AnyType &args) { if (state.status == NULL_EMPTY) return Null(); - return stateToResult(*this, state.coef, state.X_transp_Az, - state.logLikelihood, state.X_transp_AX(0,0), + return stateToResult(*this, state.coef, state.X_transp_AX, + state.logLikelihood, state.X_transp_Az(0), state.status, state.numRows); } @@ -800,16 +800,16 @@ class LogRegrIGDTransitionState { throw std::logic_error("Internal error: Incompatible transition " "states"); - // Compute the average of the models. Note: The following remains an + // Compute the average of the models. Note: The following remains an // invariant, also after more than one merge: // The model is a linear combination of the per-segment models // where the coefficient (weight) for each per-segment model is the // ratio "# rows in segment / total # rows of all segments merged so // far". - double totalNumRows = static_cast(numRows) + double totalNumRows = static_cast(numRows) + static_cast(inOtherState.numRows); - coef = double(numRows) / totalNumRows * coef - + double(inOtherState.numRows) / totalNumRows * inOtherState.coef; + coef = double(numRows) / totalNumRows * coef + + double(inOtherState.numRows) / totalNumRows * inOtherState.coef; numRows += inOtherState.numRows; X_transp_AX += inOtherState.X_transp_AX; @@ -824,7 +824,7 @@ class LogRegrIGDTransitionState { * @brief Reset the inter-iteration fields. */ inline void reset() { - // FIXME: HAYING: stepsize is hard-coded here now + // FIXME: HAYING: stepsize is hard-coded here now stepsize = .01; numRows = 0; X_transp_AX.fill(0); @@ -870,7 +870,7 @@ class LogRegrIGDTransitionState { typename HandleTraits::ColumnVectorTransparentHandleMap coef; typename HandleTraits::ReferenceToUInt64 numRows; - typename HandleTraits::MatrixTransparentHandleMap X_transp_AX; + typename HandleTraits::MatrixTransparentHandleMap X_transp_AX; typename HandleTraits::ReferenceToDouble logLikelihood; typename HandleTraits::ReferenceToUInt16 status; }; @@ -1025,7 +1025,7 @@ internal_logregr_igd_result::run(AnyType &args) { state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse); return stateToResult(*this, state.coef, - decomposition.pseudoInverse().diagonal(), + state.X_transp_AX, state.logLikelihood, decomposition.conditionNo(), state.status, state.numRows); } @@ -1039,12 +1039,18 @@ internal_logregr_igd_result::run(AnyType &args) { AnyType stateToResult( const Allocator &inAllocator, const HandleMap > &inCoef, - const ColumnVector &diagonal_of_inverse_of_X_transp_AX, + const Matrix & hessian, const double &logLikelihood, const double &conditionNo, int status, const uint64_t &numRows) { + SymmetricPositiveDefiniteEigenDecomposition decomposition( + hessian, EigenvaluesOnly, ComputePseudoInverse); + + const Matrix &inverse_of_X_transp_AX = decomposition.pseudoInverse(); + const ColumnVector &diagonal_of_X_transp_AX = inverse_of_X_transp_AX.diagonal(); + MutableNativeColumnVector stdErr( inAllocator.allocateArray(inCoef.size())); MutableNativeColumnVector waldZStats( @@ -1055,23 +1061,22 @@ AnyType stateToResult( inAllocator.allocateArray(inCoef.size())); for (Index i = 0; i < inCoef.size(); ++i) { - stdErr(i) = std::sqrt(diagonal_of_inverse_of_X_transp_AX(i)); + stdErr(i) = std::sqrt(diagonal_of_X_transp_AX(i)); waldZStats(i) = inCoef(i) / stdErr(i); waldPValues(i) = 2. * prob::cdf( prob::normal(), -std::abs(waldZStats(i))); oddsRatios(i) = std::exp( inCoef(i) ); } + // Return all coefficients, standard errors, etc. in a tuple AnyType tuple; tuple << inCoef << logLikelihood << stdErr << waldZStats << waldPValues - << oddsRatios << sqrt(conditionNo) << status << numRows; + << oddsRatios << inverse_of_X_transp_AX + << sqrt(decomposition.conditionNo()) << status << numRows; return tuple; } - - - // --------------------------------------------------------------------------- // Robust Logistic Regression States // --------------------------------------------------------------------------- @@ -1187,7 +1192,7 @@ class RobustLogRegrTransitionState { */ void rebind(uint16_t inWidthOfX) { - iteration.rebind(&mStorage[0]); + iteration.rebind(&mStorage[0]); widthOfX.rebind(&mStorage[1]); coef.rebind(&mStorage[2], inWidthOfX); numRows.rebind(&mStorage[2 + inWidthOfX]); @@ -1202,7 +1207,7 @@ class RobustLogRegrTransitionState { - typename HandleTraits::ReferenceToUInt32 iteration; + typename HandleTraits::ReferenceToUInt32 iteration; typename HandleTraits::ReferenceToUInt16 widthOfX; typename HandleTraits::ColumnVectorTransparentHandleMap coef; @@ -1219,13 +1224,13 @@ class RobustLogRegrTransitionState { AnyType robuststateToResult( const Allocator &inAllocator, - const ColumnVector &inCoef, + const ColumnVector &inCoef, const ColumnVector &diagonal_of_varianceMat) { - MutableNativeColumnVector variance( + MutableNativeColumnVector variance( inAllocator.allocateArray(inCoef.size())); - MutableNativeColumnVector coef( + MutableNativeColumnVector coef( inAllocator.allocateArray(inCoef.size())); MutableNativeColumnVector stdErr( @@ -1332,23 +1337,23 @@ robust_logregr_step_final::run(AnyType &args) { // We request a mutable object. Depending on the backend, this might perform // a deep copy. RobustLogRegrTransitionState > state = args[0]; - // Aggregates that haven't seen any data just return Null. + // Aggregates that haven't seen any data just return Null. if (state.numRows == 0) return Null(); - //Compute the robust variance with the White sandwich estimator - SymmetricPositiveDefiniteEigenDecomposition decomposition( + //Compute the robust variance with the White sandwich estimator + SymmetricPositiveDefiniteEigenDecomposition decomposition( state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse); - Matrix bread = decomposition.pseudoInverse(); + Matrix bread = decomposition.pseudoInverse(); - /* + /* This is written a little strangely because it prevents Eigen warnings. The following two lines are equivalent to: Matrix variance = bread*state.meat*bread; but eigen throws a warning on that. - */ - Matrix varianceMat;// = meat; + */ + Matrix varianceMat;// = meat; varianceMat = bread*state.meat*bread; /* @@ -1478,7 +1483,7 @@ class MarginalLogRegrTransitionState { * - 3 + widthOfX: X_transp_AX (X^T A X) */ void rebind(uint16_t inWidthOfX) { - iteration.rebind(&mStorage[0]); + iteration.rebind(&mStorage[0]); widthOfX.rebind(&mStorage[1]); coef.rebind(&mStorage[2], inWidthOfX); numRows.rebind(&mStorage[2 + inWidthOfX]); @@ -1491,7 +1496,7 @@ class MarginalLogRegrTransitionState { public: - typename HandleTraits::ReferenceToUInt32 iteration; + typename HandleTraits::ReferenceToUInt32 iteration; typename HandleTraits::ReferenceToUInt16 widthOfX; typename HandleTraits::ColumnVectorTransparentHandleMap coef; typename HandleTraits::ReferenceToUInt64 numRows; @@ -1601,8 +1606,8 @@ marginal_logregr_step_transition::run(AnyType &args) { } // Standard error according to the delta method - state.delta += p * (1 - p) * delta; - + state.delta += p * (1 - p) * delta; + return state; } @@ -1636,20 +1641,23 @@ marginal_logregr_step_final::run(AnyType &args) { // We request a mutable object. // Depending on the backend, this might perform a deep copy. MarginalLogRegrTransitionState > state = args[0]; - // Aggregates that haven't seen any data just return Null. + // Aggregates that haven't seen any data just return Null. if (state.numRows == 0) return Null(); - + // Compute variance matrix of logistic regression - SymmetricPositiveDefiniteEigenDecomposition decomposition( + SymmetricPositiveDefiniteEigenDecomposition decomposition( state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse); - - Matrix variance = decomposition.pseudoInverse(); + + Matrix variance = decomposition.pseudoInverse(); + + // dberr << "Delta = " << state.delta << " numrows = " << state.numRows << std::endl; + // dberr << "Variance = " << variance << std::endl; // Standard error according to the delta method - Matrix std_err; - std_err = state.delta * variance * trans(state.delta) / (state.numRows*state.numRows); - + Matrix std_err; + std_err = state.delta * variance * trans(state.delta) / (state.numRows*state.numRows); + // Computing the marginal effects return marginalstateToResult(*this, state.coef, 381 src/modules/regress/marginal.cpp  @@ -0,0 +1,381 @@ +/* ------------------------------------------------------ + * + * @file logistic.cpp + * + * @brief Logistic-Regression functions + * + * We implement the conjugate-gradient method and the iteratively-reweighted- + * least-squares method. + * + *//* ----------------------------------------------------------------------- */ +#include +#include +#include +#include +#include +#include +#include "marginal.hpp" + +namespace madlib { + +// Use Eigen +using namespace dbal::eigen_integration; + +namespace modules { + +// Import names from other MADlib modules +using dbal::NoSolutionFoundException; + +namespace regress { + +// FIXME this enum should be accessed by all modules that may need grouping +// valid status values +enum { IN_PROCESS, COMPLETED, TERMINATED, NULL_EMPTY }; + +inline double logistic(double x) { + return 1. / (1. + std::exp(-x)); +} + +// --------------------------------------------------------------------------- +// Marginal Effects Logistic Regression States +// --------------------------------------------------------------------------- +/** + * @brief State for marginal effects calculation for logistic regression + * + * TransitionState encapsualtes the transition state during the + * marginal effects calculation for the logistic-regression aggregate function. + * To the database, the state is exposed as a single DOUBLE PRECISION array, + * to the C++ code it is a proper object containing scalars and vectors. + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length at least 5, and all elemenets are 0. + * + */ +template +class MarginsLogregrInteractionState { + template + friend class MarginsLogregrInteractionState; + + public: + MarginsLogregrInteractionState(const AnyType &inArray) + : mStorage(inArray.getAs()) { + + rebind(static_cast(mStorage[1]), + static_cast(mStorage[2]), + static_cast(mStorage[3])); + } + + /** + * @brief Convert to backend representation + * + * We define this function so that we can use State in the + * argument list and as a return type. + */ + inline operator AnyType() const { + return mStorage; + } + + /** + * @brief Initialize the marginal variance calculation state. + * + * This function is only called for the first iteration, for the first row. + */ + inline void initialize(const Allocator &inAllocator, + const uint16_t inWidthOfX, + const uint16_t inNumBasis, + const uint16_t inNumCategoricals) { + mStorage = inAllocator.allocateArray( + arraySize(inWidthOfX, inNumBasis, inNumCategoricals)); + rebind(inWidthOfX, inNumBasis, inNumCategoricals); + widthOfX = inWidthOfX; + numBasis = inNumBasis; + numCategoricals = inNumCategoricals; + } + + /** + * @brief We need to support assigning the previous state + */ + template + MarginsLogregrInteractionState &operator=( + const MarginsLogregrInteractionState &inOtherState) { + + for (size_t i = 0; i < mStorage.size(); i++) + mStorage[i] = inOtherState.mStorage[i]; + return *this; + } + + /** + * @brief Merge with another State object by copying the intra-iteration + * fields + */ + template + MarginsLogregrInteractionState &operator+=( + const MarginsLogregrInteractionState &inOtherState) { + + if (mStorage.size() != inOtherState.mStorage.size() || + widthOfX != inOtherState.widthOfX) + throw std::logic_error("Internal error: Incompatible transition " + "states"); + + numRows += inOtherState.numRows; + marginal_effects += inOtherState.marginal_effects; + delta += inOtherState.delta; + return *this; + } + + /** + * @brief Reset the inter-iteration fields. + */ + inline void reset() { + numRows = 0; + marginal_effects.fill(0); + categorical_indices.fill(0); + training_data_vcov.fill(0); + delta.fill(0); + } + + private: + static inline size_t arraySize(const uint16_t inWidthOfX, + const uint16_t inNumBasis, + const uint16_t inNumCategoricals) { + return 5 + inNumBasis + inNumCategoricals + (inWidthOfX + inNumBasis) * inWidthOfX; + } + + /** + * @brief Rebind to a new storage array + * + * @param inWidthOfX The number of independent variables. + * + */ + void rebind(uint16_t inWidthOfX, uint16_t inNumBasis, uint16_t inNumCategoricals) { + iteration.rebind(&mStorage[0]); + widthOfX.rebind(&mStorage[1]); + numBasis.rebind(&mStorage[2]); + numCategoricals.rebind(&mStorage[3]); + numRows.rebind(&mStorage[4]); + marginal_effects.rebind(&mStorage[5], inNumBasis); + training_data_vcov.rebind(&mStorage[5 + inNumBasis], inWidthOfX, inWidthOfX); + delta.rebind(&mStorage[5 + inNumBasis + inWidthOfX * inWidthOfX], + inNumBasis, inWidthOfX); + if (inNumCategoricals > 0) + categorical_indices.rebind(&mStorage[5 + inNumBasis + + (inWidthOfX + inNumBasis) * inWidthOfX], + inNumCategoricals); + } + Handle mStorage; + + public: + + typename HandleTraits::ReferenceToUInt32 iteration; + typename HandleTraits::ReferenceToUInt16 widthOfX; + typename HandleTraits::ReferenceToUInt16 numBasis; + typename HandleTraits::ReferenceToUInt16 numCategoricals; + typename HandleTraits::ReferenceToUInt64 numRows; + typename HandleTraits::ColumnVectorTransparentHandleMap marginal_effects; + typename HandleTraits::ColumnVectorTransparentHandleMap categorical_indices; + typename HandleTraits::MatrixTransparentHandleMap training_data_vcov; + typename HandleTraits::MatrixTransparentHandleMap delta; +}; +// ---------------------------------------------------------------------- + +/** + * @brief Helper function that computes the final statistics for the marginal variance + */ + +AnyType margins_stateToResult( + const Allocator &inAllocator, + const ColumnVector &diagonal_of_variance_matrix, + const ColumnVector inmarginal_effects_per_observation, + const double numRows) { + + uint16_t n_basis_terms = inmarginal_effects_per_observation.size(); + MutableNativeColumnVector marginal_effects( + inAllocator.allocateArray(n_basis_terms)); + MutableNativeColumnVector stdErr( + inAllocator.allocateArray(n_basis_terms)); + MutableNativeColumnVector tStats( + inAllocator.allocateArray(n_basis_terms)); + MutableNativeColumnVector pValues( + inAllocator.allocateArray(n_basis_terms)); + + for (Index i = 0; i < n_basis_terms; ++i) { + marginal_effects(i) = inmarginal_effects_per_observation(i) / numRows; + stdErr(i) = std::sqrt(diagonal_of_variance_matrix(i)); + tStats(i) = marginal_effects(i) / stdErr(i); + + // P-values only make sense if numRows > coef.size() + if (numRows > n_basis_terms) + pValues(i) = 2. * prob::cdf( prob::normal(), + -std::abs(tStats(i))); + } + + // Return all coefficients, standard errors, etc. in a tuple + // Note: p-values will return NULL if numRows <= coef.size + AnyType tuple; + tuple << marginal_effects + << stdErr + << tStats + << (numRows > n_basis_terms? pValues: Null()); + return tuple; +} + + +/** + * @brief Perform the marginal effects transition step + */ +AnyType +margins_logregr_int_transition::run(AnyType &args) { + MarginsLogregrInteractionState > state = args[0]; + if (args[1].isNull()) { return args[0]; } + MappedColumnVector x; + try { + // an exception is raised in the backend if args[2] contains nulls + MappedColumnVector xx = args[1].getAs(); + // x is a const reference, we can only rebind to change its pointer + x.rebind(xx.memoryHandle(), xx.size()); + } catch (const ArrayWithNullException &e) { + return args[0]; + } + + // The following check was added with MADLIB-138. + if (!dbal::eigen_integration::isfinite(x)) + throw std::domain_error("Design matrix is not finite."); + + MappedColumnVector coef = args[2].getAs(); + + // matrix is read in as a column-order matrix, the input is passed-in as row-order. + Matrix derivative_matrix = args[4].getAs(); + derivative_matrix.transposeInPlace(); + + if (state.numRows == 0) { + if (x.size() > std::numeric_limits::max()) + throw std::domain_error("Number of independent variables cannot be " + "larger than 65535."); + Matrix training_data_vcov = args[3].getAs(); + + MappedColumnVector categorical_indices; + uint16_t numCategoricals = 0; + if (!args[5].isNull()) { + MappedColumnVector xx = args[5].getAs(); + categorical_indices.rebind(xx.memoryHandle(), xx.size()); + numCategoricals = categorical_indices.size(); + } + state.initialize(*this, + static_cast(coef.size()), + static_cast(derivative_matrix.rows()), + static_cast(numCategoricals)); + state.training_data_vcov = training_data_vcov; + if (numCategoricals > 0) + state.categorical_indices = categorical_indices; + } + + // Now do the transition step + state.numRows++; + double xc = dot(x, coef); + double p = std::exp(xc)/ (1 + std::exp(xc)); + + // compute marginal effects and delta using 1st and 2nd derivatives + ColumnVector coef_interaction_sum; + coef_interaction_sum = derivative_matrix * coef; + ColumnVector current_me = coef_interaction_sum * p * (1 - p); + + Matrix current_delta; + current_delta = p * (1 - p) * ( + (1 - 2 * p) * coef_interaction_sum * trans(x) + + derivative_matrix); + + // update marginal effects and delta using discrete differences just for + // categorical variables + // if (state.numRows == 100) { + // dberr << "Delta before: " << current_delta << std::endl; + // } + Matrix x_set; + Matrix x_unset; + if (!args[6].isNull() && !args[7].isNull()){ + // the matrix is read in column-order but passed in row-order + x_set = args[6].getAs(); + x_set.transposeInPlace(); + + x_unset = args[7].getAs(); + x_unset.transposeInPlace(); + } + for (Index i = 0; i < state.numCategoricals; ++i) { + // Note: categorical_indices are assumed to be zero-based + double xc_set = dot(x_set.row(i), coef); + double p_set = logistic(xc_set); + double xc_unset = dot(x_unset.row(i), coef); + double p_unset = logistic(xc_unset); + current_me(static_cast(state.categorical_indices(i))) = p_set - p_unset; + + current_delta.row(static_cast(state.categorical_indices(i))) = ( + (p_set * (1 - p_set) * x_set.row(i) - + p_unset * (1 - p_unset) * x_unset.row(i))); + + // if (state.numRows == 100) { + // dberr << "Delta after updating " << static_cast(state.categorical_indices(i)) + // << " : " << current_delta << std::endl; + // } + } + + state.marginal_effects += current_me; + state.delta += current_delta; + return state; +} + + +/** + * @brief Marginal effects: Merge transition states + */ +AnyType +margins_logregr_int_merge::run(AnyType &args) { + MarginsLogregrInteractionState > stateLeft = args[0]; + MarginsLogregrInteractionState > stateRight = args[1]; + // We first handle the trivial case where this function is called with one + // of the states being the initial state + if (stateLeft.numRows == 0) + return stateRight; + else if (stateRight.numRows == 0) + return stateLeft; + + // Merge states together and return + stateLeft += stateRight; + return stateLeft; +} + +/** + * @brief Marginal effects: Final step + */ +AnyType +margins_logregr_int_final::run(AnyType &args) { + // We request a mutable object. + // Depending on the backend, this might perform a deep copy. + MarginsLogregrInteractionState > state = args[0]; + // Aggregates that haven't seen any data just return Null. + if (state.numRows == 0) + return Null(); + + // dberr << "Delta = " << state.delta << std::endl; + // dberr << "----------------------------------------------------" << std::endl; + // dberr << "Variance = " << state.training_data_vcov << std::endl; + + // Standard error for continuous variables according to the delta method + Matrix std_err; + std_err = state.delta * state.training_data_vcov * trans(state.delta) / (state.numRows*state.numRows); + + // Standard error for categorical variables according to the delta method + + // Combining the two standard error vectors + + // Computing the marginal effects + return margins_stateToResult(*this, std_err.diagonal(), + state.marginal_effects, + state.numRows); +} +// ------------------------ End of Marginal ------------------------------------ + +} // namespace regress + +} // namespace modules + +} // namespace madlib 14 src/modules/regress/marginal.hpp  @@ -0,0 +1,14 @@ +/** + * @brief Marginal Effects Logistic regression step: Transition function + */ +DECLARE_UDF(regress, margins_logregr_int_transition) + +/** + * @brief Marginal effects Logistic regression: State merge function + */ +DECLARE_UDF(regress, margins_logregr_int_merge) + +/** + * @brief Marginal Effects Logistic regression: Final function + */ +DECLARE_UDF(regress, margins_logregr_int_final) 1  src/modules/regress/regress.hpp  @@ -9,5 +9,6 @@ #include "linear.hpp" #include "clustered_errors.hpp" #include "logistic.hpp" +#include "marginal.hpp" #include "multilogistic.hpp" #include "mlogr_margins.hpp" 13 src/ports/postgres/modules/regress/linear.py_in  @@ -132,12 +132,13 @@ def linregr_train(schema_madlib, source_table, out_table, """ create table {out_table}_summary as select - '{source_table}'::varchar as source_table, - '{out_table}'::varchar as out_table, - '{dependent_varname}'::varchar as dependent_varname, - '{independent_varname}'::varchar as independent_varname, - {num_rows_processed}::integer as num_rows_processed, - {num_rows_skipped}::integer as num_missing_rows_skipped + 'linregr'::varchar as method, + '{source_table}'::varchar as source_table, + '{out_table}'::varchar as out_table, + '{dependent_varname}'::varchar as dependent_varname, + '{independent_varname}'::varchar as independent_varname, + {num_rows_processed}::integer as num_rows_processed, + {num_rows_skipped}::integer as num_missing_rows_skipped """.format(out_table=out_table, source_table=source_table, dependent_varname=dependent_varname, independent_varname=independent_varname, 22 src/ports/postgres/modules/regress/logistic.py_in  @@ -279,7 +279,9 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col, (case when (result).status = 1 then num_rows - (result).num_processed when result is null then num_rows else NULL::bigint end) as num_missing_rows_skipped, - {col_grp_iteration} as num_iterations + {col_grp_iteration} as num_iterations, + (case when (result).status = 1 then (result).vcov + else NULL::double precision[] end) as variance_covariance from ( select @@ -349,14 +351,15 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col, """ create table {tbl_output}_summary as select - '{tbl_source}'::varchar as source_table, - '{tbl_output}'::varchar as out_table, - '{dep_col}'::varchar as dependent_varname, - '{ind_col}'::varchar as independent_varname, + 'logregr'::varchar as method, + '{tbl_source}'::varchar as source_table, + '{tbl_output}'::varchar as out_table, + '{dep_col}'::varchar as dependent_varname, + '{ind_col}'::varchar as independent_varname, 'optimizer={optimizer}, max_iter={max_iter}, tolerance={tolerance}'::varchar as optimizer_params, {all_groups}::integer as num_all_groups, - {failed_groups}::integer as num_failed_groups, - {num_rows_processed}::integer as num_rows_processed, + {failed_groups}::integer as num_failed_groups, + {num_rows_processed}::integer as num_rows_processed, {num_missing_rows_skipped}::integer as num_missing_rows_skipped """.format(all_groups=all_groups, failed_groups=failed_groups, **args)) @@ -384,7 +387,7 @@ def logregr_help_msg(schema_madlib, message, **kwargs): Returns: A string, contains the help message """ - if message is None: + if not message: help_string = """ ---------------------------------------------------------------- @@ -437,6 +440,7 @@ The output table ('out_table' above) has the following columns: 'num_iterations' double precision -- how many iterations are used in the computation per group A summary table named _summary is also created at the same time, which has: + 'method' varchar, -- modeling method name ('logregr') 'source_table' varchar, -- the data source table name 'out_table' varchar, -- the output table name 'dependent_varname' varchar, -- the dependent variable @@ -493,7 +497,7 @@ SELECT madlib.logregr_train( 'patients', SELECT * from patients_logregr; """ else: - help_string = "No such option. Use {schema_madlib}.logregr_train()" + help_string = "No such option. Use {schema_madlib}.logregr_train('help')" return help_string.format(schema_madlib=schema_madlib) ## ======================================================================== 3  src/ports/postgres/modules/regress/logistic.sql_in  @@ -517,6 +517,7 @@ CREATE TYPE MADLIB_SCHEMA.__logregr_result AS ( z_stats DOUBLE PRECISION[], p_values DOUBLE PRECISION[], odds_ratios DOUBLE PRECISION[], + vcov DOUBLE PRECISION[], condition_no DOUBLE PRECISION, status INTEGER, num_processed BIGINT, @@ -894,7 +895,7 @@ m4_ifdef(__HAS_FUNCTION_PROPERTIES__', CONTAINS SQL', '); CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_train() RETURNS TEXT AS - SELECT MADLIB_SCHEMA.logregr_train(''::TEXT); + SELECT MADLIB_SCHEMA.logregr_train(NULL::TEXT); $$LANGUAGE SQL IMMUTABLE m4_ifdef(__HAS_FUNCTION_PROPERTIES__', CONTAINS SQL', '); ------------------------------------------------------------------------ 29 src/ports/postgres/modules/regress/marginal.py_in  @@ -139,6 +139,8 @@ def margins_logregr(schema_madlib, source_table, out_table, For function usage information. Run sql> select margins_logregr('usage'); """ + plpy.warning("This function has been deprecated and replaced by 'margins'") + # Reset the message level to avoid random messages old_msg_level = plpy.execute(""" SELECT setting @@ -156,9 +158,6 @@ def margins_logregr(schema_madlib, source_table, out_table, tolerance) _margins_logregr_validate_args(optimizer) - # NOTICE: * support was removed because other modules did not have it. - # Uncomment the following code if you want to re-add '*' support - group_col_str = 'NULL' if grouping_cols is None else "'" + grouping_cols + "'" optimizer_str = 'NULL' if optimizer is None else "'" + optimizer + "'" maxiter_str = 'NULL' if max_iter is None else max_iter @@ -179,8 +178,8 @@ def margins_logregr(schema_madlib, source_table, out_table, maxiter_str=maxiter_str, optimizer_str=optimizer_str, tolerance_str=tolerance_str, verbose=verbose_mode)) - m4_changequote(>>>', <<<') - m4_ifdef(>>>__HAWQ__<<<, >>> + m4_changequote(') + m4_ifdef(, >> + !>, >><<<, >>>'<<<) + !>) + m4_changequote(, ) coef = plpy.execute("select coef from {0}".format(logr_out_table))[0]['coef'] if coef is None: @@ -255,6 +254,8 @@ def margins_logregr(schema_madlib, source_table, out_table, def margins_logregr_help(schema_madlib, message, **kwargs): + plpy.warning("This function has been deprecated and replaced by 'margins'") + if not message: help_string = """ ----------------------------------------------------------------------- @@ -303,6 +304,8 @@ The output summary table is the same as logregr_train(), see also: return help_string.format(schema_madlib=schema_madlib) + +# ======================================================================== # ----------------------------------------------------------------------- # Marginal Effects for multinomial logistic regression # ----------------------------------------------------------------------- @@ -551,8 +554,8 @@ def margins_mlogregr(schema_madlib, source_table, out_table, 'max_iter={max_iter}, optimizer={optimizer}, tolerance={tolerance}') """.format(**all_arguments)) - m4_changequote(>>>', <<<') - m4_ifdef(>>>__HAWQ__<<<, >>> + m4_changequote(') + m4_ifdef(, >> + !>, >><<<, >>>'<<<) + !>) + m4_changequote(, ) num_categories = plpy.execute( "SELECT count(DISTINCT {0}) as n_cat FROM {1}". 313 src/ports/postgres/modules/regress/marginal.sql_in  @@ -18,8 +18,8 @@ m4_include(SQLCommon.m4') -@brief Calculates marginal effects for the coefficients in logistic and multinomial logistic regression problems. +@brief Calculates marginal effects for the coefficients in regression problems. A marginal effect (ME) or partial effect measures the effect on the -conditional mean of \f y \f of a change in one of the regressors, say +conditional mean of \f y \f for a change in one of the regressors, say \fX_k\f. In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, specialized algorithms are required for calculating ME. The marginal effect @@ -41,10 +41,82 @@ source table. MADlib provides marginal effects regression functions for logistic and multinomial logistic regressions. +@warning The margins_logregr has been deprecated in favor of the 'margins' function. + +@anchor margins +@par Marginal Effects for Logistic Regression with Interaction Terms + +margins( modele_table, + output_table, + x_design, + source_table, + marginal_vars + ) + +\b Arguments + + modele_table + VARCHAR. Name of the model table, which is the output of + 'logregr_train' and 'mlogregr_train'. + output_table + VARCHAR. Name of result table. The output table has the following columns. + + variablesINTEGER[]. The indices of the basis variables. marginsDOUBLE PRECISION[]. The marginal effects. std_errDOUBLE PRECISION[]. An array of the standard errors, + computed using the delta method. z_statsDOUBLE PRECISION[]. An array of the z-stats of the marginal effects. p_valuesDOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects. + + + + + + + + + + + + + + + + + + + + + + x_design(optional) + VARCHAR, default: NULL. The design of indepedent variables, necessary + only if interaction terms are present. This is necessary since the + independent variables in the underlying regression + could be a flat array, making it difficult to read interaction + effects between variables. + + Example: + The user can provide independent_varname in the regression method + either of the following ways: + - ‘array[1, color_blue, color_green, gender_female, gpa, gpa^2, gender_female*gpa, gender_female*gpa^2, weight]’ + - ‘x’ + + In the second version, the column ‘x’ is an array containing data + identical to that expressed in the first version, computed in a prior + data preparation step. + + The user would then supply an x_design argument to the margins() + function in the following way: + - ‘1, i.2.color, i.3.color, i.4.gender, 5, 5^2, 5*4, 5^2*4, 9’ + + source_table (optional) + VARCHAR, default: NULL. Name of data table to apply marginal effects on. + If not provided or NULL then the marginal effects are computed on the training table. + + marginal_vars (optional) + INTEGER[], default: NULL. The indices (base 1) of variables to calculate + marginal effects on. When NULL, computes marginal effects on all variables. + @anchor logregr_train -@par Logistic Regression Training Function +@par Marginal Effects for Logistic Regression(Deprecated) margins_logregr( source_table, output_table, @@ -105,7 +177,7 @@ margins_logregr( source_table, @anchor mlogregr_train -@par Multinomial Logistic Regression Training Function +@par Marginal Effects for Multinomial Logistic Regression margins_mlogregr( source_table, out_table, @@ -172,9 +244,10 @@ margins_mlogregr( source_table, @anchor examples @examp --# View online help for the marginal effects logistic regression function. + +-# View online help for the marginal effects. -SELECT madlib.margins_logregr(); +SELECT madlib.margins(); -# Create the sample data set. @@ -197,47 +270,48 @@ Result: ... --# Run the logistic regression function and then compute the marginal effects of all variables in the regression. - -SELECT madlib.margins_logregr( 'patients', - 'result_table', - 'second_attack', - 'ARRAY[1, treatment, trait_anxiety]' - ); - - --# View the regression results. +-# Run logistic regression to get the model, then compute the marginal effects of all variables, and view the results. +SELECT madlib.logregr_train( 'patients', + 'model_table', + 'second_attack', + 'ARRAY[1, treatment, trait_anxiety, treatment^2, treatment * trait_anxiety]' + ); +SELECT madlib.margins( 'model_table', + 'margins_table', + '1, 2, 3, 2^2, 2*3', + NULL, + NULL + ); \\x ON -SELECT * FROM result_table; +SELECT * FROM margins_table; Result: -margins | {-0.970665392796,-0.156214190168,0.0181587690137} -coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} -std_err | {0.802871454422,0.292691682191,0.0137459874022} -z_stats | {-1.2089922832,-0.533715850748,1.32102325446} -p_values | {0.2266658, 0.5935381, 0.1864936} +variables | {1,2,3} +margins | {-0.876046514609573,-0.0648833521465306,0.0177196513589633} +std_err | {0.551714275062467,0.373592457067442,0.00458001207971933} +z_stats | {-1.58786269307674,-0.173674149247659,3.86890930646828} +p_values | {0.112317391159946,0.862121554662231,0.000109323294026272} --# Run the logistic regression function and then compute the marginal effects of the first variable in the regression. +-# Compute the marginal effects of the first variable using the previous model and view the results. -SELECT madlib.margins_logregr( 'patients', - 'result_table', - 'second_attack', - 'ARRAY[1, treatment, trait_anxiety]', - NULL, - ARRAY[1] - ); +SELECT madlib.margins( 'model_table', + 'result_table', + '1, 2, 3, 2^2, 2*3', + NULL, + '1' + ); SELECT * FROM result_table; Result: -margins | {-0.970665392796} -coef | {-6.36346994178179} -std_err | {0.802871454422} -z_stats | {-1.2089922832} -p_values | {0.2266658} +variables | {1} +margins | {-0.876046514609573} +std_err | {0.551714275062467} +z_stats | {-1.58786269307674} +p_values | {0.112317391159946} -# View online help for marginal effects multinomial logistic regression. @@ -388,6 +462,10 @@ File marginal.sql_in documenting the SQL functions. @endinternal */ +-------------------------------------------------------------------------------- +-- DEPRECATION NOTICE ---------------------------------------------------------- +-- The below udt/udf/uda have been deprecated and should be removed in next major + ------------------ Marginal Logistic Regression ------------------------------ DROP TYPE IF EXISTS MADLIB_SCHEMA.marginal_logregr_result CASCADE; CREATE TYPE MADLIB_SCHEMA.marginal_logregr_result AS ( @@ -659,9 +737,9 @@ RETURNS VOID AS$$ $$LANGUAGE sql VOLATILE m4_ifdef(__HAS_FUNCTION_PROPERTIES__', MODIFIES SQL DATA', '); -- End of Default Variable calls for margins_logregr ------------------------------------------------------------------------------- +------------------------------------------------------------------------------- ------------------- Marginal Multi-Logistic Regression ------------------------------ +------------------ Marginal Multi-Logistic Regression ------------------------- DROP TYPE IF EXISTS MADLIB_SCHEMA.marginal_mlogregr_result CASCADE; CREATE TYPE MADLIB_SCHEMA.marginal_mlogregr_result AS ( @@ -1010,7 +1088,6 @@$$ LANGUAGE plpgsql VOLATILE m4_ifdef(__HAS_FUNCTION_PROPERTIES__', MODIFIES SQL DATA', '); -- End of Default Variable calls for margins_mlogregr ------------------------------------------------------------------------------ --- END OF DEPRECATED NOTICE ----------------------------------------------------------- -- Default calls for margins_mlogregr (Overloaded functions) @@ -1090,6 +1167,8 @@ m4_ifdef(__HAS_FUNCTION_PROPERTIES__', MODIFIES SQL DATA', '); -- End of Default Variable calls for margins_mlogregr ------------------------------------------------------------------------------ +-- END OF DEPRECATED NOTICE ----------------------------------------------------------- + CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__sub_array( value_array DOUBLE PRECISION[], -- The array containing values to be selected @@ -1098,3 +1177,157 @@ CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__sub_array( AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE m4_ifdef(__HAS_FUNCTION_PROPERTIES__', NO SQL', '); + +----------------------------------------------------------------------- +-- Marginal effects with Interactions terms +----------------------------------------------------------------------- + +------------------ Marginal Logistic Regression w/ Interaction ----------------- +----------------------- New interface for marginal ----------------------------- + +DROP TYPE IF EXISTS MADLIB_SCHEMA.margins_int_logregr_result; +CREATE TYPE MADLIB_SCHEMA.margins_int_logregr_result AS ( + margins DOUBLE PRECISION[], + std_err DOUBLE PRECISION[], + z_stats DOUBLE PRECISION[], + p_values DOUBLE PRECISION[] +); +-------------------------------------- + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_transition( + state DOUBLE PRECISION[], + x DOUBLE PRECISION[], + coef DOUBLE PRECISION[], + vcov DOUBLE PRECISION[], + derivative DOUBLE PRECISION[], + categorical_indices DOUBLE PRECISION[], + x_set DOUBLE PRECISION[], + x_unset DOUBLE PRECISION[]) +RETURNS DOUBLE PRECISION[] +AS 'MODULE_PATHNAME', 'margins_logregr_int_transition' +LANGUAGE C IMMUTABLE; + +-------------------------------------- + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_merge( + state1 DOUBLE PRECISION[], + state2 DOUBLE PRECISION[]) +RETURNS DOUBLE PRECISION[] +AS 'MODULE_PATHNAME', 'margins_logregr_int_merge' +LANGUAGE C IMMUTABLE STRICT; + +-------------------------------------- + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_final( + state DOUBLE PRECISION[]) +RETURNS MADLIB_SCHEMA.margins_int_logregr_result +AS 'MODULE_PATHNAME', 'margins_logregr_int_final' +LANGUAGE C IMMUTABLE STRICT; + +-------------------------------------- + +/** + * @brief Compute marginal effects for logistic regression. + * + * @param dependentVariable Column containing the dependent variable + * @param independentVariables Column containing the array of independent variables + * @param coef Column containing the array of the coefficients (as obtained by logregr) + * + * + * @return A composite value: + * - margins FLOAT8[] - Array of marginal effects + * - std_err FLOAT8[] - Array of standard-errors (calculated by the delta method), + * - z_stats FLOAT8[] - Array of z-statistics + * - p_values FLOAT8[] - Array of p-values + * + * @usage + * - Get all the diagnostic statistics:\n + * + *
+ */ + +CREATE AGGREGATE MADLIB_SCHEMA.__margins_int_logregr_agg( + /*+ "independentVariables" */ DOUBLE PRECISION[], + /*+ "coef" */ DOUBLE PRECISION[], + /*+ "vcov" */ DOUBLE PRECISION[], + /*+ "derivative matrix" */ DOUBLE PRECISION[], + /*+ "categorical_indices" */ DOUBLE PRECISION[], + /*+ "x_set" */ DOUBLE PRECISION[], + /*+ "x_unset" */ DOUBLE PRECISION[] + )( + STYPE=DOUBLE PRECISION[], + SFUNC=MADLIB_SCHEMA.__margins_logregr_int_transition, + m4_ifdef(__POSTGRESQL__', ', PREFUNC=MADLIB_SCHEMA.__margins_logregr_int_merge,') + FINALFUNC=MADLIB_SCHEMA.__margins_logregr_int_final, + INITCOND='{0, 0, 0, 0, 0, 0, 0, 0, 0}' +); + +------------------------------------------------------------------------- +-- New interface for margins logregr +------------------------------------------------------------------------- + +/** + * @brief Marginal effects with default variable_names + **/ +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins( + model_table VARCHAR, -- name of table containing logistic regression model + out_table VARCHAR, -- name of output table to return marginal effect values + x_design VARCHAR, -- design of the independent variables + source_table VARCHAR, -- Source table to apply marginal effects on + -- (Optional, if not provided or NULL then + -- training table is taken as the source) + marginal_vars VARCHAR -- indices of variables to calculate marginal effects on + -- (Optional, if not provided or NULL then compute + -- marginal effects for all basis variables) + ) +RETURNS VOID AS $$+PythonFunction(regress, margins, margins) +$$ LANGUAGE plpythonu VOLATILE; + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins( + model_table VARCHAR, -- name of table containing logistic regression model + out_table VARCHAR, -- name of output table + x_design VARCHAR, -- design of the independent variables + source_table VARCHAR -- Source table to apply marginal effects on + -- (Optional, if not provided or NULL then + -- training table is taken as the source) + ) +RETURNS VOID AS $$+ SELECT MADLIB_SCHEMA.margins(1, 2, 3, 4, NULL) +$$ LANGUAGE sql VOLATILE; + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins( + model_table VARCHAR, -- name of table containing logistic regression model + out_table VARCHAR, -- name of output table + x_design VARCHAR -- design of the independent variables + ) +RETURNS VOID AS $$+ SELECT MADLIB_SCHEMA.margins(1, 2, 3, NULL, NULL) +$$ LANGUAGE sql VOLATILE; + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins( + model_table VARCHAR, -- name of table containing logistic regression model + out_table VARCHAR -- name of output table + ) +RETURNS VOID AS $$+ SELECT MADLIB_SCHEMA.margins(1, 2, NULL, NULL, NULL) +$$ LANGUAGE sql VOLATILE; +------------------------------------------------------------------------- + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins( + message VARCHAR +) RETURNS VARCHAR AS $$+ PythonFunction(regress, margins, margins_help) +$$ LANGUAGE plpythonu IMMUTABLE +m4_ifdef(__HAS_FUNCTION_PROPERTIES__', CONTAINS SQL', '); + +-------------------------------------- + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins() +RETURNS VARCHAR AS $$+ SELECT MADLIB_SCHEMA.margins(''); +$$ LANGUAGE sql IMMUTABLE +m4_ifdef(__HAS_FUNCTION_PROPERTIES__', CONTAINS SQL', '); + +-------------------------------------------------------------------------------------