Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Margins Logregr: New interface + functionality for interaction terms

Pivotal Tracker: 67684630, 60733090

Additional authors:
Shengwen Yang <syang@gopivotal.com>
Qian, Hai <hqian@gopivotal.com>

Changes:
    - Deprecated the old interface and introduced a new single 'margins'
    function.
    - The new function takes the model table from regression as an input and
    does not run the underlying regression again. The 'margins' function detects
    the regression method from the model summary table and runs the appropriate
    calculation.
    - If interaction terms are present in the independent variables, then
    an x_design string is expected that describes the interactions.
  • Loading branch information...
commit 77e3bfe772443688e982929d956b6e6eb4a1c98b 1 parent 9fa5d0b
Rahul Iyer authored
View
1  ReadMe.txt
@@ -51,3 +51,4 @@ upgrading to the next major version:
- Overloaded forms of function 'margins_mlogregr' that accept
individual optimizer parameters (max_iter, optimizer, tolerance). These
parameters have been replaced with a single optimizer parameter.
+ - All overloaded functions 'margins_logregr'.
View
164 doc/design/modules/regression.tex
@@ -5,11 +5,11 @@
\begin{moduleinfo}
\item[Authors] {Rahul Iyer and Hai Qian}
\item[History]
- \begin{modulehistory}
+ \begin{modulehistory}
\item[v0.3] Added section on Clustered Sandwich Estimators
\item[v0.2] Added section on Marginal Effects
- \item[v0.1] Initial version, including background of regularization
- \end{modulehistory}
+ \item[v0.1] Initial version, including background of regularization
+ \end{modulehistory}
\end{moduleinfo}
\newcommand{\bS}[1]{\boldsymbol{#1}}
@@ -718,19 +718,29 @@ \section{Marginal Effects} % (fold)
linear function of $(x_1, \dots, x_m) = X$ and $y$ is a continuous variable, a
linear regression model can be stated as follows:
\begin{align*}
- & y = X' \beta \\
+ & y = X^T\beta \\
& \text{or} \\
& y = \beta_0 + \beta_1 x_1 + \dots + \beta_l x_m.
\end{align*}
From the above equation it is straightforward to see that the marginal effect of
-variable $x_k$ on the dependent variable is $\partial y / \partial x = \beta_k$.
+variable $x_k$ on the dependent variable is $\partial y / \partial x =
+\beta_k$. However, this is just for the cases where there is no
+interactions between the variables. If there is any interactions, the
+model would be
+\begin{align*}
+ & y = F^T\beta \\
+ & \text{or} \\
+ & y = \beta_0 + \beta_1 f_1 + \dots + \beta_l f_m.
+\end{align*}
+where $f_i$ is a function of the base variables $x_1, x_2, \dots, x_l$ and describes the
+interaction between the base variables.
The standard approach to modeling dichotomous/binary variables
(so $y \in {0, 1}$) is to estimate a generalized linear model under the
assumption that y follows some form of Bernoulli distribution. Thus the expected
value of $y$ becomes,
\begin{equation*}
- y = G(X' \beta),
+ y = G(X^T \beta),
\end{equation*}
where G is the specified binomial distribution. Here we assume to use
logistic regression and use $g$ to refer to the inverse logit function.
@@ -739,15 +749,34 @@ \subsection{Logistic regression} % (fold)
\label{sub:logistic_regression}
In logistic regression:
\begin{align*}
- P &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_m x_m)}} \\
+ P &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 f_1 + \dots \beta_m f_m)}} \\
&= \frac{1}{1 + e^{-z}}
\end{align*}
\begin{align*}
- \implies \frac{\partial P}{\partial X_k} &= \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot
- \frac{e^{-z}}{1 + e^{-z}} \\
- &= \beta_k \cdot P \cdot (1-P)
+ \implies \frac{\partial P}{\partial X_k} &= P \cdot (1-P) \cdot
+ \frac{\partial z}{\partial x_k},
\end{align*}
-
+where the partial derivative in the last equation equals to $\beta_k$
+if there is no interaction terms. However, in general cases, there is
+no simple expression for it, and we just keep it as it is.
+
+For categorical variables, things are a little bit complicated. Dummy
+variables are created for a categorical variable. There
+are two options. First, we can treat the dummy variables as if they
+were continuous ones, and use the above equation to compute their
+marginal effect. Second, for each dummy variable we can compute the discrete change with
+respect to the reference level of this categorical variable:
+\begin{align*}
+ \Delta_{x_k^{(v)}} P = \left.P\right\vert_{x_k^{(v)}=1, x_k^{(w)}=0\ (w\neq
+ v)} - \left.P\right\vert_{x_k^{(0)}=1, x_k^{(w)}=0 \ (w \neq 0)}\ = P_{set} - P_{unset},
+\end{align*}
+where $x_k$ is a categorical variable, and $v$, $w$ denote the levels
+of the categorical variable. $0$ denotes the reference level of this
+categorical variable. Note that in many cases, the dummy variable for
+the reference level does not appear in the regression model, and
+setting $x_k^{(w)}$ for $w\neq 0$ is enough in the second term of the
+above equation.
+Both options are valid. The default for MADlib is the second one.
There are two main methods of calculating the marginal effects for dichotomous
dependent variables.
@@ -755,9 +784,9 @@ \subsection{Logistic regression} % (fold)
\item The first uses the average of the marginal effects at every sample
observation. This is calculated as follows:
\begin{gather*}
- \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^{n} P(y_i = 1)(1-P(y_i = 1))}{n}, \\
- \text{where, } P(y_i=1) = g(X^{(i)} \beta) \\
- \text{and, } g(z) = \frac{1}{1 + e^{-z}} \\
+ \langle \frac{\partial P(y_i = 1)}{\partial x_k} \rangle = \frac{1}{n}\sum_{i=1}^{n} P(y_i = 1)(1-P(y_i = 1))\cdot\frac{\partial z_i}{\partial x_k}, \\
+ \text{where, } P(y_i=1) = \frac{1}{1 + e^{-z_i}}, \\
+ \text{and, } z_i = F(X_i)^T\beta \\
\end{gather*}
\item The second approach calculates the marginal effect for $x_k$ by taking
@@ -765,12 +794,19 @@ \subsection{Logistic regression} % (fold)
value from the same formulation with the exception of adding one unit to $x_k$.
The derivation of this marginal effect is captured by the following:
\begin{gather*}
- \frac{\partial y}{\partial x_k} = \quad \beta_k P(y=1|\bar{X})(1-P(y=1|\bar{X})) \\
- \text{where, } \bar{X} = \frac{\sum_{i=1}^{n}X^{(i)}}{n}
+ \left.\frac{\partial P(y=1)}{\partial
+ x_k}\right\vert_{X=\bar{X}} = \quad
+ P(y=1|\bar{X})(1-P(y=1|\bar{X}))
+ \left.\frac{\partial z}{\partial x_k}\right\vert_{X=\bar{X}} \\
+ \text{where, } \bar{X} = \frac{\sum_{i=1}^{n}X_i}{n}
\end{gather*}
\end{enumerate}
% subsection logistic_regression (end)
+For categorical variables, we do the same thing: either evaluate the
+marginal effect for each data record and compute the average, or
+evaluate the marginal effect at the means of the variables.
+
\subsection{Discrete change effect} % (fold)
\label{sub:discrete_change_effect}
Along with marginal effects we can also compute the following discrete change
@@ -840,68 +876,74 @@ \subsection{Standard Errors} % (fold)
function. The delta method therefore relies on finding a linear approximation
of the function by using a first-order Taylor expansion.
-We can approximate a function $f(x)$ about a value $a$ as,
+We can approximate a function $g(x)$ about a value $a$ as,
\[
- f(x) \approx f(a) + (x-a)f'(a)
+g(x) \approx g(a) + (x-a)g'(a)
\]
Taking the variance and setting $a = \mu_x$,
\[
- Var(f(X)) \approx \left[f'(\mu_x)\right]^2 Var(X)
+Var(g(X)) \approx \left[g'(\mu_x)\right]^2 Var(X)
\]
\subsubsection*{Logistic Regression}
-Using this technique, to compute the variance of the marginal effects at the
-mean observation value in \emph{logistic regression}, we obtain:
-\begin{gather*}
- Var(ME_k) = \frac{\partial (\beta_k \bar{P} (1- \bar{P}))}{\partial \beta_k} Var(\beta_k),\\
- \text{where, } \bar{P} = g(\bar{X}' \beta) = \frac{1}{1 + e^{-\bar{z}}} \\
- \text{and } \bar{z} = \beta_0 + \beta_1 \bar{x}_1 + \dots + \beta_m \bar{x}_m
-\end{gather*}
-
-Thus, using the rule for differentiating compositions of functions, we get
+Using this technique, to compute the variance of the marginal effects
+at the mean observation value in \emph{logistic regression}, we obtain
+the standard error by first computing the marginal effect's derivative
+over the coefficients, which is a $n\times m$ matrix $S_{kl}
+= \frac{\partial \mathit{ME}_k}{\partial \beta_l}$:
+\begin{eqnarray*}
+ S_{kl} &=& \frac{\partial}{\partial\beta_l} \left[P (1- P)
+ \cdot \frac{\partial z}{\partial x_k}\right]\\
+ &=& P (1- P) \cdot \frac{\partial}{\partial\beta_l}\left(\frac{\partial z}{\partial x_k}\right)
+ + \frac{\partial \left[P (1- P)\right]}{\partial\beta_l} \cdot \frac{\partial z}{\partial x_k}\\
+ &=& P(1-P)\cdot\frac{\partial^2 z}{\partial x_k\partial\beta_l}
+ + P(1-P)(1-2P) \cdot \frac{\partial z}{\partial \beta_l} \cdot \frac{\partial z}{\partial x_k},\\
+ \text{where } P &=& \frac{1}{1 + e^{-z}} \\
+ \text{and } z &=& \beta_0 + \beta_1 f_1(X)+ \dots + \beta_m f_m(X),
+ X = x_1, x_2, \dots, x_n.
+\end{eqnarray*}
+And for categorical variables, just replace $P(1-P)\cdot(\partial z/\partial x_k)$
+with $\Delta_{x_k^{(v)}}P$ in the first equation above. And we can get
+\begin{eqnarray*}
+ S_{kl} &=& \frac{\partial(P_{set}-P_{unset})}{\partial\beta_l} \\
+ &=& P_{set} (1 - P_{set}) \cdot f_{l_{set}} - P_{unset} (1 - P_{unset}) \cdot f_{l_{unset}}
+\end{eqnarray*}
+
+Thus, the variance of the marginal effects is
\begin{align*}
- Var(ME_k) & = \left(-\beta_k \bar{P} \frac{\partial \bar{P}}{\partial \beta_k} +
- \beta_k (1-\bar{P})\frac{\partial \bar{P}}{\partial \beta_k} +
- \bar{P}(1-\bar{P}) \right) Var(\beta_k) \\
- & = \left( (1-2\bar{P})\beta_k \frac{\partial \bar{P}}{\partial \beta_k} + \bar{P}(1-\bar{P}) \right) Var(\beta_k)
-\end{align*}
-We have,
-\begin{align*}
- \frac{\partial \bar{P}}{\partial \beta_k} & = \frac{\partial (\frac{1}{1 + e^{-z}})}{\partial \beta_k} \\
- & = \frac{1}{(1+e^{-z})^2} e^{-z} \frac{\partial z}{\partial \beta_k} \\
- & = \frac{x_k e^{-z}}{(1+e^{-z})^2} \\
- & = x_k \bar{P} (1 - \bar{P})
-\end{align*}
-Replacing this in the equation for $Var(ME_k)$,
-
-\begin{align*}
- Var(ME_k) = \bar{P}(1-\bar{P}) \left(1 + (1-2\bar{P})\beta_k x_k \right) Var(\beta_k)
+ Var(\mathit{ME}) = S \cdot Var(\beta)\cdot S^T\,
\end{align*}
+where $Var(\beta)$ is a $m\times m$ matrix and $S$ is a $n\times m$
+matrix. $n$ is the number of different base variables, and $m$ is the
+number of $\beta_i$.
-Since $\beta$, is a multivariate variable, we will have to use the variance-
-covariance matrix of $\beta$ to compute the variance of the marginal effects.
-Thus for the vector of marginal effects the equation becomes,
+Note: The $Var(\beta)$ is computed with respect to the training data
+for the logistic regression, but not the data used to compute the
+marginal effects (if we use a different data set for computing the
+marginal effects).
-\begin{gather*}
- Var(ME) = \bar{P}^2(1-\bar{P})^2 \left[I + (1-2\bar{P})\beta \bar{X}' \right] V \left[I+ (1-2\bar{P}) \bar{X} \beta' \right],
-\end{gather*}
-where $V$ is the estimated variance-covariance matrix of $\beta$.
+Using the definition of $z$, we can simplify $S$ a little bit
+\begin{equation}
+ S_{kl} = P(1-P)\left(\frac{\partial f_l}{\partial x_k} + (1-2P)\cdot f_l\sum_{i=1}^{m}\frac{\beta_i\partial f_i(X)}{\partial x_k}\right)
+\end{equation}
+So we just need to compute $\partial f_i/\partial x_k$ and all the
+other derivatives can be obtained.
\subsubsection*{Multinomial Logistic Regression}
For multinomial logistic regression, the coefficients $\beta$ form a matrix of
dimension $(J-1) \times K$ where $J$ is the number of categories and $K$ is the
number of features. In order to compute the standard errors on the marginal
effects of category $j$ for independent variable $k$, we need to compute
-the term $\frac{\partial ME_{k,j}} {\partial \beta_{k_1, j_1}}$ for each
+the term $\frac{\partial \mathit{ME}_{k,j}} {\partial \beta_{k_1, j_1}}$ for each
$k_1 \in \{1 \ldots K \}$ and $j_1 \in \{1 \ldots J-1 \}$. The result is a column
vector of length $K \times (J-1)$ denoted by
-$\frac{\partial ME_{k,j}}{\partial \vec{\beta}}$. Hence, for each category
+$\frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}$. Hence, for each category
$j \in \{1 \ldots J\}$ and independent variable $k \in \{1 \ldots K\}$, we
perform the following computation
\begin{equation}
- Var(ME_{j,k}) = \frac{\partial ME_{k,j}}{\partial \vec{\beta}}^T V \frac{\partial ME_{k,j}}{\partial \vec{\beta}}.
+ Var(\mathit{ME}_{j,k}) = \frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}^T V \frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}.
\end{equation}
where $V$ is the variance-covariance matrix of the multinomial logistic
@@ -911,13 +953,13 @@ \subsubsection*{Multinomial Logistic Regression}
From our earlier derivation, we know that the marginal effects for multinomial
logistic regression are:
\begin{gather*}
- \frac{ME_{j,k}}{\partial x} = \bar{P}_j \left[ \beta_{kj} - \sum_{l=1}^{j}\beta_{kl} \bar{P}_l \right]
+ \frac{\mathit{ME}_{j,k}}{\partial x} = \bar{P}_j \left[ \beta_{kj} - \sum_{l=1}^{j}\beta_{kl} \bar{P}_l \right]
\end{gather*}
where
\begin{gather*}
\bar{P}_j = \frac{e^{X\beta_{j,.}}}{\sum_{l=1}^{j} e^{X\beta_{l,.}}} \ \ \ \forall j \in \{ 1 \ldots J \}
\end{gather*}
-We now compute the term $\frac{\partial ME_{k,j}}{\partial \vec{\beta}}$. First,
+We now compute the term $\frac{\partial \mathit{ME}_{k,j}}{\partial \vec{\beta}}$. First,
we define the following three indicator variables:
$$
e_{j,j\_1} = \begin{cases} 1 & \mbox{if } $j=j\_1$ \\
@@ -933,7 +975,7 @@ \subsubsection*{Multinomial Logistic Regression}
Using the above definition, we can show that for each $j_1 \in \{ 1 \ldots J \}$ and
$k_1 \in \{1 \ldots K\}$, the partial derivative
\begin{align*}
- \frac{\partial ME_{k,j}}{\partial \beta_{j_1, k_1}} &= \frac{\partial \bar{P}_{j}}{\partial \beta_{j_1, k_1}}
+ \frac{\partial \mathit{ME}_{k,j}}{\partial \beta_{j_1, k_1}} &= \frac{\partial \bar{P}_{j}}{\partial \beta_{j_1, k_1}}
+ \bar{P}_j \Bigg{[} e_{j,j_1,k,k_1} - e_{k,k_1} \bar{P}_{j_1} -
\sum_{l=1}^{j} \beta_{l,k} \frac{\partial \bar{P}_l}{\partial \beta_{j_1, k_1}} \Bigg{]}
\end{align*}
@@ -944,16 +986,16 @@ \subsubsection*{Multinomial Logistic Regression}
\end{align*}
The two expressions above can be simplified to obtain
\begin{align*}
- \frac{\partial ME_{k,j}}{\partial \beta_{j_1, k_1}} &= \bar{P}_j
+ \frac{\partial \mathit{ME}_{k,j}}{\partial \beta_{j_1, k_1}} &= \bar{P}_j
\bar{X}_{k_1} [e_{j,j_1} - \bar{P}_{j_1}] [\beta_{j,k} -
\beta_{. k}^T \bar{P}] + \bar{P}_j \Bigg{[} e_{j,j_1,k,k_1} -
e_{k,k_1} \bar{P}_{j_1} -
\bar{X}_{k_1} \bar{P}_{j_1} ( \beta_{k,k_1} - \beta_{. k}^T \bar{P}) \Bigg{]} \\
- &= \bar{X}_{k_1} \bar{ME}_{j,k} [ e_{j,j_1} - \bar{P}_{j_1} ] +
+ &= \bar{X}_{k_1} \bar{\mathit{ME}}_{j,k} [ e_{j,j_1} - \bar{P}_{j_1} ] +
\bar{P}_j [e_{k,k_1,j,j_1} - e_{k,k_1}\bar{P}_{j_1} - \bar{X}_{k_1}
- \bar{ME}_{j_1,k} ]
+ \bar{\mathit{ME}}_{j_1,k} ]
\end{align*}
-where $\bar{ME}$ is the marginal effects computed at the mean observation $\bar{X}$.
+where $\bar{\mathit{ME}}$ is the marginal effects computed at the mean observation $\bar{X}$.
% subsection standard_errors (end)
View
2  src/config/Version.yml
@@ -1 +1 @@
-version: 1.5
+version: 1.5dev
View
116 src/modules/regress/logistic.cpp
@@ -35,7 +35,7 @@ enum { IN_PROCESS, COMPLETED, TERMINATED, NULL_EMPTY };
// Internal functions
AnyType stateToResult(const Allocator &inAllocator,
const HandleMap<const ColumnVector, TransparentHandle<double> >& inCoef,
- const ColumnVector &diagonal_of_inverse_of_X_transp_AX,
+ const Matrix & hessian,
const double &logLikelihood, const double &conditionNo, int status,
const uint64_t &numRows);
@@ -200,7 +200,7 @@ class LogRegrCGTransitionState {
* @brief Logistic function
*/
inline double sigma(double x) {
- return 1. / (1. + std::exp(-x));
+ return 1. / (1. + std::exp(-x));
}
/**
@@ -308,16 +308,16 @@ logregr_cg_step_final::run(AnyType &args) {
// Note: k = state.iteration
if (state.iteration == 0) {
- // Iteration computes the gradient
+ // Iteration computes the gradient
- state.dir = state.gradNew;
- state.grad = state.gradNew;
- } else {
+ state.dir = state.gradNew;
+ state.grad = state.gradNew;
+ } else {
// We use the Hestenes-Stiefel update formula:
//
- // g_k^T (g_k - g_{k-1})
- // beta_k = -------------------------
- // d_{k-1}^T (g_k - g_{k-1})
+ // g_k^T (g_k - g_{k-1})
+ // beta_k = -------------------------
+ // d_{k-1}^T (g_k - g_{k-1})
ColumnVector gradNewMinusGrad = state.gradNew - state.grad;
state.beta
= dot(state.gradNew, gradNewMinusGrad)
@@ -341,8 +341,8 @@ logregr_cg_step_final::run(AnyType &args) {
// d_k = g_k - beta_k * d_{k-1}
state.dir = state.gradNew - state.beta * state.dir;
- state.grad = state.gradNew;
- }
+ state.grad = state.gradNew;
+ }
// H_k = - X^T A_k X
// where A_k = diag(a_1, ..., a_n) and a_i = sigma(x_i c_{k-1}) sigma(-x_i c_{k-1})
@@ -405,7 +405,7 @@ internal_logregr_cg_result::run(AnyType &args) {
state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse);
return stateToResult(*this, state.coef,
- decomposition.pseudoInverse().diagonal(), state.logLikelihood,
+ state.X_transp_AX, state.logLikelihood,
decomposition.conditionNo(), state.status, state.numRows);
}
@@ -693,9 +693,9 @@ AnyType logregr_irls_step_final::run(AnyType &args) {
// of X^T A X, so that we don't have to recompute it in the result function.
// Likewise, we store the condition number.
// FIXME: This feels a bit like a hack.
- state.X_transp_Az = inverse_of_X_transp_AX.diagonal();
- state.X_transp_AX(0,0) = decomposition.conditionNo();
-
+ // state.X_transp_Az = inverse_of_X_transp_AX.diagonal();
+ // state.X_transp_AX(0,0) = decomposition.conditionNo();
+ // state.X_transp_Az(0) = decomposition.conditionNo();
return state;
}
@@ -724,8 +724,8 @@ AnyType internal_logregr_irls_result::run(AnyType &args) {
if (state.status == NULL_EMPTY)
return Null();
- return stateToResult(*this, state.coef, state.X_transp_Az,
- state.logLikelihood, state.X_transp_AX(0,0),
+ return stateToResult(*this, state.coef, state.X_transp_AX,
+ state.logLikelihood, state.X_transp_Az(0),
state.status, state.numRows);
}
@@ -800,16 +800,16 @@ class LogRegrIGDTransitionState {
throw std::logic_error("Internal error: Incompatible transition "
"states");
- // Compute the average of the models. Note: The following remains an
+ // Compute the average of the models. Note: The following remains an
// invariant, also after more than one merge:
// The model is a linear combination of the per-segment models
// where the coefficient (weight) for each per-segment model is the
// ratio "# rows in segment / total # rows of all segments merged so
// far".
- double totalNumRows = static_cast<double>(numRows)
+ double totalNumRows = static_cast<double>(numRows)
+ static_cast<double>(inOtherState.numRows);
- coef = double(numRows) / totalNumRows * coef
- + double(inOtherState.numRows) / totalNumRows * inOtherState.coef;
+ coef = double(numRows) / totalNumRows * coef
+ + double(inOtherState.numRows) / totalNumRows * inOtherState.coef;
numRows += inOtherState.numRows;
X_transp_AX += inOtherState.X_transp_AX;
@@ -824,7 +824,7 @@ class LogRegrIGDTransitionState {
* @brief Reset the inter-iteration fields.
*/
inline void reset() {
- // FIXME: HAYING: stepsize is hard-coded here now
+ // FIXME: HAYING: stepsize is hard-coded here now
stepsize = .01;
numRows = 0;
X_transp_AX.fill(0);
@@ -870,7 +870,7 @@ class LogRegrIGDTransitionState {
typename HandleTraits<Handle>::ColumnVectorTransparentHandleMap coef;
typename HandleTraits<Handle>::ReferenceToUInt64 numRows;
- typename HandleTraits<Handle>::MatrixTransparentHandleMap X_transp_AX;
+ typename HandleTraits<Handle>::MatrixTransparentHandleMap X_transp_AX;
typename HandleTraits<Handle>::ReferenceToDouble logLikelihood;
typename HandleTraits<Handle>::ReferenceToUInt16 status;
};
@@ -1025,7 +1025,7 @@ internal_logregr_igd_result::run(AnyType &args) {
state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse);
return stateToResult(*this, state.coef,
- decomposition.pseudoInverse().diagonal(),
+ state.X_transp_AX,
state.logLikelihood, decomposition.conditionNo(),
state.status, state.numRows);
}
@@ -1039,12 +1039,18 @@ internal_logregr_igd_result::run(AnyType &args) {
AnyType stateToResult(
const Allocator &inAllocator,
const HandleMap<const ColumnVector, TransparentHandle<double> > &inCoef,
- const ColumnVector &diagonal_of_inverse_of_X_transp_AX,
+ const Matrix & hessian,
const double &logLikelihood,
const double &conditionNo,
int status,
const uint64_t &numRows) {
+ SymmetricPositiveDefiniteEigenDecomposition<Matrix> decomposition(
+ hessian, EigenvaluesOnly, ComputePseudoInverse);
+
+ const Matrix &inverse_of_X_transp_AX = decomposition.pseudoInverse();
+ const ColumnVector &diagonal_of_X_transp_AX = inverse_of_X_transp_AX.diagonal();
+
MutableNativeColumnVector stdErr(
inAllocator.allocateArray<double>(inCoef.size()));
MutableNativeColumnVector waldZStats(
@@ -1055,23 +1061,22 @@ AnyType stateToResult(
inAllocator.allocateArray<double>(inCoef.size()));
for (Index i = 0; i < inCoef.size(); ++i) {
- stdErr(i) = std::sqrt(diagonal_of_inverse_of_X_transp_AX(i));
+ stdErr(i) = std::sqrt(diagonal_of_X_transp_AX(i));
waldZStats(i) = inCoef(i) / stdErr(i);
waldPValues(i) = 2. * prob::cdf( prob::normal(),
-std::abs(waldZStats(i)));
oddsRatios(i) = std::exp( inCoef(i) );
}
+
// Return all coefficients, standard errors, etc. in a tuple
AnyType tuple;
tuple << inCoef << logLikelihood << stdErr << waldZStats << waldPValues
- << oddsRatios << sqrt(conditionNo) << status << numRows;
+ << oddsRatios << inverse_of_X_transp_AX
+ << sqrt(decomposition.conditionNo()) << status << numRows;
return tuple;
}
-
-
-
// ---------------------------------------------------------------------------
// Robust Logistic Regression States
// ---------------------------------------------------------------------------
@@ -1187,7 +1192,7 @@ class RobustLogRegrTransitionState {
*/
void rebind(uint16_t inWidthOfX) {
- iteration.rebind(&mStorage[0]);
+ iteration.rebind(&mStorage[0]);
widthOfX.rebind(&mStorage[1]);
coef.rebind(&mStorage[2], inWidthOfX);
numRows.rebind(&mStorage[2 + inWidthOfX]);
@@ -1202,7 +1207,7 @@ class RobustLogRegrTransitionState {
- typename HandleTraits<Handle>::ReferenceToUInt32 iteration;
+ typename HandleTraits<Handle>::ReferenceToUInt32 iteration;
typename HandleTraits<Handle>::ReferenceToUInt16 widthOfX;
typename HandleTraits<Handle>::ColumnVectorTransparentHandleMap coef;
@@ -1219,13 +1224,13 @@ class RobustLogRegrTransitionState {
AnyType robuststateToResult(
const Allocator &inAllocator,
- const ColumnVector &inCoef,
+ const ColumnVector &inCoef,
const ColumnVector &diagonal_of_varianceMat) {
- MutableNativeColumnVector variance(
+ MutableNativeColumnVector variance(
inAllocator.allocateArray<double>(inCoef.size()));
- MutableNativeColumnVector coef(
+ MutableNativeColumnVector coef(
inAllocator.allocateArray<double>(inCoef.size()));
MutableNativeColumnVector stdErr(
@@ -1332,23 +1337,23 @@ robust_logregr_step_final::run(AnyType &args) {
// We request a mutable object. Depending on the backend, this might perform
// a deep copy.
RobustLogRegrTransitionState<MutableArrayHandle<double> > state = args[0];
- // Aggregates that haven't seen any data just return Null.
+ // Aggregates that haven't seen any data just return Null.
if (state.numRows == 0)
return Null();
- //Compute the robust variance with the White sandwich estimator
- SymmetricPositiveDefiniteEigenDecomposition<Matrix> decomposition(
+ //Compute the robust variance with the White sandwich estimator
+ SymmetricPositiveDefiniteEigenDecomposition<Matrix> decomposition(
state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse);
- Matrix bread = decomposition.pseudoInverse();
+ Matrix bread = decomposition.pseudoInverse();
- /*
+ /*
This is written a little strangely because it prevents Eigen warnings.
The following two lines are equivalent to:
Matrix variance = bread*state.meat*bread;
but eigen throws a warning on that.
- */
- Matrix varianceMat;// = meat;
+ */
+ Matrix varianceMat;// = meat;
varianceMat = bread*state.meat*bread;
/*
@@ -1478,7 +1483,7 @@ class MarginalLogRegrTransitionState {
* - 3 + widthOfX: X_transp_AX (X^T A X)
*/
void rebind(uint16_t inWidthOfX) {
- iteration.rebind(&mStorage[0]);
+ iteration.rebind(&mStorage[0]);
widthOfX.rebind(&mStorage[1]);
coef.rebind(&mStorage[2], inWidthOfX);
numRows.rebind(&mStorage[2 + inWidthOfX]);
@@ -1491,7 +1496,7 @@ class MarginalLogRegrTransitionState {
public:
- typename HandleTraits<Handle>::ReferenceToUInt32 iteration;
+ typename HandleTraits<Handle>::ReferenceToUInt32 iteration;
typename HandleTraits<Handle>::ReferenceToUInt16 widthOfX;
typename HandleTraits<Handle>::ColumnVectorTransparentHandleMap coef;
typename HandleTraits<Handle>::ReferenceToUInt64 numRows;
@@ -1601,8 +1606,8 @@ marginal_logregr_step_transition::run(AnyType &args) {
}
// Standard error according to the delta method
- state.delta += p * (1 - p) * delta;
-
+ state.delta += p * (1 - p) * delta;
+
return state;
}
@@ -1636,20 +1641,23 @@ marginal_logregr_step_final::run(AnyType &args) {
// We request a mutable object.
// Depending on the backend, this might perform a deep copy.
MarginalLogRegrTransitionState<MutableArrayHandle<double> > state = args[0];
- // Aggregates that haven't seen any data just return Null.
+ // Aggregates that haven't seen any data just return Null.
if (state.numRows == 0)
return Null();
-
+
// Compute variance matrix of logistic regression
- SymmetricPositiveDefiniteEigenDecomposition<Matrix> decomposition(
+ SymmetricPositiveDefiniteEigenDecomposition<Matrix> decomposition(
state.X_transp_AX, EigenvaluesOnly, ComputePseudoInverse);
-
- Matrix variance = decomposition.pseudoInverse();
+
+ Matrix variance = decomposition.pseudoInverse();
+
+ // dberr << "Delta = " << state.delta << " numrows = " << state.numRows << std::endl;
+ // dberr << "Variance = " << variance << std::endl;
// Standard error according to the delta method
- Matrix std_err;
- std_err = state.delta * variance * trans(state.delta) / (state.numRows*state.numRows);
-
+ Matrix std_err;
+ std_err = state.delta * variance * trans(state.delta) / (state.numRows*state.numRows);
+
// Computing the marginal effects
return marginalstateToResult(*this,
state.coef,
View
381 src/modules/regress/marginal.cpp
@@ -0,0 +1,381 @@
+/* ------------------------------------------------------
+ *
+ * @file logistic.cpp
+ *
+ * @brief Logistic-Regression functions
+ *
+ * We implement the conjugate-gradient method and the iteratively-reweighted-
+ * least-squares method.
+ *
+ *//* ----------------------------------------------------------------------- */
+#include <limits>
+#include <dbconnector/dbconnector.hpp>
+#include <modules/shared/HandleTraits.hpp>
+#include <modules/prob/boost.hpp>
+#include <boost/math/distributions.hpp>
+#include <modules/prob/student.hpp>
+#include "marginal.hpp"
+
+namespace madlib {
+
+// Use Eigen
+using namespace dbal::eigen_integration;
+
+namespace modules {
+
+// Import names from other MADlib modules
+using dbal::NoSolutionFoundException;
+
+namespace regress {
+
+// FIXME this enum should be accessed by all modules that may need grouping
+// valid status values
+enum { IN_PROCESS, COMPLETED, TERMINATED, NULL_EMPTY };
+
+inline double logistic(double x) {
+ return 1. / (1. + std::exp(-x));
+}
+
+// ---------------------------------------------------------------------------
+// Marginal Effects Logistic Regression States
+// ---------------------------------------------------------------------------
+/**
+ * @brief State for marginal effects calculation for logistic regression
+ *
+ * TransitionState encapsualtes the transition state during the
+ * marginal effects calculation for the logistic-regression aggregate function.
+ * To the database, the state is exposed as a single DOUBLE PRECISION array,
+ * to the C++ code it is a proper object containing scalars and vectors.
+ *
+ * Note: We assume that the DOUBLE PRECISION array is initialized by the
+ * database with length at least 5, and all elemenets are 0.
+ *
+ */
+template <class Handle>
+class MarginsLogregrInteractionState {
+ template <class OtherHandle>
+ friend class MarginsLogregrInteractionState;
+
+ public:
+ MarginsLogregrInteractionState(const AnyType &inArray)
+ : mStorage(inArray.getAs<Handle>()) {
+
+ rebind(static_cast<uint16_t>(mStorage[1]),
+ static_cast<uint16_t>(mStorage[2]),
+ static_cast<uint16_t>(mStorage[3]));
+ }
+
+ /**
+ * @brief Convert to backend representation
+ *
+ * We define this function so that we can use State in the
+ * argument list and as a return type.
+ */
+ inline operator AnyType() const {
+ return mStorage;
+ }
+
+ /**
+ * @brief Initialize the marginal variance calculation state.
+ *
+ * This function is only called for the first iteration, for the first row.
+ */
+ inline void initialize(const Allocator &inAllocator,
+ const uint16_t inWidthOfX,
+ const uint16_t inNumBasis,
+ const uint16_t inNumCategoricals) {
+ mStorage = inAllocator.allocateArray<double, dbal::AggregateContext,
+ dbal::DoZero, dbal::ThrowBadAlloc>(
+ arraySize(inWidthOfX, inNumBasis, inNumCategoricals));
+ rebind(inWidthOfX, inNumBasis, inNumCategoricals);
+ widthOfX = inWidthOfX;
+ numBasis = inNumBasis;
+ numCategoricals = inNumCategoricals;
+ }
+
+ /**
+ * @brief We need to support assigning the previous state
+ */
+ template <class OtherHandle>
+ MarginsLogregrInteractionState &operator=(
+ const MarginsLogregrInteractionState<OtherHandle> &inOtherState) {
+
+ for (size_t i = 0; i < mStorage.size(); i++)
+ mStorage[i] = inOtherState.mStorage[i];
+ return *this;
+ }
+
+ /**
+ * @brief Merge with another State object by copying the intra-iteration
+ * fields
+ */
+ template <class OtherHandle>
+ MarginsLogregrInteractionState &operator+=(
+ const MarginsLogregrInteractionState<OtherHandle> &inOtherState) {
+
+ if (mStorage.size() != inOtherState.mStorage.size() ||
+ widthOfX != inOtherState.widthOfX)
+ throw std::logic_error("Internal error: Incompatible transition "
+ "states");
+
+ numRows += inOtherState.numRows;
+ marginal_effects += inOtherState.marginal_effects;
+ delta += inOtherState.delta;
+ return *this;
+ }
+
+ /**
+ * @brief Reset the inter-iteration fields.
+ */
+ inline void reset() {
+ numRows = 0;
+ marginal_effects.fill(0);
+ categorical_indices.fill(0);
+ training_data_vcov.fill(0);
+ delta.fill(0);
+ }
+
+ private:
+ static inline size_t arraySize(const uint16_t inWidthOfX,
+ const uint16_t inNumBasis,
+ const uint16_t inNumCategoricals) {
+ return 5 + inNumBasis + inNumCategoricals + (inWidthOfX + inNumBasis) * inWidthOfX;
+ }
+
+ /**
+ * @brief Rebind to a new storage array
+ *
+ * @param inWidthOfX The number of independent variables.
+ *
+ */
+ void rebind(uint16_t inWidthOfX, uint16_t inNumBasis, uint16_t inNumCategoricals) {
+ iteration.rebind(&mStorage[0]);
+ widthOfX.rebind(&mStorage[1]);
+ numBasis.rebind(&mStorage[2]);
+ numCategoricals.rebind(&mStorage[3]);
+ numRows.rebind(&mStorage[4]);
+ marginal_effects.rebind(&mStorage[5], inNumBasis);
+ training_data_vcov.rebind(&mStorage[5 + inNumBasis], inWidthOfX, inWidthOfX);
+ delta.rebind(&mStorage[5 + inNumBasis + inWidthOfX * inWidthOfX],
+ inNumBasis, inWidthOfX);
+ if (inNumCategoricals > 0)
+ categorical_indices.rebind(&mStorage[5 + inNumBasis +
+ (inWidthOfX + inNumBasis) * inWidthOfX],
+ inNumCategoricals);
+ }
+ Handle mStorage;
+
+ public:
+
+ typename HandleTraits<Handle>::ReferenceToUInt32 iteration;
+ typename HandleTraits<Handle>::ReferenceToUInt16 widthOfX;
+ typename HandleTraits<Handle>::ReferenceToUInt16 numBasis;
+ typename HandleTraits<Handle>::ReferenceToUInt16 numCategoricals;
+ typename HandleTraits<Handle>::ReferenceToUInt64 numRows;
+ typename HandleTraits<Handle>::ColumnVectorTransparentHandleMap marginal_effects;
+ typename HandleTraits<Handle>::ColumnVectorTransparentHandleMap categorical_indices;
+ typename HandleTraits<Handle>::MatrixTransparentHandleMap training_data_vcov;
+ typename HandleTraits<Handle>::MatrixTransparentHandleMap delta;
+};
+// ----------------------------------------------------------------------
+
+/**
+ * @brief Helper function that computes the final statistics for the marginal variance
+ */
+
+AnyType margins_stateToResult(
+ const Allocator &inAllocator,
+ const ColumnVector &diagonal_of_variance_matrix,
+ const ColumnVector inmarginal_effects_per_observation,
+ const double numRows) {
+
+ uint16_t n_basis_terms = inmarginal_effects_per_observation.size();
+ MutableNativeColumnVector marginal_effects(
+ inAllocator.allocateArray<double>(n_basis_terms));
+ MutableNativeColumnVector stdErr(
+ inAllocator.allocateArray<double>(n_basis_terms));
+ MutableNativeColumnVector tStats(
+ inAllocator.allocateArray<double>(n_basis_terms));
+ MutableNativeColumnVector pValues(
+ inAllocator.allocateArray<double>(n_basis_terms));
+
+ for (Index i = 0; i < n_basis_terms; ++i) {
+ marginal_effects(i) = inmarginal_effects_per_observation(i) / numRows;
+ stdErr(i) = std::sqrt(diagonal_of_variance_matrix(i));
+ tStats(i) = marginal_effects(i) / stdErr(i);
+
+ // P-values only make sense if numRows > coef.size()
+ if (numRows > n_basis_terms)
+ pValues(i) = 2. * prob::cdf( prob::normal(),
+ -std::abs(tStats(i)));
+ }
+
+ // Return all coefficients, standard errors, etc. in a tuple
+ // Note: p-values will return NULL if numRows <= coef.size
+ AnyType tuple;
+ tuple << marginal_effects
+ << stdErr
+ << tStats
+ << (numRows > n_basis_terms? pValues: Null());
+ return tuple;
+}
+
+
+/**
+ * @brief Perform the marginal effects transition step
+ */
+AnyType
+margins_logregr_int_transition::run(AnyType &args) {
+ MarginsLogregrInteractionState<MutableArrayHandle<double> > state = args[0];
+ if (args[1].isNull()) { return args[0]; }
+ MappedColumnVector x;
+ try {
+ // an exception is raised in the backend if args[2] contains nulls
+ MappedColumnVector xx = args[1].getAs<MappedColumnVector>();
+ // x is a const reference, we can only rebind to change its pointer
+ x.rebind(xx.memoryHandle(), xx.size());
+ } catch (const ArrayWithNullException &e) {
+ return args[0];
+ }
+
+ // The following check was added with MADLIB-138.
+ if (!dbal::eigen_integration::isfinite(x))
+ throw std::domain_error("Design matrix is not finite.");
+
+ MappedColumnVector coef = args[2].getAs<MappedColumnVector>();
+
+ // matrix is read in as a column-order matrix, the input is passed-in as row-order.
+ Matrix derivative_matrix = args[4].getAs<MappedMatrix>();
+ derivative_matrix.transposeInPlace();
+
+ if (state.numRows == 0) {
+ if (x.size() > std::numeric_limits<uint16_t>::max())
+ throw std::domain_error("Number of independent variables cannot be "
+ "larger than 65535.");
+ Matrix training_data_vcov = args[3].getAs<MappedMatrix>();
+
+ MappedColumnVector categorical_indices;
+ uint16_t numCategoricals = 0;
+ if (!args[5].isNull()) {
+ MappedColumnVector xx = args[5].getAs<MappedColumnVector>();
+ categorical_indices.rebind(xx.memoryHandle(), xx.size());
+ numCategoricals = categorical_indices.size();
+ }
+ state.initialize(*this,
+ static_cast<uint16_t>(coef.size()),
+ static_cast<uint16_t>(derivative_matrix.rows()),
+ static_cast<uint16_t>(numCategoricals));
+ state.training_data_vcov = training_data_vcov;
+ if (numCategoricals > 0)
+ state.categorical_indices = categorical_indices;
+ }
+
+ // Now do the transition step
+ state.numRows++;
+ double xc = dot(x, coef);
+ double p = std::exp(xc)/ (1 + std::exp(xc));
+
+ // compute marginal effects and delta using 1st and 2nd derivatives
+ ColumnVector coef_interaction_sum;
+ coef_interaction_sum = derivative_matrix * coef;
+ ColumnVector current_me = coef_interaction_sum * p * (1 - p);
+
+ Matrix current_delta;
+ current_delta = p * (1 - p) * (
+ (1 - 2 * p) * coef_interaction_sum * trans(x) +
+ derivative_matrix);
+
+ // update marginal effects and delta using discrete differences just for
+ // categorical variables
+ // if (state.numRows == 100) {
+ // dberr << "Delta before: " << current_delta << std::endl;
+ // }
+ Matrix x_set;
+ Matrix x_unset;
+ if (!args[6].isNull() && !args[7].isNull()){
+ // the matrix is read in column-order but passed in row-order
+ x_set = args[6].getAs<MappedMatrix>();
+ x_set.transposeInPlace();
+
+ x_unset = args[7].getAs<MappedMatrix>();
+ x_unset.transposeInPlace();
+ }
+ for (Index i = 0; i < state.numCategoricals; ++i) {
+ // Note: categorical_indices are assumed to be zero-based
+ double xc_set = dot(x_set.row(i), coef);
+ double p_set = logistic(xc_set);
+ double xc_unset = dot(x_unset.row(i), coef);
+ double p_unset = logistic(xc_unset);
+ current_me(static_cast<uint16_t>(state.categorical_indices(i))) = p_set - p_unset;
+
+ current_delta.row(static_cast<uint16_t>(state.categorical_indices(i))) = (
+ (p_set * (1 - p_set) * x_set.row(i) -
+ p_unset * (1 - p_unset) * x_unset.row(i)));
+
+ // if (state.numRows == 100) {
+ // dberr << "Delta after updating " << static_cast<uint16_t>(state.categorical_indices(i))
+ // << " : " << current_delta << std::endl;
+ // }
+ }
+
+ state.marginal_effects += current_me;
+ state.delta += current_delta;
+ return state;
+}
+
+
+/**
+ * @brief Marginal effects: Merge transition states
+ */
+AnyType
+margins_logregr_int_merge::run(AnyType &args) {
+ MarginsLogregrInteractionState<MutableArrayHandle<double> > stateLeft = args[0];
+ MarginsLogregrInteractionState<ArrayHandle<double> > stateRight = args[1];
+ // We first handle the trivial case where this function is called with one
+ // of the states being the initial state
+ if (stateLeft.numRows == 0)
+ return stateRight;
+ else if (stateRight.numRows == 0)
+ return stateLeft;
+
+ // Merge states together and return
+ stateLeft += stateRight;
+ return stateLeft;
+}
+
+/**
+ * @brief Marginal effects: Final step
+ */
+AnyType
+margins_logregr_int_final::run(AnyType &args) {
+ // We request a mutable object.
+ // Depending on the backend, this might perform a deep copy.
+ MarginsLogregrInteractionState<MutableArrayHandle<double> > state = args[0];
+ // Aggregates that haven't seen any data just return Null.
+ if (state.numRows == 0)
+ return Null();
+
+ // dberr << "Delta = " << state.delta << std::endl;
+ // dberr << "----------------------------------------------------" << std::endl;
+ // dberr << "Variance = " << state.training_data_vcov << std::endl;
+
+ // Standard error for continuous variables according to the delta method
+ Matrix std_err;
+ std_err = state.delta * state.training_data_vcov * trans(state.delta) / (state.numRows*state.numRows);
+
+ // Standard error for categorical variables according to the delta method
+
+ // Combining the two standard error vectors
+
+ // Computing the marginal effects
+ return margins_stateToResult(*this, std_err.diagonal(),
+ state.marginal_effects,
+ state.numRows);
+}
+// ------------------------ End of Marginal ------------------------------------
+
+} // namespace regress
+
+} // namespace modules
+
+} // namespace madlib
View
14 src/modules/regress/marginal.hpp
@@ -0,0 +1,14 @@
+/**
+ * @brief Marginal Effects Logistic regression step: Transition function
+ */
+DECLARE_UDF(regress, margins_logregr_int_transition)
+
+/**
+ * @brief Marginal effects Logistic regression: State merge function
+ */
+DECLARE_UDF(regress, margins_logregr_int_merge)
+
+/**
+ * @brief Marginal Effects Logistic regression: Final function
+ */
+DECLARE_UDF(regress, margins_logregr_int_final)
View
1  src/modules/regress/regress.hpp
@@ -9,5 +9,6 @@
#include "linear.hpp"
#include "clustered_errors.hpp"
#include "logistic.hpp"
+#include "marginal.hpp"
#include "multilogistic.hpp"
#include "mlogr_margins.hpp"
View
13 src/ports/postgres/modules/regress/linear.py_in
@@ -132,12 +132,13 @@ def linregr_train(schema_madlib, source_table, out_table,
"""
create table {out_table}_summary as
select
- '{source_table}'::varchar as source_table,
- '{out_table}'::varchar as out_table,
- '{dependent_varname}'::varchar as dependent_varname,
- '{independent_varname}'::varchar as independent_varname,
- {num_rows_processed}::integer as num_rows_processed,
- {num_rows_skipped}::integer as num_missing_rows_skipped
+ 'linregr'::varchar as method,
+ '{source_table}'::varchar as source_table,
+ '{out_table}'::varchar as out_table,
+ '{dependent_varname}'::varchar as dependent_varname,
+ '{independent_varname}'::varchar as independent_varname,
+ {num_rows_processed}::integer as num_rows_processed,
+ {num_rows_skipped}::integer as num_missing_rows_skipped
""".format(out_table=out_table, source_table=source_table,
dependent_varname=dependent_varname,
independent_varname=independent_varname,
View
22 src/ports/postgres/modules/regress/logistic.py_in
@@ -279,7 +279,9 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col,
(case when (result).status = 1 then num_rows - (result).num_processed
when result is null then num_rows
else NULL::bigint end) as num_missing_rows_skipped,
- {col_grp_iteration} as num_iterations
+ {col_grp_iteration} as num_iterations,
+ (case when (result).status = 1 then (result).vcov
+ else NULL::double precision[] end) as variance_covariance
from
(
select
@@ -349,14 +351,15 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col,
"""
create table {tbl_output}_summary as
select
- '{tbl_source}'::varchar as source_table,
- '{tbl_output}'::varchar as out_table,
- '{dep_col}'::varchar as dependent_varname,
- '{ind_col}'::varchar as independent_varname,
+ 'logregr'::varchar as method,
+ '{tbl_source}'::varchar as source_table,
+ '{tbl_output}'::varchar as out_table,
+ '{dep_col}'::varchar as dependent_varname,
+ '{ind_col}'::varchar as independent_varname,
'optimizer={optimizer}, max_iter={max_iter}, tolerance={tolerance}'::varchar as optimizer_params,
{all_groups}::integer as num_all_groups,
- {failed_groups}::integer as num_failed_groups,
- {num_rows_processed}::integer as num_rows_processed,
+ {failed_groups}::integer as num_failed_groups,
+ {num_rows_processed}::integer as num_rows_processed,
{num_missing_rows_skipped}::integer as num_missing_rows_skipped
""".format(all_groups=all_groups, failed_groups=failed_groups,
**args))
@@ -384,7 +387,7 @@ def logregr_help_msg(schema_madlib, message, **kwargs):
Returns:
A string, contains the help message
"""
- if message is None:
+ if not message:
help_string = """
----------------------------------------------------------------
@@ -437,6 +440,7 @@ The output table ('out_table' above) has the following columns:
'num_iterations' double precision -- how many iterations are used in the computation per group
A summary table named <out_table>_summary is also created at the same time, which has:
+ 'method' varchar, -- modeling method name ('logregr')
'source_table' varchar, -- the data source table name
'out_table' varchar, -- the output table name
'dependent_varname' varchar, -- the dependent variable
@@ -493,7 +497,7 @@ SELECT madlib.logregr_train( 'patients',
SELECT * from patients_logregr;
"""
else:
- help_string = "No such option. Use {schema_madlib}.logregr_train()"
+ help_string = "No such option. Use {schema_madlib}.logregr_train('help')"
return help_string.format(schema_madlib=schema_madlib)
## ========================================================================
View
3  src/ports/postgres/modules/regress/logistic.sql_in
@@ -517,6 +517,7 @@ CREATE TYPE MADLIB_SCHEMA.__logregr_result AS (
z_stats DOUBLE PRECISION[],
p_values DOUBLE PRECISION[],
odds_ratios DOUBLE PRECISION[],
+ vcov DOUBLE PRECISION[],
condition_no DOUBLE PRECISION,
status INTEGER,
num_processed BIGINT,
@@ -894,7 +895,7 @@ m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `CONTAINS SQL', `');
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_train()
RETURNS TEXT
AS $$
- SELECT MADLIB_SCHEMA.logregr_train(''::TEXT);
+ SELECT MADLIB_SCHEMA.logregr_train(NULL::TEXT);
$$ LANGUAGE SQL IMMUTABLE
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `CONTAINS SQL', `');
------------------------------------------------------------------------
View
29 src/ports/postgres/modules/regress/marginal.py_in
@@ -139,6 +139,8 @@ def margins_logregr(schema_madlib, source_table, out_table,
For function usage information. Run
sql> select margins_logregr('usage');
"""
+ plpy.warning("This function has been deprecated and replaced by 'margins'")
+
# Reset the message level to avoid random messages
old_msg_level = plpy.execute("""
SELECT setting
@@ -156,9 +158,6 @@ def margins_logregr(schema_madlib, source_table, out_table,
tolerance)
_margins_logregr_validate_args(optimizer)
- # NOTICE: * support was removed because other modules did not have it.
- # Uncomment the following code if you want to re-add '*' support
-
group_col_str = 'NULL' if grouping_cols is None else "'" + grouping_cols + "'"
optimizer_str = 'NULL' if optimizer is None else "'" + optimizer + "'"
maxiter_str = 'NULL' if max_iter is None else max_iter
@@ -179,8 +178,8 @@ def margins_logregr(schema_madlib, source_table, out_table,
maxiter_str=maxiter_str, optimizer_str=optimizer_str,
tolerance_str=tolerance_str, verbose=verbose_mode))
- m4_changequote(`>>>', `<<<')
- m4_ifdef(>>>__HAWQ__<<<, >>>
+ m4_changequote(`<!', `!>')
+ m4_ifdef(<!__HAWQ__!>, <!
plpy.execute("""
CREATE TABLE {out_table}_summary AS
SELECT
@@ -196,7 +195,7 @@ def margins_logregr(schema_madlib, source_table, out_table,
FROM
{old}_summary
""".format(out_table=out_table, old=logr_out_table))
- <<<, >>>
+ !>, <!
# Rename the output summary table
rename_table(schema_madlib,
"{old}_summary".format(old=logr_out_table),
@@ -204,8 +203,8 @@ def margins_logregr(schema_madlib, source_table, out_table,
plpy.execute("""UPDATE {out_table}_summary SET out_table = '{out_table}'
""".format(out_table=out_table))
- <<<)
- m4_changequote(>>>`<<<, >>>'<<<)
+ !>)
+ m4_changequote(<!`!>, <!'!>)
coef = plpy.execute("select coef from {0}".format(logr_out_table))[0]['coef']
if coef is None:
@@ -255,6 +254,8 @@ def margins_logregr(schema_madlib, source_table, out_table,
def margins_logregr_help(schema_madlib, message, **kwargs):
+ plpy.warning("This function has been deprecated and replaced by 'margins'")
+
if not message:
help_string = """
-----------------------------------------------------------------------
@@ -303,6 +304,8 @@ The output summary table is the same as logregr_train(), see also:
return help_string.format(schema_madlib=schema_madlib)
+
+# ========================================================================
# -----------------------------------------------------------------------
# Marginal Effects for multinomial logistic regression
# -----------------------------------------------------------------------
@@ -551,8 +554,8 @@ def margins_mlogregr(schema_madlib, source_table, out_table,
'max_iter={max_iter}, optimizer={optimizer}, tolerance={tolerance}')
""".format(**all_arguments))
- m4_changequote(`>>>', `<<<')
- m4_ifdef(>>>__HAWQ__<<<, >>>
+ m4_changequote(`<!', `!>')
+ m4_ifdef(<!__HAWQ__!>, <!
plpy.execute("""
CREATE TABLE {out_table}_summary AS
SELECT
@@ -567,7 +570,7 @@ def margins_mlogregr(schema_madlib, source_table, out_table,
FROM
{old}_summary
""".format(out_table=out_table, old=mlog_out_table))
- <<<, >>>
+ !>, <!
# Rename the output summary table
rename_table(schema_madlib,
"{old}_summary".format(old=mlog_out_table),
@@ -575,8 +578,8 @@ def margins_mlogregr(schema_madlib, source_table, out_table,
plpy.execute("UPDATE {out_table}_summary SET out_table = '{out_table}'".
format(out_table=out_table))
- <<<)
- m4_changequote(>>>`<<<, >>>'<<<)
+ !>)
+ m4_changequote(<!`!>, <!'!>)
num_categories = plpy.execute(
"SELECT count(DISTINCT {0}) as n_cat FROM {1}".
View
313 src/ports/postgres/modules/regress/marginal.sql_in
@@ -18,8 +18,8 @@ m4_include(`SQLCommon.m4')
<div class="toc"><b>Contents</b>
<ul>
-<li><a href="#logregr_train">Logistic Regression Training Function</a></li>
-<li><a href="#mlogregr_train">Multinomial Logistic Regression Training Function</a></li>
+<li><a href="#margins">Marginal Effects for Logistic Regression with Interaction Terms</a></li>
+<li><a href="#mlogregr_train">Marginal Effects for Multinomial Logistic Regression</a></li>
<li><a href="#examples">Examples</a></li>
<li><a href="#notes">Notes</a></li>
<li><a href="#background">Technical Background</a></li>
@@ -28,10 +28,10 @@ m4_include(`SQLCommon.m4')
</ul>
</div>
-@brief Calculates marginal effects for the coefficients in logistic and multinomial logistic regression problems.
+@brief Calculates marginal effects for the coefficients in regression problems.
A marginal effect (ME) or partial effect measures the effect on the
-conditional mean of \f$ y \f$ of a change in one of the regressors, say
+conditional mean of \f$ y \f$ for a change in one of the regressors, say
\f$X_k\f$. In the linear regression model, the ME equals the
relevant slope coefficient, greatly simplifying analysis. For nonlinear models,
specialized algorithms are required for calculating ME. The marginal effect
@@ -41,10 +41,82 @@ source table.
MADlib provides marginal effects regression functions for logistic and
multinomial logistic regressions.
+@warning The margins_logregr has been deprecated in favor of the 'margins' function.
+
+@anchor margins
+@par Marginal Effects for Logistic Regression with Interaction Terms
+<pre class="syntax">
+margins( modele_table,
+ output_table,
+ x_design,
+ source_table,
+ marginal_vars
+ )
+</pre>
+\b Arguments
+<dl class="arglist">
+ <dt>modele_table</dt>
+ <dd>VARCHAR. Name of the model table, which is the output of
+ '<em>logregr_train</em>' and '<em>mlogregr_train</em>'.</dd>
+ <dt>output_table</dt>
+ <dd>VARCHAR. Name of result table. The output table has the following columns.
+ <table class="output">
+ <tr>
+ <th>variables</th>
+ <td>INTEGER[]. The indices of the basis variables.</td>
+ </tr>
+ <tr>
+ <th>margins</th>
+ <td>DOUBLE PRECISION[]. The marginal effects.</td>
+ </tr>
+ <tr>
+ <th>std_err</th>
+ <td>DOUBLE PRECISION[]. An array of the standard errors,
+ computed using the delta method.</td>
+ </tr>
+ <tr>
+ <th>z_stats</th>
+ <td>DOUBLE PRECISION[]. An array of the z-stats of the marginal effects.</td>
+ </tr>
+ <tr>
+ <th>p_values</th>
+ <td>DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.</td>
+ </tr>
+ </table>
+ </dd>
+ <dt>x_design(optional)</dt>
+ <dd>VARCHAR, default: NULL. The design of indepedent variables, necessary
+ only if interaction terms are present. This is necessary since the
+ independent variables in the underlying regression
+ could be a flat array, making it difficult to read interaction
+ effects between variables.
+
+ Example:
+ The user can provide independent_varname in the regression method
+ either of the following ways:
+ - <code> ‘array[1, color_blue, color_green, gender_female, gpa, gpa^2, gender_female*gpa, gender_female*gpa^2, weight]’ </code>
+ - <code> ‘x’ </code>
+
+ In the second version, the column ‘x’ is an array containing data
+ identical to that expressed in the first version, computed in a prior
+ data preparation step.
+
+ The user would then supply an x_design argument to the <em>margins()</em>
+ function in the following way:
+ - <code> ‘1, i.2.color, i.3.color, i.4.gender, 5, 5^2, 5*4, 5^2*4, 9’ </code>
+ </dd>
+ <dt>source_table (optional)</dt>
+ <dd>VARCHAR, default: NULL. Name of data table to apply marginal effects on.
+ If not provided or NULL then the marginal effects are computed on the training table.</dd>
+
+ <dt>marginal_vars (optional)</dt>
+ <dd>INTEGER[], default: NULL. The indices (base 1) of variables to calculate
+ marginal effects on. When NULL, computes marginal effects on all variables.</dd>
+</dl>
@anchor logregr_train
-@par Logistic Regression Training Function
+@par Marginal Effects for Logistic Regression(Deprecated)
<pre class="syntax">
margins_logregr( source_table,
output_table,
@@ -105,7 +177,7 @@ margins_logregr( source_table,
</dl>
@anchor mlogregr_train
-@par Multinomial Logistic Regression Training Function
+@par Marginal Effects for Multinomial Logistic Regression
<pre class="syntax">
margins_mlogregr( source_table,
out_table,
@@ -172,9 +244,10 @@ margins_mlogregr( source_table,
@anchor examples
@examp
--# View online help for the marginal effects logistic regression function.
+
+-# View online help for the marginal effects.
<pre class="example">
-SELECT madlib.margins_logregr();
+SELECT madlib.margins();
</pre>
-# Create the sample data set.
@@ -197,47 +270,48 @@ Result:
...
</pre>
--# Run the logistic regression function and then compute the marginal effects of all variables in the regression.
-<pre class="example">
-SELECT madlib.margins_logregr( 'patients',
- 'result_table',
- 'second_attack',
- 'ARRAY[1, treatment, trait_anxiety]'
- );
-</pre>
-
--# View the regression results.
+-# Run logistic regression to get the model, then compute the marginal effects of all variables, and view the results.
<pre class="example">
+SELECT madlib.logregr_train( 'patients',
+ 'model_table',
+ 'second_attack',
+ 'ARRAY[1, treatment, trait_anxiety, treatment^2, treatment * trait_anxiety]'
+ );
+SELECT madlib.margins( 'model_table',
+ 'margins_table',
+ '1, 2, 3, 2^2, 2*3',
+ NULL,
+ NULL
+ );
\\x ON
-SELECT * FROM result_table;
+SELECT * FROM margins_table;
</pre>
Result:
<pre class="result">
-margins | {-0.970665392796,-0.156214190168,0.0181587690137}
-coef | {-6.36346994178179,-1.02410605239327,0.119044916668605}
-std_err | {0.802871454422,0.292691682191,0.0137459874022}
-z_stats | {-1.2089922832,-0.533715850748,1.32102325446}
-p_values | {0.2266658, 0.5935381, 0.1864936}
+variables | {1,2,3}
+margins | {-0.876046514609573,-0.0648833521465306,0.0177196513589633}
+std_err | {0.551714275062467,0.373592457067442,0.00458001207971933}
+z_stats | {-1.58786269307674,-0.173674149247659,3.86890930646828}
+p_values | {0.112317391159946,0.862121554662231,0.000109323294026272}
</pre>
--# Run the logistic regression function and then compute the marginal effects of the first variable in the regression.
+-# Compute the marginal effects of the first variable using the previous model and view the results.
<pre class="example">
-SELECT madlib.margins_logregr( 'patients',
- 'result_table',
- 'second_attack',
- 'ARRAY[1, treatment, trait_anxiety]',
- NULL,
- ARRAY[1]
- );
+SELECT madlib.margins( 'model_table',
+ 'result_table',
+ '1, 2, 3, 2^2, 2*3',
+ NULL,
+ '1'
+ );
SELECT * FROM result_table;
</pre>
Result:
<pre class="result">
-margins | {-0.970665392796}
-coef | {-6.36346994178179}
-std_err | {0.802871454422}
-z_stats | {-1.2089922832}
-p_values | {0.2266658}
+variables | {1}
+margins | {-0.876046514609573}
+std_err | {0.551714275062467}
+z_stats | {-1.58786269307674}
+p_values | {0.112317391159946}
</pre>
-# View online help for marginal effects multinomial logistic regression.
@@ -388,6 +462,10 @@ File marginal.sql_in documenting the SQL functions.
@endinternal
*/
+--------------------------------------------------------------------------------
+-- DEPRECATION NOTICE ----------------------------------------------------------
+-- The below udt/udf/uda have been deprecated and should be removed in next major
+
------------------ Marginal Logistic Regression ------------------------------
DROP TYPE IF EXISTS MADLIB_SCHEMA.marginal_logregr_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.marginal_logregr_result AS (
@@ -659,9 +737,9 @@ RETURNS VOID AS $$
$$ LANGUAGE sql VOLATILE
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-- End of Default Variable calls for margins_logregr
-------------------------------------------------------------------------------
+-------------------------------------------------------------------------------
------------------- Marginal Multi-Logistic Regression ------------------------------
+------------------ Marginal Multi-Logistic Regression -------------------------
DROP TYPE IF EXISTS MADLIB_SCHEMA.marginal_mlogregr_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.marginal_mlogregr_result AS (
@@ -1010,7 +1088,6 @@ $$ LANGUAGE plpgsql VOLATILE
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-- End of Default Variable calls for margins_mlogregr
------------------------------------------------------------------------------
--- END OF DEPRECATED NOTICE -----------------------------------------------------------
-- Default calls for margins_mlogregr (Overloaded functions)
@@ -1090,6 +1167,8 @@ m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-- End of Default Variable calls for margins_mlogregr
------------------------------------------------------------------------------
+-- END OF DEPRECATED NOTICE -----------------------------------------------------------
+
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__sub_array(
value_array DOUBLE PRECISION[], -- The array containing values to be selected
@@ -1098,3 +1177,157 @@ CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__sub_array(
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `NO SQL', `');
+
+-----------------------------------------------------------------------
+-- Marginal effects with Interactions terms
+-----------------------------------------------------------------------
+
+------------------ Marginal Logistic Regression w/ Interaction -----------------
+----------------------- New interface for marginal -----------------------------
+
+DROP TYPE IF EXISTS MADLIB_SCHEMA.margins_int_logregr_result;
+CREATE TYPE MADLIB_SCHEMA.margins_int_logregr_result AS (
+ margins DOUBLE PRECISION[],
+ std_err DOUBLE PRECISION[],
+ z_stats DOUBLE PRECISION[],
+ p_values DOUBLE PRECISION[]
+);
+--------------------------------------
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_transition(
+ state DOUBLE PRECISION[],
+ x DOUBLE PRECISION[],
+ coef DOUBLE PRECISION[],
+ vcov DOUBLE PRECISION[],
+ derivative DOUBLE PRECISION[],
+ categorical_indices DOUBLE PRECISION[],
+ x_set DOUBLE PRECISION[],
+ x_unset DOUBLE PRECISION[])
+RETURNS DOUBLE PRECISION[]
+AS 'MODULE_PATHNAME', 'margins_logregr_int_transition'
+LANGUAGE C IMMUTABLE;
+
+--------------------------------------
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_merge(
+ state1 DOUBLE PRECISION[],
+ state2 DOUBLE PRECISION[])
+RETURNS DOUBLE PRECISION[]
+AS 'MODULE_PATHNAME', 'margins_logregr_int_merge'
+LANGUAGE C IMMUTABLE STRICT;
+
+--------------------------------------
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__margins_logregr_int_final(
+ state DOUBLE PRECISION[])
+RETURNS MADLIB_SCHEMA.margins_int_logregr_result
+AS 'MODULE_PATHNAME', 'margins_logregr_int_final'
+LANGUAGE C IMMUTABLE STRICT;
+
+--------------------------------------
+
+/**
+ * @brief Compute marginal effects for logistic regression.
+ *
+ * @param dependentVariable Column containing the dependent variable
+ * @param independentVariables Column containing the array of independent variables
+ * @param coef Column containing the array of the coefficients (as obtained by logregr)
+ *
+ *
+ * @return A composite value:
+ * - <tt>margins FLOAT8[] </tt> - Array of marginal effects
+ * - <tt>std_err FLOAT8[]</tt> - Array of standard-errors (calculated by the delta method),
+ * - <tt>z_stats FLOAT8[]</tt> - Array of z-statistics
+ * - <tt>p_values FLOAT8[]</tt> - Array of p-values
+ *
+ * @usage
+ * - Get all the diagnostic statistics:\n
+ *
+ * </pre>
+ */
+
+CREATE AGGREGATE MADLIB_SCHEMA.__margins_int_logregr_agg(
+ /*+ "independentVariables" */ DOUBLE PRECISION[],
+ /*+ "coef" */ DOUBLE PRECISION[],
+ /*+ "vcov" */ DOUBLE PRECISION[],
+ /*+ "derivative matrix" */ DOUBLE PRECISION[],
+ /*+ "categorical_indices" */ DOUBLE PRECISION[],
+ /*+ "x_set" */ DOUBLE PRECISION[],
+ /*+ "x_unset" */ DOUBLE PRECISION[]
+ )(
+ STYPE=DOUBLE PRECISION[],
+ SFUNC=MADLIB_SCHEMA.__margins_logregr_int_transition,
+ m4_ifdef(`__POSTGRESQL__', `', `PREFUNC=MADLIB_SCHEMA.__margins_logregr_int_merge,')
+ FINALFUNC=MADLIB_SCHEMA.__margins_logregr_int_final,
+ INITCOND='{0, 0, 0, 0, 0, 0, 0, 0, 0}'
+);
+
+-------------------------------------------------------------------------
+-- New interface for margins logregr
+-------------------------------------------------------------------------
+
+/**
+ * @brief Marginal effects with default variable_names
+ **/
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins(
+ model_table VARCHAR, -- name of table containing logistic regression model
+ out_table VARCHAR, -- name of output table to return marginal effect values
+ x_design VARCHAR, -- design of the independent variables
+ source_table VARCHAR, -- Source table to apply marginal effects on
+ -- (Optional, if not provided or NULL then
+ -- training table is taken as the source)
+ marginal_vars VARCHAR -- indices of variables to calculate marginal effects on
+ -- (Optional, if not provided or NULL then compute
+ -- marginal effects for all basis variables)
+ )
+RETURNS VOID AS $$
+PythonFunction(regress, margins, margins)
+$$ LANGUAGE plpythonu VOLATILE;
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins(
+ model_table VARCHAR, -- name of table containing logistic regression model
+ out_table VARCHAR, -- name of output table
+ x_design VARCHAR, -- design of the independent variables
+ source_table VARCHAR -- Source table to apply marginal effects on
+ -- (Optional, if not provided or NULL then
+ -- training table is taken as the source)
+ )
+RETURNS VOID AS $$
+ SELECT MADLIB_SCHEMA.margins($1, $2, $3, $4, NULL)
+$$ LANGUAGE sql VOLATILE;
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins(
+ model_table VARCHAR, -- name of table containing logistic regression model
+ out_table VARCHAR, -- name of output table
+ x_design VARCHAR -- design of the independent variables
+ )
+RETURNS VOID AS $$
+ SELECT MADLIB_SCHEMA.margins($1, $2, $3, NULL, NULL)
+$$ LANGUAGE sql VOLATILE;
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins(
+ model_table VARCHAR, -- name of table containing logistic regression model
+ out_table VARCHAR -- name of output table
+ )
+RETURNS VOID AS $$
+ SELECT MADLIB_SCHEMA.margins($1, $2, NULL, NULL, NULL)
+$$ LANGUAGE sql VOLATILE;
+-------------------------------------------------------------------------
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins(
+ message VARCHAR
+) RETURNS VARCHAR AS $$
+ PythonFunction(regress, margins, margins_help)
+$$ LANGUAGE plpythonu IMMUTABLE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `CONTAINS SQL', `');
+
+--------------------------------------
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins()
+RETURNS VARCHAR AS $$
+ SELECT MADLIB_SCHEMA.margins('');
+$$ LANGUAGE sql IMMUTABLE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `CONTAINS SQL', `');
+
+-------------------------------------------------------------------------------------
View
109 src/ports/postgres/modules/regress/margins.py_in
@@ -0,0 +1,109 @@
+"""
+@file margins_common.py_in
+
+@brief Marginal Effects with Interactions: Common functions
+
+@namespace marginal
+"""
+
+import plpy
+from margins_common import margins_validate_model_table
+from margins_logregr import margins_int_logregr
+#------------------------------------------------------------------------------
+
+
+def margins(schema_madlib, model_table, out_table, x_design=None,
+ source_table=None, marginal_vars=None, *args, **kwargs):
+ """
+ Call the appropriate margins functions depending on the regression type
+ found in the model table.
+
+ Args:
+ @param schema_madlib
+ @param model_table
+ @param out_table
+ @param x_design
+ @param source_table
+ @param marginal_vars
+ @param args
+ @param kwargs
+
+ Returns:
+ None
+ """
+ margins_validate_model_table(model_table)
+ # get the relevant regression type
+ if model_table:
+ model_summary_table = model_table + "_summary"
+ else:
+ raise ValueError("Margins error: Invalid regression model table name!")
+
+ reg_type = plpy.execute("SELECT method from {0}".format(model_summary_table))[0]['method']
+ if not reg_type:
+ plpy.error("Margins error: Regression type cannot be obtained from the model table")
+ elif reg_type in ("logregr", "logistic", "logistic_regression"):
+ margins_int_logregr(schema_madlib, model_table, out_table, x_design,
+ source_table, marginal_vars, *args, **kwargs)
+ else:
+ plpy.error("Margins not supported for {0}".format(str(reg_type)))
+#------------------------------------------------------------------------------
+
+
+def margins_help(schema_madlib, message, **kwargs):
+ """
+ Help function for margins
+
+ Args:
+ @param schema_madlib
+ @param message: string, Help message string
+ @param kwargs
+
+ Returns:
+ String. Help/usage information
+ """
+ if not message:
+ help_string = """
+-----------------------------------------------------------------------
+ SUMMARY
+-----------------------------------------------------------------------
+Functionality: Calculate marginal effects for binomial/multinomial logistic
+regression with interaction terms.
+A marginal effect (ME) or partial effect measures the effect on the
+conditional mean of a response (dependent variable) for a change in one of the
+regressors (independent variable).
+
+For more details on function usage:
+ SELECT {schema_madlib}.margins('usage')
+ """
+ elif message in ['usage', 'help', '?']:
+ help_string = """
+-----------------------------------------------------------------------
+ USAGE
+-----------------------------------------------------------------------
+ SELECT {schema_madlib}.margins(
+ 'model_table', -- Name of table containing regression model
+ 'output_table', -- Name of result table
+ 'x_design', -- Design of the independent variables
+ -- (Optional, if not provided or NULL then independent variable list
+ is assumed to have no interaction terms)
+ 'source_table', -- Source table to apply marginal effects on
+ -- (Optional, if not provided or NULL then assume training table as the source)
+ 'marginal_vars' -- Indices of variables to calculate marginal effects on
+ -- (Optional, if not provided or NULL then compute marginal effects for all basis variables)
+ );
+
+-----------------------------------------------------------------------
+ OUTUPT
+-----------------------------------------------------------------------
+The output table ('output_table' above) has the following columns
+ variables INTEGER[], -- Indices of the basis variables,
+ -- will be same as marginal vars if provided
+ margins DOUBLE PRECISION[], -- Marginal effects
+ std_err DOUBLE PRECISION[], -- Standard errors using delta method
+ z_stats DOUBLE PRECISION[], -- z-stats of the standard errors
+ p_values DOUBLE PRECISION[], -- p-values of the standard errors
+ """
+ else:
+ help_string = "No such option. Use {schema_madlib}.margins()"
+
+ return help_string.format(schema_madlib=schema_madlib)
View
888 src/ports/postgres/modules/regress/margins_common.py_in
@@ -0,0 +1,888 @@
+"""
+@file margins_common.py_in
+
+@brief Marginal Effects with Interactions: Common functions
+
+@namespace marginal
+"""
+import re
+from collections import defaultdict
+from itertools import chain
+
+if __name__ != "__main__":
+ ## The import statements are called only when this file is not executed
+ ## as a standalone (for unit testing)
+ from utilities.utilities import _assert
+ from utilities.validate_args import table_exists
+ from utilities.validate_args import columns_exist_in_table
+ from utilities.validate_args import table_is_empty
+ from utilities.utilities import py_list_to_sql_string
+# ========================================================================
+
+def margins_validate_model_table(model_table):
+ """
+ Args:
+ @param schema_madlib
+ @param model_table
+
+ Returns:
+
+ """
+ _assert(model_table and
+ model_table.strip().lower() not in ('null', ''),
+ "Margins error: Invalid regression model table name!")
+ _assert(table_exists(model_table),
+ "Margins error: Specified Model table ({0}) is missing! "
+ "Rerun underlying regression".format(model_table))
+
+ model_summary_table = model_table + '_summary'
+ _assert(table_exists(model_summary_table),
+ "Margins error: Summary for model table ({0}) is missing! "
+ "Rerun underlying regression".format(model_summary_table))
+
+ _assert(columns_exist_in_table(model_summary_table,
+ ['method', 'source_table', 'out_table', 'dependent_varname',
+ 'independent_varname']),
+ "Margins error: Invalid model summary table ({0})"
+ " - some required columns missing".format(model_summary_table))
+ return True
+#------------------------------------------------------------------------------
+
+
+def margins_validate_args(out_table, source_table, x_design=None,
+ **kwargs):
+ _assert(out_table and
+ out_table.strip().lower() not in ('null', ''),
+ "Margins error: Invalid output table name!")
+ _assert(not table_exists(out_table),
+ "Margins error: Output table already exists!")
+ _assert(not table_exists(out_table + '_summary'),
+ "Margins error: Output summary table already exists!")
+
+ _assert(source_table and source_table.strip().lower() not in ('null', ''),
+ "Margins error: Invalid data table name!")
+ _assert(table_exists(source_table),
+ "Margins error: Data table ({0}) is missing!".format(source_table))
+ _assert(not table_is_empty(source_table),
+ "Margins error: Data table ({0}) is empty!".format(source_table))
+# -------------------------------------------------------------------------
+
+
+def parse_marginal_vars(marginal_var_str):
+ """
+ Marginal vars is supposed to be a list of integers separated by ','.
+ It's possible that the user inputs it as an array string
+ eg: ARRAY[1, 2, 3, 4] or '{2, 4, 5}'. We strip out the 'ARRAY' if present in
+ front and also strip and square/curly brackets.
+ """
+ if str.lower(marginal_var_str.strip()[:6]) == "array[":
+ marginal_var_str = marginal_var_str[5:]
+ marginal_var_str = marginal_var_str.strip(" {}[]")
+ try:
+ marginal_var_list = map(int, marginal_var_str.split(','))
+ except Exception:
+ raise ValueError("Margins error: Invalid input for marginal vars: {0}".
+ format(marginal_var_str))
+ return marginal_var_list
+#------------------------------------------------------------------------------
+
+
+class InteractionTermBuilder(object):
+
+ def __init__(self, design_str):
+ """
+ Args:
+ @param self
+ @param design_string
+ """
+ self.design_str = design_str
+
+ # basis_terms (list): Indices from independent_variable that are basis variables
+ self.basis_terms = []
+
+ # indicator_terms (dict): Elements in 'basis_terms' that are indicator
+ # key = category of the indicator variable,
+ # value = list of all variables for this category
+ self.indicator_terms = defaultdict(list)
+
+ # reference_terms (dict): For each category in indicator terms, if the
+ # reference category is provided then it is added to this dictionary
+ # key = category of the indicator variable,
+ # value = index (int) of reference variable
+ self.reference_terms = defaultdict(lambda: None)
+
+ # interaction_terms (dict): Each interaction term
+ # is placed in this dictionary with the index from independent_variable
+ # as the key and an interaction_term represented as a dictionary.
+ # see _parse_design_string() for an example of the format
+ # of a single interaction term.
+ self.interaction_terms = defaultdict(dict)
+ self.n_basis_terms, self.n_int_terms = self._parse_design_string()
+ self.n_terms = self.n_basis_terms + self.n_int_terms
+ #------------------------------------------------------------------------------
+
+ @staticmethod
+ def _strip_indicator(indicator_term):
+ """
+ Find if input is an indicator term and strip the 'i.' prefix and the
+ category suffix. Return the underlying index and the category string.
+ If the category string is None then the variable is not an indicator term.
+ (see usage examples)
+
+ Args:
+ @param indicator_term: string, Index of variable, possibly prefixed by
+ 'i.' to indicate it as an indicator variable.
+ If prefix exists then a suffix should also be
+ present giving the category of the indicator term
+
+ Returns:
+ Tuple. (Stripped indicator index, True if it was an indicator variable)
+
+ Raises:
+ ValueError, if input is invalid
+
+ Examples:
+ (Input -- Output):
+ 'i.1.color' -- (1, 'color', False)
+ 'i.2.gender' -- (2, 'gender', False)
+ 'ir.2.gender' -- (2, 'gender', True)
+ '1' -- (1, None, False)
+ 'i.1' -- ValueError
+ 'i*1' -- ValueError
+ '2*3' -- ValueError
+ """
+ t = re.match(r'(^i[r]?\.)(\d+)(\.)(.*)', indicator_term.strip())
+ # (1) is the prefix, (2) is the index, (4) is the suffix
+ if t:
+ indicator_var = t.group(2)
+ indicator_category = t.group(4)
+ is_reference = (t.group(1) == 'ir.')
+ else:
+ indicator_var = indicator_term
+ indicator_category = None
+ is_reference = False
+
+ try:
+ indicator_index = int(indicator_var)
+ except (ValueError, TypeError):
+ raise ValueError("Invalid input for indicator term: {0}".format(indicator_term))
+ return indicator_index, indicator_category, is_reference
+ # ------------------------------------------------------------------------------
+
+ @classmethod
+ def _parse_interaction_term(cls, interaction_term):
+ """
+ Parse each interaction term containing product of powers of various basis terms
+ to generate a list of the terms involved in the product.
+
+ Args:
+ @param interaction_term: string, Interaction between variables represented
+ as a product of powers of variables.
+
+ Returns:
+ Dict: Dictionary of interaction elements. Each key is the variable index
+ and the value is the power for that variable
+
+ Example
+ interaction_term = "1*3^2"
+ output = {1: 1, 3: 2}
+
+ interaction_term = "i.1.color^2 * 2^2"
+ output = {1: 2, 2: 2}
+
+ Here the key in the dictionary is an index in the
+ original independent variable list (represents a basis term).
+ """
+ # split string by '*'; eg: '1^2 * 2^2' would give two element list
+ product_terms = interaction_term.strip().split('*')
+ if not product_terms:
+ raise ValueError("Invalid input for interaction term: {0}".
+ format(interaction_term))
+ if len(product_terms) == 1:
+ if len(product_terms[0].strip().split("^")) == 1:
+ # a single variable is input without any power or product operation.
+ # this should be treated as a basis variable instead of interaction term
+ raise ValueError("""Inputed interaction term {0} is actually a
+ basis term""".format(interaction_term))
+
+ all_inter_terms = defaultdict(int)
+ for each_term in product_terms:
+ try:
+ power_terms = each_term.strip().split('^')
+ term_index = cls._strip_indicator(power_terms[0])[0]
+ if len(power_terms) > 1:
+ term_power = int(power_terms[1])
+ if term_power < 0:
+ # negative powers not allowed
+ raise ValueError("Invalid input for interaction term: {0}".
+ format(each_term))
+ else:
+ # if no power is provided (eg. '1 * 2') then default power = 1
+ term_power = 1
+ all_inter_terms[term_index] += term_power
+ except (TypeError, ValueError):
+ raise ValueError("Invalid input for interaction term: {0}".
+ format(each_term))
+ return all_inter_terms
+ # -------------------------------------------------------------------------
+
+ @staticmethod
+ def str_int_term(int_term, array_name=None, quoted=False):
+ """
+ Create a string representation from the dict representation of an
+ interaction term.
+
+ Args:
+ @param int_term: dict, Dictionary representation of a single interaction term.
+ @param array_name: string. If the int_term indices are to be treated as
+ indices of an array then 'array_name' can be used
+ while building the string. If None, then the
+ indices are printed as is.
+ @param quoted: bool. If true, double quotes are added around the
+ array_name. Ignored if array_name is None.
+ Returns:
+ String: String representation of interaction term.
+
+ Example:
+ int_term_list = {1:1, 3:2}
+ array_name = 'x'
+ quoted = True
+ output = '"x"[1]^1 * "x"[3]^2'
+
+ int_term_list = {0:3, 1:1, 3:2}
+ array_name = 'x'
+ quoted = False
+ output = '3 * "x"[1]^1 * "x"[3]^2'
+
+ int_term_list = {1: 1, 3: 2}
+ array_name = None
+ output = "1 * 3^2"
+ """
+ if not int_term:
+ return ""
+ str_rst = []
+ for var, power in int_term.items():
+ if power < 1:
+ continue
+ if var == 0:
+ # var = 0 indicates a constant term whose value is 'power'
+ prefix_str = ""
+ power_suffix = "{0}".format(power)
+ else:
+ power_suffix = "^" + str(power) if power > 1 else ""
+ if array_name:
+ quote_str = "\"" if quoted else ""
+ prefix_str = "{q}{0}{q}[{1}]".format(array_name, var, q=quote_str)
+ else:
+ prefix_str = "{0}".format(var)
+ str_append = ("{prefix}{suffix}".
+ format(prefix=prefix_str, suffix=power_suffix))
+ str_rst.append(str_append)
+ return ' * '.join(str_rst)
+ #------------------------------------------------------------------------------
+
+ def str_sum_int_terms(self, coef_array_name, indep_array_name=None, quoted=False):
+ """
+ Generate the string representation for computing \sigma_i (f_i * beta_i)
+ where f_i is an interaction term
+ """
+ return self.create_sum_int_terms(self.interaction_terms,
+ coef_array_name, indep_array_name,
+ quoted)
+ #------------------------------------------------------------------------------
+
+ def str_sum_deriv_int_terms(self, ref, coef_array_name, indep_array_name=None, quoted=False):
+ return self.create_sum_int_terms(self.interaction_partial_deriv(ref),
+ coef_array_name, indep_array_name,
+ quoted)
+ #------------------------------------------------------------------------------
+
+ @classmethod
+ def create_sum_int_terms(cls, interaction_terms, coef_array_name, indep_array_name=None, quoted=False):
+ if not interaction_terms:
+ return ""
+ result = []
+ for each_index, each_term in interaction_terms.items():
+ quote_str = "\"" if quoted else ""
+ coef_str = "{q}{0}{q}[{1}]".format(coef_array_name, str(each_index), q=quote_str)
+ result.append(coef_str + " * " + cls.str_int_term(each_term, indep_array_name, quoted))
+ return " + ".join(result)
+ #------------------------------------------------------------------------------
+
+ def create_2nd_derivative_matrix(self, indep_array_name=None, quoted=False):
+ """
+ Compute the \frac{\partial f_i}{\partial x_k}
+ where i is the index of independant variable
+ and k is the index of basis variable
+ """
+ derivative_matrix_str_list = []
+ if not self.interaction_terms:
+ # return an identity matrix of size num_basis_terms x num_basis_terms
+ for i in range(self.n_basis_terms):
+ derivative_matrix_str_list.append(py_list_to_sql_string(
+ [1 if j==i else 0 for j in range(self.n_basis_terms)]))
+ else:
+ for index, each_basis in enumerate(self.basis_terms):
+ all_partial_deriv_list = ['0'] * self.n_terms
+ all_partial_deriv_list[each_basis - 1] = 1
+ curr_partial_deriv_dict = self.interaction_partial_deriv(each_basis)
+ for each_index, each_term in curr_partial_deriv_dict.items():
+ all_partial_deriv_list[each_index - 1] = \
+ self.str_int_term(each_term, indep_array_name, quoted)
+ derivative_matrix_str_list.append(py_list_to_sql_string(all_partial_deriv_list) + "::double precision[]")
+ return py_list_to_sql_string(derivative_matrix_str_list)
+
+ #------------------------------------------------------------------------------
+
+ def interaction_partial_deriv(self, ref):
+ """
+ Args:
+ @param ref: int. Variable for which to compute partial derivative
+
+ Returns:
+ Dict. Partial derivative of all the interaction terms that contain the
+ 'ref' variable in the same format as the interaction terms
+ """
+ if not self.interaction_terms:
+ return self.interaction_terms
+ result = defaultdict(lambda: defaultdict(int))
+ for index, int_term in self.interaction_terms.items():
+ if ref in int_term:
+ result[index] = self._int_term_partial_deriv(index, ref)
+ return result
+ #------------------------------------------------------------------------------
+
+ def _parse_design_string(self):
+ """
+ Parse the design string to compute the basis terms,
+ interaction terms, and any indicator variables in the independent
+ variables list.
+
+ Updates:
+ self.basis_terms, self.indicator_terms, self.interaction_terms
+
+ Raises:
+ ValueError
+
+ Example:
+ self.design_str = "ir.1.color, i.2.color, i.3.color, i.4.gender, i.5.degree,
+ i.5.degree*i.2.color, i.5.degree*i.3.color, i.5.degree*i.4.gender,
+ i.5.degree*i.4.gender*i.2.color, i.5.degree*i.4.gender*i.3.color,
+ 11, 11^2, i.5.degree*11, i.5.degree*11^2, 15"
+
+ self.basis_terms = [1, 2, 3, 4, 5, 11, 15],
+ self.interaction_terms = {6: {5:1, 2:1}, 7: {5: 1, 3: 1},
+ 8: {5:1, 4:1}, 9: {5: 1, 4: 1, 2: 1},
+ 10: {5: 1, 4: 1, 3: 1},, 12: {11: 2},
+ 13: {5: 1, 11: 1}, 14: {5:1, 11: 2}}
+ self.indicator_terms = {'color': [1, 2, 3], 'gender': [4], 'degree': [5]}
+ self.reference_terms = {'color': 1}
+
+ A constant is not allowed as a multiplicative term in interaction_terms.
+ The partial derivative of an interaction term, however, can contain a
+ constant. This is represented as the value for key '0'.
+ Example: {0:2, 5: 1, 3: 1} corresponds to 2*x[5]*x[3] where 'x'
+ represents the original array for which '5' and '3' are indices.
+ The indices are assumed to be 1-base, thus no index of value '0' should
+ be present.
+ """
+ if not self.design_str:
+ raise ValueError("Invalid input for x_design: {0}".format(str(self.design_str)))
+
+ indicator_term_chars = r"[\*\^]"
+
+ design_terms = self.design_str.strip().split(",")
+ for index, each_term in enumerate(design_terms):
+ # find if it is an interaction term
+ if re.search(indicator_term_chars, each_term):
+ self.interaction_terms[index + 1] = self._parse_interaction_term(each_term)
+ # Database indexing starts from 1 instead of 0
+ else:
+ # get the basis terms (should be int after striping any "i.")
+ basis_var, indicator_cat, is_reference = self._strip_indicator(each_term)
+ self.basis_terms.append(basis_var)
+ if indicator_cat:
+ self.indicator_terms[indicator_cat].append(basis_var)
+ if is_reference:
+ if not indicator_cat in self.reference_terms:
+ self.reference_terms[indicator_cat] = basis_var
+ else:
+ raise ValueError(
+ "Invalid input in x_design. Multiple reference terms "
+ "present for category '{0}'".format(indicator_cat))
+ self.basis_terms.sort()
+
+ # Verify if each variable in interactions_terms are found in basis_terms
+ for each_term in self.interaction_terms.values():
+ # each term is represented as a dictionary of dictionaries
+ for each_var in each_term:
+ if each_var not in self.basis_terms:
+ raise ValueError(
+ "Invalid input in x_design. Interaction term '{0}' "
+ "contains variable '{1}' not defined as a basis "
+ "variable".format(self.str_int_term(each_term), each_var))
+ return len(self.basis_terms), len(self.interaction_terms)
+ # -------------------------------------------------------------------------
+
+ def get_indicator_category(self, term):
+ """
+ Get the category for a given indicator term
+ Args:
+ @param term
+
+ Returns:
+ str.
+ """
+ if not self.indicator_terms or not term:
+ return None
+ for each_cat, all_ind_terms in self.indicator_terms.items():
+ if term in all_ind_terms:
+ return each_cat
+ return None
+ #------------------------------------------------------------------------------
+
+ def get_indicator_reference(self, term):
+ """
+ Get the reference variable for a given indicator term
+ Args:
+ @param term
+
+ Returns:
+ str.
+ """
+ if not self.indicator_terms or not term:
+ return None
+ return self.reference_terms[self.get_indicator_category(term)]
+ #------------------------------------------------------------------------------
+
+ def n_total_terms(self):
+ """
+ Get the total number of terms in the x_design string
+ """
+ return len(self.basis_terms) + len(self.interaction_terms)
+ #------------------------------------------------------------------------------
+
+ def get_siblings(self, ind_term):
+ """
+ Args:
+ @param self
+ @param ind_term
+
+ Returns:
+ tuple (list, int). A tuple containing a list of terms that are
+ in the same category as the ind_term (without the reference) and an
+ int representing the index of the reference term. If no reference
+ term is provided then the 2nd output is None.
+ """
+ term_siblings = None
+ for each_cat, each_ind_list in self.indicator_terms.items():
+ if ind_term in each_ind_list:
+ term_siblings = list(each_ind_list)
+ term_siblings.remove(ind_term)
+ ref_term = self.get_indicator_reference(ind_term)
+ if ref_term and ref_term in term_siblings:
+ term_siblings.remove(ref_term)
+ break
+ if term_siblings is None: # term_siblings can be an empty list
+ raise ValueError("Computing discrete difference for a non-indicator term")
+ return (term_siblings, ref_term)
+ #------------------------------------------------------------------------------
+
+ def get_discrete_diff_arrays(self, term_to_explore, array_name='x', quoted=False):
+ """
+ Create a pair of strings that corresponds to the array expressions for
+ computing the marginal effect for an indicator term using
+ discrete differences
+
+ Args:
+ @param term_to_explore: int, For which term to create discrete
+ difference expression
+ @param array_name: str, All variables that are not siblings of the
+ indicator variable in question are expressed
+ as they were in original expression.
+ 'array_name' is the name of the array that
+ contains the values for the variables.
+ Returns:
+ tuple. A pair of strings each representing an array of data.
+ First one corresponding to term_to_explore
+ being 'set' (=1) with other siblings unset (=0) and the second one
+ corresponding to term_to_explore as 'unset' and other siblings also
+ 'unset' (reference variable, if present, is 'set'). In both cases,
+ variables unrelated to term_to_explore are unchanged.
+ """
+ quote_str = "\"" if quoted else ""
+ term_siblings, ref_term = self.get_siblings(term_to_explore)
+ n_terms = self.n_total_terms() # len(self.basis_terms) + len(self.interaction_terms)
+ indicator_set_list = [None] * n_terms
+ indicator_unset_list = [None] * n_terms
+
+ for each_basis_index in self.basis_terms:
+ # each_basis_index is an index in the original array (in DB) which is 1-base
+ if each_basis_index == term_to_explore:
+ indicator_set_list[each_basis_index - 1] = '1'
+ indicator_unset_list[each_basis_index - 1] = '0'
+ elif each_basis_index in term_siblings:
+ # unset all siblings in both cases
+ indicator_set_list[each_basis_index - 1] = '0'
+ indicator_unset_list[each_basis_index - 1] = '0'
+ elif ref_term and each_basis_index == ref_term:
+ # set the reference when unsetting the term_to_explore
+ indicator_set_list[each_basis_index - 1] = '0'
+ indicator_unset_list[each_basis_index - 1] = '1'
+ else:
+ # all other variables are added in as is
+ basis_str = "{q}{0}{q}[{1}]".format(array_name, each_basis_index,
+ q=quote_str)
+ indicator_set_list[each_basis_index - 1] = basis_str
+ indicator_unset_list[each_basis_index - 1] = basis_str
+
+ for index, each_int_term in self.interaction_terms.items():
+ # again 'index' is for the original array (in DB) which is 1-base
+ for each_sibling in term_siblings:
+ if each_sibling in each_int_term:
+ indicator_set_list[index - 1] = '0'
+ indicator_unset_list[index - 1] = '0'
+ break
+ else:
+ if term_to_explore in each_int_term:
+ indicator_unset_list[index - 1] = '0'
+ copy_int_term = dict(each_int_term)
+ copy_int_term.pop(term_to_explore)
+ indicator_set_list[index - 1] = self.str_int_term(copy_int_term,
+ array_name,
+ quoted)
+ else:
+ term_as_is = self.str_int_term(each_int_term, array_name, quoted)
+ indicator_set_list[index - 1] = term_as_is
+ indicator_unset_list[index - 1] = term_as_is
+ return (py_list_to_sql_string(indicator_set_list),
+ py_list_to_sql_string(indicator_unset_list))
+ #------------------------------------------------------------------------------
+
+ def get_all_ind_indices(self):
+ """
+ Get the indices for all indicator terms as a sorted list
+
+ Returns:
+ List
+ """
+ return list(sorted(chain(* self.indicator_terms.values())))
+ #------------------------------------------------------------------------------
+
+ def get_all_reference_indices(self):
+ """
+ Get the indices for all reference indicator terms as a sorted list
+
+ Returns:
+ List
+ """
+ return list(sorted(self.reference_terms.values()))
+ #------------------------------------------------------------------------------
+
+ def _int_term_partial_deriv(self, int_term_index, ref):
+ """
+ Compute the derivate of product of polynomial terms (together making the
+ a single interaction term)
+
+ Args:
+ @param int_terms: dict, Dictionary representing the product of
+ power of basis terms (key is index of basis term,
+ and value is the power)
+ @param ref: int, reference variable (index) to compute partial derivative
+ Returns:
+ Dict.
+ """
+ int_term = self.interaction_terms[int_term_index]
+ if not int_term:
+ return int_term
+ if ref in int_term and int_term[ref] > 0:
+ result = defaultdict(int)
+ result.update(int_term)
+ if int_term[ref] > 1:
+ result[0] = int_term[ref]
+ result[ref] -= 1
+ return result
+ else:
+ return defaultdict(int)
+ #------------------------------------------------------------------------------
+
+ def _get_subset_int_terms(self, ref):
+ """
+ Args:
+ @param ref: int. The reference term to search for in each interaction term
+
+ Returns:
+ Dict. A subset of 'interaction_terms' where each interaction term contains 'ref'
+ """
+ if not self.interaction_terms:
+ return self.interaction_terms
+ result = defaultdict(lambda: defaultdict(int))
+ for index, int_terms in self.interaction_terms.items():
+ if ref in int_terms:
+ result[index] = int_terms
+ return result
+ #------------------------------------------------------------------------------
+
+
+import unittest
+
+
+class MarginsTestCase(unittest.TestCase):
+
+ def setUp(self):
+ self.xd = ("ir.1.color, i.2.color, i.3.color, i.4.gender, i.5.degree,"
+ "i.5.degree*i.2.color, i.5.degree*i.3.color, i.5.degree*i.4.gender,"
+ "i.5.degree*i.4.gender*i.2.color, i.5.degree*i.4.gender*i.3.color,"
+ "11, 11^2, i.5.degree*11, i.5.degree*11^2, 15")
+ self.int_obj = InteractionTermBuilder(self.xd)
+
+ self.xd2 = '1, 2, 3, 4, 5, 6, 7, 8, 3*2, 4*3*5, 5^2*4*6, 6^2, 7^3*8'
+ self.int_obj2 = InteractionTermBuilder(self.xd2)
+
+ def tearDown(self):
+ pass
+
+ def test_ind_prefix(self):
+ self.assertEqual(InteractionTermBuilder._strip_indicator('i.1.color'), (1, 'color', False))
+ self.assertEqual(InteractionTermBuilder._strip_indicator('1'), (1, None, False))
+ self.assertEqual(InteractionTermBuilder._strip_indicator('10'), (10, None, False))
+ self.assertEqual(InteractionTermBuilder._strip_indicator('i.10.gender'), (10, 'gender', False))
+ self.assertEqual(InteractionTermBuilder._strip_indicator('ir.10.gender'), (10, 'gender', True))
+ self.assertEqual(InteractionTermBuilder._strip_indicator('i.10.1'), (10, '1', False))
+ self.assertRaises(ValueError, InteractionTermBuilder._strip_indicator, 'i.21*2')
+ self.assertRaises(ValueError, InteractionTermBuilder._strip_indicator, 'i.1')
+
+ def test_parse_interaction(self):
+ self.assertEqual(InteractionTermBuilder._parse_interaction_term('1*2'), {1: 1, 2: 1})
+ self.assertEqual(InteractionTermBuilder._parse_interaction_term('1^2'), {1: 2})
+ self.assertEqual(InteractionTermBuilder._parse_interaction_term('i.1.color*2*i.3.size^4'), {1: 1, 2: 1, 3: 4})
+ self.assertEqual(InteractionTermBuilder._parse_interaction_term('1^2*i.2.2^7*3^4'), {1: 2, 2: 7, 3: 4})
+ self.assertEqual(InteractionTermBuilder._parse_interaction_term('1^2*i.2.2^7*3*3^2'), {1: 2, 2: 7, 3: 3})
+ self.assertRaises(ValueError, InteractionTermBuilder._parse_interaction_term, '1^-2*2^7*3^4')
+ self.assertRaises(ValueError, InteractionTermBuilder._parse_interaction_term, '1^2*2^7.1*3^4')
+ self.assertRaises(ValueError, InteractionTermBuilder._parse_interaction_term, '1^2/2^7.1*3^4')
+
+ def test_str_interaction(self):
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('1*2')), "1 * 2")
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('1*2^2')), "1 * 2^2")
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('1^1*2^2*2^3')), "1 * 2^5")
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('i.1.color*2^2')), "1 * 2^2")
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('i.1.color*2^2'),
+ 'x', quoted=False), 'x[1] * x[2]^2')
+ self.assertEqual(InteractionTermBuilder.str_int_term(
+ InteractionTermBuilder._parse_interaction_term('i.1.color*2^2'),
+ 'Abc', quoted=True), '"Abc"[1] * "Abc"[2]^2')
+ self.assertEqual(
+ self.int_obj.str_sum_int_terms('B', 'A', True),
+ '"B"[6] * "A"[2] * "A"[5] + "B"[7] * "A"[3] * "A"[5] + '
+ '"B"[8] * "A"[4] * "A"[5] + "B"[9] * "A"[2] * "A"[4] * "A"[5] + '
+ '"B"[10] * "A"[3] * "A"[4] * "A"[5] + "B"[12] * "A"[11]^2 + '
+ '"B"[13] * "A"[11] * "A"[5] + "B"[14] * "A"[11]^2 * "A"[5]')
+
+ def test_x_design(self):
+ self.assertEqual(self.int_obj2.basis_terms, [1, 2, 3, 4, 5, 6, 7, 8])
+ self.assertEqual(self.int_obj2.interaction_terms,
+ {9: {2: 1, 3: 1},
+ 10: {3: 1, 4: 1, 5: 1},
+ 11: {5: 2, 4: 1, 6: 1},
+ 12: {6: 2},
+ 13: {7: 3, 8: 1}})
+ self.assertEqual(self.int_obj2.indicator_terms, {})
+ self.assertEqual(self.int_obj2.reference_terms, {})
+ self.assertEqual(self.int_obj2.interaction_partial_deriv(5),
+ {10: {3: 1, 4: 1, 5: 0}, 11: {0: 2, 5: 1, 4: 1, 6: 1}})
+
+ def test_x_design1(self):
+ xd = '1, 2, 1*2, 1^1*i.2.color^2* 5^4 , 5, 6, 5^2'
+ int_obj = InteractionTermBuilder(xd)
+ self.assertEqual(int_obj.basis_terms, [1, 2, 5, 6])
+ self.assertEqual(int_obj.interaction_terms, {3: {1: 1, 2: 1},
+ 4: {1: 1, 2: 2, 5: 4},
+ 7: {5: 2}})
+ self.assertEqual(int_obj.indicator_terms, {})
+ self.assertEqual(int_obj.reference_terms, {})
+
+ def test_x_design2(self):
+ xd = '1, i.2.color, 1*2, 1^1*2^2* 5^4 , 5, 6, 5^2'
+ int_obj = InteractionTermBuilder(xd)
+ self.assertEqual(int_obj.basis_terms, [1, 2, 5, 6])
+ self.assertEqual(int_obj.interaction_terms, {3: {1: 1, 2: 1},
+ 4: {1: 1, 2: 2, 5: 4},
+ 7: {5: 2}})
+ self.assertEqual(int_obj.indicator_terms, {'color': [2]})
+ self.assertEqual(int_obj.reference_terms, {})
+
+ def test_x_design3(self):
+ xd = '1, 2, 3'
+ int_obj = InteractionTermBuilder(xd)
+ self.assertEqual(int_obj.basis_terms, [1, 2, 3])
+ self.assertEqual(int_obj.interaction_terms, {})
+ self.assertEqual(int_obj.indicator_terms, {})
+ self.assertEqual(int_obj.reference_terms, {})
+
+ def test_x_design4(self):
+ xd = '1, 2, i.1*i.2, 1*2^2, i.5, i.6'
+ # i.1, i.2 is invalid input
+ self.assertRaises(ValueError, InteractionTermBuilder, xd)
+
+ def test_x_design5(self):
+ xd = '1, i.2, 1*i.2.color, 1^1*2^2*3^4'
+ # '3' in interaction term is invalid
+ self.assertRaises(ValueError, InteractionTermBuilder, xd)
+
+ def test_x_design6(self):
+ self.assertEqual(self.int_obj.basis_terms, [1, 2, 3, 4, 5, 11, 15])
+ self.assertEqual(self.int_obj.interaction_terms,
+ {6: {5: 1, 2: 1}, 7: {5: 1, 3: 1},
+ 8: {5: 1, 4: 1}, 9: {5: 1, 4: 1, 2: 1},
+ 10: {5: 1, 4: 1, 3: 1}, 12: {11: 2},
+ 13: {5: 1, 11: 1}, 14: {5: 1, 11: 2}})
+ self.assertEqual(self.int_obj.indicator_terms, {'color': [1, 2, 3], 'gender': [4], 'degree': [5]})
+ self.assertEqual(self.int_obj.reference_terms, {'color': 1})
+
+ def test_int_term_partial_deriv(self):
+ self.assertEqual(self.int_obj._int_term_partial_deriv(6, 2), {5: 1, 2: 0})
+ self.assertEqual(self.int_obj._int_term_partial_deriv(7, 5), {5: 0, 3: 1})
+ self.assertEqual(self.int_obj._int_term_partial_deriv(10, 5), {4: 1, 3: 1, 5: 0})
+ self.assertEqual(self.int_obj._int_term_partial_deriv(12, 11), {0: 2, 11: 1})
+ self.assertEqual(self.int_obj._int_term_partial_deriv(14, 11), {0: 2, 5: 1, 11: 1})
+ self.assertEqual(self.int_obj._int_term_partial_deriv(15, 2), {})
+
+ def test_get_subset_int_terms(self):
+ self.assertEqual(self.int_obj._get_subset_int_terms(5),
+ {6: {5: 1, 2: 1}, 7: {5: 1, 3: 1}, 8: {5: 1, 4: 1},
+ 9: {5: 1, 4: 1, 2: 1}, 10: {5: 1, 4: 1, 3: 1},
+ 13: {5: 1, 11: 1}, 14: {5: 1, 11: 2}})
+
+ def test_partial_deriv_all(self):
+ # self.xd = ("ir.1.color, i.2.color, i.3.color, i.4.gender, i.5.degree,"
+ # "i.5.degree*i.2.color, i.5.degree*i.3.color, i.5.degree*i.4.gender,"
+ # "i.5.degree*i.4.gender*i.2.color, i.5.degree*i.4.gender*i.3.color,"
+ # "11, 11^2, i.5.degree*11, i.5.degree*11^2, 15")
+
+ self.assertEqual(self.int_obj.interaction_partial_deriv(5),
+ {6: {5: 0, 2: 1}, 7: {5: 0, 3: 1}, 8: {5: 0, 4: 1},