Margins: multinomial logistic

Additional authors:
- Shengwen Yang <syang@gopivotal.com>
- Feng, Xixuan (Aaron) <xfeng@gopivotal.com>

Pivotal Tracker: #67684748

Changes:
- added support of marginal effects for multinomial logistic
- appended covariance matrix and coefficients to summary table of mlogregr_train()
- updated the design doc to better reflect the implementation
 @@ -683,11 +683,11 @@ \subsubsection{Multi-Logistic Regression Derivatives} \end{align} The derivatives are then -\begin{equation}\label{eq:first_derivative} +\begin{equation}\label{eq:first_derivative2} \frac{\partial l}{\partial \beta_{k,j}} = \sum_{i=1}^{N} Y_{i,j}x_{i,k} - \pi_{i,j}x_{i,k} \ \ \ \ \forall k \ \forall j \end{equation} The Hessian is then -\begin{align}\label{eq:second_derivative} +\begin{align}\label{eq:second_derivative2} \frac{\partial^2 l({\beta})}{\partial \beta_{k_2,j_2} \partial \beta_{k_1,j_1}} &= \sum_{i=1}^{N} -\pi_{i,j_2}x_{i,k_2}(1-\pi_{i,j_1})x_{i,k_1} &&j_1 = j_2 \\ &= \sum_{i=1}^{N} \pi_{i,j_2}x_{i,k_2}\pi_{i,j_1}x_{i,k_1} &&j_1 \neq j_2 @@ -920,21 +920,30 @@ \subsection{Marginal effects for regression methods} % (fold) and let $J(m) = \pder[\vec{f}]{x_m}$ denote the $m$-th column of $J$, where $\vec{f} = (f_1, \ldots f_N)^T$. If $x_k$ is a categorical variable and if discrete differences are required for it, then the column corresponding to $x_k$ can be replaced by, -\pder[\vec{f}]{x_k} = - \begin{bmatrix} - f_0^{set} - f_0^{unset}, & f_1^{set} - f_1^{unset}, & - f_2^{set} - f_2^{unset}, & \ldots, & f_{N - 1}^{set} - f_{N - 1}^{unset} +\begin{align*} +\pder[\vec{f}]{x_k} &= \vec{f^{set_k}} - \vec{f^{unset_k}} \\ + &= \begin{bmatrix} + f_0^{set_k} , & f_1^{set_k}, & f_2^{set_k} , & \ldots, & f_{N - 1}^{set_k} + \end{bmatrix}^T - + \begin{bmatrix} + f_0^{unset_k} , & f_1^{unset_k}, & f_2^{unset_k} , & \ldots, & f_{N - 1}^{unset_k} + \end{bmatrix}^T \\ + &= \begin{bmatrix} + f_0^{set_k} - f_0^{unset_k}, & f_1^{set_k} - f_1^{unset_k}, & + f_2^{set_k} - f_2^{unset_k}, & \ldots, & f_{N - 1}^{set_k} - f_{N - 1}^{unset_k} \end{bmatrix}^T, - +\end{align*} where \begin{align*} - & f_i^{set} = f_i(\ldots, x_k=1, x_l=0, x_r=0) \\ + & f_i^{set_k} = f_i(\ldots, x_k=1, x_l=0, x_r=0) \\ & \text{and} \\ - & f_i^{unset} = f_i(\ldots, x_k=0, x_l=0, x_r=1), + & f_i^{unset_k} = f_i(\ldots, x_k=0, x_l=0, x_r=1), \end{align*} -$\forall x_l \in$ (set of dummy variables related to $x_k$ excluding the reference variable) and -$x_r =$ reference variable (if present). +$\forall x_l \in$ (set of dummy variables related to $x_k$ excluding the +reference variable) and $x_r =$ reference variable of $x_k$ (if present). The +response probability corresponding to $f^{set_k}$ is denoted as $P^{set_k}$. + \subsubsection{Linear regression} % (fold) \label{ssub:linear_regression} @@ -977,11 +986,14 @@ \subsubsection{Multilogistic regression} % (fold) \label{ssub:multilogistic_regression} The probabilities of different outcomes for multilogistic regression are expressed as, \begin{gather*} - P(y=l | \vec{x}) = \frac{e^{\vec{f}^T \vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}}, + P^l = P(y=l | \vec{x}) + = \frac{e^{\vec{f}^T \vec{\beta^l}}}{1 + \displaystyle \sum_{q=1}^{L-1} e^{\vec{f}^T \vec{\beta^q}}} + = \frac{e^{\vec{f}^T \vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}}, \end{gather*} -where $\vec{\beta^l}$ represents the coefficient vector for category $l$, -with $L$ being the total number of categories. The $\vec{\beta}$ vector is set to zero for one of the -outcomes, called the base outcome'' or the reference category''. +where $\vec{\beta^l}$ represents the coefficient vector for category $l$, with +$L$ being the total number of categories. The coefficients are set to zero for +one of the outcomes, called the base outcome'' or the reference category''. +Here, without loss of generality, we let $\vec{\beta^L} = \vec{0}$. % The odds of outcome $j$ verus outcome $m$ are % \begin{align*} @@ -1000,7 +1012,7 @@ \subsubsection{Multilogistic regression} % (fold) Thus, \begin{align*} \displaystyle - \pder[P(y=l|\vec{x})]{x_m} &= +\pder[P^l]{x_m} &= \frac{1}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}} \pder[e^{\vec{f}^T \vec{\beta^l}}]{x_m} - \frac{e^{\vec{f}^T \vec{\beta^l}} } @@ -1020,15 +1032,26 @@ \subsubsection{Multilogistic regression} % (fold) Hence, for every $m$-th variable $x_m$, we have the marginal effect for category $l$ as, \begin{gather*} - \mathit{ME}_m^l = P(y=l)\left(\displaystyle J(m)^T \vec{\beta^l} - - \sum_{q=1}^{L} P(y=q) \cdot J(m)^T \vec{\beta^q} \right). + \mathit{ME}_m^l = P^l\left(\displaystyle J(m)^T \vec{\beta^l} - + \sum_{q=1}^{L-1} P^q \cdot J(m)^T \vec{\beta^q} \right). \end{gather*} Vectorizing the above equation, we get - $$- \mathit{ME}^l = - P(y=l)\left(\displaystyle J^T \vec{\beta^l} - - \sum_{q=1}^{L} P(y=q) \cdot J^T \vec{\beta^q} \right). -$$ +$$+\mathit{ME}^l = P^l \left(\displaystyle J^T \vec{\beta^l} - J^T B \vec{p} \right), +$$ +where $\vec{p} = (P^1, \ldots, P^{L-1})^T$ is a column vector and $B = (\vec{\beta}^1, \dots, \vec{\beta}^{L-1})$ is a $N \times (L-1)$ matrix. +Finally, we can simplify the computation of the marginal effects matrix as +$$+\mathit{ME} = J^T B \mathit{diag}(\vec{p}) - J^T B \vec{p} \vec{p}^T. +$$ + +Once again, for categorical variables, we compute the discrete difference +as described in \ref{sub:categorical_variables}. +For categorical variable $x_k$, the $k$-th row of $\mathit{ME}$ will be, +$$+ \mathit{ME_k} = \left( \vec{p^{set_k}} - \vec{p^{unset_k}} \right)^T +$$ + % subsubsection multilogistic_regression (end) % subsection marginal_effects_for_regression_methods (end) @@ -1047,7 +1070,7 @@ \subsection{Standard Errors} % (fold) $Var(g(X)) \approx \left[g'(\mu_x)\right]^2 Var(X)$ \subsubsection*{Linear Regression} % (fold) -\label{ssub:linear_regression} +\label{ssub:std_linear_regression} Using this technique, to compute the variance of the marginal effects at a given observation value in \emph{linear regression}, we obtain the standard error by first computing the marginal effect's derivative @@ -1121,7 +1144,7 @@ \subsubsection*{Logistic Regression} \subsubsection*{Multinomial Logistic Regression} For multinomial logistic regression, the coefficients $\vec{\beta}$ form a matrix of -dimension $(L-1) \times N$ where $L$ is the number of categories and $N$ is the +dimension $N \times (L - 1)$ where $L$ is the number of categories and $N$ is the number of features (including interaction terms). In order to compute the standard errors on the marginal effects of category $l$ for independent variable $x_m$, we need to compute the term $\pder[\mathit{ME}_{m}^{l}]{\beta_{n}^{l'}}$ @@ -1144,12 +1167,12 @@ \subsubsection*{Multinomial Logistic Regression} logistic regression for the $m$-th index of data vector $\vec{x}$ is given as: \begin{gather*} \mathit{ME}_m^l = P^l \left[ J(m)^T \vec{\beta^l} - - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right] + \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] \end{gather*} where \begin{gather*} P^l = P(y=l| \vec{x}) = \frac{e^{\vec{f}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}\vec{\beta^q}}} - \qquad \forall l \in \{ 1 \ldots L \} + \qquad \forall l \in \{ 1 \ldots (L-1) \}. \end{gather*} We now compute the term $\pder[\mathit{ME}_m^l]{\vec{\beta}}$. First, @@ -1160,48 +1183,70 @@ \subsubsection*{Multinomial Logistic Regression} 0 & \mbox{otherwise} \end{cases}. \end{equation*} -We can show that for each $j_1 \in \{ 1 \ldots L \}$ and $k_1 \in \{1 \ldots N\}$, +We can show that for each $l' \in \{ 1 \ldots (L-1) \}$ and $n \in \{1 \ldots N\}$, the partial derivative will be \begin{align*} \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &= \pder[P^l]{\beta_n^{l'}} \left[ J(m)^T \vec{\beta^l} - - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} + \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] + P^l \left[ \pder[]{\beta_n^{l'}}(J(m)^T \vec{\beta^l}) - - \pder[]{\beta_n^{l'}} \left( \sum_{q=1}^{L} P^q J(m)^T \beta^q \right) + \pder[]{\beta_n^{l'}} \left( \sum_{q=1}^{L-1} P^q J(m)^T \beta^q \right) \right] \end{align*} where \begin{align*} - \pder[P^l]{\beta_n^{l'}} &= P^l x_n (\delta_{l,l'} - P^{l'}) + \pder[P^l]{\beta_n^{l'}} &= P^l f_n (\delta_{l,l'} - P^{l'}) \end{align*} The expression above can be simplified to obtain \begin{align*} \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &= - P^l x_n (\delta_{l,l'} - P^{l'}) + P^l f_n (\delta_{l,l'} - P^{l'}) \left[ J(m)^T \vec{\beta^l} - - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} + \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] + \\ - & \qquad P^l \left[ \delta_{l,l'} \pder[f_n]{x_m} - P^{l'} x_n J(m)^T \vec{\beta^{l'}} + - P^{l'} x_n \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} - + & \qquad P^l \left[ \delta_{l,l'} \pder[f_n]{x_m} - P^{l'} f_n J(m)^T \vec{\beta^{l'}} + + P^{l'} f_n \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} - P^{l'} \pder[f_n]{x_m} \right] \\ - &= x_n (\delta_{l,l'} - P^{l'}) \cdot P^l \left[J(m)^T\vec{\beta^l} - - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right] + \\ + &= f_n (\delta_{l,l'} - P^{l'}) \cdot P^l \left[J(m)^T\vec{\beta^l} - + \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] + \\ & \qquad P^l \left[\delta_{l,l'} \pder[f_n]{x_m} - - x_n \cdot P^{l'} - \left( J(m)^T \vec{\beta^{l'}} - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right) + f_n \cdot P^{l'} + \left( J(m)^T \vec{\beta^{l'}} - \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right) - P^{l'} \pder[f_n]{x_m} \right] \\[7pt] - &= x_n (\delta_{l, l'} - P^{l'})\mathit{ME}_m^l + - P^l \left[ \delta_{l, l'} \pder[f_n]{x_m} - x_n \mathit{ME}_k^{l'} - P^{l'} \pder[f_n]{x_m} \right]. + &= f_n (\delta_{l, l'} - P^{l'})\mathit{ME}_m^l + + P^l \left[ \delta_{l, l'} \pder[f_n]{x_m} - f_n \mathit{ME}_m^{l'} - P^{l'} \pder[f_n]{x_m} \right]. \end{align*} Again, the above computation is performed for every $l \in \{1 \ldots (L-1)\}$ (base outcome is skipped) and every $m \in \{1 \ldots M\}$, with each computation returning a column vector of size $(L-1) \times N$. + +For categorical variables, we use the discrete difference value of $\mathit{ME}_m^l$ +to compute the standard error as, + +\begin{eqnarray*} + \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &=& + \pder[P^{set_m, l}]{\beta_n^{l'}} - \pder[P^{unset_m, l}]{\beta_n^{l'}} \\[7pt] + &=& P^{set_m, l} f_n^{set_m} (\delta_{l,l'} - P^{set_m, l'}) - + P^{unset_m, l} f_n^{unset_m} (\delta_{l,l'} - P^{unset_m, l'}) \\[7pt] + &=& - \left(P^{set_m, l} f_n^{set_m} P^{set_m, l'} - + P^{unset_m, l} f_n^{unset_m} P^{unset_m, l'} \right) + \\ + && \qquad \delta_{l,l'} \left(P^{set_m, l} f_n^{set_m} - P^{unset_m, l} f_n^{unset_m} \right), +\end{eqnarray*} +where +\begin{align*} + & P^{set_m, l} = \frac{e^{\vec{f^{set_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{set_m}}\vec{\beta^q}}} \\ + & \text{and} \\ + & P^{unset_m, l} = \frac{e^{\vec{f^{unset_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{unset_m}}\vec{\beta^q}}} \qquad \qquad \forall l \in \{ 1 \ldots (L-1) \}. +\end{align*} + + + % subsection standard_errors (end) % section marginal_effects (end)