Permalink
Browse files

Margins: multinomial logistic

Additional authors:
- Shengwen Yang <syang@gopivotal.com>
- Feng, Xixuan (Aaron) <xfeng@gopivotal.com>

Pivotal Tracker: #67684748

Changes:
- added support of marginal effects for multinomial logistic
- appended covariance matrix and coefficients to summary table of mlogregr_train()
- updated the design doc to better reflect the implementation
  • Loading branch information...
1 parent 52332bf commit 8cdce809a1ebcfe14fa1e2f6e2c243f7927dedb4 Rahul Iyer committed with haying May 8, 2014
View
@@ -22,6 +22,7 @@ auto
_region_.tex
auto
*.swp
+*.fdb_latexmk
# Biblatex temporary files
*-blx.bib
@@ -683,11 +683,11 @@ \subsubsection{Multi-Logistic Regression Derivatives}
\end{align}
The derivatives are then
-\begin{equation}\label{eq:first_derivative}
+\begin{equation}\label{eq:first_derivative2}
\frac{\partial l}{\partial \beta_{k,j}} = \sum_{i=1}^{N} Y_{i,j}x_{i,k} - \pi_{i,j}x_{i,k} \ \ \ \ \forall k \ \forall j
\end{equation}
The Hessian is then
-\begin{align}\label{eq:second_derivative}
+\begin{align}\label{eq:second_derivative2}
\frac{\partial^2 l({\beta})}{\partial \beta_{k_2,j_2} \partial \beta_{k_1,j_1}}
&= \sum_{i=1}^{N} -\pi_{i,j_2}x_{i,k_2}(1-\pi_{i,j_1})x_{i,k_1} &&j_1 = j_2 \\
&= \sum_{i=1}^{N} \pi_{i,j_2}x_{i,k_2}\pi_{i,j_1}x_{i,k_1} &&j_1 \neq j_2
@@ -920,21 +920,30 @@ \subsection{Marginal effects for regression methods} % (fold)
and let $J(m) = \pder[\vec{f}]{x_m}$ denote the $m$-th column of $J$, where $\vec{f} = (f_1, \ldots f_N)^T$.
If $x_k$ is a categorical variable and if discrete differences are required
for it, then the column corresponding to $x_k$ can be replaced by,
-$$\pder[\vec{f}]{x_k} =
- \begin{bmatrix}
- f_0^{set} - f_0^{unset}, & f_1^{set} - f_1^{unset}, &
- f_2^{set} - f_2^{unset}, & \ldots, & f_{N - 1}^{set} - f_{N - 1}^{unset}
+\begin{align*}
+\pder[\vec{f}]{x_k} &= \vec{f^{set_k}} - \vec{f^{unset_k}} \\
+ &= \begin{bmatrix}
+ f_0^{set_k} , & f_1^{set_k}, & f_2^{set_k} , & \ldots, & f_{N - 1}^{set_k}
+ \end{bmatrix}^T -
+ \begin{bmatrix}
+ f_0^{unset_k} , & f_1^{unset_k}, & f_2^{unset_k} , & \ldots, & f_{N - 1}^{unset_k}
+ \end{bmatrix}^T \\
+ &= \begin{bmatrix}
+ f_0^{set_k} - f_0^{unset_k}, & f_1^{set_k} - f_1^{unset_k}, &
+ f_2^{set_k} - f_2^{unset_k}, & \ldots, & f_{N - 1}^{set_k} - f_{N - 1}^{unset_k}
\end{bmatrix}^T,
-$$
+\end{align*}
where
\begin{align*}
- & f_i^{set} = f_i(\ldots, x_k=1, x_l=0, x_r=0) \\
+ & f_i^{set_k} = f_i(\ldots, x_k=1, x_l=0, x_r=0) \\
& \text{and} \\
- & f_i^{unset} = f_i(\ldots, x_k=0, x_l=0, x_r=1),
+ & f_i^{unset_k} = f_i(\ldots, x_k=0, x_l=0, x_r=1),
\end{align*}
-$\forall x_l \in$ (set of dummy variables related to $x_k$ excluding the reference variable) and
-$x_r = $ reference variable (if present).
+$\forall x_l \in$ (set of dummy variables related to $x_k$ excluding the
+reference variable) and $x_r = $ reference variable of $x_k$ (if present). The
+response probability corresponding to $f^{set_k}$ is denoted as $P^{set_k}$.
+
\subsubsection{Linear regression} % (fold)
\label{ssub:linear_regression}
@@ -977,11 +986,14 @@ \subsubsection{Multilogistic regression} % (fold)
\label{ssub:multilogistic_regression}
The probabilities of different outcomes for multilogistic regression are expressed as,
\begin{gather*}
- P(y=l | \vec{x}) = \frac{e^{\vec{f}^T \vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}},
+ P^l = P(y=l | \vec{x})
+ = \frac{e^{\vec{f}^T \vec{\beta^l}}}{1 + \displaystyle \sum_{q=1}^{L-1} e^{\vec{f}^T \vec{\beta^q}}}
+ = \frac{e^{\vec{f}^T \vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}},
\end{gather*}
-where $\vec{\beta^l}$ represents the coefficient vector for category $l$,
-with $L$ being the total number of categories. The $\vec{\beta}$ vector is set to zero for one of the
-outcomes, called the ``base outcome'' or the ``reference category''.
+where $\vec{\beta^l}$ represents the coefficient vector for category $l$, with
+$L$ being the total number of categories. The coefficients are set to zero for
+one of the outcomes, called the ``base outcome'' or the ``reference category''.
+Here, without loss of generality, we let $\vec{\beta^L} = \vec{0}$.
% The odds of outcome $j$ verus outcome $m$ are
% \begin{align*}
@@ -1000,7 +1012,7 @@ \subsubsection{Multilogistic regression} % (fold)
Thus,
\begin{align*}
\displaystyle
- \pder[P(y=l|\vec{x})]{x_m} &=
+\pder[P^l]{x_m} &=
\frac{1}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}}
\pder[e^{\vec{f}^T \vec{\beta^l}}]{x_m} -
\frac{e^{\vec{f}^T \vec{\beta^l}} }
@@ -1020,15 +1032,26 @@ \subsubsection{Multilogistic regression} % (fold)
Hence, for every $m$-th variable $x_m$, we have the marginal effect for category
$l$ as,
\begin{gather*}
- \mathit{ME}_m^l = P(y=l)\left(\displaystyle J(m)^T \vec{\beta^l} -
- \sum_{q=1}^{L} P(y=q) \cdot J(m)^T \vec{\beta^q} \right).
+ \mathit{ME}_m^l = P^l\left(\displaystyle J(m)^T \vec{\beta^l} -
+ \sum_{q=1}^{L-1} P^q \cdot J(m)^T \vec{\beta^q} \right).
\end{gather*}
Vectorizing the above equation, we get
- $$
- \mathit{ME}^l =
- P(y=l)\left(\displaystyle J^T \vec{\beta^l} -
- \sum_{q=1}^{L} P(y=q) \cdot J^T \vec{\beta^q} \right).
- $$
+$$
+\mathit{ME}^l = P^l \left(\displaystyle J^T \vec{\beta^l} - J^T B \vec{p} \right),
+$$
+where $\vec{p} = (P^1, \ldots, P^{L-1})^T$ is a column vector and $B = (\vec{\beta}^1, \dots, \vec{\beta}^{L-1})$ is a $N \times (L-1)$ matrix.
+Finally, we can simplify the computation of the marginal effects matrix as
+$$
+\mathit{ME} = J^T B \mathit{diag}(\vec{p}) - J^T B \vec{p} \vec{p}^T.
+$$
+
+Once again, for categorical variables, we compute the discrete difference
+as described in \ref{sub:categorical_variables}.
+For categorical variable $x_k$, the $k$-th row of $\mathit{ME}$ will be,
+$$
+ \mathit{ME_k} = \left( \vec{p^{set_k}} - \vec{p^{unset_k}} \right)^T
+$$
+
% subsubsection multilogistic_regression (end)
% subsection marginal_effects_for_regression_methods (end)
@@ -1047,7 +1070,7 @@ \subsection{Standard Errors} % (fold)
\[ Var(g(X)) \approx \left[g'(\mu_x)\right]^2 Var(X) \]
\subsubsection*{Linear Regression} % (fold)
-\label{ssub:linear_regression}
+\label{ssub:std_linear_regression}
Using this technique, to compute the variance of the marginal effects
at a given observation value in \emph{linear regression}, we obtain
the standard error by first computing the marginal effect's derivative
@@ -1121,7 +1144,7 @@ \subsubsection*{Logistic Regression}
\subsubsection*{Multinomial Logistic Regression}
For multinomial logistic regression, the coefficients $\vec{\beta}$ form a matrix of
-dimension $(L-1) \times N$ where $L$ is the number of categories and $N$ is the
+dimension $N \times (L - 1)$ where $L$ is the number of categories and $N$ is the
number of features (including interaction terms). In order to compute the
standard errors on the marginal effects of category $l$ for independent variable
$x_m$, we need to compute the term $\pder[\mathit{ME}_{m}^{l}]{\beta_{n}^{l'}}$
@@ -1144,12 +1167,12 @@ \subsubsection*{Multinomial Logistic Regression}
logistic regression for the $m$-th index of data vector $\vec{x}$ is given as:
\begin{gather*}
\mathit{ME}_m^l = P^l \left[ J(m)^T \vec{\beta^l} -
- \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right]
+ \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right]
\end{gather*}
where
\begin{gather*}
P^l = P(y=l| \vec{x}) = \frac{e^{\vec{f}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}\vec{\beta^q}}}
- \qquad \forall l \in \{ 1 \ldots L \}
+ \qquad \forall l \in \{ 1 \ldots (L-1) \}.
\end{gather*}
We now compute the term $\pder[\mathit{ME}_m^l]{\vec{\beta}}$. First,
@@ -1160,48 +1183,70 @@ \subsubsection*{Multinomial Logistic Regression}
0 & \mbox{otherwise} \end{cases}.
\end{equation*}
-We can show that for each $j_1 \in \{ 1 \ldots L \}$ and $k_1 \in \{1 \ldots N\}$,
+We can show that for each $l' \in \{ 1 \ldots (L-1) \}$ and $n \in \{1 \ldots N\}$,
the partial derivative will be
\begin{align*}
\pder[\mathit{ME}_m^l]{\beta_n^{l'}} &=
\pder[P^l]{\beta_n^{l'}}
\left[ J(m)^T \vec{\beta^l} -
- \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q}
+ \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q}
\right] +
P^l \left[ \pder[]{\beta_n^{l'}}(J(m)^T \vec{\beta^l}) -
- \pder[]{\beta_n^{l'}} \left( \sum_{q=1}^{L} P^q J(m)^T \beta^q \right)
+ \pder[]{\beta_n^{l'}} \left( \sum_{q=1}^{L-1} P^q J(m)^T \beta^q \right)
\right]
\end{align*}
where
\begin{align*}
- \pder[P^l]{\beta_n^{l'}} &= P^l x_n (\delta_{l,l'} - P^{l'})
+ \pder[P^l]{\beta_n^{l'}} &= P^l f_n (\delta_{l,l'} - P^{l'})
\end{align*}
The expression above can be simplified to obtain
\begin{align*}
\pder[\mathit{ME}_m^l]{\beta_n^{l'}} &=
- P^l x_n (\delta_{l,l'} - P^{l'})
+ P^l f_n (\delta_{l,l'} - P^{l'})
\left[ J(m)^T \vec{\beta^l} -
- \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q}
+ \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q}
\right] + \\
- & \qquad P^l \left[ \delta_{l,l'} \pder[f_n]{x_m} - P^{l'} x_n J(m)^T \vec{\beta^{l'}} +
- P^{l'} x_n \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} -
+ & \qquad P^l \left[ \delta_{l,l'} \pder[f_n]{x_m} - P^{l'} f_n J(m)^T \vec{\beta^{l'}} +
+ P^{l'} f_n \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} -
P^{l'} \pder[f_n]{x_m}
\right] \\
- &= x_n (\delta_{l,l'} - P^{l'}) \cdot P^l \left[J(m)^T\vec{\beta^l} -
- \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right] + \\
+ &= f_n (\delta_{l,l'} - P^{l'}) \cdot P^l \left[J(m)^T\vec{\beta^l} -
+ \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] + \\
& \qquad P^l \left[\delta_{l,l'} \pder[f_n]{x_m} -
- x_n \cdot P^{l'}
- \left( J(m)^T \vec{\beta^{l'}} - \sum_{q=1}^{L} P^q J(m)^T \vec{\beta^q} \right)
+ f_n \cdot P^{l'}
+ \left( J(m)^T \vec{\beta^{l'}} - \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right)
- P^{l'} \pder[f_n]{x_m}
\right] \\[7pt]
- &= x_n (\delta_{l, l'} - P^{l'})\mathit{ME}_m^l +
- P^l \left[ \delta_{l, l'} \pder[f_n]{x_m} - x_n \mathit{ME}_k^{l'} - P^{l'} \pder[f_n]{x_m} \right].
+ &= f_n (\delta_{l, l'} - P^{l'})\mathit{ME}_m^l +
+ P^l \left[ \delta_{l, l'} \pder[f_n]{x_m} - f_n \mathit{ME}_m^{l'} - P^{l'} \pder[f_n]{x_m} \right].
\end{align*}
Again, the above computation is performed for every $l \in \{1 \ldots (L-1)\}$ (base outcome is skipped)
and every $m \in \{1 \ldots M\}$, with each computation returning a column vector
of size $(L-1) \times N$.
+
+For categorical variables, we use the discrete difference value of $\mathit{ME}_m^l$
+to compute the standard error as,
+
+\begin{eqnarray*}
+ \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &=&
+ \pder[P^{set_m, l}]{\beta_n^{l'}} - \pder[P^{unset_m, l}]{\beta_n^{l'}} \\[7pt]
+ &=& P^{set_m, l} f_n^{set_m} (\delta_{l,l'} - P^{set_m, l'}) -
+ P^{unset_m, l} f_n^{unset_m} (\delta_{l,l'} - P^{unset_m, l'}) \\[7pt]
+ &=& - \left(P^{set_m, l} f_n^{set_m} P^{set_m, l'} -
+ P^{unset_m, l} f_n^{unset_m} P^{unset_m, l'} \right) + \\
+ && \qquad \delta_{l,l'} \left(P^{set_m, l} f_n^{set_m} - P^{unset_m, l} f_n^{unset_m} \right),
+\end{eqnarray*}
+where
+\begin{align*}
+ & P^{set_m, l} = \frac{e^{\vec{f^{set_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{set_m}}\vec{\beta^q}}} \\
+ & \text{and} \\
+ & P^{unset_m, l} = \frac{e^{\vec{f^{unset_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{unset_m}}\vec{\beta^q}}} \qquad \qquad \forall l \in \{ 1 \ldots (L-1) \}.
+\end{align*}
+
+
+
% subsection standard_errors (end)
% section marginal_effects (end)
Oops, something went wrong.

0 comments on commit 8cdce80

Please sign in to comment.