# Offline-to-Online Training

Our model is trained generatively---the observed data log-likelihood is maximized using the EM algorithm. However, our goal is to deploy the model in a predictive setting. We want to predict the most likely future trajectory given (1) any baseline information and (2) the noisy marker values observed so far. The focus of this notebook is to understand how we can adjust the parameters of a generative trajectory model in order to improve the performance on the trajectory prediction task.

## Related Work

The paper by [Raina and Ng (2003)](http://ai.stanford.edu/~rajatr/papers/nips03-hybrid.pdf) describes a hybrid generative/discriminative model. One of the key ideas in this work is the relative importance of random variables in the generative model when applied in a predictive context (i.e. the generative model is used to derive a conditional probability through Bayes rule). On page 3 there is an interesting point: they show that the decision rule for binary classification of UseNet documents can be formulated as a comparison between the sum of log-likelihood terms. They note that if features are extracted from, say, the message title and the message body there are many more log-likelihood terms for the body than there are for the title. The title, however, may be informative for making the decision. The NBC (or more generally any generatively trained classifier) will treat them all equally, however.

## Experimental Setup

The metric of interest is the mean absolute error aggregated within the usual buckets we've defined---(1,2], (2,4], (4,8], and (8,25]. We'll begin by looking at predictions made after observing one year of data (i.e. we will only train a single online-adapted model). We will compare to two baselines. The first baseline will be the predictions made using the MAP estimate of the subtype under full information (i.e. observing all of the individual's pFVC data) and the second baseline will be the MAP estimate of the subtype under one year of data (i.e. the standard conditional prediction obtained via application of Bayes rule to the generative model).

## Methods

For each individual $i$, let $y_i$ denote the vector of observed measurements, $t_i$ the measurement times, and $x_i$ the vector of covariates used in the population and subpopulation models. Each individual is associated with a subtype, which we denote using $z_i \in \{1, \ldots, K\}$. Let $\Phi_1(t_i)$ denote the population feature matrix, $\Phi_2(t_i)$ denote the subpopulation feature matrix, and $\Phi_3(t_i)$ denote the individual-specific long-term effects feature matrix.

In the generative model, we specify the marginal probability of subtype membership and the conditional probability of observed markers given subtype membership. The marginal probability of subtype membership is modeled using softmax multiclass regression:

$$ p(z_i = k \mid w_{1:K}) \propto \exp \{ x_i^\top w_k \}. $$

The conditional probability of a marker sequence given subtype membership is

$$ p(y_i \mid z_i = k, \beta_{1:K}) = \mathcal{N} ( m_i(k), \Sigma_i ), $$

where

$$ m_i(k) = \Phi_1(t_i) \Lambda x_i + \Phi_2(t_i) \beta_k $$
and
$$ \Sigma_i = \Phi_3(t_i) \Sigma_b \Phi_3^\top(t_i) + K_{\text{OU}}(t_i) + \sigma^2 \mathbf{1}. $$

Given some observed data $y_i$, the posterior over subtype membership $z_i$ is

$$ p(z_i = k \mid y_i) \propto p(z_i = k \mid w_{1:K}) p(y_i \mid z_i = k, \beta_{1:K}). $$

## Code

In [1]:
import numpy as np
import pandas as pd

from imp import reload

In [2]:
import sys
sys.path.append('/Users/pschulam/Git/mypy')

In [140]:
np.set_printoptions(precision=2)

### B-spline Basis

This basis is **hard-coded** to implement the exact basis functions used to fit the model in the R code.

In [4]:
from mypy import bsplines

boundaries   = (-1.0, 23.0)
degree       = 2
num_features = 6
basis = bsplines.universal_basis(boundaries, degree, num_features)

### Kernel Function

In [5]:
from mypy.util import as_row, as_col

def kernel(x1, x2=None, a_const=1.0, a_ou=1.0, l_ou=1.0):
    symmetric = x2 is None
    d = differences(x1, x1) if symmetric else differences(x1, x2)
    K = a_const * np.ones_like(d)
    K += ou_kernel(d, a_ou, l_ou)
    if symmetric:
        K += np.eye(x1.size)
    return K

def ou_kernel(d, a, l):
    return a * np.exp( - np.abs(d) / l )

def differences(x1, x2):
    return as_col(x1) - as_row(x2)

In [6]:
x_test = np.linspace(0, 20, 41)
X_test = basis.eval(x_test)
K_test = kernel(x_test, a_const=16.0, a_ou=36.0, l_ou=2.0)

In [7]:
X_test[:5, :]

array([[ 0.6944,  0.2917,  0.0139,  0.    ,  0.    ,  0.    ],
       [ 0.5625,  0.4062,  0.0312,  0.    ,  0.    ,  0.    ],
       [ 0.4444,  0.5   ,  0.0556,  0.    ,  0.    ,  0.    ],
       [ 0.3403,  0.5729,  0.0868,  0.    ,  0.    ,  0.    ],
       [ 0.25  ,  0.625 ,  0.125 ,  0.    ,  0.    ,  0.    ]])

In [8]:
K_test[:5, :5]

array([[ 53.    ,  44.0368,  37.8351,  33.0052,  29.2437],
       [ 44.0368,  53.    ,  44.0368,  37.8351,  33.0052],
       [ 37.8351,  44.0368,  53.    ,  44.0368,  37.8351],
       [ 33.0052,  37.8351,  44.0368,  53.    ,  44.0368],
       [ 29.2437,  33.0052,  37.8351,  44.0368,  53.    ]])

### Softmax Model

In [9]:
import scipy.optimize as opt

from mypy.models import softmax
reload(softmax)

<module 'mypy.models.softmax' from '/Users/pschulam/Git/mypy/mypy/models/softmax.py'>

### Trajectory Model

In [65]:
from scipy.stats import multivariate_normal

a_const = 16.0
a_ou    = 36.0
l_ou    = 2.0

def phi1(x):
    return np.ones((x.size, 1))

def phi2(x):
    return basis.eval(x)

def gp_posterior(tnew, t, y, kern, **kwargs):
    from numpy import dot
    from scipy.linalg import inv, solve
    
    K11 = kern(tnew, **kwargs)
    K12 = kern(tnew, t, **kwargs)
    K22 = kern(t, **kwargs)
    
    m = dot(K12, solve(K22, y))
    K = K11 - dot(K12, solve(K22, K12.T))
    
    return m, K

def trajectory_means(t, x, b, B):
    from numpy import dot
    
    P1 = phi1(t)
    P2 = phi2(t)
    
    m1 = dot(P1, dot(b, x)).ravel()
    m2 = dot(B, P2.T)
    
    return m1 + m2

def trajectory_logl(t, x, y, z, B, b):
    if t.size < 1:
        return 0.0
    
    m = trajectory_means(t, x, b, B)[z]
    S = kernel(t, a_const=a_const, a_ou=a_ou, l_ou=l_ou)
    
    return multivariate_normal.logpdf(y, m, S)

### Load Parameters

In [20]:
b = np.loadtxt('param/pop.dat')
B = np.loadtxt('param/subpop.dat')
W = np.loadtxt('param/marginal.dat')
W = np.r_[ np.zeros((1, W.shape[1])), W ]

In [86]:
from scipy.misc import logsumexp

def model_prior(t, x1, x2, y, b, B, W):
    return softmax.regression_log_proba(x2, W)

def model_likelihood(t, x1, x2, y, b, B, W):
    k = B.shape[0]
    return np.array([trajectory_logl(t, x1, y, z, B, b) for z in range(k)])

def model_posterior(t, x1, x2, y, b, B, W):
    prior = model_prior(t, x1, x2, y, b, B, W)
    likel = model_likelihood(t, x1, x2, y, b, B, W)
    lp = prior + likel
    return np.exp(lp - logsumexp(lp))

def model_evidence(t, x1, x2, y, b, B, W):
    prior = model_prior(t, x1, x2, y, b, B, W)
    likel = model_likelihood(t, x1, x2, y, b, B, W)
    lp = prior + likel
    return logsumexp(lp)

### Load Data

In [159]:
from copy import deepcopy

def PatientData(tbl):
    pd = {}
    pd['ptid'] = int(tbl['ptid'].values[0])
    pd['t']    = tbl['years_seen_full'].values.copy()
    pd['y']    = tbl['pfvc'].values.copy()
    pd['x1']   = np.asarray(tbl.loc[:, ['female', 'afram']].drop_duplicates()).ravel()
    pd['x2']   = np.asarray(tbl.loc[:, ['female', 'afram', 'aca', 'scl']].drop_duplicates()).ravel()
    pd['x2']   = np.r_[ 1.0, pd['x2'] ]
    return pd

def truncated_data(pd, censor_time):
    obs = pd['t'] <= censor_time
    pdc = deepcopy(pd)
    pdc['t'] = pd['t'][obs]
    pdc['y'] = pd['y'][obs]
    return pdc, pd['t'][~obs]

def eval_prior(pd, b=b, B=B, W=W):
    return model_prior(pd['t'], pd['x1'], pd['x2'], pd['y'], b, B, W)

def eval_likel(pd, b=b, B=B, W=W):
    return model_likelihood(pd['t'], pd['x1'], pd['x2'], pd['y'], b, B, W)

def run_inference(pd, b=b, B=B, W=W):
    ll = model_loglik(pd['t'], pd['x1'], pd['x2'], pd['y'], b, B, W)
    posterior = model_posterior(pd['t'], pd['x1'], pd['x2'], pd['y'], b, B, W)
    return ll, posterior

In [50]:
pfvc = pd.read_csv('data/benchmark_pfvc.csv')
data = [PatientData(tbl) for _, tbl in pfvc.groupby('ptid')]

In [89]:
ll, pst = run_inference(data[9], b, B, W)
np.round(pst, 3)

array([ 0.   ,  0.   ,  0.   ,  0.01 ,  0.022,  0.117,  0.816,  0.035])

### Online Tuning Algorithm

We're going to tune the posterior predictions of our model at a given time point by adjusting the relative strengths of the likelihood and the prior used to determine the likelihood ratio that determines the posterior. The goal is to fit the *full information posterior* by modifying the *partial information posterior*. For any observed marker sequence $y_i$, we can express the posterior probabilities by specifying the log of the likelihood ratios of each subtype to some *pivot* subtype.

$$ r_{11} = \log \frac{p(z = 1)}{p(z = 1)} + \log \frac{p(y_i \mid z = 1)}{p(y_i \mid z = 1)} $$

$$ r_{21} = \log \frac{p(z = 2)}{p(z = 1)} + \log \frac{p(y_i \mid z = 2)}{p(y_i \mid z = 1)} $$

$$ r_{31} = \log \frac{p(z = 3)}{p(z = 1)} + \log \frac{p(y_i \mid z = 3)}{p(y_i \mid z = 1)} $$

$$ \ldots $$

Note that the first ratio is 0 since it is the log of a ratio that will always be 1. When making a MAP estimate of an individual's subtype, the maximum of these ratios is selected. More generally, if want to match the partial information posterior to the full information posterior as closely as possible, then we want to match these ratios as closely as possible. This suggests a simple adjustment algorithm--- fit $K - 1$ separate regressions where the features are the log ratios of each term in the joint distribution.

In [233]:
def log_ratio(L, pivot=0):
    R = L - L[:, pivot][:, np.newaxis]
    return R

In [234]:
full_log_priors = np.array([eval_prior(d) for d in data])
full_log_likels = np.array([eval_likel(d) for d in data])

yr01_log_priors = np.array([eval_prior(truncated_data(d, 1.0)[0]) for d in data])
yr01_log_likels = np.array([eval_likel(truncated_data(d, 1.0)[0]) for d in data])

In [235]:
L1 = log_ratio(yr01_log_priors)
L2 = log_ratio(yr01_log_likels)
Y  = log_ratio(full_log_priors) + log_ratio(full_log_likels)

#### Algorithm 1

In [236]:
def fit_adjustment(y, x1, x2):
    from scipy.linalg import lstsq
    n = y.size
    X = np.c_[ np.ones(n), x1, x2 ]
    w, _, _, _ = lstsq(X, y)
    return w

def make_adjustment(x1, x2, w):
    n = x1.size
    X = np.c_[ np.ones(n), x1, x2 ]
    return np.dot(X, w)

In [237]:
Yhat = np.zeros_like(Y)
N, K = Yhat.shape
W    = np.zeros((K, 3))
for k in range(1, K):
    w = fit_adjustment(Y[:, k], L1[:, k], L2[:, k])
    W[k] = w
    Yhat[:, k] = make_adjustment(L1[:, k], L2[:, k], w)

In [238]:
full_log_ratio = Y
yr01_log_ratio = L1 + L2

In [239]:
P  = np.array([softmax.softmax_func(y) for y in full_log_ratio])
Q1 = np.array([softmax.softmax_func(y) for y in yr01_log_ratio])
Q2 = np.array([softmax.softmax_func(y) for y in Yhat])

In [240]:
np.sum(P * np.log(P))

-430.77831436230178

In [241]:
np.sum(P * np.log(Q1))

-790.52486194944299

In [242]:
np.sum(P * np.log(Q2))

-1244.9793742351435

This simple approach doesn't work very well using the multinomial regression objective as an evaluation, but this makes sense because each of the weights are learned entirely independently. Another option for evaluation is to look at whether the MAP under the adjusted distribution agrees more with the MAP under full information that the map under partial information.

In [243]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q1, axis=1))

0.63690476190476186

In [244]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q2, axis=1))

0.5267857142857143

Again, the results are not good. This isn't hopeless, though, since the way we trained the adjustment is pretty simple. For completeness, however, let's take a look at the confusion matrix to see if any key mistakes are being corrected using this approach.

In [245]:
from sklearn.metrics import confusion_matrix
confusion_matrix(np.argmax(P, axis=1), np.argmax(Q1, axis=1))

array([[ 17,   4,   2,   1,   0,   0,   0,   0],
       [  3, 103,  21,   7,   4,   1,   0,   0],
       [  0,  24,  77,   6,   3,   0,   1,   0],
       [  0,  10,  21,  51,  21,   0,   6,   0],
       [  0,   0,   2,  26,  93,   7,   5,   3],
       [  0,   0,   2,   0,  18,  35,   6,   1],
       [  0,   1,   4,  10,  14,   7,  35,   1],
       [  0,   0,   0,   1,   1,   0,   0,  17]])

In [246]:
confusion_matrix(np.argmax(P, axis=1), np.argmax(Q2, axis=1))

array([[ 5, 16,  0,  0,  3,  0,  0,  0],
       [ 0, 84, 49,  0,  6,  0,  0,  0],
       [ 0, 15, 82,  1, 13,  0,  0,  0],
       [ 0,  6, 41, 26, 35,  1,  0,  0],
       [ 0,  0,  5, 11, 84, 32,  3,  1],
       [ 0,  0,  0,  1,  9, 49,  1,  2],
       [ 0,  1, 11, 10, 11, 30,  9,  0],
       [ 0,  0,  0,  0,  1,  2,  1, 15]])

#### Algorithm 2

In [313]:
def multinom_pred(W, P, X1, X2):
    Z = np.zeros_like(P)
    N, K = Z.shape
    for k in range(1, K):
        w = W[k]
        X = np.c_[ np.ones(N), X1[:, k], X2[:, k] ]
        Z[:, k] = np.dot(X, w)
    
    Q = np.array([softmax.softmax_func(z) for z in Z])
    return Q

def multinom_cost(W, P, X1, X2):
    Q = multinom_pred(W, P, X1, X2)
    return np.sum(P * np.log(Q))

In [314]:
def multinom_grad(W, P, X1, X2):
    Z = np.zeros_like(P)
    N, K = Z.shape
    for k in range(1, K):
        w = W[k]
        X = np.c_[ np.ones(N), X1[:, k], X2[:, k] ]
        Z[:, k] = np.dot(X, w)
    Q = np.array([softmax.softmax_func(z) for z in Z])
        
    D = np.zeros_like(W)
    for k in range(1, K):
        for i, z in enumerate(Z):
            g = softmax.softmax_grad(z)
            x = np.r_[ 1.0, X1[i, k], X2[i, k] ]
            for j in range(K):
                D[k] += P[i, j] / Q[i, j] * g[j, k] * x
            
    return D

In [251]:
from sklearn.preprocessing import StandardScaler
X1 = StandardScaler().fit_transform(L1)
X2 = StandardScaler().fit_transform(L2)
W0 = np.zeros_like(W)

In [267]:
multinom_cost(W0, P, X1, X2)

-1397.3847160088496

In [253]:
multinom_grad(W0, P, X1, X2)

array([[  0.00e+00,   0.00e+00,   0.00e+00],
       [  5.17e+01,  -2.33e+01,  -1.12e+02],
       [  3.35e+01,  -2.44e+01,  -5.99e+01],
       [  2.63e+01,   9.03e+00,   4.83e-02],
       [  3.78e+01,   2.63e+01,   6.12e+01],
       [ -1.52e+01,   4.51e+01,   7.19e+01],
       [ -1.26e+01,   2.75e+00,   5.78e+01],
       [ -6.35e+01,   6.07e+00,   4.52e+01]])

In [254]:
def check_grad(f, x0, eps=1e-10):
    f0 = f(x0)
    n = x0.size
    g = np.zeros_like(x0)
    for i in range(n):
        dt = np.zeros_like(x0)
        dt[i] += eps
        f1 = f(x0 + dt)
        g[i] = (f1 - f0) / eps
        
    return g

In [278]:
wshape = W0.shape
f = lambda w: -multinom_cost(w.reshape(wshape), P, X1, X2)
g = lambda w: -multinom_grad(w.reshape(wshape), P, X1, X2).ravel()

In [260]:
check_grad(f, W0.ravel(), 1e-11).reshape(wshape)

array([[  0.00e+00,   0.00e+00,   0.00e+00],
       [ -5.17e+01,   2.33e+01,   1.13e+02],
       [ -3.35e+01,   2.44e+01,   5.99e+01],
       [ -2.63e+01,  -9.03e+00,  -4.55e-02],
       [ -3.78e+01,  -2.63e+01,  -6.12e+01],
       [  1.52e+01,  -4.50e+01,  -7.19e+01],
       [  1.26e+01,  -2.75e+00,  -5.78e+01],
       [  6.35e+01,  -6.07e+00,  -4.52e+01]])

In [272]:
s = opt.minimize(f, W.ravel(), jac=g, method='BFGS')

In [276]:
W2 = s.x.reshape(wshape)
W2

array([[  0.  ,   0.  ,   0.  ],
       [  9.69,   5.01,   4.65],
       [ 10.57,   4.87,   6.19],
       [ 11.37,   5.23,   8.61],
       [ 10.9 ,   5.32,  10.51],
       [  8.59,   5.67,  12.58],
       [  9.67,   5.12,  11.51],
       [ -1.6 ,   8.55,  16.67]])

In [273]:
multinom_cost(W2, P, X1, X2)

-777.60273430499228

We have a slight improvement in terms of log-likelihood. Let's check the accuracy of MAP subtype estimates.

In [274]:
Q3 = multinom_pred(W2, P, X1, X2)

In [277]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q1, axis=1))

0.63690476190476186

In [275]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q3, axis=1))

0.6339285714285714

No improvement on MAP accuracy, but we're also not directly optimizing for that. Let's try altering the objective function by fitting to the degenerate distribution over subtypes at the MAP estimate.

In [281]:
P2 = np.array([softmax.onehot_encode(np.argmax(p), P.shape[1]) for p in P])

In [289]:
wshape = W0.shape
f = lambda w: -multinom_cost(w.reshape(wshape), P2, X1, X2)
g = lambda w: -multinom_grad(w.reshape(wshape), P2, X1, X2).ravel()

In [290]:
f(W0)

1397.3847160088499

In [291]:
g(W0).reshape(wshape)

array([[  -0.  ,   -0.  ,   -0.  ],
       [ -55.  ,   21.42,  118.91],
       [ -27.  ,   27.64,   62.42],
       [ -25.  ,  -13.03,    3.06],
       [ -52.  ,  -30.04,  -73.47],
       [  22.  ,  -46.03,  -72.72],
       [  12.  ,    1.52,  -58.49],
       [  65.  ,   -5.31,  -43.18]])

In [292]:
check_grad(f, W0.ravel()).reshape(wshape)

array([[   0.  ,    0.  ,    0.  ],
       [ -55.  ,   21.42,  118.91],
       [ -27.  ,   27.64,   62.41],
       [ -25.  ,  -13.03,    3.06],
       [ -52.  ,  -30.04,  -73.47],
       [  22.  ,  -46.04,  -72.72],
       [  12.  ,    1.51,  -58.5 ],
       [  65.  ,   -5.31,  -43.18]])

In [293]:
s2 = opt.minimize(f, x0=W0.ravel(), jac=g, method='BFGS')

In [294]:
W3 = s2.x.reshape(wshape)

In [296]:
Q4 = multinom_pred(W3, P2, X1, X2)

In [297]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q1, axis=1))

0.63690476190476186

In [299]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q4, axis=1))

0.6383928571428571

In [301]:
confusion_matrix(np.argmax(P, axis=1), np.argmax(Q1, axis=1))

array([[ 17,   4,   2,   1,   0,   0,   0,   0],
       [  3, 103,  21,   7,   4,   1,   0,   0],
       [  0,  24,  77,   6,   3,   0,   1,   0],
       [  0,  10,  21,  51,  21,   0,   6,   0],
       [  0,   0,   2,  26,  93,   7,   5,   3],
       [  0,   0,   2,   0,  18,  35,   6,   1],
       [  0,   1,   4,  10,  14,   7,  35,   1],
       [  0,   0,   0,   1,   1,   0,   0,  17]])

In [300]:
confusion_matrix(np.argmax(P, axis=1), np.argmax(Q4, axis=1))

array([[17,  4,  2,  1,  0,  0,  0,  0],
       [ 4, 96, 27,  7,  4,  1,  0,  0],
       [ 0, 20, 77, 11,  3,  0,  0,  0],
       [ 0, 10, 16, 59, 19,  0,  5,  0],
       [ 0,  0,  1, 24, 96,  6,  8,  1],
       [ 0,  0,  2,  0, 20, 33,  6,  1],
       [ 0,  1,  2, 11, 14,  8, 36,  0],
       [ 0,  0,  0,  1,  1,  2,  0, 15]])

#### Future Directions

1. Use a Bayesian multinomial logistic regression classifier to sidestep the inexpressive linear model.

2. Change the objective function to more directly reflect the cost function used to evaluate the model (i.e. not all subtype misclassification mistakes are equal, some mistakes are more costly and perhaps we'd like to reflect that in the learning procedure).

3. Incorporate additional likelihood ratios based on other longitudinally measured outcomes.

### Adding Additional Markers

In [302]:
L3 = np.loadtxt('param/gi_ratios.dat')

In [307]:
X3 = StandardScaler().fit_transform(L3)

In [318]:
def multinom_pred2(W, P, X1, X2, X3):
    Z = np.zeros_like(P)
    N, K = Z.shape
    for k in range(1, K):
        w = W[k]
        X = np.c_[ np.ones(N), X1[:, k], X2[:, k], X3[:, k] ]
        Z[:, k] = np.dot(X, w)
    
    Q = np.array([softmax.softmax_func(z) for z in Z])
    return Q

def multinom_cost2(W, P, X1, X2, X3):
    Q = multinom_pred2(W, P, X1, X2, X3)
    return np.sum(P * np.log(Q))

In [319]:
def multinom_grad2(W, P, X1, X2, X3):
    Z = np.zeros_like(P)
    N, K = Z.shape
    for k in range(1, K):
        w = W[k]
        X = np.c_[ np.ones(N), X1[:, k], X2[:, k], X3[:, k] ]
        Z[:, k] = np.dot(X, w)
    Q = np.array([softmax.softmax_func(z) for z in Z])
        
    D = np.zeros_like(W)
    for k in range(1, K):
        for i, z in enumerate(Z):
            g = softmax.softmax_grad(z)
            x = np.r_[ 1.0, X1[i, k], X2[i, k], X3[i, k] ]
            for j in range(K):
                D[k] += P[i, j] / Q[i, j] * g[j, k] * x
            
    return D

In [333]:
W0 = np.zeros((P.shape[1], 4))
f2 = lambda w: -multinom_cost2(w.reshape(W0.shape), P, X1, X2, X3)
g2 = lambda w: -multinom_grad2(w.reshape(W0.shape), P, X1, X2, X3).ravel()

In [334]:
s2 = opt.minimize(f2, W0.ravel(), jac=g2, method='BFGS')

In [336]:
Q5 = multinom_pred2(s2.x.reshape(W0.shape), P, X1, X2, X3)

In [337]:
np.mean(np.argmax(P, axis=1) == np.argmax(Q5, axis=1))

0.6607142857142857