# Gillan Model


$Q_{MF0_{t}}$ = TD(0) Value of action 1 when reaching 2nd stage at time t

$Q_{MF1_{t}}$ = TD(1) MF Value of action 1 when seeing reward after 2nd stage choice

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition counter is updated. For example if state1, action1 led to state 2 once, and on the next transition, the same transition occurs, the counting matrix would be updated as follows:

$T_{counting}=\begin{bmatrix}
1+1 & 0\\
0 & 0
\end{bmatrix}$

$T$ can be one of two matrices at and given trial 
$T_{1} = \begin{bmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{bmatrix}$ or $T_{2}=\begin{bmatrix}
0.3 & 0.7 \\
0.7 & 0.3
\end{bmatrix}$ at any given trial. 

This is determined by the $T_{counting}$ matrix. When $T_{counting}(1,1) + T_{counting}(2,2) > T_{counting}(1,2) + T_{counting}(2,1)$, then $T_{1}$ is used.

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF0{t+1}}=Q_{MF0_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF0_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Each Q value has its own beta in the following softmax:

$P(a_{1},s_{1}) \propto e^{(\beta_{MF0}Q_{MF0}+\beta_{MF1}Q_{MF1}+\beta_{MB}Q_{MB}+\beta_{st}M)}$

Note this includes a perseveration parameter that has a beta on the last action taken.


# Gillan + TL Model

$Q_{MF0_{t}}$ = TD(0) Value of action 1 when reaching 2nd stage at time t

$Q_{MF1_{t}}$ = TD(1) MF Value of action 1 when seeing reward after 2nd stage choice

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\gamma$ = learning rate for state transitions

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: 

$P(s_1,a_1,s_2)= P(s_1,a_1,s_2) + \gamma(1-P(s_1,a_1,s_2))$

and

$P(s_1,a_1,s_3)= 1-P(s_1,a_1,s_2)$

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF0{t+1}}=Q_{MF0_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF0_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Each Q value has its own beta in the following softmax:

$P(a_{1},s_{1}) \propto e^{(\beta_{MF0}Q_{MF0}+\beta_{MF1}Q_{MF1}+\beta_{MB}Q_{MB}+\beta_{st}M)}$

Note this includes a perseveration parameter that has a beta on the last action taken.


# Fixed TL Model

$Q_{MF0_{t}}$ = TD(0) Value of action 1 when reaching 2nd stage at time t

$Q_{MF1_{t}}$ = TD(1) MF Value of action 1 when seeing reward after 2nd stage choice

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\gamma$ = FIXED (fitted hierarchically to full sample)

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: 

$P(s_1,a_1,s_2)= P(s_1,a_1,s_2) + \gamma(1-P(s_1,a_1,s_2))$

and

$P(s_1,a_1,s_3)= 1-P(s_1,a_1,s_2)$

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF0{t+1}}=Q_{MF0_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF0_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Each Q value has its own beta in the following softmax:

$P(a_{1},s_{1}) \propto e^{(\beta_{MF0}Q_{MF0}+\beta_{MF1}Q_{MF1}+\beta_{MB}Q_{MB}+\beta_{st}M)}$

Note this includes a perseveration parameter that has a beta on the last action taken.


# Bayes model


$Q_{MF0_{t}}$ = TD(0) Value of action 1 when reaching 2nd stage at time t

$Q_{MF1_{t}}$ = TD(1) MF Value of action 1 when seeing reward after 2nd stage choice

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition counter is updated. For example if state1, action1 led to state 2 once, and on the next transition, the same transition occurs, the counting matrix would be updated as follows:

$T_{counting}=\begin{bmatrix}
1+1 & 0\\
0 & 0
\end{bmatrix}$

$T$ can be one of two matrices at and given trial 
$T_{1} = \begin{bmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{bmatrix}$ or $T_{2}=\begin{bmatrix}
0.3 & 0.7 \\
0.7 & 0.3
\end{bmatrix}$ at any given trial. 

The first column of the matrix is action 1, which can transition to state 2 or 3 (rows 1 and 2 respectively).

Categorical prior ($p$ is a free parameter) on transition matrices = $[p \,\,\,\,\,(1-p)]$ 

Henceforth $p$ is $p_1$ and $1-p$ is $p_2$. Each refers to the probability that either of the two matrices delineated above is the correct transition matrix.

Probabilities are updated according to Bayes rule:

$p_1 = Bernoulli(all,0.7,hits)(p_1)$

$p_2=Bernoulli(all,0.3,hits)(p_2)$
        
$p_{total}=p_1+p_2$

$p_{1}=\frac{p_{1}}{p_{total}}$

$p_{2}=\frac{p_{2}}{p_{total}}$

Here "all" refers the running sum of experienced transitions. 

"Hits" refers to evidence in favor of matrix $T_{1}$, which are defined as the common transitions. For example, hits comprise experiencing state 2 from action 1 (entry (1,1) in the matrix) or state 3 from action 2 (entry (2,2) in the matrix). 

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF0{t+1}}=Q_{MF0_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF0_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T_1 \cdot Q_{MF2_{t}})(p_1)+argmax(T_2 \cdot Q_{MF2_{t}})(p_2)$

Note the MB action value is a weighted combination of MB values according to each possible transition matrix weighted by their current posterior probability

Each Q value has its own beta in the following softmax:

$P(a_{1},s_{1}) \propto e^{(\beta_{MF0}Q_{MF0}+\beta_{MF1}Q_{MF1}+\beta_{MB}Q_{MB}+\beta_{st}M)}$

Note this includes a perseveration parameter that has a beta on the last action taken.


# MB only model

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\gamma$ = learning rate for state transitions

$\alpha_{1}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: 

$P(s_1,a_1,s_2)_{t+1}= P(s_1,a_1,s_2)_{t} + \gamma(1-P(s_1,a_1,s_2)_{t})$

and

$P(s_1,a_1,s_3)_{t+1}= 1-P(s_1,a_1,s_2)_{t+1}$

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Each Q value has its own beta in the following softmax:

$P(a_{1},s_{1}) \propto e^{(\beta_{MB}Q_{MB}+\beta_{st}M)}$

Note this includes a perseveration parameter that has a beta on the last action taken.

# Daw Model

$Q_{MF1_{t}}$ = MF Value of action 1 

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$\lambda$ = eligibility trace

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition counter is updated. For example if state1, action1 led to state 2 once, and on the next transition, the same transition occurs, the counting matrix would be updated as follows:

$T_{counting}=\begin{bmatrix}
1+1 & 0\\
0 & 0
\end{bmatrix}$

$T$ can be one of two matrices at and given trial 
$T_{1} = \begin{bmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{bmatrix}$ or $T_{2}=\begin{bmatrix}
0.3 & 0.7 \\
0.7 & 0.3
\end{bmatrix}$ at any given trial. 

This is determined by the $T_{counting}$ matrix. When $T_{counting}(1,1) + T_{counting}(2,2) > T_{counting}(1,2) + T_{counting}(2,1)$, then $T_{1}$ is used.

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF1{t+1}}=Q_{MF1_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF1_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \lambda\alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Q values for 1-stage actions are integrated in the following way:

$Q_{integrated} = w(Q_{MB})+(1-w)(Q_{MF1})$

$P(a_{1},s_{1}) \propto e^{\beta[Q_{integrated}+\rho(M)]}$

where $\rho$ is a perseveration parameter. 

# Daw + TL model

$Q_{MF1_{t}}$ = MF Value of action 1 

$Q_{MF2_{t}}$ = TD Value of action 2

$Q_{MB_{t}}$ = Model-based value of action 1

$R$ = reward

$\alpha_{1}$ = learning rate for $Q_{MF0}$

$\alpha_{2}$ = learning rate for $Q_{MF1}$ and $Q_{MB}$

$\lambda$ = eligibility trace

$T = \begin{bmatrix}
P(s_1,a_1,s_2) & P(s_1,a_2,s_2) \\
P(s_1,a_1,s_3) & P(s_1,a_2,s_3) 
\end{bmatrix}$

Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: 

$P(s_1,a_1,s_2)= P(s_1,a_1,s_2) + \gamma(1-P(s_1,a_1,s_2))$

and

$P(s_1,a_1,s_3)= 1-P(s_1,a_1,s_2)$

$M$ = one-hot vector indicating which first-stage action was previously taken.

$Q_{MF1{t+1}}=Q_{MF1_{t}} + \alpha_{1}(Q_{MF2_{t}}-Q_{MF1_{t}})$

$Q_{MF1_{t+1}}=Q_{MF1_{t}} + \lambda\alpha_{2}(R-Q_{MF1_{t}})$

$Q_{MF2_{t+1}}=Q_{MF2_{t}} + \alpha(R-Q_{MF2_{t}})$

$Q_{MB_{t+1}} = argmax(T \cdot Q_{MF2_{t}})$

Q values for 1-stage actions are integrated in the following way:

$Q_{integrated} = w(Q_{MB})+(1-w)(Q_{MF1})$

$P(a_{1},s_{1}) \propto e^{\beta[Q_{integrated}+\rho(M)]}$

where $\rho$ is a perseveration parameter.