In [1]:
from IPython.display import HTML, display

#   How do we decide and learn in a volatile environment? 

## **Theoretical Framework**

## Limitations of current theoretical approaches 
#### Performing successfully in a volatile environment requires making fast, accurate decisions and updating those decisions given environmental feedback. However, accumulator models of choice-making, which model mechanisms internal to the decision, and reinforcement learning models, which involve how the outcome of those choices influence decision updates, are often isolated, despite their complementary goals. While drift diffusion modeling is capable of describing the speed and accuracy of choices, the decision parameters which govern those choices are assumed to remain static across trials, and while models of reinforcement learning explain how environmental feedback sculpts choice preferences, they fail to describe decision-making mechanisms. 

## Neural evidence for integration
#### Much research has shown that both action selection and reinforcement learning rely on the cortico-basal ganglia-thalamic (CBGT) circuit ([Bogacz & Larsen, 2011](https://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00103)), which contains three pathways with distinct effects on action selection: the direct (action facilitation), indirect (action suppression), and hyperdirect (immediate action cancellation) pathways. 
#### In this circuit, each component of the basal ganglia can be either facilitated or inhibited by direct and indirect action channels, where the direct pathway subdues tonic inhibition of the thalamus by the globus pallidus internal segment (GPi) to encourage action execution, and the indirect pathway activates the globus pallidus external segment (GPe) and subthalamic nucleus (STN) to increase GPi output and prevent action output. These indirect and direct pathways are also modulated by dopamine released during reinforcement learning via projections from the substantia nigra compacta (SNC). 

<tr><td><img src='cbgt_dunovanVerstynen.png' style='width: 700px;'>
*figure from [Dunovan and Verstynen 2018](https://www.biorxiv.org/content/early/2017/10/28/153676)*
    
#### Hyperdirect activation of the STN has been correlated with value-conflict ([Frank et al., 2015](http://www.jneurosci.org/content/35/2/485)), which is thought to affect evidence accumulation within the decision making framework. Additionally, there is evidence that the volatility of environmental feedback affects tonic dopamine levels in the striatum to regulate the explore-exploit tradeoff for potential actions ([Humphris, Khamassi, & Gurney, 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3272648/)). Altogether, this evidence suggests that 1) reinforcement learning and decision making processes act as integrated parts of a whole, and 2) it points to the importance of value-conflict and volatility as key learning signals for the integration of decision making and reinforcement learning. 

#### I plan to explore how value conflict between competing actions (the degree to which the value associated with each action is similar) and the volatility of feedback (the change point frequency of mean value-action associations) influence adaptive decision making using a  drift-diffusion reinforcement learning model. 

## **Task**

#### In our task, we operationally define conflict as the mean reward difference between actions and volatility as the hazard rate of change in that mean reward difference. We plan to manipulate factors of volatility and conflict using a 2x2 within-subjects design to form four conditions, with each participant performing 1000 trials per condition and two conditions per day until all conditions are complete.     

#### Reaction time and choice accuracy data will be collected using a two-alternative forced-choice task written in PsychoPy. On each trial, participants will be asked to choose the highest-reward target within 700 ms. Then they will be shown the reward earned on each trial based on their decision. To prevent prepotent selections, the position of the rewarding target will be randomized across trials, with the target identity as the rewarding feature (rather than target position).  The instruction screen and the structure of a sample trial are below: 
<tr><td><img src='instructions.png' style='width: 700px;'></td><td><img src='task.png' style='width: 700px;'></td></tr>

#### Each participant will receive a separate reward schedule. Below is a sample of one high volatility reward schedule, with the reward difference between targets on the y-axis. The  mean reward difference between targets shifts probabilistically, approximately every 30 trials. <br>

<tr><td><img src='hv_lc_sample.png' style='width: 500px;'>
    
#### Below is a sample of a high conflict reward schedule. The reward difference between targets is drawn from a normal distribution centered on a mean difference between .01 and .2. Reward distributions for all conditions have standard deviations ranging from 0.05 to 0.07. <br>


<tr><td><img src='lv_hc_sample.png' style='width: 500px;'>



## **Hypotheses** 
<br>
#### *Conflict*

#### The rate of evidence accumulation [$v$] affects the speed of a decision, and so may vary with conflict, such that lower conflict conditions show greater drift rates and higher conflict conditions show slower drift rates. 
<br>

#### The degree to which evidence accumulation is biased toward the higher value target may also differ according to conflict. The starting point for evidence accumulation [$z$] may vary with conflict, with larger differences in prevous action values biasing the starting point for evidence accumulation toward the target of higher value, and smaller differences in value decreasing starting point bias (so that $z$ is closer to $a/2$). 

<br>
#### *Volatility*  
#### The decision threshold [$a$], which represents a global degree of caution regarding either action and regulates choice accuracy, may increase and decrease with change point probability [$\Omega$]. 
#### When the decision threshold increases with change point probability, the learner has more time to make a decision, so she may be shifted toward a more explorative learning policy. 

#### Increased volatility will increase learning rates [$\beta$]. 

#### Decision threshold and drift rate adaptation will likely combine to drive behavior, with the threshold shift driven by change point probability and affecting both targets, or by belief in the reward difference changing the starting point, biasing the decision toward the high value target. The drift rate change will be driven by belief in the reward difference.


## **Cognitive models**

#### Nomenclature

<center>**Learning signals (estimates from ideal observer)**</center>
$$
\begin{align}
{B} = \textrm{belief in the mean reward difference between targets} && \Omega = \textrm{change point probability}\\
\end{align}
$$ 
<center>**Learning targets (decision parameters)** </center>
$$
\begin{align}
a = \textrm{boundary} && v = \textrm{execution drift rate}\\  
z = \textrm{starting point}\\
\end{align}
$$

<center>**Other parameters**</center> 
$$
\begin{align}
\sigma^2_n = \textrm{variance of the generative distribution} && \sigma^2_t = \textrm{estimated variance}\\ 
\phi = \textrm{model confidence} && H = \textrm{hazard rate}\\ 
r_t = \textrm{reward difference observed} && \alpha = \textrm{bayesian belief learning rate}\\
\delta = \textrm{reward prediction error}\\
\end{align}
$$
<br>

### Approach
#### We update the targets of learning, decision boundary height, the rate of evidence accumulation, and the starting point, using estimates of the reward difference between targets and the reward changepoint probability as learning signals. <br><br>

### Belief calculation
#### The belief in the mean of the distribution of reward differences on the next trial is calculated as: 
$$B_{t+1} = B_t + \alpha_t\delta_t$$

#### Where the learning rate of the model [$\alpha$] is influenced by the change point probability [$\Omega$] and the model confidence [$\phi$]. The learning rate should be high if either 1) a change in the mean of the distribution of reward is likely [$\Omega$ is high] or 2) the estimate of the mean is highly imprecise [$\sigma^2_n$ is high]:
$$\alpha_t = \Omega_t + (1-\Omega)(1-\phi_t)$$


#### The prediction error, $\delta$, is the difference between the model belief and the reward difference observed: 
$$\delta_t = r_t - B_t$$

#### Estimated variance is calculated as: 
$$\sigma^2_t = \sigma^2_n + \frac{(1-\phi_t)\sigma^2_n}{\phi_t}$$

### Changepoint probability calculation
#### The changepoint probability is the likelihood that a new sample is drawn from the same Gaussian distribution centered about the current belief estimate of the model relative to the likelihood that a new sample is drawn from a uniform distribution. The changepoint probability will be close to 1 as the relative probability of a sample coming from a uniform distribution increases. 
$$\Omega_t = \frac{U(r_t)H}{U(r_t)H + N(r_t|B_t,\sigma^2_t)(1-H)}$$

#### The hazard rate is the global probability that the mean of the distribution has changed (calculated as the sum of change points over the total number of trials). 
$$H = \frac{sum(cp_{trials})}{n_{trials}}$$

#### The model confidence [$\phi$] is a function of the changepoint probability [$\Omega$] and the variance of the generative distribution [$\sigma^2_n$]. The first term is the variance when a changepoint is assumed to have occurred. The second term is the variance conditional on no changepoint (slowly decaying uncertainty). The third term is the rise in uncertainty when the model is unsure whether a changepoint has occurred. The same terms are in the denominator with an added variance term to reflect uncertainty arising from noise. 

$$RU_t = \frac{\Omega_t\sigma^2_n + (1-\Omega_t)(1-\phi_t)\sigma^2_n + \Omega_t(1-\Omega_t)(\delta_t\phi_t)^2}{\Omega_t\sigma^2_n + (1-\Omega_t)(1-\phi_t)\sigma^2_n + \Omega_t(1-\Omega_t)(\delta_t\phi_t)^2+\sigma^2_n}$$
<br>
$$\phi_{t+1} =  1 - RU$$

### Models
#### We propose three component cognitive models. 
####  **Adaptive drift rate model.** The drift rate alone may vary as a function of the estimated reward difference between targets, where an increased drift rate would speed evidence accumulation, decreasing reaction time and increasing choice accuracy: 
#### $$v_{t+1} = \hat\beta{\cdot B_t} + v_t$$
<br>
####  **Adaptive decision boundary model.** The boundary alone may adapt as a function of the change point probability, which would increase the window of evidence accumulation, increasing reaction time and choice accuracy: 
####  $$a_{t+1} = \hat\beta\cdot\Omega_t + a_0$$
<br>
####  **Adaptive starting point model.** The starting point may adapt as a function of the estimated reward difference between targets, decreasing reaction time and increasing accuracy: 
####  $$z_{t+1} = \hat\beta \cdot {B_t} + z_0$$
<br>

#### As mentioned in the Hypotheses section, we hypothesize that a combination of decision threshold adaptation (whether value-driven, affecting the starting point for evidence accumulation, or volatility-driven, affecting the global caution associated with both actions) and value-driven drift rate adaptation will modulate behavior. 

## **Simulated behavior** 

#### In all conditions, if the learner solely uses changepoint probability as the learning signal, then reaction time spikes quickly after the changepoint. However, because the belief in the value difference between actions does not inform a changepoint-driven model, accuracy remains at chance level. 

#### If the starting point for evidence accumulation is modulated by the belief in the value difference between actions, then accuracy increases relative to all other models while the reaction time slowly increases over time. 

#### Adaptive drift model accuracy lies between the accuracy of the starting point and decision boundary models. While reaction times for the drift model exhibit the same temporal incidence as the starting point model, the amplitude of the increase in reaction time following the changepoint is greater for the drift model than for the starting point model when conflict is low. During high conflict conditions, the drift and starting point models exhibit similar reaction time profiles overall. 

#### For all models, high conflict conditions reduce accuracy and reduce the amplitude of increases in reaction time.

#### Below, reaction times increase more quickly under conditions of low conflict than high conflict, with the high conflict condition showing a relatively slow increase in reaction time and accuracy as the learner disambiguates the value difference between targets.  

<table><tr>
<td> <img src="lvlc_acc_tc.png"  style="width: 900px;"/> </td>
<td> <img src="lvlc_rt_tc.png"  style="width: 900px;"/> </td>
</tr></table>

<table><tr>
<td> <img src="lvhc_acc_tc.png"  style="width: 900px;"/> </td>
<td> <img src="lvhc_rt_tc.png"  style="width: 900px;"/> </td>
</tr></table>

#### Reaction time and accuracy profiles will be similar under conditions of low and high volatility.  
<table><tr>
<td> <img src="hvlc_acc_tc.png" style="width: 900px;"/> </td>
<td> <img src="hvlc_rt_tc.png" style="width: 900px;"/> </td>
</tr></table>
<table><tr>
<td> <img src="lvlc_acc_tc.png"  style="width: 900px;"/> </td>
<td> <img src="lvlc_rt_tc.png"  style="width: 900px;"/> </td>
</tr></table>

#### When volatility and conflict are high, reaction time and accuracy profiles will be similar to those in high conflict and low volatility conditions.
<table><tr>
<td> <img src="hvhc_acc_tc.png"  style="width: 900px;"/> </td>
<td> <img src="hvhc_rt_tc.png"  style="width: 900px;"/> </td>
</tr></table>

## **Model Comparison & fitting procedures**

### **Parameter estimates.** Parameter estimates will be derived from behavioral data using hierarchical Bayesian estimation. This procedure was chosen for its ability to recover models for smaller datasets, its robustness to outliers and multiple comparisons, and distribution-level expression of estimation uncertainty (as opposed to using parametric  estimates of confidence around a point estimate). 

### **Model comparison & falsification.** To quantify the likelihood of observing the experimental data given each model, we will compare model deviance information criterion (DIC) scores. In addition, we will simulate the data using the best-fitting parameter values to check 1) whether the same reaction time and accuracy profiles are detectable in the empirical data (the model's generative performance), and 2) whether there is a statistical difference between observed and simulated data (see [Palminteri et al., 2017](https://www.ncbi.nlm.nih.gov/pubmed/28476348) for instances in which relative information criterion scores can be misleading without absolute falsification criteria). 


## **Data storage and availability**

### All  data, experiment/analysis code, and *a priori* hypotheses will be stored using a project [repository](https://github.com/kmbond/volatileValues) on GitHub, and updated throughout the project timeline. Ultimately, this repository will be linked to a public project using the Open Science Framework.