In [None]:
from IPython.display import HTML, display

#   How do we decide and learn in a volatile environment? 

## **Theoretical Framework**


#### Performing successfully in a volatile environment requires making fast, accurate decisions and updating those decisions given environmental feedback. However, accumulator models of choice-making, which model mechanisms internal to the decision, and reinforcement learning models, which involve how the outcome of those choices influence decision updates, are often isolated, despite their complementary goals. While drift diffusion modeling is capable of describing the speed and accuracy of choices, the decision parameters which govern those choices are assumed to remain static across trials, and while models of reinforcement learning explain how environmental feedback sculpts choice preferences, they fail to describe decision-making mechanisms. 

#### Much research has shown that both action selection and reinforcement learning rely on the cortico-basal ganglia thalamocortical (CBGT) circuit ([Bogacz & Larsen, 2011](https://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00103)). In addition, hyperdirect activation of the subthalamic nucleus has been correlated with value-conflict ([Frank et al., 2015](http://www.jneurosci.org/content/35/2/485)) and there is evidence that tonic dopamine levels in the striatum change the probability distribution function for action selection, regulating the explore-exploit tradeoff for potential actions, dependent on the volatility of environmental feedback ([Humphris, Khamassi, & Gurney, 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3272648/)). Altogether, this evidence points to the importance of value-conflict and volatility as learning signals for the integration of decision making and reinforcement learning. 

#### I plan to explore how value conflict between competing actions (the degree to which the value associated with each action is similar) and the volatility of feedback (the change point frequency of mean value-action associations) influence adaptive decision making using a hybrid drift-diffusion reinforcement learning model. 

## **Task**

#### In our task, we operationally define conflict as the mean reward difference between actions and volatility as the hazard rate of change in that mean reward difference. We plan to manipulate factors of volatility and conflict using a 2x2 within-subjects design to form four conditions, with each participant performing 1000 trials per condition and two conditions per day until all conditions are complete.     

#### Reaction time and choice accuracy data will be collected using a two-alternative forced-choice task written in PsychoPy. On each trial, participants will be asked to choose the highest-reward target within 700 ms. Then they will be shown the reward earned on each trial based on their decision. To prevent prepotent selections, the position of the rewarding target will be randomized across trials, with the target identity as the rewarding feature (rather than target position).  The instruction screen and the structure of a sample trial are below: 
<tr><td><img src='instructions.png' style='width: 700px;'></td><td><img src='task.png' style='width: 700px;'></td></tr>

#### Each participant will receive a separate reward schedule. Below is a sample of one high volatility reward schedule, with the reward difference between targets on the y-axis. The  mean reward difference between targets shifts probabilistically, approximately every 30 trials. <br>

<tr><td><img src='hv_lc_sample.png' style='width: 500px;'>
    
#### Below is a sample of a high conflict reward schedule. The reward difference between targets is drawn from a normal distribution centered on a mean difference between .01 and .2. Reward distributions for all conditions have standard deviations ranging from 0.05 to 0.07. <br>


<tr><td><img src='lv_hc_sample.png' style='width: 500px;'>



## **Hybrid cognitive models**

#### We update the decision boundary height [$a$], the rate of evidence accumulation [$v$], and the starting point [$z$] parameters for the drift diffusion component of our model using estimates of the reward difference between targets [$B$]  and the reward changepoint probability [$\Omega$] from a modified quasi-optimal Bayesian observer ([Vaghi et al., 2017](https://www.ncbi.nlm.nih.gov/labs/articles/28965997/); [original code](https://github.com/BDMLab/Vaghi_Luyckx_et_al_2017)). <br><br>

#### We form three candidate models of behavior, which will be either falsified or supported using the empirical data and model simulations ([Palminteri et al., 2017](https://www.ncbi.nlm.nih.gov/pubmed/28476348))<br>

####  **Adaptive drift rate model.** The drift rate alone may vary as a function of the estimated reward difference between targets, where an increased drift rate would speed evidence accumulation, decreasing reaction time and increasing choice accuracy:
#### $$v_{t+1} = \hat\beta{\cdot B_t} + v_t$$
<br>
####   **Adaptive decision boundary model.** The boundary alone may adapt as a function of the change point probability, which would increase the window of evidence accumulation, increasing reaction time and decreasing choice accuracy:
####  $$a_{t+1} = \hat\beta\cdot\Omega_t + a_0$$
<br>
####  **Adaptive starting point model.** The starting point may adapt as a function of the estimated reward difference between targets, decreasing reaction time and increasing accuracy:
####  $$z_{t+1} = \hat\beta \cdot {B_t} + z_0$$
<br>

## **Hypotheses & preliminary simulations** 

### **Mechanism** 
#### *Conflict*
#### Either the rate of evidence accumulation [drift rate, $v$] or the starting point for evidence accumulation [$z$] will vary with conflict, such that larger differences in value either increase the drift rate or bias the starting point toward the higher-value target, and smaller differences in value decrease the drift rate or decrease starting point bias (so that $z$ is closer to $a$/2). The decision threshold [$a$], which represents the degree of caution in the decision, may increase with conflict. 

#### *Volatility*  

#### The decision threshold [$a$] will increase as volatility increases and decrease as volatility decreases. When volatility is increased, the changepoint probability will increase, potentially shifting the learner toward a more explorative learning policy. This could happen in the form of an increase in boundary height or a decrease in drift rate, or a more complex interaction between the two. 

#### Increased volatility will increase learning rates [$\beta$].
### **Behavior** 
#### Reaction times will increase more quickly under conditions of low conflict than high conflict, with the high conflict condition showing a relatively slow increase in reaction time and accuracy as the learner disambiguates the value difference between targets. 
<table><tr>
<td> <img src="lvlc_acc_tc.png" alt="Drawing" style="width: 900px;"/> </td>
<td> <img src="lvlc_rt_tc.png" alt="Drawing" style="width: 900px;"/> </td>
</tr></table>

<table><tr>
<td> <img src="lvhc_acc_tc.png" alt="Drawing" style="width: 900px;"/> </td>
<td> <img src="lvhc_rt_tc.png" alt="Drawing" style="width: 900px;"/> </td>
</tr></table>

#### Reaction time and accuracy profiles will be similar under conditions of low and high volatility.  
<table><tr>
<td> <img src="hvlc_acc_tc.png" alt="Drawing" style="width: 900px;"/> </td>
<td> <img src="hvlc_rt_tc.png" alt="Drawing" style="width: 900px;"/> </td>
</tr></table>
<table><tr>
<td> <img src="lvlc_acc_tc.png" alt="Drawing" style="width: 900px;"/> </td>
<td> <img src="lvlc_rt_tc.png" alt="Drawing" style="width: 900px;"/> </td>
</tr></table>

#### When volatility and conflict are high, reaction time and accuracy profiles will be similar to those in high conflict and low volatility conditions.
<table><tr>
<td> <img src="hvhc_acc_tc.png" alt="Drawing" style="width: 900px;"/> </td>
<td> <img src="hvhc_rt_tc.png" alt="Drawing" style="width: 900px;"/> </td>
</tr></table>

## **Model Comparison & fitting procedures**

### **Parameter estimates.** Parameter estimates will be derived from behavioral data using hierarchical Bayesian estimation. This procedure was chosen for its ability to recover models for smaller datasets, its robustness to outliers and multiple comparisons, and distribution-level expression of estimation uncertainty (as opposed to using parametric  estimates of confidence around a point estimate). 

### **Model comparison & falsification.** To quantify the likelihood of observing the experimental data given each model, we will compare model deviance information criterion (DIC) scores. In addition, we will simulate the data using the best-fitting parameter values to check 1) whether the same reaction time and accuracy profiles are detectable in the empirical data (the model's generative performance), and 2) whether there is a statistical difference between observed and simulated data (see [Palminteri et al., 2017](https://www.ncbi.nlm.nih.gov/pubmed/28476348) for instances in which relative information criterion scores can be misleading without absolute falsification criteria). 


## **Data storage and availability**

### All  data, experiment/analysis code, and *a priori* hypotheses will be stored using a public project [repository](https://github.com/kmbond/volatileValues) on GitHub, and updated throughout the project timeline. Ultimately, this public repository will be linked to a public project using the Open Science Framework.

## **Appendix**<br>


### **Ideal observer equations**<br>
The learning rate of the model [$\alpha$] is influenced by the change point probability, [$\Omega$, the model's suspicion that the location of the mean has shifted] and the model confidence [$\phi$, uncertainty arising from imprecise estimate of the mean]. The learning rate should be high if either 1) a change in the mean of the distribution of reward is likely [$\Omega$ is high] or 2) the estimate of the mean is highly imprecise [$\sigma^2_n$ is high].
$$\alpha_t = \Omega_t + (1-\Omega)(1-\phi_t)$$

The belief estimate of the mean of the distribution of rewards on the next trial: 
$$B_{t+1} = B_t + \alpha_t\delta_t$$

The prediction error, $\delta$, is the difference between the model belief and the current sample: 
$$\delta_t = r_t - B_t$$

If $\alpha_t$ is 0, the current sample will not update the model belief estimate at all but if 
$\alpha_t$ is 1, the current sample will entirely dictate the model's belief estimate. 
***
<br>

### **Changepoint probability**<br>
The changepoint probability is the likelihood that a new sample is drawn from the same Gaussian distribution centered about the current belief estimate of the model relative to the likelihood that a new sample is drawn from a uniform distribution. The changepoint probability will be close to 1 as the relative probability of a sample coming from a uniform distribution increases. H is the probability that the mean of the distribution has changed. 

$$\Omega_t = \frac{U(r_t)H}{U(r_t)H + N(r_t|B_t,\sigma^2_t)(1-H)}$$

### **Estimated variance**<br>
$$\sigma^2_t = \sigma^2_n + \frac{(1-\phi_t)\sigma^2_n}{\phi_t}$$


### **Model confidence**<br>
The model confidence [$\phi$] is a function of the changepoint probability [$\Omega$] and the variance of the generative distribution [$\sigma^2_n$]. The first term is the variance when a changepoint is assumed to have occurred. The second term is the variance conditional on no changepoint (slowly decaying uncertainty). The third term is the rise in uncertainty when the model is unsure whether a changepoint has occurred. The same terms are in the denominator with an added variance term to reflect uncertainty arising from noise. 

$$RU_t = \frac{\Omega_t\sigma^2_n + (1-\Omega_t)(1-\phi_t)\sigma^2_n + \Omega_t(1-\Omega_t)(\delta_t\phi_t)^2}{\Omega_t\sigma^2_n + (1-\Omega_t)(1-\phi_t)\sigma^2_n + \Omega_t(1-\Omega_t)(\delta_t\phi_t)^2+\sigma^2_n}$$
<br>
$$\phi_{t+1} =  1 - RU$$

_*note that the calculation of model confidence in the paper is actually reward uncertainty, so we take the additive inverse*_ <br>
*Vaghi et al., 2017*
<br>

