# a) Variables & observations

### Variables, $p$
*Predictor*<br> 
> *  conflict (high/low, qualitative)<br>
*  volatility (high/low, qualitative)<br>

*Response*<br>
>*behavioral*
>> *  accuracy (qualitative, 0/1)<br>
*  reaction time (quantitative)<br>

> *parameters from model fits to behavioral data*<br> 
>> *  decision boundary height [$a$] (quantitative)<br>
*  drift rate [$v$] (quantitative)<br>
*  starting point [$z$] (quantitative)<br>
*  learning rate [$\beta$] (quantitative)<br>

> *learning signals from ideal observer*<br> 
>> *  change point probability [$\Omega$] (quantitative)<br>
*  belief in the reward difference between targets [$B$] (quantitative)

### Number of observations, $n$
6 participants with four 1000-trial sessions each. 

# b) Data architecture 

All behavioral data and metadata will be stored within the lab Dropbox folder for the experiment.  Each data file is named according to subject, condition, and trial set ID. For example, if my subject number was 123, my condition number was 0, and the trial set ID was 0, then my data file would be named *123_cond0_trialset0.csv*. Additionally, for each subject, system- and experiment-related metadata, such as the last computer reboot time, the versions of key modules for the experiment, the total length of the experiment for that session, and  the length of the mid-experiment break, is recorded in a separate csv file with *runInfo* appended to the file name, as below.
![image.png](data_arch.png)

In [11]:
example_metadata <- read.csv("test_cond0_trialset0_runInfo.csv")
head(example_metadata)

psychopy_version,python_version,pythonScipyVersion,pyglet_version,pygame_version,numpy_version,wx_version,window_refresh_time_avg_ms,begin_time,exp_dir,last_sys_reboot,system_platform,internet_access,total_exp_time,break_time
1.85.2,2.7.12,0.19.1,1.2.4,1.9.3,1.13.1,4.0.0b2 gtk3 (phoenix),33.33302,2018_02_02 17:49 (Year_Month_Day Hour:Min),/home/coaxlab/Dropbox/volatileValues/simple_rt_experiment,2018-02-02 16:32,linux 4.4.0-112-generic,True,24.63348,1.33224


Within the behavioral data file, key variables are stored within columns:
> * the left/right **choice** is coded as 0 or 1
> * the **accuracy** is coded as 0 (incorrect) or 1 (correct)
> * the choice corresponding to the highest point value is stored as the **solution**
> * the number of points earned on each trial is stored as **reward**
> * the reward accumulated across the experiment so far is stored as **cumulative_reward**
> * the reaction time for each trial is stored as **rt**
> * the trial time, including feedback time, is stored as **total_trial_time** 
> * the intertrial interval is stored as **iti**
> * the change point indicator (0/1) is stored with slow trial (-1) and fast trial (-2) indicators as **cp_with_slow_fast**
> * and the ASCII value for the color of the high-value cue is stored as **high_val_cue**

In [12]:
example_data <- read.csv("test_cond0_trialset0.csv")
head(example_data)

choice,accuracy,solution,reward,cumulative_reward,rt,total_trial_time,iti,cp_with_slow_fast,high_val_cue
0,0,1,0,0,1.057122,2.711002,0.6296882,-1,112
0,0,1,46,46,0.2342849,1.896353,0.6759479,0,112
0,1,0,55,101,0.426306,2.029746,0.622425,0,112
1,0,0,38,139,0.3260748,1.660718,0.3501943,0,112
0,1,0,55,194,0.3734028,1.941348,0.5746767,0,112
0,0,1,41,235,0.3061411,1.755645,0.4637496,0,112


Because data collection is still ongoing, I don't have a sample data structure to show for the parameters from model fits. However, the learning signals resulting from ideal observer simulations (change point probability, $\Omega$ and belief in the reward difference between targets, $B$) are currently stored within Python objects named for each candidate model of learning. Each object is named for the learning rate (for ex., 'mod' for moderate), the condition (for ex., 'hv_hc' for high volatility and high conflict), the learning signal (for ex., 'B' for the belief in the difference between targets, $B$), and the target of learning (for ex., 'sp' for starting point, $Z$):   
![image.png](python_objects.png)

# c) Anticipated anomalies & data cleansing

* ## Syntactic 
> > **Lexical**: Because the behavioral data is written to lists which are concatenated, there would be a concatenation error if they were of different sizes that would prevent saving the data, so I don't expect there to be any discrepancies between the intended and actual data structure format. However, is possible that the values for two variables could be switched without affecting the size of the list, so I will check for this type of error.
<br>
> > **Domain format errors and irregularities**: Because all of the data is written at once, I don't expect there to be formatting inconsistencies, but I will check for this type of error.
<br>
* ## Semantic 
> > **Integrity constraint violations**: Will check that reaction times are within the minimum (.1 s) and maximum (1 s) set within the experiment. Because the trial should end with a timing message if either the max. or min. reaction time is recorded, if reaction times associated with a non-repeated trial are greater than the maximum or less than the minimum , then a) the reaction time was recorded incorrectly or b) the experiment did not operate as intended. I will check that accuracy, solution, and choice values are all either 0 or 1 and that the high value cue is always either one of the two ASCII values for the colors of the stimuli presented. 
<br>
> > **Contradictions**: I will recalculate accuracy from the choice and the solution values to ensure that the accuracy variable does not represent a contradiction. Additionally, I will ensure that if the experimental constraints on reaction time are not met on a given trial,  the trial is flagged appropriately (-1 or -2 for fast for slow trials) and that the trial is repeated (same reward values on the t+1 as t). Additionally, I will check that total trial time is always greater than the recorded reaction time, that cumulative reward is always increasing, and that when the cp_with_fast_slow indicator is 1 (indicating a change point), then the ASCII value for the high value cue also changes. 
<br>
> > **Duplicates**: While some trials should repeat given an out-of-bounds reaction time, I will check for repeated recordings of the same trial by finding whether any trial within a subject has repeated values for reaction time (which has a high degree of precision) and cumulative reward (which should always increase). 
<br>
> > **Invalid tuples**: I will check that accuracy is moderately variable within a given subject. Given the probabilistic nature of the task, I would not expect a subject to have either perfect accuracy or for that subject to have all incorrect trials. 
<br>

* ## Coverage 
> > **Missing values & missing tuples**: Missing values within a variable should not be a problem because the data would not save due to a concatenation error. Missing data vectors also should not be a problem because of the automated data collection, but I will check that the data from each subject matches the expected size (number of trials by number of variables). 
<br>


## Overall approach
Because it is feasible (timewise and computationally) to check all of my data, I will not select a subset for data auditing -- I'll audit all of it. I'll write detection/resolution scripts appropriate to the above anomalies, then re-run the detection scripts to ensure that all anomalies have been found and corrected. More detail/sample code for this will be included in the next edit of my data plan.  

# d) Clean data table

My data table will be in the tidy data format, with columns for subject ID, condition, reaction time, accuracy, and choice. Each observation will form a single row and the header of the data table will refers to the names of the variables. Each column will refer to a single variable and variables will only be stored in columns (not rows). Because I only have one type of observational unit (the participant), I can store all of the data that I mentioned within a single data table. This approach is compliant with the tidy data format and avoids common problems resulting from messy data storage. 

# e) Hypotheses to test

*Mechanism*
<br>
Either the rate of evidence accumulation [drift rate, $v$] or the starting point for evidence accumulation [$z$] will vary with conflict, such that larger differences in value either increase the drift rate or bias the starting point toward the higher-value target, and smaller differences in value decrease the drift rate or decrease starting point bias (so that $z$ is closer to $a$/2).
<br>
$$v_{t+1} = \hat\beta*B_{t} + v_{t}$$
$$z_{t+1} = \hat\beta*B_{t} + z_{0}$$
<br>
The decision threshold [$a$] will increase as volatility increases and decrease as volatility decreases. Increased volatility will increase learning rates [$\beta$]. 

$$a_{t+1} = \hat\beta*\Omega_{t} + a_{0}$$

Decision threshold and drift rate adaptation will likely combine to drive behavior, with the threshold shift driven by change point probability and affecting both targets, or by belief in the reward difference changing the starting point, biasing the decision toward the high value target. The drift rate change will be driven by belief in the reward difference. 

*Behavior*
<br>
As a consequence of the above mechanisms, I predict that accuracy will decrease as volatility and conflict increase. Reaction times will increase more quickly under conditions of high volatility than high conflict, which will show a slow increase in reaction time as the learner disambiguates the value difference between targets. 


# f) Data visualization approach

Because I'm interested in learning (i.e., change over time), I plan to use changepoint-locked time series plots of subject- and epoch-averaged reaction times (in seconds) and accuracies (as probability of correct selection) with 95% confidence intervals as shaded bars. I may use bootstrapped confidence intervals depending on whether the distribution of my data is normal. I plan to show the trial before the change point to 20 trials after the change point. The same approach can be used for my model parameters ($a$, $v$, $z$). Below is an example of this plotting style using simulated data:
![image.png](hvhc_acc_tc.png)

I also plan to show the full distributions of my parameters, reaction time, and accuracy data as function of condition, perhaps using a separate ridgeline plot for each variable (though I need to look into this further). I think that this type of plot could be used to illustrate distriubution-level differences over the full set of conditions.  While unrelated to my data, here's a sample of the ridgeline plot style from rstudio.com:
![image.png](https://rviews.rstudio.com/post/2017-10-19-Top-40_files/ggridges.png)

Additionally, I am considering using a deviation plot to show how well ideal observer estimates portray my observed data, but I need to think about this more. I could simply plot the ideal observer values against observations, with the identity line representing a perfect correspondence between the two, but I think that there is likely a more creative/interesting visualization approach that I could use.