# Final project description

The final project should be sent before the **30th of June**. 
The format should be an **R script** (one per group) that can be run without error from any laptop (only by changing the path of the folder in which you put the data files of the different subject).
Please comment your script where necessary. Please add your conclusions on the statistical tests that you run also as a comment.

The final project consists 3 essential steps with an extra step as a bonus:

1. Data parsing and loading
2. Data visualization
3. Multi-level regression
4. EZ-Diffusion analysis (as a bonus)

You are divided into 5 groups. Each group has its own folder (GroupA, GroupB, GroupC, GroupD, and GroupE) that you can find on this Dropbox link: https://www.dropbox.com/sh/086fvtv0ivnscso/AADlvjuRxVK1C3HH8nmPqTBSa?dl=0. Each folder consists of a `readme.txt` file and a `Data` folder. The description of each dataset is presented in the `readme.txt` file in each folder and the recorded data for each individual is located in the `Data` folder as separate files.

![alt text](img/groups.png "Title")

# Step 1: Data loading and wrangling

This step has two parts:

- Data wrangling
- Excluding some participants and trials

For doing so, you should load the data files that are located in the `Data` folder. Each group should load and process its own dataset. The output data frame should consist of the following columns:

- participant
- block_number
- trial_number
- condition
- dots_direction
- response
- accuracy
- rt

Calculate a summary, which includes the average performance (accuracy) and RTs per participant, as well as the percentage of trials below 150 ms (too fast trials) and above 5000 ms (too slow trials).  Are there any participants with more then 10% fast or slow trials?

Then, exclude the participants that have less than 65% performance from the dataset. The trials with a reaction time less than 150 ms or greater than 5000 ms should be also excluded.

**Note**: instead of writing yourself all the data paths, you can use the following function, https://www.math.ucla.edu/~anderson/rw1001/library/base/html/list.files.html

Here is a little tutorial on how to import the files. Instead of using the fread function you should use `read_delim`. Also, what is not included in this tutorial is how to add a column with the participant number. You should find a way to fix that.

In [6]:
# example of how to define lists of files
data_folder = '~/Dropbox/teaching/r-course21/GroupData/GroupA/Data/'

list_files = paste(data_folder, list.files(path=data_folder), sep="")

list_files[1:5] # show the first 5

# Step 2: Data visualization

Now it's time to visualize the data set. In particular, we want to have a look at how the performance and reaction time evolve across the blocks. 

For this purpose, you should make a 2-by-1 grid plot that depicts the reaction time (top panel) and accuracy (middle panel) across the blocks for each condition. An example of this plot has illustrated here (each line correspondes to one condition):

![alt text](img/plot-example.png "Title")

# Step 3: Multi-level regression

Now, run a multi-level regression model. Consider the reaction time as the predicted variable and the block number as the predictor, and the participants and block number as mixed effects. Do response times decrease across the blocks?

Finally, run a repeated measures ANOVA to compare the trials in the first and last block of trials. Did participants significantly increase their performance (accuracy level)?

# Step 4: EZ-Diffusion analyziz


The Drift Diffusion Model (DDM) assumes that, when making a decision between two options, noisy evidence in favor of one over the other option is integrated over time until a pre-set threshold is reached. This threshold indicates how much of this relative evidence is enough to initiate a response. Since the incoming evidence is noisy, the integrated evidence becomes more reliable as time passes. Therefore, higher thresholds lead to more accurate decisions. However, the cost of increasing the threshold is an increase of decision time. In addition, difficulty affects decisions: When confronted with an easier choice (e.g., between a very good and a very bad option), the integration process reaches the threshold faster, meaning that less time is needed to make a decision and that decisions are more accurate. The DDM also assumes that a portion of the RTs reflects processes that are unrelated to the decision time itself, such as motor processes, and that can differ across participants. Because of this dependency between noise in the information, accuracy, and speed of decisions, the DDM is able to simultaneously predict the probability of choosing one option over the other (i.e., accuracy) and the shape of the two RT distributions corresponding to the two choice options. Importantly, by fitting the standard DDM, we assume that repeated choices are independent of each other, and discard information about the order of the choices and the feedback after each choice. To formalize the described cognitive processes, the simple DDM has four core parameters: The drift rate $v$, which describes how fast the integration of evidence is, the threshold a (with $a > 0$), that is the amount of integrated evidence necessary to initiate a response, the starting-point bias, that is the evidence in favor of one option prior to evidence accumulation, and the non-decision time Ter (with $0 \leq T_{er} < RT_{min}$), the part of the response time that is not strictly related to the decision process ($RT = decision time + T_{er}$ ). Because, in our case, the position of the options was randomized to either the left or the right screen position, we assumed no starting-point bias and only considered drift rate, threshold, and non-decision time. The following figure illustrates this model:

![alt text](img/DDM.png "Title")

The most simple version of DDM is the EZ-Diffusion model. In this model there is no bias to each option and also there is no across trial variability parameter. So, the parameters of this model can be easily obtained by the following formula. For more information you can see: "Wagenmakers, E. J., Van Der Maas, H. L., & Grasman, R. P. (2007). An EZ-diffusion model for response time and accuracy. Psychonomic bulletin & review, 14(1), 3-22".

![alt text](img/EZ.png "Title")

$$logit (P_c) = log(\frac{P_c}{1-P_c})$$

$$v = sign(P_c - \frac{1}{2}) \Big[\frac{logit (P_c)\big(P_c^2 logit (P_c) - P_c logit (P_c)+ P_c - \frac{1}{2}\big)}{VRT}\Big]^{\frac{1}{4}}$$

$$a = \frac{logit(P_c)}{v}$$

$$MDT = \Big(\frac{a}{2v}\Big)\frac{1 - exp(-va)}{1 + exp(-va)}$$

$$NDT = MRT - MDT$$

- $P_c$ : probability of correct answer

- $v$ : drift rate (rate of accumulating the information)

- $a$ : boundary separation (amount of information which is needed for making a decision)

- $NDT$ : non-decision time (the time which is needed for encoding the stimuli and also motory time to press the key)

- $MRT$ : average of reaction times

- $VRT$ : variance of reaction times

- $MDT$ : average of decision times ($MRT = NDT + MDT$)

Define the EZDiffusion function based on the mentioned formulas. Analyze your dataset and obtain the drift rate ($v$), boundary separation ($a$), and non-decision time ($NDT$) for each participant.

You should define a funcion called EZ_diffusion, that takes as arguments P_c, MRT and VRT, and returns v, a, and ndt.

Write a loop, so that you can calculate this per participant and print the results (the estimated parameters).