**Course organisers**

Jan Grohn (jan.grohn@psy.ox.ac.uk), Miriam Klein-Flügge (miriam.klein-flugge@psy.ox.ac.uk)  


# Introduction

Many experiments in learning and decision making require the estimation of values and how these change over time. Learning which stimuli and actions have higher value allows humans and other animals to select which stimuli to approach or avoid, and which actions to take.

Some of the most commonly used models of learning and reinforcement were first developed by **Bush and Mosteller** in the 1950s, and further elaborated on by **Rescorla and Wagner** in the 1970s.

They have since developed into the field of **Reinforcement Learning** in computer science.

*Our aims for today’s session are:*

1.  get to grips with Python, which we’ll be using throughout the four practicals for modelling and analyzing data
2.  set up an experimental paradigm of learning and decision making, which we can use to test how model parameters are affected by manipulating subjects’ stress levels
3.  code a simple reinforcement learning model that can be used to explain behaviour in this task
4.  understand how changing parameters in this model affects its behaviour

Parts of the session where you’re being asked to do something are indicated with an arrow (→).

Use the text cells in this notebook to type in your answers.

Note  that  the  final  section  (marked  ***)  is  a  slightly  harder  section.  It’s  not  essential  to understand  it  to  progress  with  the  later  sessions  in  the  course.  However,  it  provides  a theoretical foundation for the analyses that you’ll do in later practicals and can be a fun
exercise to do!

#Section 1: Getting started with Python and Colab notebooks

This notebook is supposed to run in a hosted colab. While it is also possible to download the notebook and host it locally, we do not recommend this (and if you try to run it locally you probably have to change some of the code below as it won't run out of the box).

To ensure that the edits you make to your copy of the notebook are being saved, you can click on *File > Save a copy in Drive* in the top left corner.

To run a code cell in this notebook, press Shift + Enter, or by clicking on the play symbol that appears when you hover over the code cell with your cursor. Run the next code cell now, which should produce some text output.

In [None]:
# @title
print('hello hello')

You can open or close code cells with a title (such as the code cell below) by double clicking on the title, by clicking the ►/▼ symbol to the left of the title, or by clicking the 'Show code' hyperlink.

In [None]:
#@title ### This cell has some hidden code

print('hello again')

If you encounter a Python function you don't know yet or you would like to know more about, you can type `?` followed by the function name (e.g. `?print`) into a new code cell. A new code cell can be created by selecting *Insert > Code cell* from the menu bar in the top left, or by hovering between existing cells with your cursor and clicking the *+ Code* icon that appears.

Create a code cell below this cell and view the help text for the `print` function now.



You can view a **Table of Contents** by clicking the icon in the top left with the three dots next to three lines.

In Python we can load in functions that other people wrote so that we don't have to code up everything from scratch. These sets of functions are organised in *libraries*, and they are loaded into Python with the `import` command, followed by the name of the library. We can also specify a different name for the library to call it by internally using the `as` statement. For example, `import numpy as np` loads in the `numpy` library, which we can then access using the `np` command. We can also load in a specific funtion from a library by using the `from` statement.

Run the next code cell now that will load in some libraries that we will be using throughout the course.

In [None]:
#@title ## Import libraries and set global parameters

# numpy is a libarary used to do all kinds of mathematical operations
import numpy as np

# this allows us to make interactive figures
from google.colab import output
output.enable_custom_widget_manager()

# seed the random number genrator
rng = np.random.default_rng(12345)

# load in some custom functions for this block practical
!rm -r *
!git clone https://github.com/jangrohn/ComputationalmodelingBlockPractical
!cp -R ComputationalmodelingBlockPractical/session1/ session1
!rm -rf ComputationalmodelingBlockPractical
from session1 import plotting # type: ignore

The last couple of lines of code in the previous cell load in some custom functions we wrote for this block practical, which we will be using throughout this session and also the next sessions. If you're interested in examining the code further, you can find them at https://github.com/JanGrohn/ComputationalModelingBlockPractical or by clicking the folder icon in the sidebar to the left and look through the scripts there.

#Section 2: Running a simple experiment

You can try out the task we will be analysing in this block practical at  https://jangrohn.github.io/volatility_study/. Play through the task as best as you can to get a feel for the study. At the beginning of the task, initially choose one of the two conditions randomly but if you have the time also restart the task and play the other condition afterwards. The task doesn't save any of your data, but you will have an option to download it once you're finished.

#Section 3: Coding a simple reinforcement learning model in Python, and understanding the effects of varying learning rates

*How did you learn whether green or orange was more likely to be rewarded?*

→ type your answer here

*How did you weigh this up against the number of points available when making your choices?*

→ type your answer here

*What do you think was the difference between condition 1 and condition 2 in the task?*

→ type your answer here

*Why do you think we included points – rather than simply letting participants learn which option has a higher reward probability?*

→ type your answer here

Computational modelling tries to describe these questions using mathematical equations. We can then provide a *quantitative* answer to each of them, using *parameters* from the model.

*What do you think might be the difference between a model parameter and a
variable? Discuss your answer with your partner.*

→ type your answer here

Reinforcement  learning  models  provide  a  way  of  tracking  the  probabilities  of  different outcomes when different actions are taken.

In this task, one of the key problems is to estimate *the probability that green is to be rewarded* on each trial. (Note that this is the *same* as 1 minus the probability that blue will be rewarded – so we only need to track one probability).

We can learn this probability by calculating a *prediction error* on each trial:

$$
\underbrace{\delta_t}_\textrm{prediction error} = \underbrace{o_t}_\textrm{outcome} - \underbrace{p_t}_\textrm{model prediction} \tag{Equation 1}
$$

where $\delta_t$ is the prediction error on trial $t$, $o_t$ is the outcome (1 if green was rewarded, 0 if blue was rewarded) on that trial, and $p_t$ is the current estimate of the probability that green will be rewarded (this is called `probOpt1` in the Python code). Have a think about what a prediction error might look like on a trial that does give reward and on a trial where no reward is obtained. What sign does it take in each case?

We then use this prediction error to *update our expectation* of how likely it is that green will be rewarded in the future.

$$
\underbrace{p_{t+1}}_\textrm{new prediction} = \underbrace{p_t}_\textrm{old prediction} + \underbrace{\alpha \delta_t}_\textrm{scaled prediction error} \tag{Equation 2}
$$

where $\alpha$ is a parameter called the *learning  rate*, whose value is $0 < \alpha \leq 1$. In your head, simulate different scenarios of this equation with a large (1) or small (0.1) learning rate and a positive or negative prediction error. What does the update look like, how does it differ?

The learning rate sets the *speed* at which the model learns from previous experience. Let’s explore directly in Python what happens when we vary the learning rate.

## Learning rates, and how they affect probability learning

The next code cell sets up, and plots, a simple ‘reward schedule’ where the true probability of green being rewarded is fixed at 0.8. Read through this code and try to understand what it is doing.

Once you have read through the code, run it to produce the figure.

In [None]:
#@title ### Learning with a fixed schedule

def generate_schedule(trueProbability, rng = rng):
  '''
  Returns if option 1 (1) or option 2 (0) is rewarded on a trial

    Parameters:
        trueProbability(float array): The probability with which option 1 is
          rewarded on each trial
        rng (numpy random number generator, defaults to rng)

    Returns:
        opt1rewarded(int array): 1 if option 1 is rewarded on a trial, 0 if
          option 2 is rewarded on a trial
  '''
  # We'll simulate whether opt 1 was rewarded on every trial. For each trial, we
  # first generate a uniformly distributed random number between 0 and 1.
  randomNumbers = rng.random(len(trueProbability))

  # The trial is rewarded if that number is smaller than trueProbability, and
  # unrewarded otherwise.
  opt1Rewarded = randomNumbers < trueProbability

  # We return the outcome of this comparison (which is either True or False for
  # each trial) an an integer (which is 0 or 1 for each trial).
  return opt1Rewarded.astype(int)

# this is the true probability that green is rewarded
fixedProb = 0.8

# this is the number of trials
nTrials = 200

# reward probability on each trial
trueProbability = np.ones(nTrials, dtype = float) * fixedProb

# generate outcomes on each trial
opt1Rewarded = generate_schedule(trueProbability)

# visualise which option was rewarded on each trial
plotting.plot_schedule(opt1Rewarded, trueProbability)

The dots are at 1 every time that green is rewarded, and at 0 every time that blue is rewarded. The black dotted line is the true probability of reward (which the subject doesn’t know in the experiment).

The function RL_model defined in the next cell takes as its input: whether green is rewarded on each trial (`opt1rewarded`), what the model’s learning rate $\alpha$ is set to, and what the starting probability on the first trial is ($p_1$). It tries to return `probOpt1`, the probability of green being rewarded, as its output. However, the final two equations in the function
haven’t been completed. Open the next cell and complete the missing lines of code.

Once you’ve done this, you should be able to run this cell without receiving any errors, and the figure should now have a red trace that approaches the true probability:

In [None]:
#@title ### Simulating the RL model
def RL_model(opt1Rewarded, alpha, startingProb = 0.5):
  '''
  Returns how likely option 1 is rewarded on each trial.

    Parameters:
        opt1rewarded(bool array): True if option 1 is rewarded on a trial, False
          if option 2 is rewarded on a trial.
        alpha(float): fixed learning rate, greater than 0, less than/equal to 1
        startingProb(float): starting probability (defaults to 0.5).

    Returns:
        probOpt1(float array): how likely option 1 is rewarded on each trial
          according to the RL model.
  '''

  # check that alpha has been set appropriately
  assert alpha > 0, 'Learning rate (alpha) must be greater than 0'
  assert alpha <= 1,'Learning rate (alpha) must be less than or equal to 1'

  # check that startingProb has been set appropriately
  assert startingProb >= 0, 'startingProb must be greater or equal than 0'
  assert startingProb <= 1, 'startingProb must be less than or equal to 1'

  # calculate the number of trials
  nTrials = len(opt1Rewarded)

  # pre-create some vectors we're going to assign into
  probOpt1 = np.zeros(nTrials, dtype = float)
  delta    = np.zeros(nTrials, dtype = float)

  # set the first trial's prediction to be equal to the starting probability
  probOpt1[0] = startingProb

  # students, complete this code to finish the reinforcement learning model
  for t in range(nTrials-1): # loop over trials
    delta[t] =      # COMPLETE THIS LINE using opt1Rewarded, probOpt1 and equation 1
    probOpt1[t+1] = # COMPLETE THIS LINE  using probOpt1, delta, alpha and equation 2


  return probOpt1

# this defines the model's estimated pronbabilty on the very first trial
startingProb = 0.5

# this is the model's learning rate
alpha = 0.05

# run the RL model
probOpt1 = RL_model(opt1Rewarded, alpha, startingProb)

plotting.plot_schedule(opt1Rewarded, trueProbability, probOpt1)

Now  try  playing  around  with  the  three  parameters `trueProb` `startingProb`, and `alpha`. Particularly try to understand the effects of varying `alpha`. What are the advantages of having a low $\alpha$? What are the advantages of having a high $\alpha$? If you set $\alpha$ to 1, how does the model behave? You can change the value using the sliders below.


In [None]:
plotting.plot_interactive_RL_model(opt1Rewarded, trueProbability, RL_model, generate_schedule)

→ type your answer here

## Using a reversal schedule
In the experiment you performed earlier, the reward probability isn’t fixed, but it reverses at  various  points  during  the  experiment  –  so  at  some  points  green  is  more  likely  to  be rewarded, but at other points blue is more likely to be rewarded.

Let’s now generate such a schedule during which reversals take
place.

In [None]:
#@title ### Regenerating the schedule and the RL model

# now we use a schedule with some reversals
trueProbability = np.concatenate((np.ones(25,  dtype = float)*0.25,
                                  np.ones(25,  dtype = float)*0.75,
                                  np.ones(25,  dtype = float)*0.25,
                                  np.ones(25,  dtype = float)*0.75,
                                  np.ones(100, dtype = float)*0.25))

opt1Rewarded = generate_schedule(trueProbability)

plotting.plot_interactive_RL_model(opt1Rewarded, trueProbability, RL_model, generate_schedule, change_trueProb = False)

Change the parameters above to see whether your reinforcement learner can keep track of this changing probability.

In the above schedule, there are periods where the reward environment is volatile (it reverses quite frequently) and other periods where the reward environment is stable (it doesn't change very frequently). Our hypothesis  is  that  people  adjust  their  learning  rates  depending  upon  the  stability  of  the environment.

Let's assume that subjects want to get their estimated probability as close to the true probability as possible. Based upon what you learnt about varying $\alpha$, in which environment might it be helpful to have a lower $\alpha$? In which environment might it be helpful to have a higher $\alpha$?

→ type your answer here

The reasoning behind why you might need to have a different $\alpha$ in different situations is discussed more fully in the following paper:

Behrens, T. E. J., Woolrich, M. W., Walton, M. E., & Rushworth, M. F. S. (2007). Learning the  value  of  information  in  an  uncertain  world,  Nature  Neuroscience  10(9),  1214–1221. http://doi.org/10.1038/nn1954

#***Section 4: Understanding learning rates as weighted sums

The key insight here is that the learning rate, $\alpha$, determines how the currently estimated probability on the next trial, $p_{t+1}$, is influenced by the past history of trials. We can think about  this in another way. Look back at equations 1 and 2. $p_{t+1}$ depends upon the outcome of the
most recent trial, $o_t$, but also the last trial’s probability estimate, $p_t$. However, $p_t$ could itself be written in terms of the probability of the previous trial’s outcome, $p_{t-1}$, and $p_{t-1}$. In turn $p_{t-1}$ could
be written in terms of $o_{t-2}$ and $p_{t-2}$. And so on.

In short, it becomes possible, when you are at trial $T$, to think of $p_{T+1}$ as a weighted sum of all the previous outcomes that the model experienced:

$$
p_{T+1} = (1-\alpha)^T p_1 + \sum^T_{t=1} w_t o_t \tag{Equation 3}
$$

Where $p_1$ is the starting probability (this becomes less and less important as we get further away from it, as $(1-\alpha)^T$ shrinks to zero), and $w_t$ is the amount of weight given to the outcome on trial $t$ on the current trial:

$$
w_t = \alpha(1-\alpha)^{T-t} \tag{Equation 4}
$$

$T$ is the current trial number, and so $T-t$ indexes how many trials into the past we are looking. In the final exercise, we see if you can derive this equation, step- by-step.

What does $w_t$ actually look like? This is explored in the next cell, which plots equation 4:

In [None]:
plotting.plot_RL_weights()

This plot is also discussed at the beginning of the following book chapter (which is available as a PDF online at http://www.princeton.edu/~ndaw/dt.pdf):

Daw, N. D. and Tobler, P. N. (2014) Value learning through Reinforcement: The Basics of Dopamine and Reinforcement Learning. Chapter 15, Neuroeconomics (2nd edition), edited by Glimcher, P. W. and Fehr, E.

Try varying $\alpha$ and replotting the weights to see how it affects the weight assigned to the history of outcomes.

Last but not least, a bit of maths. Starting with equations 1 and 2, see if
you can derive equation 4, by substituting terms from the previous trial's equation for $p_t$. (This is the hardest exercise we've asked you to do today. If you manage to complete it, well done! If not, don't worry we will reveal how to do it in session 2).