In [1]:
import sys
sys.path.append('../../src/')

from lib.pysoarlib import *

# Reinforcement Learning

This section will cover
1. Reinforcement learning (RL) as a new kind of learning. 
RL will replace our use of preferences like *worst* or *better*.
2. Combinatorial rules (using `gp`) as a way of creating a bunch of rules at once 
    (see `templates` in manual for other ways to do this)

So far, when we've proposed operators, we tied *preferences* to them.
These preferences would tell Soar which operator to select and apply when multiple operators were proposed.
Instead of using preferences like *better* or *worst* we can use numbers to weight preferences.

The agent will tweak these numbers itself through reinforcement learning.

"Reinforcement learning (RL) in Soar allows agents to alter behavior over time by dynamically changing numerical 
indifferent preferences in procedural memory in response to a reward signal. This learning mechanism contrasts 
starkly with chunking. Whereas chunking is a one-shot form of learning that increases agent execution performance 
by summarizing sub-goal results, RL is an incremental form of learning that probabilistically alters agent behavior."

#### Left Right Agent

Let's consider a simple agent that chooses to move left or right.
Unbeknownst to the agent, "right" is always the correct choice.
Once the agent makes a move, we will give it a reward.
A +1 reward will be given if the agent moves right and -1 if the agent moves left.

All this agent will do is pick a direction, receive a reward, and then halt.

Let's look at the Soar code.
This first cell is going to be code that is familiar to you.

In [8]:
lr_initialization = """
sp {propose*initialize-left-right 
    (state <s> ^superstate nil
              -^name)
-->
    (<s> ^operator <o> +)
    (<o> ^name initialize-left-right)
}

sp {apply*initialize-left-right 
    (state <s> ^operator <op>)
    (<op>      ^name     initialize-left-right)
-->
    (<s> ^name        left-right
         ^direction    <d1> <d2>
         ^location    start)
    (<d1> ^name left  ^reward -1) 
    (<d2> ^name right ^reward  1)
}
"""

lr_operators = """
sp {left-right*propose*move 
    (state <s> ^name            left-right
               ^direction.name  <dir> 
               ^location        start)
-->
    (<s> ^operator <op> +) 
    (<op> ^name move
          ^dir <dir>)
}

sp {apply*move
    (state <s> ^operator <op>
               ^location start) 
    (<op> ^name move
          ^dir <dir>)
-->
    (<s> ^location start - <dir>) 
    (write (crlf) |Moved: | <dir>)
}
"""

lr_halt_condition = """
sp {elaborate*done
    (state <s> ^name left-right
               ^location {<> start})
--> 
    (halt)
}
"""

Soar needs a couple of rl-specific rules for reinforcement learning to work.

Take a second to guess what is going on. These will be explained in more detail later.

In [9]:
lr_rl_rules = """
sp {left-right*rl*left
    (state <s> ^name left-right
               ^operator <op> +) 
    (<op> ^name move
          ^dir left)
-->
    (<s> ^operator <op> = 0)
}

sp {left-right*rl*right
    (state <s> ^name left-right
               ^operator <op> +) 
    (<op> ^name move
          ^dir right)
-->
    (<s> ^operator <op> = 0)
}
"""

lr_reward = """
sp {elaborate*reward
    (state <s> ^name        left-right
               ^reward-link <r> 
               ^location    <d-name> 
               ^direction   <dir>)
    (<dir> ^name   <d-name> 
           ^reward <d-reward>) 
-->
    (<r> ^reward.value <d-reward>) 
}
"""

By default, RL is disabled. We must enable it by executing `rl --set learning on`. 

Let's run the agent.

In [46]:
agent_raw = f"""
{lr_initialization}
{lr_operators}
{lr_halt_condition}
{lr_rl_rules}
{lr_reward}
"""

lr_agent = SoarAgent(agent_raw=agent_raw)
lr_agent.add_connector('lr-agent', AgentConnector(lr_agent))
lr_agent.connect()

# Enable RL in Soar
lr_agent.execute_command('rl --set learning on', print_res=True)

# Set exploration policy (explained later)
# TODO This returns "Too many args", so using the shorthand `-g` instead
# lr_agent.execute_command('indifferent-selection --epsilon-greedy', print_res=True)
lr_agent.execute_command('indifferent-selection -g', print_res=True)

# Step once
lr_agent.execute_command('step', print_res=True);
# TODO Expected output does not show up
lr_agent.execute_command('print --rl', print_res=True);

--------- SOURCING PRODUCTIONS ------------
Total: 8 productions sourced.
rl --set learning on

indifferent-selection -g

step
--> 1 decision cycle executed. 1 rule fired.
print --rl
left-right*rl*right  0.000000 0
left-right*rl*left  0.000000 0


You can read the output as saying, "there is an indifferent probability of selecting either direction."

`left-right*rl*left 0. 0` means that after 0 updates, we have a value of 0. 

This is expected as the Soar agent hasn't received any rewards.

If we run the agent, it will recieve a reward and update the preference value for left/right.

In [47]:
def learn(n):
    for i in range(n):
        # Reinitialize halted agent
        lr_agent.execute_command('init')
        # initialize-left-right
        lr_agent.execute_command('step')
        # Move operator
        lr_agent.execute_command('step')
        # Get agent to halt
        lr_agent.execute_command('step')
    
learn(1)
lr_agent.execute_command('print --rl', print_res=True);

print --rl
left-right*rl*right  1.000000 0.300000
left-right*rl*left  0.000000 0


Even though we reinitialized Soar, the preference values don't get wiped out.  
If they were cleared, our agent couldn't learn as easily.

Let's see what happens when we run the agent 20 more times.

In [48]:
learn(10)
lr_agent.execute_command('print --rl', print_res=True);
learn(10)
lr_agent.execute_command('print --rl', print_res=True);

print --rl
left-right*rl*right  11.000000 0.980227
left-right*rl*left  0.000000 0
print --rl
left-right*rl*right  19.000000 0.998860
left-right*rl*left  2.000000 -0.510000


If you had an output like mine, you may have noticed that even though our agent was pretty confident that `right` was the correct choice, it still tried moving `left` sometimes.

To understand why the agent did this, we have to cover **exploration**.

## Exploration

Now let's cover the *exploration policy* we set.

Take a look at this output.

```
run
Moved: right
This Agent halted.
An agent halted during the run.

init-soar
Agent reinitialized.

run
Moved: right
This Agent halted.
An agent halted during the run.

init-soar
Agent reinitialized.

run
Moved: left
This Agent halted.
An agent halted during the run.
```

The agent moves twice to the right and therefore gets rewarded twice.
Why would the agent then go left? 
Surely it has learned by now to prefer moving to the right.

If you are familiar with optimization techniques like hill-climbing / stochastic gradient descent,
you are probably aware that taking actions that lead to a lesser performance score
may help in finding a new path. (TODO:REWORD)
Otherwise known as escaping a local maximum to reach to global maximum by first traveling down.

What if, for example, `right` was rewarded every time this program ran except for 
in the first fifteen minutes of every hour. 
During that time, `left` is rewarded.
(For this example, on the state object, we are also taking into account `^hour-quarter` which is 1-4).

If we started the agent at 3:16pm, it would learn to pick `right` after only a few moves.
If it was stubborn and did not allow for future exploration of other possibilities,
it would always choose `right` despite it being 4:06pm.

By being explorative, the agent can choose to make the seemingly wrong choice in order to learn something new.


## RL Rules

This all begs the question, how does Soar map state-action pairs to a reward?

This is done using *rl-rules*. 
We created two above but didn't explain the motivation.

Soar maintains a number (referred to as the *q-value* in RL terminology) 
that denotes the expected value of an operator for a given state.

Let's say that we added the `^hour-quarter` to the `move` operator and modified the above rl-rules to match on it.
Then Soar would map the `^dir` and `^hour-quarter` to the operator's preference value.