## Training an RL Agent to Learn to Multiply

This notebook presents a worked example of how to build an RL agent with Kandula. In this example we build an RL agent that learns what to multiply via reinforcement learning without using any math operation.

### Setup

First things first, let's import the requirements. 

In [1]:
from kandula import logging
from kandula.steps import RLStep
from kandula.qtable import QTable
from kandula.q_learning import QL

from random import randint
from functools import reduce
import torch

### Define the State Space

The first thing that we should consider is how do we want to map our problem to a state-action space. Since a multiplication table presents a nice 2-dimensional space, it makes a good example for the reinforcement learning state space. We define the state space with a list, and each index of the list is a number that expresses how many possible values that dimension has. For instance, to represent a one-digit multiplication table in the range of 1-9, the state space can be defined as: `[9, 9]`. Let's however, make this space smaller (5x5) for the sake of less computation and faster convergance.

In [2]:
state_space = [5, 5]

### Define the Actions

The next thing is to define the actions. In this example, I consider guessing (the right) multiplication result as an action that my RL agent is suppose to learn. With respect to our state space above, the actions can then be defined as:

In [3]:
actions = [i for i in range(1,26)]

If for instance, my RL agent was supposed to take one of the two actions of e.g. shifting grear up or shifting gear down, my actions variable would have been `[shift_gear_up, shift_gear_down]`.

### Define the RL Step

It's now time to define our RL step. An RL step should be a child of `kandula.steps.RLStep` class and implement its two abstract methods, namely `get_state()` and `get_reward()`. In addition to the definition of states and actions, this is where you make the RL agent really specific to your problem.

In [4]:
class MyRlStep(RLStep):
    def get_state(self):
        a, b = gen_rand_nums()
        state = [a, b]
        return state
    def get_reward(self, state, action):
        prod = reduce((lambda x,y: x*y), state)
        reward = 1/(abs(prod-action)+1)
        return reward

In the above code, the state situation is simplified which matches this particular problem. To get the current state we simply choose a two random numbers in the range of 1-5 (ignoring the previous action and the enviroment) via `gen_rand_nums()` defined below. The reward is calculated by comparing the RL agent's prediction to the actual capital.

In [5]:
def gen_rand_nums():
    num_1 = randint(1,5)
    num_2 = randint(1,5)
    return num_1, num_2

### Initiate and Train the RL Agent

In [7]:
%load_ext tensorboard
mrls = MyRlStep()
qt = QTable(state_space=state_space, actions=actions)
ql = QL(qtable=qt, rl_step=mrls)
ql.train(num_epochs=70000,
         get_correct_action=get_correct_action_for_multiply,
         verbose=True,
        log_dir='rl_multiplication')

q_learning      - 135 - INFO - Training the RL agent...
q_learning      - 142 - INFO - Epoch: 1000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 2000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 3000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 4000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 5000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 6000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 7000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 8000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 9000 - Error: 84.00%
q_learning      - 142 - INFO - Epoch: 10000 - Error: 80.00%
q_learning      - 142 - INFO - Epoch: 11000 - Error: 80.00%
q_learning      - 142 - INFO - Epoch: 12000 - Error: 80.00%
q_learning      - 142 - INFO - Epoch: 13000 - Error: 80.00%
q_learning      - 142 - INFO - Epoch: 14000 - Error: 80.00%
q_learning      - 142 - INFO - Epoch: 15000 - Error: 76.00%
q_learning      - 142 - INFO - Epoch: 16000 - Error: 

Before being able to train our RL agent, we need to define one more function that is used in the training loop and its main purpose is to decide what is the best action to take in a given state. This function must be implemented with respect to the problem that you are solving. For instance, in this example, at each state the best action is simply the actual multiplication of the two indexes that represent the state. Hence, we can define the following function:

In [6]:
def get_correct_action_for_multiply(state: list):
    return state[0]*state[1]

By setting `verbose=True` allow the train function to log the errors at each epoch, however, this is not necessary, as the error results are always stored in a tensorboard plot. Outside Notebooks, you can access the plots by simply running `tensorboard --logdir=rl_multiplication`. You can then access the tensorboard under http://localhost:6006/. To observe the plots in the Notebooks directly, you can run the following command:

In [8]:
%tensorboard --logdir rl_multiplication

The RL agent seems to be trained and the error has dropped well and relatively fast which is expected for this simple problem. Let's now write a script that uses the trained RL agent and answers multiplication queries:

In [12]:
while True:
    inp = input ("Enter two numbers separated by a comma, to see their product: ")
    if inp.lower() == "stop":
        print("Have a nice day! Bye.")
        break
    try:
        a, b = inp.split(',')
    except:
        logging.error('Input format seems to be wrong, please try again.')
        continue
    a = int(a)
    b = int(b)
    if (a > 5) or (b > 5):
        logging.error('The RL agent knows how to multiply only numbers <= 5.')
        continue
    state_index = ql.q_table.get_state_index([a,b])
    action_index = torch.argmax(ql.q_table.q_table[state_index]).item()
    res = ql.q_table.actions[action_index]
    print(f'{a} x {b} = {res}')

Enter two numbers separated by a comma, to see their product: 2,3
2 x 3 = 7
Enter two numbers separated by a comma, to see their product: 4,5
4 x 5 = 20
Enter two numbers separated by a comma, to see their product: 6,6


2975179283      - 14 - ERROR - The RL agent knows how to multiply only numbers <= 5.


Enter two numbers separated by a comma, to see their product: 


2975179283      - 9 - ERROR - Input format seems to be wrong, please try again.


Enter two numbers separated by a comma, to see their product: 5,5
5 x 5 = 25
Enter two numbers separated by a comma, to see their product: 4,5
4 x 5 = 20
Enter two numbers separated by a comma, to see their product: 3,3
3 x 3 = 9
Enter two numbers separated by a comma, to see their product: 4,2
4 x 2 = 9
Enter two numbers separated by a comma, to see their product: stop
Have a nice day! Bye.
