## Training an RL Agent to Learn to Multiply

This notebook presents a worked example of how to build an RL agent with bourbon. In this example we build an RL agent that is used in a wind turbine to learn how to position the rotors with respect to wind parameters to achieve an optimal output power.

### Setup

First things first, let's import the requirements. 

In [1]:
from bourbon.steps import RLStep
from bourbon.qtable import QTable
from bourbon.q_learning import QL

from random import randint
from functools import reduce
import numpy as np

In [16]:
import pandas as pd

data = [
                    {'dataid': 1, 'username': 'Alice', 'annotation': 'This is Alice annotation for 1'},
                    {'dataid': 1, 'username': 'Bob', 'annotation': 'This is Bob annotation for 1'},
                    {'dataid': 1, 'username': 'Sandra', 'annotation': 'This is Sandra annotation of 1'},
                    {'dataid': 2, 'username': 'Alice', 'annotation': 'This is Alice annotation for 2'},
                    {'dataid': 2, 'username': 'Bob', 'annotation': 'This is Bob annotation for 2'},
                    {'dataid': 2, 'username': 'Sandra', 'annotation': 'This is Sandra annotation of 2'},
                    {'dataid': 3, 'username': 'Alice', 'annotation': 'This is Alice annotation for 3'},
                    {'dataid': 3, 'username': 'Bob', 'annotation': 'This is Bob annotation for 3'},
                    {'dataid': 3, 'username': 'Sandra', 'annotation': 'This is Sandra annotation of 3'},
                ]

dataframe = pd.DataFrame(data)

ModuleNotFoundError: No module named 'pandas'

In [12]:
wind_speeds = np.random.uniform(10, 15, size=100)  # Simulated wind speeds (m/s)
# For discrete action space
wind_speed_categories = ["Low Wind", "Moderate Wind", "High Wind", "Very High Wind", "Extreme Wind"]
power_curve = np.interp(wind_speeds, [3, 7, 12, 15], [0, 50, 300, 0])  # Power curve (kW)
initial_pitch_angle = 0
# For discrete action space
pitch_angle_categories = ["Category 1", "Category 2", "Category 3", "Category 4", "Category 5"]


### Define Actions

In [13]:
# Rotor angle is the angle of the rotor blades
actions = [i for i in range(0, 17)]

### Define the State Space

The first thing that we should consider is how do we want to map our problem to a state-action space. Since a multiplication table presents a nice 2-dimensional space, it makes a good example for the reinforcement learning state space. We define the state space with a list, and each index of the list is a number that expresses how many possible values that dimension has. For instance, to represent a one-digit multiplication table in the range of 1-9, the state space can be defined as: `[9, 9]`. Let's however, make this space smaller (5x5) for the sake of less computation and faster convergance.

In [15]:
# Each state is represented by wind speed, pitch angle, previous action
state_space = [len(wind_speed_categories), len(pitch_angle_categories), len(actions)]

### Descretize wind speed and pitch angels

In [None]:
# Function to discretize continuous wind speed into categories
def discretize_wind_speed(wind_speed):
    if wind_speed < 5:
        return "Low Wind"
    elif wind_speed < 10:
        return "Moderate Wind"
    elif wind_speed < 15:
        return "High Wind"
    elif wind_speed < 20:
        return "Very High Wind"
    else:
        return "Extreme Wind"

# Function to discretize continuous pitch angle into categories
def discretize_pitch_angle(pitch_angle):
    # Define your discretization logic here, e.g., based on bins or categories
    if pitch_angle < 10:
        return "Category 1"
    elif pitch_angle < 20:
        return "Category 2"
    elif pitch_angle < 30:
        return "Category 3"
    elif pitch_angle < 40:
        return "Category 4"
    else:
        return "Category 5"

### Define the RL Step

It's now time to define our RL step. An RL step should be a child of `bourbon.steps.RLStep` class and implement its two abstract methods, namely `get_state()` and `get_reward()`. In addition to the definition of states and actions, this is where you make the RL agent really specific to your problem.

In [4]:


def get_state(wind_speed, pitch_angle, previous_action):
    wind_speed_category = discretize_wind_speed(wind_speed)
    pitch_angle_category = discretize_pitch_angle(pitch_angle)
    return (wind_speed_category, pitch_angle_category, previous_action)



def get_reward(state, action):
    # Extract wind speed category, pitch angle category, and previous action from the state
    wind_speed_category, pitch_angle_category, previous_action = state
    
    # Retrieve the actual wind speed and pitch angle values (if needed)
    wind_speed_value = get_actual_wind_speed(wind_speed_category)  # Implement this function
    pitch_angle_value = get_actual_pitch_angle(pitch_angle_category)  # Implement this function
    
    # Use the actual continuous values for more fine-grained calculations (if needed)
    power_generated = calculate_power_output(wind_speed_value, pitch_angle_value)  # Implement this function

    # Calculate the reward based on power generated and action
    reward = power_generated - 10 * abs(action - pitch_angle_value)
    
    return reward




class MyRlStep(RLStep):
    def get_state(self):
        a, b = gen_rand_nums()
        state = [a, b]
        return state
    def get_reward(self, state, action):
        prod = reduce((lambda x,y: x*y), state)
        reward = 1/(abs(prod-action)+1)
        return reward

In the above code, the state situation is simplified which matches this particular problem. To get the current state we simply choose a two random numbers in the range of 1-5 (ignoring the previous action and the enviroment) via `gen_rand_nums()` defined below. The reward is calculated by comparing the RL agent's prediction to the actual capital.

In [5]:
def gen_rand_nums():
    num_1 = randint(1,5)
    num_2 = randint(1,5)
    return num_1, num_2

### Initiate and Train the RL Agent

In [8]:
mrls = MyRlStep()
qt = QTable(state_space=state_space, actions=actions)
ql = QL(qtable=qt, rl_step=mrls)
ql.train(num_epochs=120000,
         get_correct_action=get_correct_action_for_multiply,
        log_dir='rl_multiplication')

q_learning      - 150 - INFO - Training the RL agent...
100%|█████████████████████████████████████████████████████████████████████████████████████| 119999/119999 [00:16<00:00, 7364.91it/s, Error (%)=4]


Before being able to train our RL agent, we need to define one more function that is used in the training loop and its main purpose is to decide what is the best action to take in a given state. This function must be implemented with respect to the problem that you are solving. For instance, in this example, at each state the best action is simply the actual multiplication of the two indexes that represent the state. Hence, we can define the following function:

In [7]:
def get_correct_action_for_multiply(state: list):
    return state[0]*state[1]

The error results are always stored in a tensorboard plot. Outside Notebooks, you can access the plots by simply running `tensorboard --logdir=rl_multiplication`. You can then access the tensorboard under http://localhost:6006/. To observe the plots in the Notebooks directly, you can run the following command:

In [9]:
%load_ext tensorboard
%tensorboard --logdir rl_multiplication

The RL agent seems to be trained and the error has dropped well and relatively fast which is expected for this simple problem. Let's now write a script that uses the trained RL agent and answers multiplication queries:

In [10]:
while True:
    inp = input ("Enter two numbers separated by a comma, to see their product: ")
    if inp.lower() == "stop":
        print("Have a nice day! Bye.")
        break
    try:
        a, b = inp.split(',')
    except:
        print('Input format seems to be wrong, please try again.')
        continue
    a = int(a)
    b = int(b)
    if (a > 5) or (b > 5):
        print('The RL agent knows how to multiply only numbers <= 5.')
        continue
    state = [a,b]
    best_action = ql.get_best_action(state)
    print(f'{a} x {b} = {best_action}')

Enter two numbers separated by a comma, to see their product: 2,3
2 x 3 = 6
Enter two numbers separated by a comma, to see their product: 3,3
3 x 3 = 9
Enter two numbers separated by a comma, to see their product: 3,4
3 x 4 = 12
Enter two numbers separated by a comma, to see their product: 5,5
5 x 5 = 25
Enter two numbers separated by a comma, to see their product: 4,5
4 x 5 = 20
Enter two numbers separated by a comma, to see their product: 2.2
Input format seems to be wrong, please try again.
Enter two numbers separated by a comma, to see their product: 2,2
2 x 2 = 4
Enter two numbers separated by a comma, to see their product: 1,2
1 x 2 = 2
Enter two numbers separated by a comma, to see their product: 1,1
1 x 1 = 1
Enter two numbers separated by a comma, to see their product: 1,3
1 x 3 = 3
Enter two numbers separated by a comma, to see their product: 3,5
3 x 5 = 15
Enter two numbers separated by a comma, to see their product: Stop
Have a nice day! Bye.
