# What is Reinforcement Learning? 
Reinforcement learning is a framework that is used to solve control tasks. Through RL, agents learn from their environment by interacting within the environment and receiving rewards as feedback. 

## What is the goal of reinforcement learning? 
The goal of reinforcement is to maximize the expected cumulative reward in the environment 

# Two approaches to finding the optimal policy
* Policy-based methods
    * Policy is trained directly through which the agent learns what actions to take given the current state
* Value-based methods:
    * Train a value function to learn which state is more valuable and our agent takes actions to reach this state

## Value-based methods

Value of a state s is the expected discounted return the agent can get if it starts at state s and acts according to policy. 

### What is the policy in value-based methods? 
In value-based methods, our policy takes action based on the value function. The policy is not trained explicitly in value-based methods, as such the behavior of the model must be defined by hand. The simplest policy in value-based methods is Greedy Policy where the we choose the action that has the highest value.

### Two types of value-based methods
* State-value function
    * For each state s, the function returns the expected return V if the agent starts at s and follows the policy till termination
* Action-value function
    * Given a state s and action a, the function returns the expected return Q if the agent starts at s and takes action a and follows the policy till termination
* In both cases, we need to sum all the rewards an agent can get if it starts at state s   

### The Bellman equation 
To calculate the $V(S_{t})$ we need to calculate the expected return starting at state $S_t$ till termination. To calculate $V(S_{t+1})$ we need to calculate the expected return starting at state $S_{t+1}$ till termination. As we can see there is a lot of redundant calculations. The Bellman equation is a recursive function that can significantly reduce these redundant calculations. 

Under the Bellman equation, the value of state the immediate reward $R_{t+1}$+ the discount factor gamma times the value of the subsequent state $(gamma * V(S_{t+1}))$

**Bellman Equation**

$V(S_t) = R_{t+1} + gamma * V(S_{t+1})$

## Monte Carlo vs Temporal Difference Learning

Simply put, the Monte Carlo strategy uses an entire episode of experience before learning, while the Temporal difference strategy uses just a single step to start learning 

### Monte Carlo 
When using monte carlo we wait for an entire episode to complete before we calculate $G_t$ (return) and use it as a target to update the value function. 

$V(S_t) <-- V(S_t) + lr[G_t - V(S_t)]$

* Each episode starts at the same position
* The agent takes actions according to the policy
* We store the state, actions, rewards and next states in a tuple
* We repeat this process till the episode terminates
* We sum up all the rewards the agent receives $G_t$
* We then update $V(S_t)$ based on the total rewards $G_t$ using the above formula  

### Temporal Difference Learning 
When using Temporal Difference Learning, we use a single step to update $V(S_t)$. 

We update $V(S_t)$ at each step. Since we didn't complete an entire episode, we don't have $G_t$, instead, we estimate $G_t$ by adding the immediate reward $R_{t+1}$ and the discounted value of the next state. 

$V(S_t)<-- V(S_t)+ lr[R_{t+1} + gamma(V(S_{t+1})) - V(S_t)]$ 

## What is Q-Learning? 

Q-Learning is a off-policy value-based method that uses Temporal Difference(TD) approach to train an action-value function. 

In Q-learning, we train a Q-function, an action-value function which outputs the value of taking an action in a given state. 

The Q-function is encoded into a Q-table, which is a table that stores state-action pair values. 


### Q-learning algorithm 
* Step 1: Initialize the Q-table for each state and action to zero
* Step 2: Choose an action using the epsilon-greedy strat
    * Initialize epsilon with 1.0
    * with probability 1-epsilon we take actions prescribed by our q-table
    * with probability epsilon we choose a random action
    * Since epsilon is 1.0 at the beginning, we choose a random action, as time progresses and we fill our q-table, we lower our epsilon value.
* Step 3: Perform action $A_t$, get reward $R_{t+1}$ and next state $S_{t+1}$
* Step 4: Update $Q(S_t,A_t)$
    * $Q(S_t, A_t) <--Q(S_t, A_t) + lr[R_{t+1} + gamma*max_aQ(S_{t+1},a) - Q(S_t,A_t)]$
    * We update $Q(S_t,A_t)$ using the formula above
    1. Obtain reward $R_{t+1}$ after taking action $A_t$
    2. Get the best-action pair value by using the greedy policy to select the best action which will be the action with the highest state-action value.
* repeat the process.     

#### Off-policy vs On-policy 
* Off-policy: using a different policy for acting(inference) and updating(training)
    * In the case of Q-learning, epsilon-greedy is used for acting, while the greedy policy is used to update our Q-value function.
* On-policy: using the same policy for acting and updating
    * In the case of Sarsa,  the epsilon-greedy policy is used to select the next-state action pair instead of greedy policy

## Q-learning example Frozen Lake

In [1]:
import numpy as np 
import gymnasium as gym
import random
import os 
from tqdm.auto import tqdm

env = gym.make("FrozenLake-v1", map_name = "4x4", is_slippery=False,render_mode="rgb_array")
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space",env.observation_space)
print("Sample Observation", env.observation_space.sample())

print("\n _____Action Space_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space sample", env.action_space.sample())

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample Observation 4

 _____Action Space_____ 

Action Space Shape 4
Action Space sample 1


In [2]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")
action_space = env.action_space.n
print("There are ", action_space, " possible actions")
def initialize_q_table(state_space, action_space):
    Qtable = np.zeros((state_space,action_space))
    return Qtable

Qtable_frozenlake = initialize_q_table(state_space, action_space)

There are  16  possible states
There are  4  possible actions


In [3]:
def greedy_policy(Qtable,state):
    action = np.argmax(Qtable[state][:])
    return action

In [4]:
def epsilon_greedy_policy(Qtable, state, epsilon):
    random_num = random.uniform(0,1)

    if random_num > epsilon:
        action = greedy_policy(Qtable,state)

    else:
        action = env.action_space.sample()

    return action

In [6]:
# Training parameters 
n_training_episodes = 10000 # total training episodes 
learning_rate = 0.7

# Evaluation parameters 
n_eval_episodes = 100

# Env parameters 
env_id = 'FrozenLake-v1'
max_steps = 99
gamma = 0.95
eval_seed = []

# Exploration params 
max_epsilon = 1.0 
min_epsilon = 0.05
decay_rate = 0.0005

In [7]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
    for episode in tqdm(range(n_training_episodes)):
        epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
        state, info = env.reset()
        step = 0 
        terminated = False
        truncated = False

        for step in range(max_steps):
            action = epsilon_greedy_policy(Qtable, state, epsilon)

            new_state, reward, terminated, truncated, info = env.step(action)

            Qtable[state][action] = Qtable[state][action] + learning_rate * (
                reward+ gamma*np.max(Qtable[new_state])-Qtable[state][action])

            if terminated or truncated: 
                break

            state= new_state

    return Qtable

In [8]:
Qtable_frozenlake = train(n_training_episodes,min_epsilon,max_epsilon,decay_rate,env,max_steps,Qtable_frozenlake)

  0%|          | 0/10000 [00:00<?, ?it/s]

In [9]:
Qtable_frozenlake

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

In [10]:
def evaluate_agent(env, max_steps, n_eval_episodesm,Q, seed):
    episode_rewards = []
    for episode in tqdm(range(n_eval_episodes)):
        if seed:
            state, info = env.reset(seed =seed[episode])
        else:
            state, info = env.reset()

        step = 0
        truncated = False
        terminated = False
        total_rewards_ep = 0

        for step in range(max_steps):
            action = greedy_policy(Q, state)
            new_state, reward, terminated, truncated, info = env.step(action)
            total_rewards_ep+=reward

            if terminated or truncated:
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward

In [11]:
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

  0%|          | 0/100 [00:00<?, ?it/s]

Mean_reward=1.00 +/- 0.00


## Model taking actions

In [12]:
env = gym.make('FrozenLake-v1',map_name = "4x4", is_slippery=False,render_mode='human')
observation,info=env.reset()
done=False
score=0
steps=0
while not done: 
    action=greedy_policy(Qtable_frozenlake,observation)
    observation,reward,done,truncated,info=env.step(action)
    score+=reward
    steps+=1
    env.render()
print(f'Score:{score}')
env.close()

Score:1.0


## Q-Learning example #2: Taxi

In [13]:
env = gym.make("Taxi-v3", render_mode = "rgb_array")
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

There are  500  possible states
There are  6  possible actions


In [14]:
Qtable_taxi = initialize_q_table(state_space, action_space)
print(Qtable_taxi)
print("Q-table shape: ", Qtable_taxi.shape)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]
Q-table shape:  (500, 6)


In [15]:
# Training parameters 
n_training_episodes = 25000
learning_rate = 0.7

# Evaluation params 
n_eval_episodes = 100

eval_seed = [
     16,
    54,
    165,
    177,
    191,
    191,
    120,
    80,
    149,
    178,
    48,
    38,
    6,
    125,
    174,
    73,
    50,
    172,
    100,
    148,
    146,
    6,
    25,
    40,
    68,
    148,
    49,
    167,
    9,
    97,
    164,
    176,
    61,
    7,
    54,
    55,
    161,
    131,
    184,
    51,
    170,
    12,
    120,
    113,
    95,
    126,
    51,
    98,
    36,
    135,
    54,
    82,
    45,
    95,
    89,
    59,
    95,
    124,
    9,
    113,
    58,
    85,
    51,
    134,
    121,
    169,
    105,
    21,
    30,
    11,
    50,
    65,
    12,
    43,
    82,
    145,
    152,
    97,
    106,
    55,
    31,
    85,
    38,
    112,
    102,
    168,
    123,
    97,
    21,
    83,
    158,
    26,
    80,
    63,
    5,
    81,
    32,
    11,
    28,
    148,
]

# Env params 
env_id = "Taxi-v3"
max_steps = 99
gamma = 0.95

#Exploration params 
max_epsilon = 1.0 
min_epsilon = 0.05
decay_rate = 0.005


In [16]:
Qtable_taxi = train(n_training_episodes,min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
Qtable_taxi

  0%|          | 0/25000 [00:00<?, ?it/s]

array([[  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ],
       [  2.75200369,   3.94947756,   2.75200354,   3.94947757,
          5.20997639,  -5.05052243],
       [  7.93349184,   9.40367562,   5.20997639,   9.40367562,
         10.9512375 ,   0.40367562],
       ...,
       [ -2.65639999,   9.40367562,  -2.56532871,  -2.98871184,
         -9.70515   ,  -9.70515   ],
       [  1.95364336,   6.53681725,  -5.73854437,  -5.44492963,
        -13.59739954, -13.87662576],
       [ -1.3755    ,   9.59385   ,   7.12669996,  18.        ,
         -7.        ,  -7.        ]])

In [18]:
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_taxi, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

  0%|          | 0/100 [00:00<?, ?it/s]

Mean_reward=7.40 +/- 2.79


In [21]:
env = gym.make('Taxi-v3',render_mode='human')
observation,info=env.reset()
done=False
score=0
steps=0
while not done: 
    action=greedy_policy(Qtable_taxi,observation)
    observation,reward,done,truncated,info=env.step(action)
    score+=reward
    steps+=1
    env.render()
print(f'Score:{score}')
env.close()

Score:6
