# TD METHODS

In DP methods, we found optimal policies. But, we used ```transition_map``` to do that. This is quite impractical. There are many reasons for that:

- We don't know all the states we can visit
- Environment dynamics can be unknown
- Iterating over all states can be computationally infeasible

In this notebook, we will be practicing over TD methods. Instead of calculating the exact value, we will get samples and estimate the value over samples. One approach that comes into our minds, when we talk about sampling, is **Monte Carlo** method. MC method is the first algorithm we will be implementing.

#### Environment

We will be using a new environment called **Warehouse** where an agent tries to match each item with the corresponding box. Building a transition map for this environment would be painful. Luckily, we don't need to build that in TD methods. The ```worldmap``` is given below (you can modify this for fun but, we will be using the one below to evaluate).

You can run the cell below to visualize the environment.
- <span style="color:#989898">Dark gray cells</span> are impassable cells.
- <span style="color:#DADADA">light gray cells</span> are passable empty cells.
- <span style="color:#00B8FA">Blue cell</span> is the agent.
- <span style="color:#A33675"> Darker magenta cell </span> and <span style="color:#48C69F"> darker green cell </span> are items to collect. 
- <span style="color:#B34685"> Lighter magenta cell </span> and <span style="color:#58D6AF"> lighter green cell </span> are the boxes.

Pairing is also given below. A key in the pairing dictionary is a box(uppercase) and its corresponding value is a list of items(lowercase) that can be delivered to that box.

State representation is different here. Instead of giving just the position of the agent, we have 4 more additional features. These are 4 boolean values for two items and two boxes.  The boolean feature is true if the item or the box exists on the map and false otherwise. When the bucket receives the item it disappears.

In [None]:
%load_ext autoreload
%autoreload 2

from env import Warehouse

worldmap = ["#########",
            "#   #   #",
            "# c # C #",
            "#   #   #",
            "#   P   #",
            "#   #   #",
            "# b # B #",
            "#   #   #",
            "#########"]

# Buckets are Uppercase letters while balls are lowercase
# Matching is done so that the ball "b" must be carried to the bucket "B"
pairing = {
    "B": ["b"],
    "C": ["c"]
}

env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)
env.init_render()


## Monte Carlo Method

In [None]:
from mc_agent import MCAgent
from collections import namedtuple
import numpy as np
import random

# Initialize agent
agent = MCAgent(n_acts=4)

# Hyperparameters
args = dict(
    iteration = 800,
    gamma = 0.99,
    alpha = 0.03,
    init_eps = 0.9,
    final_eps = 0.1,
    eps_decay_rate = 0.995,
    seed = 12021,            # Current year: (10000 + 2021 = 12021) Holocene calendar
)
args = namedtuple('args', args.keys())(*args.values()) 

# Seed
np.random.seed(args.seed)
random.seed(args.seed)

# Traning loop
reward_list = []
epsilon = args.init_eps
for ix in range(args.iteration):
    epsilon = max(args.final_eps, epsilon*args.eps_decay_rate)
    reward = agent.one_episode_train(
        env, lambda x: agent.e_greedy_policy(x, epsilon), args.gamma)
    reward_list.append(reward)
    if ((ix + 1) % 50) == 0:
        print("Episode: {}, reward: {}".format(ix + 1, np.mean(reward_list[-100:])))

### Let's try to render the trained agent

In [None]:
env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)

env.init_render()

In [None]:
agent.evaluate(env, render=True)

MC method waits until the episode is terminated to begin the update. But, it is possible to update the policy within every transition. Temporal Difference(TD) methods exactly aim for this. There are two popular TD methods that you will be implementing: Q Learning and SARSA.

## Q - Learning

Q learning is an off-policy algorithm that employs temporal difference value estimation.

In [None]:
# Initiate environment
env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)
env.init_render()

In [None]:
from collections import namedtuple
import numpy as np
import random
import matplotlib.pyplot as plt

from td_agents import QAgent

# Initialize agent
q_agent = QAgent(n_acts=4)
# Hyperparameters
args = dict(
    episodes = 600,
    evaluate_period = 50,
    gamma = 0.9,
    alpha = 0.4,
    init_eps = 0.9,
    final_eps = 0.1,
    eps_decay_rate = 0.999,
    seed = 12021,            # Current year: (10000 + 2021 = 12021) Holocene calendar
)
args = namedtuple('args', args.keys())(*args.values()) 

# Seed
np.random.seed(args.seed)
random.seed(args.seed)


q_rewards = q_agent.train(env, q_agent.e_greedy_policy, args)

### Let's try to render the trained agent

In [None]:
env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)

env.init_render()

In [None]:
q_agent.evaluate(env, render=True)

## SARSA

In [None]:
# Initiate environment
env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)
env.init_render()

In [None]:
from collections import namedtuple
import numpy as np
import random
import matplotlib.pyplot as plt

from td_agents import SarsaAgent

# Initialize agent
s_agent = SarsaAgent(n_acts=4)
# Hyperparameters
args = dict(
    episodes = 800,
    evaluate_period = 50,
    gamma = 0.9,
    alpha = 0.5,
    init_eps = 0.9,
    final_eps = 0.1,
    eps_decay_rate = 0.995,
    seed = 12021,            # Current year: (10000 + 2021 = 12021) Holocene calendar
)
args = namedtuple('args', args.keys())(*args.values()) 

# Seed
np.random.seed(args.seed)
random.seed(args.seed)


s_rewards = s_agent.train(env, s_agent.e_greedy_policy, args)

### Let's try to render the trained agent

In [None]:
env = Warehouse(balls="cb", buckets="BC",
                pairing=pairing, worldmap=worldmap)

env.init_render()

In [None]:
s_agent.evaluate(env, render=True)