# Module 4 - Programming Assignment

## Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at `EN605.445@gmail.com` if you do not understand something.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
You must follow the directions *exactly* or you will get a 0 on the assignment.
</div>

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: `<jhed_id>.zip`. It will not include any other information. Inside this zip file should be the following directory structure:

```
<jhed_id>
    |
    +--module-04-programming.ipynb
    +--module-04-programming.html
    +--world.txt
    +--test01.txt
    +--(any other files)
```

For example, do not name  your directory `programming_assignment_01` and do not name your directory `smith122_pr1` or any else. It must be only your JHED ID.

In [None]:
from IPython.core.display import *
from StringIO import StringIO

# add whatever else you need from the Anaconda packages

## Reinforcement Learning with Q-Learning

This problem is very similar to the one in Module 1 that we solved with A\* search but this time we're going to use a different approach.

We're replacing the deterministic movement from that module with stochastic movement. This means that actions don't result in a fixed successor state but a probability distribution over successor states and the successor state we want may not be the successor state we get. 

There are a variety of ways to handle this problem. If the agent finds itself off the solution, you can simply calculate a new solution from where the agent ended up. Although this sounds like a really bad idea, it has actually been shown to work really well in Video Games that use formal Planning algorithms (which we will cover later).

Another approach is to use Reinforcement Learning. There are a variety of options there: model-based and model-free, Value Iteration, Q-Learning and SARSA. You are going to use Q-Learning.

## The World Representation

As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:

```
token   terrain    cost 
.       plains     1
*       forest     3
^       hills      5
~       swamp      7
x       mountains  impassible
```

When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges).

There are quite a few differences between A\* Search and Reinforcement Learning but one of the most salient is that A\* Search returns a plan of N steps that gets us from A to Z, for example, A->C->E->G.... Reinforcement Learning, on the other hand, returns  a *policy* that tells us what the best thing to do for every and any state. 

For example, the policy might say that the best thing to do in A is go to C. However, we might find ourselves in D. In which case, the policy might say, D->E. Trying this action might land us in C and the policy will say, C->E, etc. At least with offline learning, everything will be learned in advance (in online learning, you can only learn by doing and so you may act according to a known but suboptimal policy).

## Reading the Map

To avoid global variables, we have a <code>read_world()</code> function that takes a filename and returns the world as `List` of `List`s. **The same coordinates reversal applies: (x, y) is world[ y][ x] as from PR01.**

In [None]:
def read_world( filename):
    with open( filename, 'r') as f:
        world_data = [x for x in f.readlines()]
    f.closed
    world = []
    for line in world_data:
        line = line.strip()
        if line == "": continue
        world.append([x for x in line])
    return world

Next we create a dict of movement costs. Note that we've negated them this time because RL requires negative costs and positive rewards:

In [None]:
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
costs

and a list of offsets for `cardinal_moves`. You'll need to work this into your actions, A, parameter.

In [None]:
cardinal_moves = [(0,-1), (1,0), (0,1), (-1,0)]

And now the confusing bits begin. We must program both the Q-Learning algorithm and a *simulator*. The Q-Learning algorithm doesn't know T but the simulator *must*. Essentially the *simulator* is any time you apply a move and check to see what state you actually end up in (rather than the state you planned to end up in).

The transition function your *simulation* should use, T, is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if I select "up" then 70% of the time, I go up but 10% of the time I go left, 10% of the time I go right and 10% of the time I go down. If you're at the edge of the map, you simply bounce back to the current state.

You need to implement `q_learning()` with the following parameters:

+ world: a `List` of `List`s of terrain (this is S from S, A, T, gamma, R)
+ costs: a `Dict` of costs by terrain (this is part of R)
+ goal: A `Tuple` of (x, y) stating the goal state.
+ reward: The reward for achieving the goal state.
+ actions: a `List` of possible actions, A, as offsets.
+ gamma: the discount rate
+ alpha: the learning rate

you will return a policy: 

`{(x1, y1): action1, (x2, y2): action2, ...}`

Remember...a policy is what to do in any state for all the states. Notice how this is different that A\* search which only returns actions to take from the start to the goal. This also explains why `q_learning` doesn't take a `start` state!

You should also define a function `pretty_print_policy( cols, rows, policy)` that takes a policy and prints it out as a grid using "^" for up, "<" for left, "v" for down and ">" for right. Note that it doesn't need the `world` because the policy has a move for every state. However, you do need to know how big the grid is so you can pull the values out of the `Dict` that is returned.

```
vvvvvvv
vvvvvvv
vvvvvvv
>>>>>>v
^^^>>>v
^^^>>>v
^^^>>>G
```

(Note that that policy is completely made up and only illustrative of the desired output).

There are a lot of details that I have left up to you. For example, when do you stop? Is there a strategy for learning the policy? Watch and re-watch the lecture on Q-Learning. Ask questions. You need to implement a way to pick initial states for each iteration and you need a way to balance exploration and exploitation while learning. You may have to experiment with different gamma and alpha values. Be careful with your reward...the best reward is related to the discount rate and the approxmiate number of actions you need to reach the goal.

* If everything is otherwise the same, do you think that the path from (0,0) to the goal would be the same for both A\* Search and Q-Learning?
* What do you think if you have a map that looks like:

```
><>>^
>>>>v
>>>>v
>>>>v
>>>>G
```

has this converged?


**Work with a smaller test world to start!** Name it `test01.txt` and return it with your assignment. I suggest it be asymmetric to be avoid corner cases (5x6 for example).

Remember that you should follow the general style guidelines for this course: well-named, short, focused functions with limited indentation using Markdown documentation that explains their implementation and the AI concepts behind them.

This assignment sometimes wrecks havoc with IPython notebook, save often. Put your helper functions here, along with their documentation. There should be one Markdown cell for the documentation, followed by one Codecell for the implementation.

----

**Pick initial state**

In the q_learning algorithm, an initial state must be selected for each episode.  pick_initial_state selects a random state in the world at which the agent will start its exploration.

In [None]:
def pick_initial_state(world):
    y_coordinate = random.randint(0, len(world) - 1)
    
    x_coordinate = random.randint(0, len(world[0]) - 1)

    return (x_coordinate, y_coordinate)

&nbsp;

**Initialize zero array**

In the q_learning algorithm, Q must be initialized to all zeros.  initialize_zero_array uses the dimensions of the given world to create an two dimensional array of the same size initialized with all zeros.

In [None]:
def initialize_zero_array(world):
    zero_array = []
    height = len(world)
    width = len(world[0])
    zeros = [0] * width

    for i in range(height):
        new_zero_array = copy.deepcopy(zeros)
        zero_array.append(new_zero_array)

    return zero_array

&nbsp;

**Initialize zeros**

initialize_zeros utilizes initialize_zero_array to create a dictionary where each action is the key, and the value is a multidimensional array of zeros.  This format of dictionary is used to keep track of q values (variable 'q') as well as a count of which state action pairs have been visited (variable 'visits').  The format is { action1: [[0,0...], [0,0...]...], action2: [[0,0...], [0,0...]...] }.  

In [None]:
def initialize_zeros(world, actions):
    q = {}
    for action in actions:
        q[action] = initialize_zero_array(world)

    return q

&nbsp;

**Is valid**

Given a state, an action, and the world, is_valid determines whether that action is valid for that state.  The action is valid only if taking it would not cause the state to be outside of the world or land on impassible terrain.  This function is used to determine which actions to take from a state, and to get the final policy using the q values.

In [None]:
def is_valid(state, action, world):
    x_coordinate = state[0] + action[0]
    y_coordinate = state[1] + action[1]

    world_depth = len(world)
    if y_coordinate < 0 or y_coordinate >= world_depth:
        return False

    world_width = len(world[0])
    if x_coordinate < 0 or x_coordinate >= world_width:
        return False

    if world[y_coordinate][x_coordinate] == 'x':
        return False

    return True

&nbsp;

**Get actions**

get_actions determines the best action to take from a given state as well as the other available actions.  For each possible action, it checks if that action is valid for the state using is_valid.  Out of the actions that are valid, it selects the action that has been visited the least using the visits data structure.  It returns two values: selected_action, and an array other_actions.  The values returned are used by the simulator when selecting an action to execute based on the percentages defined above.

In [None]:
def get_actions(state, visits, actions, world):
    min_visits = float("inf")
    selected_action = None
    other_actions = []
    for action in actions:
        if is_valid(state, action, world):
            x_coordinate = state[0]
            y_coordinate = state[1]
            num_visits = visits[action][y_coordinate][x_coordinate]
            if num_visits < min_visits:
                min_visits = num_visits

                if selected_action and selected_action not in other_actions:
                    other_actions.append(selected_action)

                selected_action = action

            elif action not in other_actions:
                other_actions.append(action)

    if selected_action in other_actions:
        other_actions.remove(selected_action)

    return selected_action, other_actions

&nbsp;

**Get max action**

get_max_action_value is used for the q value calculation.  The formula for Q[s, a] requires finding maxa(Q[s', a']), and this function finds that value.  Given a state, actions, and the q data structure, it returns the maximum q value for a state.

In [None]:
def get_max_action_value(state, actions, q):
    max = float("-inf")

    for action in actions:
        x_coordinate = state[0]
        y_coordinate = state[1]
        q_value = q[action][y_coordinate][x_coordinate]
        if q_value > max:
            max = q_value

    return max

&nbsp;

**Calculate Q value**

calculate_q_value calculates the Q[s, a] value required for the q_learning algorithm.  Given alpha, gamma, reward, q, action, state, actions, new_state it returns the value for the formula Q[s, a] = (1 - alpha) * q_value + alpha * (reward + gamma * max_action_value)

In [None]:
def calculate_q_value(alpha, gamma, reward, q, action, state, actions, new_state):
    max_action_value = get_max_action_value(new_state, actions, q)

    x_coordinate = state[0]
    y_coordinate = state[1]
    q_value = q[action][y_coordinate][x_coordinate]

    new_q_value = (1 - alpha) * q_value + alpha * (reward + gamma * max_action_value)

    return new_q_value

&nbsp;

**Check for convergence**

check_for_convergence determines whether the q_learning algorithm has converged.  It compares the values in the current q structure with the values of the q structure in the previous episode.  If the absolute value difference for every value comparison is less then epsilon, it returns true.

In [None]:
def check_for_convergence(q, previous_q, actions):
    epsilon = 0.2

    for action in actions:
        q_values = q[action]
        previous_q_values = previous_q[action]
        for i in range(len(q_values)):
            row = q_values[i]
            for j in range(len(row)):
                if abs(q_values[i][j] - previous_q_values[i][j]) > epsilon:
                    return False

    return True

-----

In [None]:
def q_learning( world, costs, goal, reward, actions, gamma, alpha):
    pass

In [None]:
def pretty_print_policy( rows, cols, policy):
    pass

## Test World

In [None]:
test_world = read_world( "test01.txt")

In [None]:
# goal = ?? # FILL ME IN
# gamma = ?? # FILL ME IN
# alpha = ?? # FILL ME IN

test_policy = q_learning( test_world, costs, goal, reward, cardinal_moves, gamma, alpha)

In [None]:
# cols = ?? # FILL ME IN
# rows = ?? # FILL ME IN

pretty_print_policy( cols, rows, test_policy)

## Full World

In [None]:
full_world = read_world( "world.txt")

In [None]:
# goal = ?? # FILL ME IN
# gamma = ?? # FILL ME IN
# alpha = ?? # FILL ME IN

full_policy = q_learning( full_world, costs, goal, reward, cardinal_moves, gamma, alpha)

In [None]:
# cols = ?? # FILL ME IN
# rows = ?? # FILL ME IN

pretty_print_policy( cols, rows, full_policy)