# Reinforcement learning - dynamic programming

At the core of reinforcement learning is the derivation of an optimal policy; a function which calculates actions to be executed from a state. The reinforcement learning problem is suited to solving a framework framework of problems termed Markov Decision Processes (MDPs). These processes are characterised by five key parameters, S (state), A (action), P (transition matrix), R (reward) and \gamma (discount factor). If all five parameters are explicitly known, dynamic programming can be applied to MDPs to solve *(value) prediction* and *(optimal) control*. 

Dynamic programming provides a methodology to solve complex problems by breaking large/complex problems into smaller/more simple subproblems, solving subproblems and finally using subproblem solutions to construct solutions to the overall problem. Two important properties of dynamic programming: 

    1. Optimal substructure
    2. Overlapping subproblems
    
    1. Optimal substructure
    The idea of optimal substructure is premised on the idea complex problems can be decomposed into subproblems, calculate the optimal value for each subproblem then combine to form the overall optimal solution. 
    
    2. Overlapping subproblems
    Second, dynamic programming uses overlapping subproblems to cache solutions to previously calculated subproblems, store the output and reuse in calculations when the subproblems reoccur. 
    
    
Problems of the above form can be solved in both a prediction and control setting. Prediction solutions calculates the total value achievable from a state under a specific policy whereas control determines the action/decision that should be made in each state.



## Policy evaluation - prediction
Policy evaluation aims to solve the problem - "if I follow a particular policy \pi how 'good' is it to be in a particular state". An MDP and policy are given the policy is then evaluated as to how optimal it is. 

In [1]:
from tkinter import *
import numpy as np

CANVAS_HEIGHT_WIDTH = 600
GRID_DIM = 5
np.set_printoptions(suppress=True)

DISCOUNT_FACTOR = 1.0
REWARD = -1

line_distance = CANVAS_HEIGHT_WIDTH/GRID_DIM
label_offset = line_distance/2


#### Define policy
trans_prob = {"UP":0.25, "DOWN":0.25, "LEFT":0.25, "RIGHT":0.25}