# A tour of Reinforce.jl

The goal of this tutorial is to introduce you to ```Reinforce.jl``` library which is a Reinforcement Learning library written in Julia.

You'll find it particularly beneficial if you're someone who is interested in
1. Reinforcement Learning and how a real library looks like
2. Features of Julia that make writing a library simpler
3. Contributing to the ```Reinforce.jl``` project

We'll be exploring the library in a breadth first search manner where we start from an example and understand it end to end. Then at each step we record all the abstractions we're relying on and explore them one by one until we reach the core Julia libraries.


PUT IN tree of how everything relates

So let's get started and run 
    
```git clone https://github.com/JuliaML/Reinforce.jl ```

You'll notice that there are 3 folders
1. examples/
2. src/
3. test/

Starting with examples/ there's just the one file called ```mountain_car.jl```

Mountain car is a popular testbed for RL algorithms. In other words, if your algorithm doesn't do well on mountain car then it's unlikely to do well on harder problems like Dota.

INSERT IMAGE OF MOUNTAIN CAR HERE

## Canonical example

In [16]:
# examples/mountain_car.jl

# Import the RL algorithms
using Reinforce

# Import the Mountain car environment
using Reinforce.MountainCarEnv: MountainCar

# Import the Plots plotting library and use GR as the backend 
# https://github.com/jheinen/GR.jl
using Plots
gr()

Plots.GRBackend()

In [19]:
# examples/mountain_car.jl

# Define a deterministic policy while respect an AbstractPolicy interface
mutable struct BasicCarPolicy <: Reinforce.AbstractPolicy end

#We define our policy's actions as
Reinforce.action(policy::BasicCarPolicy, r, s, A) = s.velocity < 0 ? 1 : 3

In [20]:
#Instantiate the MountainCar environment
env = MountainCar()

MountainCar(Reinforce.MountainCarEnv.MountainCarState(0.0, 0.0), 0.0, -1)

Now we have all the building blocks we need to start evaluating our policy \\(\pi \\) on our environment ```env```. More so than in other languages, ```Julia``` developers often use unicode like \\(\pi \\) so that numerical code mirrors its mathematical formulas. 

There is no training happenening in this example, the policy is fixed and deterministic. However, the code contains a few artifacts from a training procedure and are worth highlighting.

Generally a Reinforcement Learning algorithm has access to episodes which are observations of the environment its interacting in. 

At any given timestep \\(t \\) our environment is in some state \\(s \\). Our agent performs an action \\(a \\) which changes the state from \\(s \\) to \\(s' \\), the state change may or may not reward the agent with a reward \\(s \\) which would be negative for punishments.

In [22]:
# Run next episode
# "!" mark at the end of "episode!" means it modifies its arguments
function episode!(env, π = RandomPolicy())
  ep = Episode(env, π)
  for (s, a, r, s′) in ep
    #visualize the environment
    gui(plot(env))
  end
    
  # "return" keyword is optional in Julia
  # we are choosing to only return the total reward and the total number of iterations 
  # we could return whatever metrics we want here
  ep.total_reward, ep.niter
end

# main() function
R, n = episode!(env, BasicCarPolicy())
println("reward: $R, iter: $n")

reward: -84.0, iter: 84


Now that we have our basic example working, it's time to go one step lower in our layer of abstractions to tie up the loose ends in our understanding. So we'll look at how
1. What does ```Reinforce.AbstractPolicy``` look like and why is it mutable?
2. How does ```Reinforce.action``` work?
3. What does an ```Episode``` look like?

## Reinforce.AbstractPolicy

AbstractPolicy is surpringly simple it just says that any policy needs to implement an ```action``` function that given a policy, reward, state and a set of valid actions will return the next action \\(a \\)

In [24]:
# src/Reinforce.jl


abstract type AbstractPolicy end

"""
    a = action(policy, r, s, A)

Take in the last reward `r`, current state `s`,
and set of valid actions `A = actions(env, s)`,
then return the next action `a`.

Note that a policy could do a 'sarsa-style' update simply by saving the last state and action `(s,a)`.
"""
function action end

action

## Episode

The Episode struct is relatively straightforward and keeps track of what you'd expect

**NEED MORE DETAIL HERE** - esp on ```episodes``` data structure

In [26]:
# src/episodes/iterators.jl

mutable struct Episode{E<:AbstractEnvironment,P<:AbstractPolicy,F<:AbstractFloat}
  env::E
  policy::P
  total_reward::F # total reward of the episode
  last_reward::F
  niter::Int      # current step in this episode
  freq::Int       # number of steps between choosing actions
  maxn::Int       # max steps in an episode - should be constant during an episode
end

ErrorException: cannot assign a value to variable Reinforce.Episode from module Main

The individual fields are all unsurprising, we've also already seen ```AbstractPolicy``` but ```AbstractEnvironment``` is a new one so let's look at that next

## AbstractEnvironment

Any environment needs to subtype the ```AbstractEnvironment``` type which needs to implement the same kind of interface you'd see in the Open AI gym environment interface https://github.com/openai/gym/blob/master/gym/core.py

In [29]:
# src/Reinforce.jl

abstract type AbstractEnvironment end

function reset! end

"""
r, s′ = step!(env, s, a)
Move the simulation forward, collecting a reward and getting the next state.
"""
function step! end

finished(env::AbstractEnvironment, s′) = false

"""
A = actions(env, s)
Return a list/set/description of valid actions from state `s`.
"""
function actions end

state(env::AbstractEnvironment) = env.state

reward(env::AbstractEnvironment) = env.reward

maxsteps(env::AbstractEnvironment) = 0

maxsteps

## Implementing a custom environment

Great at this point we understand all the main abstractions in the example tutorial. So let's now see how we can create our own environment

First we define the parameters of the MountainCar simulation environment

In [40]:
# src/envs/mountain_car.jl

const min_position = -1.2
const max_position = 0.6
const max_speed = 0.07
const goal_position = 0.5
const min_start = -0.6
const max_start = 0.4

const car_width = 0.05
const car_height = car_width/2.0
const clearance = 0.2*car_height
const flag_height = 0.05

0.05

Next we define the restricted state representation we want our learning algorithm to use. There is a tradeoff here, we can feed all the parameters of the simulator to our learning algorithm but it's going to be a lot slower that way.

In [41]:
mutable struct MountainCarState
  position::Float64
  velocity::Float64
end

mutable struct MountainCar <: AbstractEnvironment state::MountainCarState
  reward::Float64
  seed::Int
end
MountainCar(seed=-1) = MountainCar(MountainCarState(0.0, 0.0), 0.0, seed)

ErrorException: cannot assign a value to variable MountainCarEnv.MountainCar from module Main

Next we need to implement a simple ```reset!()``` function which will randomize the starting position and set the velocity of the car to 0. The reason we randomize the starting position is because we want to make sure our agent does well regardless of the starting position and generalize its behavior. A further improvement would be to randomize the hills the car has to climb or randomize the dimensions of the car to make sure that our agent can drive any car on any hill and not just a single car on a single hill at a single starting position.

In [37]:
function reset!(env::MountainCar)
  if env.seed >= 0
    seed!(env.seed)
    env.seed = -1
  end

  env.state.position = rand(Uniform(min_start, max_start))
  env.state.velocity = 0.0

  env
end

reset! (generic function with 1 method)

Then we make a simplifying assumption that there are only 3 possible actions i.e: go left, go right, do nothing.

In [None]:
actions(env::MountainCar, s) = DiscreteSet(1:3)

We need a goal state where we know the agent did a good job. We again make a simplying assumption that goal states are on the right sidef of the screen.

In [None]:
finished(env::MountainCar, s′) = env.state.position >= goal_position

Now we can get to the meat of simulator with the ```step!()``` function

In [38]:
function step!(env::MountainCar, s::MountainCarState, a::Int)
  
  # store local variables
  position = env.state.position
  velocity = env.state.velocity

  # update state based on action
  velocity += (a - 2)*0.001 + cos(3*position)*(-0.0025)
  velocity = clamp(velocity, -max_speed, max_speed)
    
  # update position based on change to velocity
  position += velocity
    
  # don't YOLO to the left side of the screen
  if position <= min_position && velocity < 0
    velocity = 0
  end
    
  # 
  position = clamp(position, min_position, max_position)
  env.state = MountainCarState(position, velocity)
  env.reward = -1

  return env.reward, env.state
end


UndefVarError: UndefVarError: MountainCarState not defined

In [36]:

# ------------------------------------------------------------------------
height(xs) = sin(3 * xs)*0.45 + 0.55
rotate(xs::Array{Float64}, ys::Array{Float64}, Θ::Float64) =
  xs.*cos(Θ) .- ys.*sin(Θ), ys.*cos(Θ) .+ xs.*sin(Θ)

translate(xs::Array{Float64}, ys::Array{Float64}, t::Array{Float64}) =
  xs .+ t[1], ys .+ t[2]

@recipe function f(env::MountainCar)
  legend := false
  xlims := (min_position, max_position)
  ylims := (0, 1.1)
  grid := false
  ticks := nothing

  # Mountain
  @series begin
    xs = range(min_position, stop = max_position, length = 100)
    ys = height.(xs)
    seriestype := :path
    linecolor --> :blue
    xs, ys
  end

  # Car
  @series begin
    fillcolor --> :black
    seriestype := :shape

    θ = cos(3 * env.state.position)
    xs = [-car_width/2, -car_width/2, car_width/2, car_width/2]
    ys = [0, car_height, car_height, 0] .+ clearance
    xs, ys = rotate(xs, ys, θ)
    translate(xs, ys, [env.state.position, height(env.state.position)])
  end

  # Flag
  @series begin
    linecolor --> :red
    seriestype := :path

    [goal_position, goal_position], [height(goal_position), height(goal_position) + flag_height]
  end
end
end


reset! (generic function with 1 method)

## Learning algorithms - Actor critic algorithm

So far we've covered how to
1. Hardcode a policy
2. Create a custom environment
3. Extract actions from a policy and plug them into an environment
4. Design data structures to make all of the above possible

There's still one final loose end, an example of a reinforcement learning algorithm that can learn a robust policy. We've already covered the ```AbstractPolicy``` type so let's dive into a specific implementation of an actor critic algorithm by looking at ```src/policies/actor_critic.jl``` 

**Can also look at policy gradients**



## Next steps

I hope this article motivates you to tinker more with Reinforcement Learning. Consider making a PR to the ```Reinforce.jl``` project - any one of the below would be extremely helpful

1. More environments
2. More policy functions
3. YAML model configuration via ```Flux.jl```
4. YAML scene configuration to create 3D scenes for robotics applications

Also consider checking ```tests/``` in the ```Reinforce.jl``` project for more experimental code and ideas on what you could contribute.