<a href="https://colab.research.google.com/github/mgite03/bu-ai4all-2019/blob/main/rl/Copy_of_Intro_to_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Reinforcement Learning

## What is Reinforcement Learning?

Reinforcement learing is a type of machine learning that's different from both supervised and unsupervised learning. In reinforcement learning, we want to teach an agent to suceed in a certain environment by providing rewards to the agent.

For example: say we want to teach a puppy to come when called. The agent is the puppy, and the environment is the world. When we call the puppy's name, it can take some action, either running over or ignoring its owner's voice. If the puppy comes running over, it receives a treat as its reward.

<!-- Here's another example:  -->




## Breaking it down further...

Every RL problem can be broken down into the following components:
* Agent
* Actions (that the agent can take)
* Environment
* State (of the environment)
* Reward

**Agent**: The agent is what we are trying to teach. It encompases everything we *can* control about the situation

**Environment**: The environment is the world in which the agent resides. It encompases everything we *can't* control, although the agent can change the state of its environment by taking actions.

**Actions**: Anything the agent can *do*. Whenever the agent takes an action, it can change the state of the environment.

**State**: A complete description of the current state of the environment. By "complete description", we really mean everything, which includes not only where all of the mountains and trees are, but also that the agent is in a certain location currently climbing up a mountain with a certain speed, for example.

**Reward**: A scalar (number) that the agent receives after each action. We give positive reward to "good" actions and negative reward to "bad" actions.

(Read each of these descriptions over a few times to make sure you understand them.)


### Markov Decision Process

A markov decision process (MDP) is a formulation of states and actions where the probability of moving to the next state from the current state depends only on the chosen action and is independent of any previous states.

The states and actions of our RL problems follow this rule.

## Moving through time

![alt text](https://lh3.google.com/u/0/d/1ENx_BQnHLyeDF245SWuOzhk04C5xlFeZ=w2560-h1478-iv2)

The agent observes the state of its environment, the agent takes an action, the state of the environment changes and the agent receives a reward, the agent takes another action based on the new state, and so on.

Each one of these cycles happens in a **timestep**.

Sequences of states and actions are broken down into **episodes**, and each episode lasts a certain number of timesteps. An episode can end when the agent dies or achieves its ultimate goal.

# Representing all this in code

We will start by importing the `gym` module created by OpenAI. `gym` contains many pre-coded environments that we can interact with.

In [None]:
import gym

We will be using the "Frozen Lake" environment. Read about what this environment is [here](https://gym.openai.com/envs/FrozenLake-v0/).

Let's create an instance of the "Frozen Lake" environment.

In [None]:
env = gym.make("FrozenLake-v0", is_slippery=False, map_name = "8x8", desc = None)

We can visualize the environment by calling the `render()` function on the environment.

In [None]:
env.render()


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


The highlighted square shows the current location of the agent. In this environment, the agent always starts on the "start" square (denoted by `S`).

If we assign a number each square of the environment, we can describe the agent's state with just one number.

Let's look at all the possible states:

In [None]:
env.observation_space

Discrete(64)

`Discrete(16)` means that the agent can be in one of 16 possible states. A 4 by 4 grid has 16 squares.

Environments in `gym` have a few functions we can call:
* `step(a)` tells the environment that the agent has chosen to take some action `a`. The environment calculates the next state depending on `a`, and it returns a tuple containing the new state, the reward, whether the episode has finished, and a dictionary of any additional information (which we won't be using).
* `reset()` resets the environment to a fresh state, and returns the current state (which the agent can observe). For the Frozen Lake environment, this always puts the agent back at the upper-left corner.

Let's reset the environment and get the current state:

In [None]:
obs = env.reset()
print("current state =", obs)

current state = 0


This is telling us that right now the agent is in state `0`, which is the upper left corner. We can verify this by rendering the environment again:

In [None]:
env.render()


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


In [None]:
env.action_space

Discrete(4)

This means that there are 4 possible actions that the agent can take. These happen to be left, down, right, and up. When we use `step()` to tell the environment which action we want to take, we input `0` for left, `1` for down, `2` for right, and `3` for up.

For example, if we want to tell the agent to go down, we call `env.step(`1`)`.

Tell the environment we want the agent to go down, and render the environment again:

In [None]:
env.step(1)
env.render()

  (Down)
SFFFFFFF
[41mF[0mFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


Now the agent has moved downwards.

Instead of having to render the environment every time to figure out where the agent is, we want to store all the important information in variables. Let's reset the environment and call `step` again.

In [None]:
# Reset the environment, and store the initial state in the "obs" variable.
# When the agent percieves the state, we call this the agent's observation ("obs" stands for observation).
obs = env.reset()
print("initial state =",obs)


# take a step to the right (remember, right is denoted by "2")
obs, reward, done, info = env.step(2) # (focus on "obs" for now. "reward", "done", and "info" will be explained in a moment)
print("new state =", obs)

initial state = 0
new state = 1


If we count the squares in the environment starting from the upper left (and start counting from 0), we see that the agent is in square # 1. We can verify this by rendering the environment:

In [None]:
env.render()

  (Right)
S[41mF[0mFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


Rendering the environment also tells us the most recent action we have taken. In this case, we just took action `2`, which is to move right.

Remember, when we call `step()`, it returns more information than just the new state. It also returns the reward, whether the episode is finished, and a dictionary of any extra information.

Let's take another step to the right and look at each of these variables:

In [None]:
obs, reward, done, info = env.step(2)
print("observation =", obs)
print("reward =", reward)
print("is the episode done? done =", done)
print("(we won't be using info, but here it is: info =", info, ")")

observation = 2
reward = 0.0
is the episode done? done = False
(we won't be using info, but here it is: info = {'prob': 1.0} )


What have we learned from these variables? We see that the agent is in state `2`, it has received no reward, and the episode is not finished.

We can interpret zero reward as neutral, neither good or bad.

Render the environment to make sure you know where state `2` is:

In [None]:
env.render()


  (Right)
SF[41mF[0mFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


Now, you know how to move around, you know how to check where the agent currently is (this is the same as checking the current state), and you know how to check if the episode is done.

Spend some time controlling the agent in this environment. After each step, print out the current state, the reward the agent has just received, and the `done` variable. You may also call `render()` to render the environment.

(You can make lots of code cells to do this.)

In [None]:



obs, reward,done, info = env.step(2)
env.render()
print(obs, reward, done, info)

if done == True and reward == 1:
  print("Good job!")
elif done == True and reward == 0:
  print("Ya dead.")






In [None]:
# you can make more code cells below this one by clicking "[+] Code"

What did you notice about this environment? Specifically:

1. What are the different rewards the agent received?

> At G, the agent gets a reward, otherwise, it gets nothing. 

2. When did the episode end (when did `done = True`)?

> The episode ended when we got into a hole or got to G. 

3. What reward did the agent receive when the episode ended?

> If we reached G, it got a reward of 1.0.

Type your answers below each question.


(Continue experimenting above until you know the answers to all of these questions.)

There's a more efficient way of interacting with the environment than typing out `env.step()` every time you want the agent to move.

Write a function that takes in a string as an input and renders the current state of the environment. Your function should keep asking for the next action until the episode is over.

In [None]:
# Your code here

env.reset()
env.render()

def move():
  done = False
  while done == False:
    move = int(input("Move?"))
    obs, reward,done, info = env.step(move)
    env.render()
  return done, reward
    

DONE, REWARD = move()


while DONE == True:
  print(REWARD)
  if REWARD == 1:
    print("Good job!")
    play_again = input("Play again?")
    if play_again == "yes" or play_again == "Yes":
      env.reset()
      env.render()
      DONE, REWARD = move()
    else:
      print("Goodbye!")
      break
  elif REWARD == 0:
    print("Ya dead.")
    revive = input("Do you want to come back to life?")
    if revive == "Yes" or revive == "yes":
      env.reset()
      env.render()
      DONE, REWARD = move()
    else:
      print("Goodbye!")
      break

    

  
  
  


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
S[41mF[0mFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
SF[41mF[0mFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
SFF[41mF[0mFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?1
  (Down)
SFFFFFFF
FFF[41mF[0mFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?1
  (Down)
SFFFFFFF
FFFFFFFF
FFF[41mH[0mFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
0.0
Ya dead.
Do you want to come back to life?yes

[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
S[41mF[0mFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
SF[41mF[0mFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move?2
  (Right)
SFF[41mF[0mFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Move