# Lecture 10

### Reinforcement Learning

- Learning to act in complex environments
- Maximize some reward in a complex environment
- Actions done by the agent influences the environment it exists in
- Continue until "termination" where you finally measure how rewarded you are

#### Why is this difficult?

- No immediate label
- Reward is from a combination of actions
- I need to determine what part of my actions generated the reward
- I need to explore the environment

- Value Network: Regression problem that estimates the reward of your current state
- Policy Network: decides your moves for you

#### Discount Rate

- The discount rate indicates that future rewards is worth less than short-term rewards
    - Analogy: interest rates indicate that the same money in the future is worth less
- Discount rate modulates your preference for short-term reward or long-term reward and what your preferred tradeoff is

- Find the policy that has the greatest expected returns
- 

#### Q-Value
- The expected return for a given move if you continue with your policy

#### Notes
- If you wanted to do this without deep learning you could just hard-code your policy & run the algo

# Open-AI Gym

In [1]:
!pip install gym



In [2]:
import gym

### Policy Function
This function determines where you should try to go given your current position

In [3]:
def policy(obs):
    """
    Up = 3
    Right = 2
    Down = 1
    Left = 0
    Hole = -1 (reset board)
    Goal = -2 (you win)
    """
    mapping = [
         2,  2, 1,  0,
         1, -1, 1, -1,
         2,  1, 1, -1,
        -1,  2, 2, -2
    ]
    return mapping[obs]

### Iterate through the environment making movements depending on your policy

In [8]:
# Watch out - slippage can happen that messes up your movement
env = gym.make('FrozenLake-v0')
obs = env.reset()
epochs = 0
deaths = 0

env.render()

while (True):
    
    direction = policy(obs)
    if direction == -1:
        print("You fell in the hole")
        deaths += 1
        obs = env.reset()
        continue
    elif direction == -2:
        print("You win!")
        break
    obs, reward, is_done, info = env.step(direction)
    env.render()
    print("Reward: {}".format(reward))
    epochs += 1
    
print("Number of moves: {}".format(epochs))
print("Number of deaths: {}".format(deaths))
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
Reward: 0.0
  (Down)
SFFF
FH[41mF[0mH
FFFH
HFFG
Reward: 0.0
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Reward: 0.0
You fell in the hole
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
Reward: 0.0
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Reward: 0.0
You fell in the hole
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
Reward: 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
Reward: 0.0
  (Right)
SFFF
FHFH
FFFH
[41mH[0mFFG
Reward: 0.0
You fell in the hole
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
Reward: 0.0
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
Reward: 0.0
You fell in the hole
  (Right)
[41mS[0mFFF