# Q* Learning with FrozenLake 4x4 

In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>

![alt text](http://simoninithomas.com/drlc/Qlearning/frozenlake4x4.png)

The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H). However, the ice is slippery, **so you won't always move in the direction you intend (stochastic environment)**

Thanks to [lukewys](https://github.com/lukewys) for his help

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)
- In the [video version](https://www.youtube.com/watch?v=q2ZOEFAaaI0)  we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy.

# This is a notebook from [Deep Reinforcement Learning Course, new version](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients, Prediction Based rewards agents…), and how to implement them with Tensorflow and PyTorch.**

  ![alt text](http://simoninithomas.com/drlc/libraries.png)
  
  
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Step -1: Install the dependencies on Google Colab

In [1]:
!pip install numpy
!pip install gym

Collecting gym
  Using cached https://files.pythonhosted.org/packages/87/04/70d4901b7105082c9742acd64728342f6da7cd471572fd0660a73f9cfe27/gym-0.10.11.tar.gz
Collecting scipy (from gym)
[?25l  Downloading https://files.pythonhosted.org/packages/c0/1d/eef9d7b34ab8b7ee42d570f2e24d58ee0374064c1ca593bdb02914f66a80/scipy-1.2.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (28.8MB)
[K    100% |████████████████████████████████| 28.8MB 792kB/s ta 0:00:011
Collecting requests>=2.0 (from gym)
[?25l  Downloading https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl (57kB)
[K    100% |████████████████████████████████| 61kB 6.9MB/s ta 0:00:011
Collecting pyglet>=1.2.0 (from gym)
  Using cached https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl
Collecting url

## Step 0: Import the dependencies 📚
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our FrozenLake Environment
- `Random` to generate random numbers

In [2]:
import numpy as np
import gym
import random
from IPython.display import clear_output

## Step 1: Create the environment 🎮
- Here we'll create the FrozenLake 8x8 environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>
- In our case we choose to use Frozen Lake.

In [3]:
from gym.envs.registration import register

register(
    id='a1-v3',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '8x8', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.8196, # optimum = .8196, changing this seems have no influence
)

In [4]:
# env = gym.make("FrozenLakeNotSlippery-v0")

#env = gym.make("FrozenLake-v0")

env = gym.make("a1-v3")

# env = gym.make("Taxi-v2")

In [5]:
env.render()
print("States = ", env.observation_space.n, ", Actions = ", env.action_space.n)


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
States =  64 , Actions =  4


## Step 2: Create the Q-table and initialize it 🗄️
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

## 3. Hyperparameters

In [6]:
max_steps = 100
epochs = 100000

learning_rate    = 0.8
discounting      = 0.95  # gamma

max_exploration   = 1
min_exploration   = 0.01
exploration_decay = 0.001


## 4. Q-Learning

In [7]:
def chose_action(state, exploration_rate):
  if random.uniform(0, 1) < exploration_rate:
    return env.action_space.sample()
  else:
    action = np.argmax(qtable[state, :])
    return action

In [8]:
def learn(state, action, next_state, resulting_reward):
  predicted = qtable[state, action]
  delta_q = resulting_reward + discounting * np.max(qtable[next_state]) - predicted
  qtable[state, action] = predicted + learning_rate*delta_q
  

In [9]:
def learn_epoch(max_steps, exploration_rate):
  state = env.reset()
  total_reward = 0
  for s in range(max_steps):
    action = chose_action(state, exploration_rate)
    new_state, reward, done, info = env.step(action)
    if reward == 0 and done:
      reward = -1
    elif new_state == state:
      reward -= 0.01
    total_reward += reward
    learn(state, action, new_state, reward)
    state = new_state
    if done:
      break;
  return total_reward, state

In [10]:
def learn_qtable(epochs, max_steps, print_debug=False):
  segment_size = epochs / 20
  segment_reward = 0
  for e in range(epochs):
    exploration_rate = min_exploration + (max_exploration-min_exploration)*np.exp(-exploration_decay*e)
    total_reward, final_state = learn_epoch(max_steps, exploration_rate)
    segment_reward += total_reward
    if print_debug and e % segment_size == 0:
      clear_output()
      print(qtable)
    if e % segment_size == 0:
      print("Epoch=", e, ", Exploration Rate=", exploration_rate, " avg reward=", segment_reward/segment_size)

In [11]:
np.set_printoptions(precision=3, suppress=True)

qtable = np.zeros((env.observation_space.n, env.action_space.n))
learn_qtable(epochs, max_steps, print_debug=False)

print("final:")
print(qtable)


Epoch= 0 , Exploration Rate= 1.0  avg reward= -0.000218
Epoch= 5000 , Exploration Rate= 0.016670567529094613  avg reward= 0.029374000000003828
Epoch= 10000 , Exploration Rate= 0.010044945930464861  avg reward= 0.9752180000000022
Epoch= 15000 , Exploration Rate= 0.010000302843297297  avg reward= 1.927877999999999
Epoch= 20000 , Exploration Rate= 0.010000002040542086  avg reward= 2.8761679999999967
Epoch= 25000 , Exploration Rate= 0.010000000013749065  avg reward= 3.830838000000007
Epoch= 30000 , Exploration Rate= 0.010000000000092641  avg reward= 4.778298000000028
Epoch= 35000 , Exploration Rate= 0.010000000000000625  avg reward= 5.73130000000006
Epoch= 40000 , Exploration Rate= 0.010000000000000004  avg reward= 6.69031600000008
Epoch= 45000 , Exploration Rate= 0.01  avg reward= 7.644940000000046
Epoch= 50000 , Exploration Rate= 0.01  avg reward= 8.595574000000013
Epoch= 55000 , Exploration Rate= 0.01  avg reward= 9.547433999999985
Epoch= 60000 , Exploration Rate= 0.01  avg reward= 10.5

In [12]:
def execute(max_steps):
  state = env.reset()
  for s in range(max_steps):
    action = chose_action(state, -100)
    new_state, reward, done, info = env.step(action)
    state = new_state
    env.render()
    if (done):
      print("Done after ", s, " steps")
      break
  

In [14]:
execute(100)

  (Down)
SFFFFFFF
[41mF[0mFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Down)
SFFFFFFF
FFFFFFFF
[41mF[0mFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
[41mF[0mFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
F[41mF[0mFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FF[41mF[0mFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFF[41mF[0mFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFF[41mF[0mHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFH[41mF[0mFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHF[41mF[0mFHF
FHFFHFHF
FFFHFFFG
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFF[41mF[0mHF
FHFFHFHF
FFFHFFFG
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFH[41mF[0mHF
FFFHFFFG
  (Do